Skip to content

Combining both model and rule based sentensizers #11107

Discussion options

You must be logged in to vote

Thank you @polm for you answer. I've actually used the overwrite setting before but it didn't help. Let me work an example to show what I'm doing and the outputs:

Experiment text:

example_text = """
That is a normal text. I've 

found online. I will use it to show the Sentencizer splits
"""

I will identify sentences using:

for row_id, row in enumerate(nlp(example_text).sents):
    print(row_id, row)

When I use the en_core_web_lg language model I'm getting:

0 
That is a normal text.
1 I've 

found online.
2 I will use it to show the Sentencizer splits

After adding the sentencizer:

def _add_custom_sentencizer(nlp):
    config = {"punct_chars": ["\n", "\n\n"], "overwrite": True}
    nlp.a…

Replies: 2 comments 4 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
4 replies
@antonpibm
Comment options

@polm
Comment options

@antonpibm
Comment options

@polm
Comment options

Answer selected by antonpibm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / pipeline Feature: Processing pipeline and components
2 participants