Combining both model and rule based sentensizers #11107

antonpibm · 2022-07-10T09:01:35Z

antonpibm
Jul 10, 2022

Hi everyone.

When I iterate over all sentences in a doc using the .sents generator I found that sometimes sentences are not split by the new line symbol (\n).
I guess that could make sense in the general purpose approach, but for my data I know a new line should always be considered as a new sentence. I've tried adding this exception using the sentencizer pipeline component, but as it turns out only one of sentencizer or model approach can be used. I've tried adding the sentencizer to both the beginning and end of the components.

Code for adding the new symbols:

    def _add_custom_sentencizer(nlp):
        config = {"punct_chars": ["\n", "\n\n"]}
        nlp.add_pipe("sentencizer", config=config, first=True)
    _add_custom_sentencizer(nlp)

Am I doing something wrong? or is there another way to add exceptions to the model based sentencizer?

Thanks,
Anton

Answered by antonpibm

Jul 11, 2022

Thank you @polm for you answer. I've actually used the overwrite setting before but it didn't help. Let me work an example to show what I'm doing and the outputs:

Experiment text:

example_text = """
That is a normal text. I've 

found online. I will use it to show the Sentencizer splits
"""

I will identify sentences using:

for row_id, row in enumerate(nlp(example_text).sents):
    print(row_id, row)

When I use the en_core_web_lg language model I'm getting:

0 
That is a normal text.
1 I've 

found online.
2 I will use it to show the Sentencizer splits

After adding the sentencizer:

def _add_custom_sentencizer(nlp):
    config = {"punct_chars": ["\n", "\n\n"], "overwrite": True}
    nlp.a…

View full answer

polm · 2022-07-11T05:36:10Z

polm
Jul 11, 2022

You should be able to use both the sentence recognizer and the sentencizer together without issue. When you say "only one can be used" what do you mean - do you get an error or something?

The issue might be that if you don't set the overwrite setting, the second component won't do anything, because it won't change the annotations of the first component. You can turn on overwrite like this:

# assume sentence recognizer is already added
config = {"punct_chars": ["\n", "\n\n"], "overwrite": True}
# don't put first=True - this should be after the recognizer
nlp.add_pipe("sentencizer", config=config)

0 replies

antonpibm · 2022-07-11T07:04:12Z

antonpibm
Jul 11, 2022
Author

Thank you @polm for you answer. I've actually used the overwrite setting before but it didn't help. Let me work an example to show what I'm doing and the outputs:

Experiment text:

example_text = """
That is a normal text. I've 

found online. I will use it to show the Sentencizer splits
"""

I will identify sentences using:

for row_id, row in enumerate(nlp(example_text).sents):
    print(row_id, row)

When I use the en_core_web_lg language model I'm getting:

0 
That is a normal text.
1 I've 

found online.
2 I will use it to show the Sentencizer splits

After adding the sentencizer:

def _add_custom_sentencizer(nlp):
    config = {"punct_chars": ["\n", "\n\n"], "overwrite": True}
    nlp.add_pipe("sentencizer", config=config)
_add_custom_sentencizer(nlp)

The order of pipeline components:

['tok2vec',
 'tagger',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner',
 'sentencizer']

The new language model gives the output:

0 

1 That is a normal text. I've 


2 found online. I will use it to show the Sentencizer splits

When I say "only one can be used", I mean that I would expect the combination of both splitting by \n and by the model, something like:

0 
1 That is a normal text
2 I've
3 found online
4 I will use it to show the Sentencizer splits

4 replies

antonpibm Jul 11, 2022
Author

@polm I saw you've mentioned the "sentence recognizer" should also be loaded.

When I check the components in the model, I'm getting:

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x199a341c0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x1d0ee9580>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1ca27f200>),
 ('senter', <spacy.pipeline.senter.SentenceRecognizer at 0x1b8a99fa0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x1bcd6fbc0>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x1640f3b40>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x1ca27f5f0>)]

But the list of components gives:

nlp.pipe_names # ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
nlp.has_pipe("senter") # False

polm Jul 11, 2022

Ah, OK, I see what you mean about combining them. The issue is that when overwriting it overwrites every token, not just those that are definitely sentence starts.

There are a couple of ways to handle this, but here's a simple one - you can use this custom component instead of the sentencizer.

import spacy
from spacy.language import Language

@Language.component("line_sent")
def line_senter(doc):
    for tok in doc:
        if tok.text == "\n" or tok.text == "\n\n":
            tok.is_sent_start = True
    return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("line_sent", first=True)

text = """Hello. How are you? I see 

you have newlines. Where can I get some?"""
doc = nlp(text)

for ii, sent in enumerate(doc.sents):
    print(ii, sent.text.strip())

This works because tok.is_sent_start has three valid states: True, False, or None (unset). The sentencizer will set every component to True or False, but this only sets some tokens to True. The parser will generally respect anything set to True or False before it gets input (though see #7716).

About the senter - it's included in pretrained pipelines by default, but doesn't show up in the list because it's deactivated. The parser also sets sentence boundaries, so you don't need the senter if you're using the parser.

antonpibm Jul 11, 2022
Author

@polm Thank you very much for the clarification and code example. This works and I've verified that dependencies are also parsed correctly.

I want to ask, is it a bug? I mean at first you've mentioned that this should work

polm Jul 12, 2022

Glad that solved your issue!

This is not a bug, I just hadn't actually done this before and hadn't thought through how the overwrite settings would interact enough.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combining both model and rule based sentensizers #11107

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Combining both model and rule based sentensizers #11107

antonpibm Jul 10, 2022

Replies: 2 comments · 4 replies

polm Jul 11, 2022

antonpibm Jul 11, 2022 Author

antonpibm Jul 11, 2022 Author

polm Jul 11, 2022

antonpibm Jul 11, 2022 Author

polm Jul 12, 2022

antonpibm
Jul 10, 2022

Replies: 2 comments 4 replies

polm
Jul 11, 2022

antonpibm
Jul 11, 2022
Author

antonpibm Jul 11, 2022
Author

antonpibm Jul 11, 2022
Author