Adding custom Sentencizer leads to inconsistent state when parsing doc #3569

krlng · 2019-04-10T13:12:25Z

How to reproduce the behaviour

nlp2 = spacy.load('en_core_web_md')
from spacy.pipeline import Sentencizer
sentencizer = Sentencizer()
nlp2.add_pipe(sentencizer)
tokens = nlp2(u'apple peach banana')

Gives me:
ValueError: [E043] Refusing to write to token.sent_start if its document is parsed, because this may cause inconsistent state.

Your Environment

spaCy version 2.1.3
Location C:\Users<xx>\AppData\Local\Continuum\lib\site-packages\spacy
Platform Windows-10-10.0.16299-SP0
Python version 3.6.6

The text was updated successfully, but these errors were encountered:

ines · 2019-04-10T13:29:17Z

This is expected behaviour and why spaCy raises this error: By default, nlp.add_pipe will add the component last in the pipeline (so in this case, after the parser). If no sentence boundaries are set when the parser sees the Doc, it will assign them and all of its predictions will be consistent with those sentence boundaries. If you then tried to change those boundaries afterwards (e.g. using a rule-based component like the sentencizer), you can easily end up with inconsistent linguistic annotations – like dependencies spanning over sentence boundaries.

So if you want to use custom sentence boundaries, you'd have to apply them before the parser – e.g. via nlp.add_pipe(sentencizer, before="parser") or nlp.add_pipe(sentencizer, first=True). The parser will then take your custom boundaries into account as well and only assign dependencies that are consistent with the sentences. This can sometimes lead to an improvement in parsing accuracy.

Also see the docs here for details and examples: https://spacy.io/usage/linguistic-features#sbd-custom

krlng · 2019-04-11T07:36:51Z

Thank you ines, that makes it really clear 👍

lock · 2019-05-11T08:44:04Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added feat / parser Feature: Dependency Parser feat / pipeline Feature: Processing pipeline and components usage General spaCy usage labels Apr 10, 2019

krlng closed this as completed Apr 11, 2019

lock bot locked as resolved and limited conversation to collaborators May 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding custom Sentencizer leads to inconsistent state when parsing doc #3569

Adding custom Sentencizer leads to inconsistent state when parsing doc #3569

krlng commented Apr 10, 2019

ines commented Apr 10, 2019

krlng commented Apr 11, 2019

lock bot commented May 11, 2019

Adding custom Sentencizer leads to inconsistent state when parsing doc #3569

Adding custom Sentencizer leads to inconsistent state when parsing doc #3569

Comments

krlng commented Apr 10, 2019

How to reproduce the behaviour

Your Environment

ines commented Apr 10, 2019

krlng commented Apr 11, 2019

lock bot commented May 11, 2019