Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding custom Sentencizer leads to inconsistent state when parsing doc #3569

Closed
krlng opened this issue Apr 10, 2019 · 3 comments
Closed

Adding custom Sentencizer leads to inconsistent state when parsing doc #3569

krlng opened this issue Apr 10, 2019 · 3 comments
Labels
feat / parser Feature: Dependency Parser feat / pipeline Feature: Processing pipeline and components usage General spaCy usage

Comments

@krlng
Copy link

krlng commented Apr 10, 2019

How to reproduce the behaviour

nlp2 = spacy.load('en_core_web_md')
from spacy.pipeline import Sentencizer
sentencizer = Sentencizer()
nlp2.add_pipe(sentencizer)
tokens = nlp2(u'apple peach banana')

Gives me:
ValueError: [E043] Refusing to write to token.sent_start if its document is parsed, because this may cause inconsistent state.

Your Environment

spaCy version 2.1.3
Location C:\Users<xx>\AppData\Local\Continuum\lib\site-packages\spacy
Platform Windows-10-10.0.16299-SP0
Python version 3.6.6

@ines
Copy link
Member

ines commented Apr 10, 2019

This is expected behaviour and why spaCy raises this error: By default, nlp.add_pipe will add the component last in the pipeline (so in this case, after the parser). If no sentence boundaries are set when the parser sees the Doc, it will assign them and all of its predictions will be consistent with those sentence boundaries. If you then tried to change those boundaries afterwards (e.g. using a rule-based component like the sentencizer), you can easily end up with inconsistent linguistic annotations – like dependencies spanning over sentence boundaries.

So if you want to use custom sentence boundaries, you'd have to apply them before the parser – e.g. via nlp.add_pipe(sentencizer, before="parser") or nlp.add_pipe(sentencizer, first=True). The parser will then take your custom boundaries into account as well and only assign dependencies that are consistent with the sentences. This can sometimes lead to an improvement in parsing accuracy.

Also see the docs here for details and examples: https://spacy.io/usage/linguistic-features#sbd-custom

@ines ines added feat / parser Feature: Dependency Parser feat / pipeline Feature: Processing pipeline and components usage General spaCy usage labels Apr 10, 2019
@krlng
Copy link
Author

krlng commented Apr 11, 2019

Thank you ines, that makes it really clear 👍

@krlng krlng closed this as completed Apr 11, 2019
@lock
Copy link

lock bot commented May 11, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 11, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / parser Feature: Dependency Parser feat / pipeline Feature: Processing pipeline and components usage General spaCy usage
Projects
None yet
Development

No branches or pull requests

2 participants