-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matcher with "SENT_START": False
works differently with Sentencizer vs. dependency parser
#5287
Comments
This relates to #4775. I agree that it's confusing that one component sets non-sentence boundaries as As a temporary workaround, have you tried pulling in branch #5282 to check whether that would solve your issue, using |
It doesn't look like #5282 solves the issue. In fact, it looks like a completely different issue.
This is the same error as I got when trying the workarounds mentioned above. The issue seems to be that any type of extended pattern syntax (like
This issue seems to be independent of #5282. Just to confirm this, here is a self-contained code snippet, which breaks in the last line:
|
For My proposed API is kind of clunky, but I think the easiest short-term solution is to normalize this value to |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
How to reproduce the behaviour
Matcher with a token pattern containing
"SENT_START": False
gives the intended result with sentence boundaries set by the parser, but not if sentence boundaries are set by theSentencizer
.Say we want to identify (and merge) sequences of tokens in Title Case, but obviously we do not want to include the first token in a sentence. Then the following token pattern works with spaCy version >= 2.2.4
printing
as expected.
But when we disable the parser and instead use
Sentencizer
to assign sentence boundaries, we getwhere
European Central Bank
is not merged as an entity (full code is listed at the bottom).Possible causes, solutions, and (currently impossible) workarounds
When a token is not at the beginning of a sentence, the dependency parser does not actually set
token.is_sent_start
(leaving it asNone
), whileSentencizer
sets it toFalse
.(How exactly this leads to the difference in matching behavior is not 100% clear to me).
Trivalent sentence boundary marking and assigning
is_sentenced
The behavior of the dependency parser is probably the better option, because with its trivalent logic multiple sentence segmentation components could complement each other in a given pipeline.
But just removing the following two lines
from
Sentencizer.set_annotations
would cause problems withDoc
s consisting of a single sentence: given how it is currently implemented,Doc.is_sentenced
would beFalse
.(Currently impossible?) workarounds
Instead of specifying
"SENT_START": False
, we could try to write the pattern in terms of!= True
, which would include bothFalse
andNone
.Unfortunately, this does not seem possible right now.
The following two approaches yield errors complaining that the complex right-hand side is
not of type 'boolean' [0 -> SENT_START]
:(The latter issue may be related to #5281).
Finally, I tried writing the pattern like
and the results were wild: in the above example it will (expectedly) yield the merged token
the European Central Bank
; but when a sentence starts with multiple tokens in Title Case, sentence boundaries get messed up (again expectedly: merging a match involving a sentence boundary invalidates theSpan
s marking sentences).The full code to reproduce the second output above (using
IS_SENT_START
instead ofSENT_START
yields the same pattern):Your Environment
The text was updated successfully, but these errors were encountered: