-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix in noun chunks #5470
Bugfix in noun chunks #5470
Conversation
I will add that couldn't reproduce this with |
Hmm there's some nested looping here that makes me nervous. I think this is another place where we might be introducing accidentally quadratic behaviour. On long documents it could get really slow, and I think it's unnecessary? Noun chunks are contiguous, right? Like, they have to form a span. So we only need to test the boundaries of two chunks to ensure they don't overlap. We also know that we're iterating in order. So why can't we just check |
Does this work? I'm assuming that def noun_chunks(obj):
"""
Detect base noun phrases. Works on both Doc and Span.
"""
# It follows the logic of the noun chunks finder of English language,
# adjusted to some Greek language special characteristics.
# obj tag corrects some DEP tagger mistakes.
# Further improvement of the models will eliminate the need for this tag.
labels = ["nsubj", "obj", "iobj", "appos", "ROOT", "obl"]
doc = obj.doc # Ensure works on both Doc and Span.
if not doc.is_parsed:
raise ValueError(Errors.E029)
np_deps = [doc.vocab.strings.add(label) for label in labels]
conj = doc.vocab.strings.add("conj")
nmod = doc.vocab.strings.add("nmod")
np_label = doc.vocab.strings.add("NP")
prev_end = -1
for i, word in enumerate(obj):
if word.pos not in (NOUN, PROPN, PRON):
continue
# Prevent nested chunks from being produced
if word.left_edge.i <= prev_end:
continue
if word.dep in np_deps:
if word.pos == NOUN:
for potential_nmod in word.rights:
if potential_nmod.dep == nmod:
yield word.left_edge.i, potential_nmod.i + 1, np_label
prev_end = potential_nmod.i + 1
break
else:
yield word.left_edge.i, word.i + 1, np_label
prev_end = word.i + 1
elif word.dep == conj:
# covers the case: έχει όμορφα και έξυπνα παιδιά
head = word.head
while head.dep == conj and head.head.i < head.i:
head = head.head
# If the head is an NP, and we're coordinated to it, we're an NP
if head.dep in np_deps:
yield word.left_edge.i, word.i + 1, np_label
prev_end = word.i + 1 |
I wish I could figure out the |
The whole parse representation stuff went to hell when it added support for non-projective parses :(. We can't not support non-projective, but it makes life way harder. |
I could reproduce the issue from #5458 with |
You're right, that's much better. And your code snippet from your other post makes sense. I'll adjust accordingly! |
# Conflicts: # spacy/lang/el/syntax_iterators.py # spacy/lang/en/syntax_iterators.py # spacy/lang/fa/syntax_iterators.py # spacy/lang/fr/syntax_iterators.py # spacy/lang/id/syntax_iterators.py # spacy/lang/nb/syntax_iterators.py # spacy/lang/sv/syntax_iterators.py
Add a similar fix as in explosion#5470 to prevent the German noun chunks iterator from producing overlapping spans.
Add a similar fix as in #5470 to prevent the German noun chunks iterator from producing overlapping spans.
Description
There was a bug in the code that was supposed to prevent overlapping noun chunks. It was checking whether a token's subtree was not yet in a chunk, and if not, would create a new chunk from
[word.left_edge.i, word.i + 1]
.But every once in a while, depending on the model and its predicted POS tags / deps, the tokens in the
subtree
actually would be entirely different from the tokens in that span, so the check would fail.Replaced the code with a more straightforward check on the actual tokens in the span that are being defined as a noun chunk.
Fixes #5458
Fixes #5393
Types of change
bug fix
Checklist