Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix in noun chunks #5470

Merged
merged 6 commits into from
May 21, 2020
Merged

Conversation

svlandeg
Copy link
Member

Description

There was a bug in the code that was supposed to prevent overlapping noun chunks. It was checking whether a token's subtree was not yet in a chunk, and if not, would create a new chunk from [word.left_edge.i, word.i + 1].

But every once in a while, depending on the model and its predicted POS tags / deps, the tokens in the subtree actually would be entirely different from the tokens in that span, so the check would fail.

Replaced the code with a more straightforward check on the actual tokens in the span that are being defined as a noun chunk.

Fixes #5458
Fixes #5393

Types of change

bug fix

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@svlandeg svlandeg added bug Bugs and behaviour differing from documentation feat / pipeline Feature: Processing pipeline and components labels May 20, 2020
@adrianeboyd
Copy link
Contributor

I will add that couldn't reproduce this with master with the exact same models, but I haven't figured out what might have changed. I guess it would have to be the parse trees somehow given your test case?

@honnibal
Copy link
Member

Hmm there's some nested looping here that makes me nervous. I think this is another place where we might be introducing accidentally quadratic behaviour. On long documents it could get really slow, and I think it's unnecessary?

Noun chunks are contiguous, right? Like, they have to form a span. So we only need to test the boundaries of two chunks to ensure they don't overlap. We also know that we're iterating in order. So why can't we just check chunk[0].i > prev[-1].i?

@honnibal
Copy link
Member

honnibal commented May 21, 2020

Does this work? I'm assuming that word.left_edge.i <= word.i and that everything in word.rights is indeed to the right of the word. If these aren't true that we've definitely got other problems, right?

def noun_chunks(obj):
    """
    Detect base noun phrases. Works on both Doc and Span.
    """
    # It follows the logic of the noun chunks finder of English language,
    # adjusted to some Greek language special characteristics.
    # obj tag corrects some DEP tagger mistakes.
    # Further improvement of the models will eliminate the need for this tag.
    labels = ["nsubj", "obj", "iobj", "appos", "ROOT", "obl"]
    doc = obj.doc  # Ensure works on both Doc and Span.

    if not doc.is_parsed:
        raise ValueError(Errors.E029)

    np_deps = [doc.vocab.strings.add(label) for label in labels]
    conj = doc.vocab.strings.add("conj")
    nmod = doc.vocab.strings.add("nmod")
    np_label = doc.vocab.strings.add("NP")
    prev_end = -1
    for i, word in enumerate(obj):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
            if word.pos == NOUN:
                for potential_nmod in word.rights:
                    if potential_nmod.dep == nmod:
                        yield word.left_edge.i, potential_nmod.i + 1, np_label
                        prev_end = potential_nmod.i + 1
                        break
                else:
                    yield word.left_edge.i, word.i + 1, np_label
                    prev_end = word.i + 1
        elif word.dep == conj:
            # covers the case: έχει όμορφα και έξυπνα παιδιά
            head = word.head
            while head.dep == conj and head.head.i < head.i:
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
                yield word.left_edge.i, word.i + 1, np_label
                prev_end = word.i + 1

@adrianeboyd
Copy link
Contributor

I wish I could figure out the set_children_from_heads problem from #4497. Something in there with lefts/rights/edges is probably related.

@honnibal
Copy link
Member

The whole parse representation stuff went to hell when it added support for non-projective parses :(. We can't not support non-projective, but it makes life way harder.

@svlandeg
Copy link
Member Author

svlandeg commented May 21, 2020

I will add that couldn't reproduce this with master with the exact same models, but I haven't figured out what might have changed. I guess it would have to be the parse trees somehow given your test case?

I could reproduce the issue from #5458 with en_core_web_md (not with other models) and the latest code from master - that's what the unit test replicates. But yea it's very much dependent on the exact pos & dep tags that are being predicted.

@svlandeg
Copy link
Member Author

svlandeg commented May 21, 2020

So we only need to test the boundaries of two chunks to ensure they don't overlap. We also know that we're iterating in order. So why can't we just check chunk[0].i > prev[-1].i?

You're right, that's much better. And your code snippet from your other post makes sense. I'll adjust accordingly!

# Conflicts:
#	spacy/lang/el/syntax_iterators.py
#	spacy/lang/en/syntax_iterators.py
#	spacy/lang/fa/syntax_iterators.py
#	spacy/lang/fr/syntax_iterators.py
#	spacy/lang/id/syntax_iterators.py
#	spacy/lang/nb/syntax_iterators.py
#	spacy/lang/sv/syntax_iterators.py
@honnibal honnibal merged commit 5ce02c1 into explosion:master May 21, 2020
@svlandeg svlandeg deleted the bugfix/noun-chunks branch May 21, 2020 21:31
adrianeboyd added a commit to adrianeboyd/spaCy that referenced this pull request Sep 22, 2020
Add a similar fix as in explosion#5470 to prevent the German noun chunks iterator
from producing overlapping spans.
honnibal pushed a commit that referenced this pull request Sep 22, 2020
Add a similar fix as in #5470 to prevent the German noun chunks iterator
from producing overlapping spans.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and behaviour differing from documentation feat / pipeline Feature: Processing pipeline and components
Projects
None yet
3 participants