Bugfix in noun chunks #5470

svlandeg · 2020-05-20T22:28:55Z

Description

There was a bug in the code that was supposed to prevent overlapping noun chunks. It was checking whether a token's subtree was not yet in a chunk, and if not, would create a new chunk from [word.left_edge.i, word.i + 1].

But every once in a while, depending on the model and its predicted POS tags / deps, the tokens in the subtree actually would be entirely different from the tokens in that span, so the check would fail.

Replaced the code with a more straightforward check on the actual tokens in the span that are being defined as a noun chunk.

Fixes #5458
Fixes #5393

Types of change

bug fix

Checklist

I have submitted the spaCy Contributor Agreement.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

adrianeboyd · 2020-05-21T08:10:48Z

I will add that couldn't reproduce this with master with the exact same models, but I haven't figured out what might have changed. I guess it would have to be the parse trees somehow given your test case?

honnibal · 2020-05-21T13:09:25Z

Hmm there's some nested looping here that makes me nervous. I think this is another place where we might be introducing accidentally quadratic behaviour. On long documents it could get really slow, and I think it's unnecessary?

Noun chunks are contiguous, right? Like, they have to form a span. So we only need to test the boundaries of two chunks to ensure they don't overlap. We also know that we're iterating in order. So why can't we just check chunk[0].i > prev[-1].i?

honnibal · 2020-05-21T13:27:48Z

Does this work? I'm assuming that word.left_edge.i <= word.i and that everything in word.rights is indeed to the right of the word. If these aren't true that we've definitely got other problems, right?

def noun_chunks(obj):
    """
    Detect base noun phrases. Works on both Doc and Span.
    """
    # It follows the logic of the noun chunks finder of English language,
    # adjusted to some Greek language special characteristics.
    # obj tag corrects some DEP tagger mistakes.
    # Further improvement of the models will eliminate the need for this tag.
    labels = ["nsubj", "obj", "iobj", "appos", "ROOT", "obl"]
    doc = obj.doc  # Ensure works on both Doc and Span.

    if not doc.is_parsed:
        raise ValueError(Errors.E029)

    np_deps = [doc.vocab.strings.add(label) for label in labels]
    conj = doc.vocab.strings.add("conj")
    nmod = doc.vocab.strings.add("nmod")
    np_label = doc.vocab.strings.add("NP")
    prev_end = -1
    for i, word in enumerate(obj):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
        if word.left_edge.i <= prev_end:
            continue
        if word.dep in np_deps:
            if word.pos == NOUN:
                for potential_nmod in word.rights:
                    if potential_nmod.dep == nmod:
                        yield word.left_edge.i, potential_nmod.i + 1, np_label
                        prev_end = potential_nmod.i + 1
                        break
                else:
                    yield word.left_edge.i, word.i + 1, np_label
                    prev_end = word.i + 1
        elif word.dep == conj:
            # covers the case: έχει όμορφα και έξυπνα παιδιά
            head = word.head
            while head.dep == conj and head.head.i < head.i:
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
                yield word.left_edge.i, word.i + 1, np_label
                prev_end = word.i + 1

adrianeboyd · 2020-05-21T13:34:08Z

I wish I could figure out the set_children_from_heads problem from #4497. Something in there with lefts/rights/edges is probably related.

honnibal · 2020-05-21T13:58:27Z

The whole parse representation stuff went to hell when it added support for non-projective parses :(. We can't not support non-projective, but it makes life way harder.

svlandeg · 2020-05-21T16:00:24Z

I will add that couldn't reproduce this with master with the exact same models, but I haven't figured out what might have changed. I guess it would have to be the parse trees somehow given your test case?

I could reproduce the issue from #5458 with en_core_web_md (not with other models) and the latest code from master - that's what the unit test replicates. But yea it's very much dependent on the exact pos & dep tags that are being predicted.

svlandeg · 2020-05-21T16:06:00Z

So we only need to test the boundaries of two chunks to ensure they don't overlap. We also know that we're iterating in order. So why can't we just check chunk[0].i > prev[-1].i?

You're right, that's much better. And your code snippet from your other post makes sense. I'll adjust accordingly!

# Conflicts: # spacy/lang/el/syntax_iterators.py # spacy/lang/en/syntax_iterators.py # spacy/lang/fa/syntax_iterators.py # spacy/lang/fr/syntax_iterators.py # spacy/lang/id/syntax_iterators.py # spacy/lang/nb/syntax_iterators.py # spacy/lang/sv/syntax_iterators.py

Add a similar fix as in explosion#5470 to prevent the German noun chunks iterator from producing overlapping spans.

Add a similar fix as in #5470 to prevent the German noun chunks iterator from producing overlapping spans.

svlandeg added 3 commits May 20, 2020 23:06

failing test to reproduce overlapping spans problem

36a94c4

fix: use actual range in 'seen' instead of subtree

b509a3e

fixing all languages

b221bcf

svlandeg added bug feat / pipeline labels May 20, 2020

svlandeg added 3 commits May 21, 2020 19:15

avoid unnecessary loop to check overlapping noun chunks

f7d10da

span / noun chunk has +1 because end is exclusive

51715b9

honnibal merged commit 5ce02c1 into explosion:master May 21, 2020

svlandeg deleted the bugfix/noun-chunks branch May 21, 2020 21:31

adrianeboyd added a commit to adrianeboyd/spaCy that referenced this pull request Sep 22, 2020

Fix overlapping German noun chunks

681fdfd

Add a similar fix as in explosion#5470 to prevent the German noun chunks iterator from producing overlapping spans.

adrianeboyd mentioned this pull request Sep 22, 2020

Fix overlapping German noun chunks #6112

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix in noun chunks #5470

Bugfix in noun chunks #5470

svlandeg commented May 20, 2020

adrianeboyd commented May 21, 2020

honnibal commented May 21, 2020

honnibal commented May 21, 2020 •

edited

Loading

adrianeboyd commented May 21, 2020

honnibal commented May 21, 2020

svlandeg commented May 21, 2020 •

edited

Loading

svlandeg commented May 21, 2020 •

edited

Loading

Bugfix in noun chunks #5470

Bugfix in noun chunks #5470

Conversation

svlandeg commented May 20, 2020

Description

Types of change

Checklist

adrianeboyd commented May 21, 2020

honnibal commented May 21, 2020

honnibal commented May 21, 2020 • edited Loading

adrianeboyd commented May 21, 2020

honnibal commented May 21, 2020

svlandeg commented May 21, 2020 • edited Loading

svlandeg commented May 21, 2020 • edited Loading

honnibal commented May 21, 2020 •

edited

Loading

svlandeg commented May 21, 2020 •

edited

Loading

svlandeg commented May 21, 2020 •

edited

Loading