Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rule matcher example doesn't match #3862

Closed
dataframing opened this issue Jun 19, 2019 · 3 comments
Closed

Rule matcher example doesn't match #3862

dataframing opened this issue Jun 19, 2019 · 3 comments
Labels
docs Documentation and website

Comments

@dataframing
Copy link

dataframing commented Jun 19, 2019

Hi,

I noticed that an example for the Matcher doesn't seem to produce what the documentation suggests it should. Specifically, the following code block is expected to invoke the add_event_ent callback once the Google I/O pattern is matched (I've added two inline print statements for clarity):

import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

def add_event_ent(matcher, doc, i, matches):
    # Get the current match and create tuple of entity label, start and end.
    # Append entity to the doc's entity. (Don't overwrite doc.ents!)
    match_id, start, end = matches[i]
    entity = Span(doc, start, end, label="EVENT")
    doc.ents += (entity,)
    print(entity.text)

pattern = [{"ORTH": "Google"}, {"ORTH": "I"}, {"ORTH": "/"}, {"ORTH": "O"}]
matcher.add("GoogleIO", add_event_ent, pattern)

doc = nlp(u"This is a text about Google I/O.")
print([t.text for t in doc])
# ['This', 'is', 'a', 'text', 'about', 'Google', 'I', '/', 'O.']

matches = matcher(doc) # <<< Note: this does not invoke `print(entity.text)`
print(matches)
# []

However, it seems like (at least, on en_core_web_sm and en_core_web_md models) the input text is tokenized such that the trailing period is included with the O, which is outside of the rule's pattern.

Assuming it was a typo, I tried removing the period and spaCy raised a Cython error on Span initialization:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-25-cc1c83c2de13> in <module>
     24 print(f"Tokens: {[t.text for t in doc]}")
     25 
---> 26 matches = matcher(doc)
     27 print(f"Matches: {matches}")

matcher.pyx in spacy.matcher.Matcher.__call__()

<ipython-input-25-cc1c83c2de13> in add_event_ent(matcher, doc, i, matches)
     13     # Append entity to the doc's entity. (Don't overwrite doc.ents!)
     14     match_id, start, end = matches[i]
---> 15     entity = Span(doc, start, end, label="EVENT")
     16     doc.ents += (entity,)
     17     print(entity.text)

span.pyx in spacy.tokens.span.Span.__cinit__()

TypeError: an integer is required

I think it may just be that the EntityRecognizer picks up Google I/O. as a PRODUCT entity type, and the Matcher doesn't allow for overlapping entity types. In a fresh environment:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"This is a text about Google I/O.")
print([(ent.text, ent.label_) for ent in doc.ents])
# [('Google I/O.', 'PRODUCT')]

I tried looking around previous issues and #3287 seems to be the closest issue, to which this commit aimed to resolve. Sorry if this has already been resolved!

Which page or section is this issue related to?

https://spacy.io/usage/rule-based-matching#on_match

@BreakBB
Copy link
Contributor

BreakBB commented Jun 19, 2019

I can reproduce your issue with the latest spaCy commit in some way.

With "This is a text about Google I/O." I get "Google I/O." as "ORG", just as you do.
However if I remove the trailing period I get "Google" as "ORG" before running into:

ValueError: [E103] Trying to set conflicting doc.ents: '(5, 6, 'ORG')' and '(5, 9, 'EVENT')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

That confirms your assumption that overlapping entities are not allowed just as the docs you linked:

A very similar logic has been implemented in the built-in EntityRuler by the way. It also takes care of handling overlapping matches, which you would otherwise have to take care of yourself.

I can get the on_match executed when using a blank model:

import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span

nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)


def add_event_ent(matcher, doc, i, matches):
    # Get the current match and create tuple of entity label, start and end.
    # Append entity to the doc's entity. (Don't overwrite doc.ents!)
    match_id, start, end = matches[i]
    entity = Span(doc, start, end, label="EVENT")
    doc.ents += (entity,)
    print(entity.text)  # prints Google I/O


pattern = [{"ORTH": "Google"}, {"ORTH": "I"}, {"ORTH": "/"}, {"ORTH": "O"}]
matcher.add("GoogleIO", add_event_ent, pattern)
doc = nlp(u"This is a text about Google I/O")
matches = matcher(doc)

So the whole issue seems to be related to #3052 with inaccurate models/predictions.

@ines ines added the docs Documentation and website label Jun 19, 2019
@ines
Copy link
Member

ines commented Jun 19, 2019

Thanks, that's a good point! Previous models maybe didn't predict "Google" here, so the problem only started coming up in v2.1.x.

I think we should rewrite the example to use English() instead of the model. This will make the example more predictable, because there won't be any pre-set spans and you can actually see the result of the Matcher properly.

Btw, on a related note: For v2.1.4, I added a util.filter_spans helper to make it easier to filter overlapping spans. Previous versions of spaCy quietly swallowed overlaps, which wasn't good. The new version is stricter and raises the appropriate errors – but it also means that you'll see them and have to fix the underlying problem. We probably shouldn't include this in the example, but it might be useful to add a note that if you use a model, you may end up with overlapping spans, so you need to add some logic to filter them, or use a custom pipeline component that runs before the entity recognizer.

@ines ines closed this as completed in d361e38 Jun 26, 2019
@lock
Copy link

lock bot commented Jul 26, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jul 26, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
docs Documentation and website
Projects
None yet
Development

No branches or pull requests

3 participants