Rule matcher example doesn't match #3862

dataframing · 2019-06-19T01:07:31Z

Hi,

I noticed that an example for the Matcher doesn't seem to produce what the documentation suggests it should. Specifically, the following code block is expected to invoke the add_event_ent callback once the Google I/O pattern is matched (I've added two inline print statements for clarity):

import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

def add_event_ent(matcher, doc, i, matches):
    # Get the current match and create tuple of entity label, start and end.
    # Append entity to the doc's entity. (Don't overwrite doc.ents!)
    match_id, start, end = matches[i]
    entity = Span(doc, start, end, label="EVENT")
    doc.ents += (entity,)
    print(entity.text)

pattern = [{"ORTH": "Google"}, {"ORTH": "I"}, {"ORTH": "/"}, {"ORTH": "O"}]
matcher.add("GoogleIO", add_event_ent, pattern)

doc = nlp(u"This is a text about Google I/O.")
print([t.text for t in doc])
# ['This', 'is', 'a', 'text', 'about', 'Google', 'I', '/', 'O.']

matches = matcher(doc) # <<< Note: this does not invoke `print(entity.text)`
print(matches)
# []

However, it seems like (at least, on en_core_web_sm and en_core_web_md models) the input text is tokenized such that the trailing period is included with the O, which is outside of the rule's pattern.

Assuming it was a typo, I tried removing the period and spaCy raised a Cython error on Span initialization:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-25-cc1c83c2de13> in <module>
     24 print(f"Tokens: {[t.text for t in doc]}")
     25 
---> 26 matches = matcher(doc)
     27 print(f"Matches: {matches}")

matcher.pyx in spacy.matcher.Matcher.__call__()

<ipython-input-25-cc1c83c2de13> in add_event_ent(matcher, doc, i, matches)
     13     # Append entity to the doc's entity. (Don't overwrite doc.ents!)
     14     match_id, start, end = matches[i]
---> 15     entity = Span(doc, start, end, label="EVENT")
     16     doc.ents += (entity,)
     17     print(entity.text)

span.pyx in spacy.tokens.span.Span.__cinit__()

TypeError: an integer is required

I think it may just be that the EntityRecognizer picks up Google I/O. as a PRODUCT entity type, and the Matcher doesn't allow for overlapping entity types. In a fresh environment:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"This is a text about Google I/O.")
print([(ent.text, ent.label_) for ent in doc.ents])
# [('Google I/O.', 'PRODUCT')]

I tried looking around previous issues and #3287 seems to be the closest issue, to which this commit aimed to resolve. Sorry if this has already been resolved!

Which page or section is this issue related to?

https://spacy.io/usage/rule-based-matching#on_match

The text was updated successfully, but these errors were encountered:

BreakBB · 2019-06-19T06:15:05Z

I can reproduce your issue with the latest spaCy commit in some way.

With "This is a text about Google I/O." I get "Google I/O." as "ORG", just as you do.
However if I remove the trailing period I get "Google" as "ORG" before running into:

ValueError: [E103] Trying to set conflicting doc.ents: '(5, 6, 'ORG')' and '(5, 9, 'EVENT')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

That confirms your assumption that overlapping entities are not allowed just as the docs you linked:

A very similar logic has been implemented in the built-in EntityRuler by the way. It also takes care of handling overlapping matches, which you would otherwise have to take care of yourself.

I can get the on_match executed when using a blank model:

import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span

nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)


def add_event_ent(matcher, doc, i, matches):
    # Get the current match and create tuple of entity label, start and end.
    # Append entity to the doc's entity. (Don't overwrite doc.ents!)
    match_id, start, end = matches[i]
    entity = Span(doc, start, end, label="EVENT")
    doc.ents += (entity,)
    print(entity.text)  # prints Google I/O


pattern = [{"ORTH": "Google"}, {"ORTH": "I"}, {"ORTH": "/"}, {"ORTH": "O"}]
matcher.add("GoogleIO", add_event_ent, pattern)
doc = nlp(u"This is a text about Google I/O")
matches = matcher(doc)

So the whole issue seems to be related to #3052 with inaccurate models/predictions.

ines · 2019-06-19T08:06:30Z

Thanks, that's a good point! Previous models maybe didn't predict "Google" here, so the problem only started coming up in v2.1.x.

I think we should rewrite the example to use English() instead of the model. This will make the example more predictable, because there won't be any pre-set spans and you can actually see the result of the Matcher properly.

Btw, on a related note: For v2.1.4, I added a util.filter_spans helper to make it easier to filter overlapping spans. Previous versions of spaCy quietly swallowed overlaps, which wasn't good. The new version is stricter and raises the appropriate errors – but it also means that you'll see them and have to fix the underlying problem. We probably shouldn't include this in the example, but it might be useful to add a note that if you use a model, you may end up with overlapping spans, so you need to add some logic to filter them, or use a custom pipeline component that runs before the entity recognizer.

lock · 2019-07-26T13:49:27Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added the docs Documentation and website label Jun 19, 2019

ines closed this as completed in d361e38 Jun 26, 2019

lock bot locked as resolved and limited conversation to collaborators Jul 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rule matcher example doesn't match #3862

Rule matcher example doesn't match #3862

dataframing commented Jun 19, 2019 •

edited

BreakBB commented Jun 19, 2019 •

edited

ines commented Jun 19, 2019

lock bot commented Jul 26, 2019

Rule matcher example doesn't match #3862

Rule matcher example doesn't match #3862

Comments

dataframing commented Jun 19, 2019 • edited

Which page or section is this issue related to?

BreakBB commented Jun 19, 2019 • edited

ines commented Jun 19, 2019

lock bot commented Jul 26, 2019

dataframing commented Jun 19, 2019 •

edited

BreakBB commented Jun 19, 2019 •

edited