PhraseMatcher does not match LEMMA #4100

fizban99 · 2019-08-09T08:41:19Z

According to the documentation, the PhraseMatcher allows setting a different attribute other than ORTH. It explicitely mentions LOWER, POS and DEP, but leaves the door open for other attributes.
The LEMMA attribute does not seem to work with the English model. This was mentioned in the gitter chat room.

How to reproduce the behaviour

If we take the exact example in the usage guide and just add attr="LEMMA" to the constructor, it does not find any match:

import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load('en_core_web_sm')
matcher = PhraseMatcher(nlp.vocab, attr="LEMMA")
terms = [u"Barack Obama", u"Angela Merkel", u"Washington, D.C."]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", None, *patterns)

doc = nlp(u"German Chancellor Angela Merkel and US President Barack Obama "
          u"converse in the Oval Office inside the White House in Washington, D.C.")
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

we can verify that in this case the lemmas and the text should be the same for the matching terms and I would expect to get the same results with ORTH or with LEMMA in this specific example:

for tok in doc:
    print(tok.lemma_, tok.text)

If instead of the en_core_web_sm model we use an empty English model:

from spacy.lang.en import English
nlp = English()

We get that the terms match everything, which is also incorrect.

Your Environment

spaCy version: 2.1.8
Platform: Linux-4.15.0-52-generic-x86_64-with-debian-buster-sid
Python version: 3.7.3

The text was updated successfully, but these errors were encountered:

svlandeg · 2019-08-09T10:07:36Z

Hi @fizban99, thanks for the report! That does sound suspicious - let me look into it ;-)

adrianeboyd · 2019-08-09T10:34:19Z

The problem is that make_doc() doesn't set lemmas. You can look up the lookup lemmas with token.lemma_ from the returned doc but they're not actually set. It's a pretty confusing situation for users, who can't tell that the lemma isn't set.

I think an error is needed in PhraseMatcher to detect that you have an attribute that isn't set, like it warns that if you're only using ORTH, you don't need the full pipeline, but the reverse.

adrianeboyd · 2019-08-09T10:48:37Z

Sorry, I should have included the solution more explicitly for @fizban99 : use nlp() instead of nlp.make_doc() and it should be able to match lemmas.

svlandeg · 2019-08-09T10:50:04Z

@adrianeboyd (regarding your first comment): it's similar to the solution implemented for NORM attributes in PR #4080. Running token.lemma_ or token.norm_ has a fallback mechanism to return something useful when that field is not set on the underlying Cython structure, but then the results for the matching are confusing.

The other solution is to remedy this in the get_token_attr method. But that is not straightforward because then that function needs access to the vocab

adrianeboyd · 2019-08-09T11:05:22Z

Maybe it would be a good idea to just use nlp() in these examples and only introduce nlp.make_doc() later in the guide? (The problem then is that people who try the example as-is will immediately get the make_doc() warning/suggestion for ORTH.)

Also: do you want users to be choose whether they're matching on fallback lookup lemmas or restrict them to better lemmas when they are available in that model? Someone might want the option for lookup lemmas since it's faster, but it's also not a straightforward situation for a typical user.

ines · 2019-08-09T11:11:25Z

I think we should probably add a "warning"-type infobox mentinoning that if you want to match on token attributes that are set by other pipeline components, you need make sure to also run those components and not just call make_doc.

Maybe it would be a good idea to just use nlp() in these examples and only introduce nlp.make_doc() later in the guide?

The problem is that calling nlp if you're only matching lexical attributes is very inefficient. So for most cases, you don't want to be doing that – and we want to make sure that if users are copy-pasting the example code, they're not copy-pasting a super slow and inefficient solution.

Also: do you want users to be choose whether they're matching on fallback lookup lemmas or restrict them to better lemmas when they are available in that model? Someone might want the option for lookup lemmas since it's faster, but it's also not a straightforward situation for a typical user.

I dont think people should be matching on lookup lemmas only, because that does make things very unpredictable and confusing. Also, I think we've mentioned this before, but lookup lemmas were a mistake 😞 We want to get rid of them entirely for the languages that we have better lemmatization for, and move the lookup lists out of the library once we have #3971 implemented.

adrianeboyd · 2019-08-09T11:17:42Z

You still have the problem that users really really can't tell when lemmas are set. (I couldn't tell until I dug through a lot of code.)

The pipeline documentation isn't much help: https://spacy.io/usage/processing-pipelines

fizban99 · 2019-08-09T19:14:44Z

Thank you all for the clarifications. Really helpful. I agree with Ines that a warning would be nice. Plus update the documentation clarifying that when using other attributes to match (POS, DEP, LEMMA), a statistical model needs to be loaded. This means that those attributes cannot be matched with an empty model such as English() or using make_doc(). This might seem obvious but it was not for me and for others, since I believe most of us are learning lots of new concepts as well as learning spaCy itself, and things not always "click" that fast...

Add more detailed token pattern checks without full JSON pattern validation and provide more detailed error messages. Addresses explosion#4070 (also related: explosion#4063, explosion#4100). * Check whether top-level attributes in patterns and attr for PhraseMatcher are in token pattern schema * Check whether attribute value types are supported in general (as opposed to per attribute with full validation) * Report various internal error types (OverflowError, AttributeError, KeyError) as ValueError with standard error messages * Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS, LEMMA, and DEP * Add error messages with relevant details on how to use validate=True or nlp() instead of nlp.make_doc() * Support attr=TEXT for PhraseMatcher * Add NORM to schema * Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler

… skip]

* Fix typo in rule-based matching docs * Improve token pattern checking without validation Add more detailed token pattern checks without full JSON pattern validation and provide more detailed error messages. Addresses #4070 (also related: #4063, #4100). * Check whether top-level attributes in patterns and attr for PhraseMatcher are in token pattern schema * Check whether attribute value types are supported in general (as opposed to per attribute with full validation) * Report various internal error types (OverflowError, AttributeError, KeyError) as ValueError with standard error messages * Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS, LEMMA, and DEP * Add error messages with relevant details on how to use validate=True or nlp() instead of nlp.make_doc() * Support attr=TEXT for PhraseMatcher * Add NORM to schema * Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler * Remove unnecessary .keys() * Rephrase error messages * Add another type check to Matcher Add another type check to Matcher for more understandable error messages in some rare cases. * Support phrase_matcher_attr=TEXT for EntityRuler * Don't use spacy.errors in examples and bin scripts * Fix error code * Auto-format Also try get Azure pipelines to finally start a build :( * Update errors.py Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

lock · 2019-09-10T09:42:42Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

svlandeg added feat / matcher Feature: Token, phrase and dependency matcher usage General spaCy usage labels Aug 9, 2019

svlandeg added bug Bugs and behaviour differing from documentation and removed usage General spaCy usage labels Aug 9, 2019

ines added docs Documentation and website usage General spaCy usage and removed bug Bugs and behaviour differing from documentation labels Aug 9, 2019

svlandeg mentioned this issue Aug 10, 2019

Wrong lemmatization using only tokenization (en language) #4102

Closed

ines closed this as completed in 1362f79 Aug 11, 2019

adrianeboyd mentioned this issue Aug 12, 2019

Improve token pattern checking without validation #4105

Merged

3 tasks

polm pushed a commit to polm/spaCy that referenced this issue Aug 18, 2019

Improve docs on phrase pattern attributes (closes explosion#4100) [ci…

855544b

… skip]

adrianeboyd mentioned this issue Aug 21, 2019

Unexpected result matching on LEMMA with Token matcher #4156

Closed

lock bot locked as resolved and limited conversation to collaborators Sep 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PhraseMatcher does not match LEMMA #4100

PhraseMatcher does not match LEMMA #4100

fizban99 commented Aug 9, 2019

svlandeg commented Aug 9, 2019

adrianeboyd commented Aug 9, 2019

adrianeboyd commented Aug 9, 2019

svlandeg commented Aug 9, 2019 •

edited

Loading

adrianeboyd commented Aug 9, 2019

ines commented Aug 9, 2019

adrianeboyd commented Aug 9, 2019

fizban99 commented Aug 9, 2019

lock bot commented Sep 10, 2019

PhraseMatcher does not match LEMMA #4100

PhraseMatcher does not match LEMMA #4100

Comments

fizban99 commented Aug 9, 2019

How to reproduce the behaviour

Your Environment

svlandeg commented Aug 9, 2019

adrianeboyd commented Aug 9, 2019

adrianeboyd commented Aug 9, 2019

svlandeg commented Aug 9, 2019 • edited Loading

adrianeboyd commented Aug 9, 2019

ines commented Aug 9, 2019

adrianeboyd commented Aug 9, 2019

fizban99 commented Aug 9, 2019

lock bot commented Sep 10, 2019

svlandeg commented Aug 9, 2019 •

edited

Loading