Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PhraseMatcher does not match LEMMA #4100

Closed
fizban99 opened this issue Aug 9, 2019 · 9 comments
Closed

PhraseMatcher does not match LEMMA #4100

fizban99 opened this issue Aug 9, 2019 · 9 comments
Labels
docs Documentation and website feat / matcher Feature: Token, phrase and dependency matcher usage General spaCy usage

Comments

@fizban99
Copy link
Contributor

fizban99 commented Aug 9, 2019

According to the documentation, the PhraseMatcher allows setting a different attribute other than ORTH. It explicitely mentions LOWER, POS and DEP, but leaves the door open for other attributes.
The LEMMA attribute does not seem to work with the English model. This was mentioned in the gitter chat room.

How to reproduce the behaviour

If we take the exact example in the usage guide and just add attr="LEMMA" to the constructor, it does not find any match:

import spacy
from spacy.matcher import PhraseMatcher
​
nlp = spacy.load('en_core_web_sm')
matcher = PhraseMatcher(nlp.vocab, attr="LEMMA")
terms = [u"Barack Obama", u"Angela Merkel", u"Washington, D.C."]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", None, *patterns)
​
doc = nlp(u"German Chancellor Angela Merkel and US President Barack Obama "
          u"converse in the Oval Office inside the White House in Washington, D.C.")
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

we can verify that in this case the lemmas and the text should be the same for the matching terms and I would expect to get the same results with ORTH or with LEMMA in this specific example:

for tok in doc:
    print(tok.lemma_, tok.text)

If instead of the en_core_web_sm model we use an empty English model:

from spacy.lang.en import English
nlp = English()

We get that the terms match everything, which is also incorrect.

Your Environment

  • spaCy version: 2.1.8
  • Platform: Linux-4.15.0-52-generic-x86_64-with-debian-buster-sid
  • Python version: 3.7.3
@svlandeg svlandeg added feat / matcher Feature: Token, phrase and dependency matcher usage General spaCy usage labels Aug 9, 2019
@svlandeg
Copy link
Member

svlandeg commented Aug 9, 2019

Hi @fizban99, thanks for the report! That does sound suspicious - let me look into it ;-)

@svlandeg svlandeg added bug Bugs and behaviour differing from documentation and removed usage General spaCy usage labels Aug 9, 2019
@adrianeboyd
Copy link
Contributor

The problem is that make_doc() doesn't set lemmas. You can look up the lookup lemmas with token.lemma_ from the returned doc but they're not actually set. It's a pretty confusing situation for users, who can't tell that the lemma isn't set.

I think an error is needed in PhraseMatcher to detect that you have an attribute that isn't set, like it warns that if you're only using ORTH, you don't need the full pipeline, but the reverse.

@adrianeboyd
Copy link
Contributor

Sorry, I should have included the solution more explicitly for @fizban99 : use nlp() instead of nlp.make_doc() and it should be able to match lemmas.

@svlandeg
Copy link
Member

svlandeg commented Aug 9, 2019

@adrianeboyd (regarding your first comment): it's similar to the solution implemented for NORM attributes in PR #4080. Running token.lemma_ or token.norm_ has a fallback mechanism to return something useful when that field is not set on the underlying Cython structure, but then the results for the matching are confusing.

The other solution is to remedy this in the get_token_attr method. But that is not straightforward because then that function needs access to the vocab

@adrianeboyd
Copy link
Contributor

Maybe it would be a good idea to just use nlp() in these examples and only introduce nlp.make_doc() later in the guide? (The problem then is that people who try the example as-is will immediately get the make_doc() warning/suggestion for ORTH.)

Also: do you want users to be choose whether they're matching on fallback lookup lemmas or restrict them to better lemmas when they are available in that model? Someone might want the option for lookup lemmas since it's faster, but it's also not a straightforward situation for a typical user.

@ines
Copy link
Member

ines commented Aug 9, 2019

I think we should probably add a "warning"-type infobox mentinoning that if you want to match on token attributes that are set by other pipeline components, you need make sure to also run those components and not just call make_doc.

Maybe it would be a good idea to just use nlp() in these examples and only introduce nlp.make_doc() later in the guide?

The problem is that calling nlp if you're only matching lexical attributes is very inefficient. So for most cases, you don't want to be doing that – and we want to make sure that if users are copy-pasting the example code, they're not copy-pasting a super slow and inefficient solution.

Also: do you want users to be choose whether they're matching on fallback lookup lemmas or restrict them to better lemmas when they are available in that model? Someone might want the option for lookup lemmas since it's faster, but it's also not a straightforward situation for a typical user.

I dont think people should be matching on lookup lemmas only, because that does make things very unpredictable and confusing. Also, I think we've mentioned this before, but lookup lemmas were a mistake 😞 We want to get rid of them entirely for the languages that we have better lemmatization for, and move the lookup lists out of the library once we have #3971 implemented.

@ines ines added docs Documentation and website usage General spaCy usage and removed bug Bugs and behaviour differing from documentation labels Aug 9, 2019
@adrianeboyd
Copy link
Contributor

You still have the problem that users really really can't tell when lemmas are set. (I couldn't tell until I dug through a lot of code.)

The pipeline documentation isn't much help: https://spacy.io/usage/processing-pipelines

@fizban99
Copy link
Contributor Author

fizban99 commented Aug 9, 2019

Thank you all for the clarifications. Really helpful. I agree with Ines that a warning would be nice. Plus update the documentation clarifying that when using other attributes to match (POS, DEP, LEMMA), a statistical model needs to be loaded. This means that those attributes cannot be matched with an empty model such as English() or using make_doc(). This might seem obvious but it was not for me and for others, since I believe most of us are learning lots of new concepts as well as learning spaCy itself, and things not always "click" that fast...

@ines ines closed this as completed in 1362f79 Aug 11, 2019
adrianeboyd added a commit to adrianeboyd/spaCy that referenced this issue Aug 12, 2019
Add more detailed token pattern checks without full JSON pattern validation and
provide more detailed error messages.

Addresses explosion#4070 (also related: explosion#4063, explosion#4100).

* Check whether top-level attributes in patterns and attr for PhraseMatcher are
  in token pattern schema

* Check whether attribute value types are supported in general (as opposed to
  per attribute with full validation)

* Report various internal error types (OverflowError, AttributeError, KeyError)
  as ValueError with standard error messages

* Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS,
  LEMMA, and DEP

* Add error messages with relevant details on how to use validate=True or nlp()
  instead of nlp.make_doc()

* Support attr=TEXT for PhraseMatcher

* Add NORM to schema

* Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler
polm pushed a commit to polm/spaCy that referenced this issue Aug 18, 2019
ines added a commit that referenced this issue Aug 21, 2019
* Fix typo in rule-based matching docs

* Improve token pattern checking without validation

Add more detailed token pattern checks without full JSON pattern validation and
provide more detailed error messages.

Addresses #4070 (also related: #4063, #4100).

* Check whether top-level attributes in patterns and attr for PhraseMatcher are
  in token pattern schema

* Check whether attribute value types are supported in general (as opposed to
  per attribute with full validation)

* Report various internal error types (OverflowError, AttributeError, KeyError)
  as ValueError with standard error messages

* Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS,
  LEMMA, and DEP

* Add error messages with relevant details on how to use validate=True or nlp()
  instead of nlp.make_doc()

* Support attr=TEXT for PhraseMatcher

* Add NORM to schema

* Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler

* Remove unnecessary .keys()

* Rephrase error messages

* Add another type check to Matcher

Add another type check to Matcher for more understandable error messages
in some rare cases.

* Support phrase_matcher_attr=TEXT for EntityRuler

* Don't use spacy.errors in examples and bin scripts

* Fix error code

* Auto-format

Also try get Azure pipelines to finally start a build :(

* Update errors.py


Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
@lock
Copy link

lock bot commented Sep 10, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Sep 10, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
docs Documentation and website feat / matcher Feature: Token, phrase and dependency matcher usage General spaCy usage
Projects
None yet
Development

No branches or pull requests

4 participants