-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PhraseMatcher does not match LEMMA #4100
Comments
Hi @fizban99, thanks for the report! That does sound suspicious - let me look into it ;-) |
The problem is that I think an error is needed in PhraseMatcher to detect that you have an attribute that isn't set, like it warns that if you're only using |
Sorry, I should have included the solution more explicitly for @fizban99 : use |
@adrianeboyd (regarding your first comment): it's similar to the solution implemented for The other solution is to remedy this in the |
Maybe it would be a good idea to just use Also: do you want users to be choose whether they're matching on fallback lookup lemmas or restrict them to better lemmas when they are available in that model? Someone might want the option for lookup lemmas since it's faster, but it's also not a straightforward situation for a typical user. |
I think we should probably add a "warning"-type infobox mentinoning that if you want to match on token attributes that are set by other pipeline components, you need make sure to also run those components and not just call
The problem is that calling
I dont think people should be matching on lookup lemmas only, because that does make things very unpredictable and confusing. Also, I think we've mentioned this before, but lookup lemmas were a mistake 😞 We want to get rid of them entirely for the languages that we have better lemmatization for, and move the lookup lists out of the library once we have #3971 implemented. |
You still have the problem that users really really can't tell when lemmas are set. (I couldn't tell until I dug through a lot of code.) The pipeline documentation isn't much help: https://spacy.io/usage/processing-pipelines |
Thank you all for the clarifications. Really helpful. I agree with Ines that a warning would be nice. Plus update the documentation clarifying that when using other attributes to match (POS, DEP, LEMMA), a statistical model needs to be loaded. This means that those attributes cannot be matched with an empty model such as |
Add more detailed token pattern checks without full JSON pattern validation and provide more detailed error messages. Addresses explosion#4070 (also related: explosion#4063, explosion#4100). * Check whether top-level attributes in patterns and attr for PhraseMatcher are in token pattern schema * Check whether attribute value types are supported in general (as opposed to per attribute with full validation) * Report various internal error types (OverflowError, AttributeError, KeyError) as ValueError with standard error messages * Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS, LEMMA, and DEP * Add error messages with relevant details on how to use validate=True or nlp() instead of nlp.make_doc() * Support attr=TEXT for PhraseMatcher * Add NORM to schema * Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler
* Fix typo in rule-based matching docs * Improve token pattern checking without validation Add more detailed token pattern checks without full JSON pattern validation and provide more detailed error messages. Addresses #4070 (also related: #4063, #4100). * Check whether top-level attributes in patterns and attr for PhraseMatcher are in token pattern schema * Check whether attribute value types are supported in general (as opposed to per attribute with full validation) * Report various internal error types (OverflowError, AttributeError, KeyError) as ValueError with standard error messages * Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS, LEMMA, and DEP * Add error messages with relevant details on how to use validate=True or nlp() instead of nlp.make_doc() * Support attr=TEXT for PhraseMatcher * Add NORM to schema * Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler * Remove unnecessary .keys() * Rephrase error messages * Add another type check to Matcher Add another type check to Matcher for more understandable error messages in some rare cases. * Support phrase_matcher_attr=TEXT for EntityRuler * Don't use spacy.errors in examples and bin scripts * Fix error code * Auto-format Also try get Azure pipelines to finally start a build :( * Update errors.py Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
According to the documentation, the PhraseMatcher allows setting a different attribute other than
ORTH
. It explicitely mentionsLOWER
,POS
andDEP
, but leaves the door open for other attributes.The
LEMMA
attribute does not seem to work with the English model. This was mentioned in the gitter chat room.How to reproduce the behaviour
If we take the exact example in the usage guide and just add
attr="LEMMA"
to the constructor, it does not find any match:we can verify that in this case the lemmas and the text should be the same for the matching terms and I would expect to get the same results with
ORTH
or withLEMMA
in this specific example:If instead of the en_core_web_sm model we use an empty English model:
We get that the terms match everything, which is also incorrect.
Your Environment
The text was updated successfully, but these errors were encountered: