New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Case insensitive PhraseMatcher #1579
Comments
Yes, that's definitely a good suggestion! The best way to implement this would be to add an option for setting a different attribute that the matcher = PhraseMatcher(nlp.vocab, attr='LOWER') # string representing attribute from spacy.attrs This would also allow a lot of other cool use cases - for example, you could pass in a |
Sat down to try to implement this, and after some thought, there's no way to make it work unfortunately. The A quick reminder of how this works (I'd forgotten): We set flags onto the Lexeme objects indicating that the word can start, end or continue the phrase. Then we use the matcher over these flag sequences. We access the flag from the token, by fetching it from the token's lexeme. For lower-case matching, we would have to look up the lexeme via the vocab, and check the flag there. I think this would be no more efficient than the |
Eventually, I added each pattern 4 times to
|
@eranhirs Yes, this is probably the best solution for now – even if you're adding 4 times as many patterns, it should still be significantly more efficient than adding one single token pattern. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
It would be very helpful to have an option to use
PhraseMatcher
as case insensitive.For example, following on the
PhraseMatcher
example, this doc returns no matches, because of the lower c in clinton.Your Environment
The text was updated successfully, but these errors were encountered: