Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case insensitive PhraseMatcher #1579

Closed
eranhirs opened this issue Nov 14, 2017 · 5 comments
Closed

Case insensitive PhraseMatcher #1579

eranhirs opened this issue Nov 14, 2017 · 5 comments
Labels
enhancement Feature requests and improvements help wanted Contributions welcome!

Comments

@eranhirs
Copy link

eranhirs commented Nov 14, 2017

It would be very helpful to have an option to use PhraseMatcher as case insensitive.
For example, following on thePhraseMatcher example, this doc returns no matches, because of the lower c in clinton.

nlp = English()
matcher = PhraseMatcher(nlp.vocab)
matcher.add("Phrase", None, nlp("Hilary Clinton"))
results = matcher(nlp(u"Hilary clinton"))

Your Environment

  • Operating System: Windows 10
  • Python Version Used: 3.5
  • spaCy Version Used: 2.0.2
  • Environment Information:
@ines ines added enhancement Feature requests and improvements help wanted Contributions welcome! labels Nov 15, 2017
@ines
Copy link
Member

ines commented Nov 15, 2017

Yes, that's definitely a good suggestion! The best way to implement this would be to add an option for setting a different attribute that the PhraseMatcher should match on – for example, LOWER instead of ORTH. The usage could look like this:

matcher = PhraseMatcher(nlp.vocab, attr='LOWER')  # string representing attribute from spacy.attrs

This would also allow a lot of other cool use cases - for example, you could pass in a Doc and match phrases with the same part-of-speech tags or dependency labels. Using it with the SHAPE could be pretty powerful, too. Like, if you're matching phone numbers or something like that, you won't have to come up with complex token patterns. Instead, you simply feed the PhraseMatcher a bunch of examples.

@honnibal
Copy link
Member

Sat down to try to implement this, and after some thought, there's no way to make it work unfortunately.

The PhraseMatcher relies on the fact that all tokens of the same type refer back to the same lexeme data. The types are indexed by the ORTH key, so that's the only key we can phrase-match over.

A quick reminder of how this works (I'd forgotten): We set flags onto the Lexeme objects indicating that the word can start, end or continue the phrase. Then we use the matcher over these flag sequences. We access the flag from the token, by fetching it from the token's lexeme.

For lower-case matching, we would have to look up the lexeme via the vocab, and check the flag there. I think this would be no more efficient than the Matcher.

@eranhirs
Copy link
Author

eranhirs commented Nov 23, 2017

Eventually, I added each pattern 4 times to PhraseMatcher:

  • Without change
  • Lower
  • Upper
  • Title

@ines
Copy link
Member

ines commented Nov 26, 2017

@eranhirs Yes, this is probably the best solution for now – even if you're adding 4 times as many patterns, it should still be significantly more efficient than adding one single token pattern.

@ines ines closed this as completed Nov 26, 2017
@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements help wanted Contributions welcome!
Projects
None yet
Development

No branches or pull requests

3 participants