Case insensitive PhraseMatcher #1579

eranhirs · 2017-11-14T22:23:28Z

It would be very helpful to have an option to use PhraseMatcher as case insensitive.
For example, following on thePhraseMatcher example, this doc returns no matches, because of the lower c in clinton.

nlp = English()
matcher = PhraseMatcher(nlp.vocab)
matcher.add("Phrase", None, nlp("Hilary Clinton"))
results = matcher(nlp(u"Hilary clinton"))

Your Environment

Operating System: Windows 10
Python Version Used: 3.5
spaCy Version Used: 2.0.2
Environment Information:

The text was updated successfully, but these errors were encountered:

ines · 2017-11-15T11:53:38Z

Yes, that's definitely a good suggestion! The best way to implement this would be to add an option for setting a different attribute that the PhraseMatcher should match on – for example, LOWER instead of ORTH. The usage could look like this:

matcher = PhraseMatcher(nlp.vocab, attr='LOWER')  # string representing attribute from spacy.attrs

This would also allow a lot of other cool use cases - for example, you could pass in a Doc and match phrases with the same part-of-speech tags or dependency labels. Using it with the SHAPE could be pretty powerful, too. Like, if you're matching phone numbers or something like that, you won't have to come up with complex token patterns. Instead, you simply feed the PhraseMatcher a bunch of examples.

honnibal · 2017-11-23T12:44:18Z

Sat down to try to implement this, and after some thought, there's no way to make it work unfortunately.

The PhraseMatcher relies on the fact that all tokens of the same type refer back to the same lexeme data. The types are indexed by the ORTH key, so that's the only key we can phrase-match over.

A quick reminder of how this works (I'd forgotten): We set flags onto the Lexeme objects indicating that the word can start, end or continue the phrase. Then we use the matcher over these flag sequences. We access the flag from the token, by fetching it from the token's lexeme.

For lower-case matching, we would have to look up the lexeme via the vocab, and check the flag there. I think this would be no more efficient than the Matcher.

eranhirs · 2017-11-23T15:58:18Z

Eventually, I added each pattern 4 times to PhraseMatcher:

Without change
Lower
Upper
Title

ines · 2017-11-26T16:43:21Z

@eranhirs Yes, this is probably the best solution for now – even if you're adding 4 times as many patterns, it should still be significantly more efficient than adding one single token pattern.

lock · 2018-05-08T06:55:21Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added enhancement Feature requests and improvements help wanted Contributions welcome! labels Nov 15, 2017

ines closed this as completed Nov 26, 2017

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Case insensitive PhraseMatcher #1579

Case insensitive PhraseMatcher #1579

eranhirs commented Nov 14, 2017 •

edited

ines commented Nov 15, 2017 •

edited

honnibal commented Nov 23, 2017

eranhirs commented Nov 23, 2017 •

edited

ines commented Nov 26, 2017

lock bot commented May 8, 2018

Case insensitive PhraseMatcher #1579

Case insensitive PhraseMatcher #1579

Comments

eranhirs commented Nov 14, 2017 • edited

Your Environment

ines commented Nov 15, 2017 • edited

honnibal commented Nov 23, 2017

eranhirs commented Nov 23, 2017 • edited

ines commented Nov 26, 2017

lock bot commented May 8, 2018

eranhirs commented Nov 14, 2017 •

edited

ines commented Nov 15, 2017 •

edited

eranhirs commented Nov 23, 2017 •

edited