Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

πŸ’« Allow matching non-ORTH attributes in PhraseMatcher #2925

Merged
merged 5 commits into from Nov 15, 2018

Conversation

@ines
Copy link
Member

ines commented Nov 14, 2018

Related issue: #1971

Description

This PR introduces the ability to match other token attributes in the PhraseMatcher – for example, to find sequences of part-of-speech tags that match an example sentence. On initialization, the attr keyword argument can specify an attribute string name.

matcher = PhraseMatcher(nlp.vocab, attr='POS')
matcher.add('PATTERN', None, nlp('I love cats'))
doc = nlp('yes, you hate dogs')
assert len(matcher(doc)) == 1  # matched 'you hate dogs' based on token.pos

Most useful use case: case-insensitive phrase matches! πŸŽ‰

matcher = PhraseMatcher(nlp.vocab, attr='LOWER')
matcher.add('PATTERN', None, nlp('iPhone X'))
doc = nlp('i have an iphone x')
assert len(matcher(doc)) == 1  # matched 'iphone x' based on token.lower

It even works for boolean flags:

matcher = PhraseMatcher(nlp.vocab, attr='IS_PUNCT')
matcher.add('PATTERN', None, nlp('hello!'))
doc = nlp('no.')
assert len(matcher(doc)) == 1  # matched 'no.' based on token.is_punct

Implementation details

Under the hood, the PhraseMatcher now creates a "phantom doc" that consists of the attribute values instead of the token.orth. For example, Doc(vocab, words=[token.pos_ for token in doc]).

Since matching requires setting flags on the lexemes, we create new tokens for non-ORTH attributes by concatenating the attribute name and value, to prevent polluting the lexemes. For example, if we match on POS, the phantom doc's tokens become 'matcher:POS-VERB' (very unlikely token) instead of just 'VERB'. This prevents the lexical entry for 'VERB' from receiving flags that can potentially cause false positives later on.

Types of change

enhancement

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.
Usage: PhraseMatcher(nlp.vocab, attr='POS')
spacy/matcher.pyx Outdated Show resolved Hide resolved
@ines ines changed the title Allow matching non-ORTH attributes in PhraseMatcher πŸ’« Allow matching non-ORTH attributes in PhraseMatcher Nov 14, 2018
@honnibal honnibal merged commit e89708c into develop Nov 15, 2018
4 checks passed
4 checks passed
continuous-integration/appveyor/branch AppVeyor build succeeded
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
@ines ines deleted the feature/phrasematcher-attrs branch Nov 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants
You can’t perform that action at this time.