## Rule-based Matching
spaCy offers a rule-matching tool called `Matcher` that allows you to build a library of token patterns, then match those patterns against a Doc object to return a list of found matches. You can match on any part of the token including text and annotations, and you can add multiple patterns to the same matcher.

In [1]:
import spacy

In [2]:
nlp = spacy.load('en_core_web_sm')

In [3]:
from spacy.matcher import Matcher

In [4]:
matcher = Matcher(nlp.vocab)

In [5]:
# SolarPower
pattern1 = [{'LOWER': 'solarpower'}]

# Solar-power
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}]

# Solar power
pattern3 = [{'LOWER': 'solar'}, {'LOWER': 'power'}]

In [6]:
matcher.add('SolarPowerMatcher', [pattern1, pattern2, pattern3])

In [7]:
doc = nlp(u"The Solar Power industry continues to grow as solarpower increases. Solar-power is amazing.")

In [8]:
found_matches = matcher(doc)

In [9]:
print(found_matches)

[(6604624467252227415, 1, 3), (6604624467252227415, 8, 9), (6604624467252227415, 11, 14)]


In [10]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]   # get string representation
    span = doc[start:end]                     # get the matched span
    print(match_id, string_id, start, end, span.text)

6604624467252227415 SolarPowerMatcher 1 3 Solar Power
6604624467252227415 SolarPowerMatcher 8 9 solarpower
6604624467252227415 SolarPowerMatcher 11 14 Solar-power


In [11]:
matcher.remove('SolarPowerMatcher')

In [12]:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP': '*'}, {'LOWER': 'power'}]

In [13]:
matcher.add('SolarPowerMatcher', [pattern1, pattern2])

In [14]:
doc = nlp(u"Solar---power is solarpower!")

In [15]:
found_matches = matcher(doc)

In [16]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]   # get string representation
    span = doc[start:end]                     # get the matched span
    print(match_id, string_id, start, end, span.text)

6604624467252227415 SolarPowerMatcher 0 3 Solar---power
6604624467252227415 SolarPowerMatcher 4 5 solarpower


___
## PhraseMatcher
An alternative to token patterns for rule based matching and often more efficient method is to match on _terminology lists_. In this case we use `PhraseMatcher` to create a `Doc` object from a list of phrases, and pass that into `matcher` instead.

In [17]:
# Perform standard imports, reset nlp
import spacy
nlp = spacy.load('en_core_web_sm')

In [18]:
# Import the PhraseMatcher library
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

In [19]:
with open('../text-files/reaganomics.txt', encoding='unicode_escape') as f:
    doc3 = nlp(f.read())

In [20]:
# First, create a list of match phrases:
phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']

# Next, convert each phrase to a Doc object:
phrase_patterns = [nlp(text) for text in phrase_list]

# Pass each Doc object into matcher (note the use of the asterisk!):
matcher.add('VoodooEconomics', None, *phrase_patterns)

# Build a list of matches:
matches = matcher(doc3)

In [21]:
# (match_id, start, end)
matches

[(3473369816841043438, 41, 45),
 (3473369816841043438, 49, 53),
 (3473369816841043438, 54, 56),
 (3473369816841043438, 61, 65),
 (3473369816841043438, 673, 677),
 (3473369816841043438, 2986, 2990)]

<font color=green>The first four matches are where these terms are used in the definition of Reaganomics:</font>

In [22]:
doc3[:70]

REAGANOMICS
https://en.wikipedia.org/wiki/Reaganomics

Reaganomics (a portmanteau of [Ronald] Reagan and economics attributed to Paul Harvey)[1] refers to the economic policies promoted by U.S. President Ronald Reagan during the 1980s. These policies are commonly associated with supply-side economics, referred to as trickle-down economics or voodoo economics by political opponents, and free-market economics by political advocates.


## Viewing Matches
There are a few ways to fetch the text surrounding a match. The simplest is to grab a slice of tokens from the doc that is wider than the match:

In [23]:
doc3[665:685]  # Note that the fifth match starts at doc3[673]

same time he attracted a following from the supply-side economics movement, which formed in opposition to Keynesian

In [24]:
doc3[2975:2995]  # The sixth match starts at doc3[2985]

lawsuits against institutions.[66] His policies became widely known as "trickle-down economics", due to the

Another way is to first apply the `sentencizer` to the Doc, then iterate through the sentences to the match point:

In [25]:
# Build a list of sentences
sents = [sent for sent in doc3.sents]

# sentences contain start and end token values:
print(sents[0].start, sents[0].end)

0 35


In [26]:
# Iterate over the sentence list until the sentence end value exceeds a match start value:
for sent in sents:
    if matches[4][1] < sent.end:  # this is the fifth match, that starts at doc3[673]
        print(sent)
        break

At the same time he attracted a following from the supply-side economics movement, which formed in opposition to Keynesian demand-stimulus economics.
