# Maching and Vocabulary
So far we've seen how a body of text is divided into tokens, and how individual tokens are parsed and tagged with the parts of sppech, dependencies and lemmas.

In this notebook we will identify and label specific phrases that match the patterns we can define ourselves.

## Rule based matching
spaCy offers a rule-matching tool called Matcher that allows you to build a library of token patterns, then match those patterns against a Doc object to return a list of found matches. You can match on any part of the token including text and annotations and you can add multiple patterns to the same matcher.

In [4]:
import spacy
import en_core_web_sm
nlp=en_core_web_sm.load()


In [8]:
# import matcher library
from spacy.matcher import Matcher

# create a Matcher object
matcher=Matcher(nlp.vocab,validate=True)

# Create a doc text
doc=nlp(u"The Solar Power industry continues to grow as solarpower increases. Solar-Power is amazing.")


## creating patterns

In [14]:
# We will create here patterns to find 3 matches in the doc 1. solar power, 2. Solarpower, 3. Solar-Power
pattern1=[{"LOWER":'solarpower'}]
pattern2=[{"LOWER":'solar'},{"LOWER":"power"}]
pattern3=[{"LOWER":'solar'},{"IS_PUNCT":True},{"LOWER":"power"}]

matcher.add('SolarPower',[pattern1,pattern2,pattern3])

Let's break this down:
* `pattern1` looks for a single token whose lowercase text reads 'solarpower'
* `pattern2` looks for two adjacent tokens that read 'solar' and 'power' in that order
* `pattern3` looks for three adjacent tokens, with a middle token that can be any punctuation.<font color=green>*</font>

<font color=green>\* Remember that single spaces are not tokenized, so they don't count as punctuation.</font>
<br>Once we define our patterns, we pass them into `matcher` with the name 'SolarPower', and set *callbacks* to `None` (more on callbacks later).

## Appling the matcher to the doc object

In [4]:
doc

The Solar Power industry continues to grow as solarpower increases. Solar-Power is amazing.

In [15]:
find_matches=matcher(doc)

In [16]:
print(find_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 8, 9), (8656102463236116519, 11, 14)]


matcher returns a list of tuples. Each tuple contains an ID for the match, with start & end tokens that map to the span doc[start:end]

In [17]:
for match_id, start, end in find_matches:
    string_id=nlp.vocab.strings[match_id]
    span=doc[start:end]
    print(match_id, string_id, start, end, span.text)

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 8 9 solarpower
8656102463236116519 SolarPower 11 14 Solar-Power


In [18]:
matcher.remove("SolarPower")

### Setting pattern options and quantifiers
You can make token rules optional by passing an `'OP':'*'` argument. This lets us streamline our patterns list:

In [19]:
pattern1=[{"LOWER":'solarpower'}]
pattern2=[{"LOWER":'solar'},{"IS_PUNCT":True,"OP":"*"},{"LOWER":'power'}]
matcher.add('SolarPower',[pattern1,pattern2])

In [22]:
doc1=nlp(u"Solar---Power is solarpower")
find_matches=matcher(doc1)

In [23]:
find_matches

[(8656102463236116519, 0, 3), (8656102463236116519, 4, 5)]

In [24]:
for match_id,start,end in find_matches:
    string_id=nlp.vocab.strings[match_id]
    span=doc1[start:end]
    print(match_id,string_id,start,end,span.text)

8656102463236116519 SolarPower 0 3 Solar---Power
8656102463236116519 SolarPower 4 5 solarpower


This found both two-word patterns, with and without the hyphen!

The following quantifiers can be passed to the `'OP'` key:
<table><tr><th>OP</th><th>Description</th></tr>

<tr ><td><span >\!</span></td><td>Negate the pattern, by requiring it to match exactly 0 times</td></tr>
<tr ><td><span >?</span></td><td>Make the pattern optional, by allowing it to match 0 or 1 times</td></tr>
<tr ><td><span >\+</span></td><td>Require the pattern to match 1 or more times</td></tr>
<tr ><td><span >\*</span></td><td>Allow the pattern to match zero or more times</td></tr>
</table>


### Be careful with lemmas!
If we wanted to match on both 'solar power' and 'solar powered', it might be tempting to look for the *lemma* of 'powered' and expect it to be 'power'. This is not always the case! The lemma of the *adjective* 'powered' is still 'powered':

In [25]:
matcher.remove('SolarPower')

In [26]:
pattern1=[{"LOWER":"solarpower"}]
pattern2=[{"LOWER":'solar'},{"IS_PUNCT":True,"OP":"*"},{"LEMMA":'power'}]
pattern3=[{"LOWER":'solarpowered'}]
pattern4=[{"LOWER":'solar'},{"IS_PUNCT":True,"OP":"*"},{"LEMMA":'powered'}]

matcher.add('SolarPolar',[pattern1,pattern2,pattern3,pattern4])

In [30]:
doc2=nlp(u"solar-powered energy runs solar-powered cars")

find_matches=matcher(doc2)

In [31]:
find_matches

[(6168938391642580357, 0, 3), (6168938391642580357, 5, 8)]

In [32]:
for match_id, start, end in find_matches:
    string_id=nlp.vocab.strings[match_id]
    span=doc2[start:end]
    print(match_id,string_id,start,end,span.text)

6168938391642580357 SolarPolar 0 3 solar-powered
6168938391642580357 SolarPolar 5 8 solar-powered


In [33]:
matcher.remove("SolarPolar")

## Other token attributes
Besides lemmas, there are a variety of token attributes we can use to determine matching rules:
<table><tr><th>Attribute</th><th>Description</th></tr>

<tr ><td><span >`ORTH`</span></td><td>The exact verbatim text of a token</td></tr>
<tr ><td><span >`LOWER`</span></td><td>The lowercase form of the token text</td></tr>
<tr ><td><span >`LENGTH`</span></td><td>The length of the token text</td></tr>
<tr ><td><span >`IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`</span></td><td>Token text consists of alphanumeric characters, ASCII characters, digits</td></tr>
<tr ><td><span >`IS_LOWER`, `IS_UPPER`, `IS_TITLE`</span></td><td>Token text is in lowercase, uppercase, titlecase</td></tr>
<tr ><td><span >`IS_PUNCT`, `IS_SPACE`, `IS_STOP`</span></td><td>Token is punctuation, whitespace, stop word</td></tr>
<tr ><td><span >`LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`</span></td><td>Token text resembles a number, URL, email</td></tr>
<tr ><td><span >`POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE`</span></td><td>The token's simple and extended part-of-speech tag, dependency label, lemma, shape</td></tr>
<tr ><td><span >`ENT_TYPE`</span></td><td>The token's entity label</td></tr>

</table>

___
## PhraseMatcher
In the above section we used token patterns to perform rule-based matching. An alternative - and often more efficient - method is to match on terminology lists. In this case we use PhraseMatcher to create a Doc object from a list of phrases, and pass that into `matcher` instead.

In [36]:
# Import phrase Matcher and make an object of it
from spacy.matcher import PhraseMatcher
matcher=PhraseMatcher(nlp.vocab)

In [38]:
with open("../TextFiles/reaganomics.txt") as f:
    doc3=nlp(f.read())

In [40]:
# First, create a list of match phrases:
phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']

# Next, convert each phrase to a Doc object:
phrase_patterns=[nlp(text) for text in phrase_list]

# Pass each Doc object into matcher
matcher.add("EconMatcher",[*phrase_patterns])

found_matches=matcher(doc3)

In [41]:
found_matches

[(3680293220734633682, 41, 45),
 (3680293220734633682, 49, 53),
 (3680293220734633682, 54, 56),
 (3680293220734633682, 61, 65),
 (3680293220734633682, 673, 677),
 (3680293220734633682, 2987, 2991)]

In [43]:
for match_id, start, end in found_matches:
    string_id=nlp.vocab.strings[match_id]
    span=doc3[start:end]
    print(match_id,string_id,start,end,span)

3680293220734633682 EconMatcher 41 45 supply-side economics
3680293220734633682 EconMatcher 49 53 trickle-down economics
3680293220734633682 EconMatcher 54 56 voodoo economics
3680293220734633682 EconMatcher 61 65 free-market economics
3680293220734633682 EconMatcher 673 677 supply-side economics
3680293220734633682 EconMatcher 2987 2991 trickle-down economics


In [44]:
for match_id, start, end in found_matches:
    string_id=nlp.vocab.strings[match_id]
    span=doc3[start-5:end+8]
    print(match_id,string_id,start,end,span)

3680293220734633682 EconMatcher 41 45 policies are commonly associated with supply-side economics, referred to as trickle-down economics
3680293220734633682 EconMatcher 49 53 economics, referred to as trickle-down economics or voodoo economics by political opponents, and
3680293220734633682 EconMatcher 54 56 trickle-down economics or voodoo economics by political opponents, and free-market
3680293220734633682 EconMatcher 61 65 by political opponents, and free-market economics by political advocates.

The four pillars
3680293220734633682 EconMatcher 673 677 attracted a following from the supply-side economics movement, which formed in opposition to Keynesian
3680293220734633682 EconMatcher 2987 2991 became widely known as "trickle-down economics", due to the significant cuts in
