# Using Token Patterns to perform Rule Based Matching

In [1]:
import spacy

nlp = spacy.load(u"en_core_web_sm")

### Rule based matching

- Spacy offers a rule matching tool called <b>matcher</b> & that allows you to build a library of token patterns then match those patterns against a doc object to return a list of found matches a very similar idea to <b>Regular Expressions</b>.
<br><br>
- Powerful version of <b>Regular Expressions</b>.
<br><br>
- You can match any part of the token including text & annotations & you can add multiple patterns to the same matcher.

In [2]:
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)

So here <b>Matcher</b> is an object that pairs the current vocab object & we can add or remove specific names `Matchers to Matchers` as needed.

In [3]:
# SolarPower
pattern1 = [{'LOWER':'solarpower'}]

# Solar-power
pattern2 = [{'LOWER':'solar'},{'IS_PUNCT':True},{'LOWER':'power'}]

# Solar power
pattern3 = [{'LOWER':'solar'},{'LOWER':'power'}]

In [4]:
matcher.add('SolarPower', None, pattern1, pattern2, pattern3)

In [5]:
doc = nlp(u"The Solar Power industry continues to grow as solarpower increases. Solar-power is amazing!")

In [6]:
found_matches = matcher(doc)

In [7]:
print(found_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 8, 9), (8656102463236116519, 11, 14)]


MatchId, Start, Stop

In [8]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(match_id, string_id, start, end, span.text)

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 8 9 solarpower
8656102463236116519 SolarPower 11 14 Solar-power


<b>Removing the pattern if we dont want:</b>

In [9]:
matcher.remove('SolarPower')

<b>Creating New Patterns:</b>

In [10]:
# SolarPower
pattern1 = [{'LOWER':'solarpower'}]

# Solar-power
pattern2 = [{'LOWER':'solar'},{'IS_PUNCT':True, 'OP':'*'},{'LOWER':'power'}]

# 'OP':'*'      ==> Means, that there can be any puncutation between solar & power } solar***power, solar_power, solar...power



# # Solar power
# pattern3 = [{'LOWER':'solar'},{'LOWER':'power'}]

In [11]:
matcher.add('SolarPower', None, pattern1, pattern2)

In [12]:
doc2 = nlp(u"Solar--power is solarpower yay!!!")

In [13]:
found_matches = matcher(doc2)

In [14]:
print(found_matches)

[(8656102463236116519, 0, 3), (8656102463236116519, 4, 5)]


In [15]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc2[start:end]
    print(match_id, string_id, start, end, span.text)

8656102463236116519 SolarPower 0 3 Solar--power
8656102463236116519 SolarPower 4 5 solarpower


# Using terminology list to perform Rule Based Matching

### This is more efficient way

In [16]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

In [18]:
with open(r'C:\Users\HARDIK\NLP END TO END\NLP_COURSE_HELP\TextFiles\reaganomics.txt') as f:
    doc3 = nlp(f.read())

In [19]:
phrase_list = ['voodoo economics', 'supply-side economics', 'trikle-down economics', 'free-market economics']

converting each phrase into doc obj

In [20]:
phrase_patterns = [nlp(text) for text in phrase_list]

In [21]:
phrase_patterns

[voodoo economics,
 supply-side economics,
 trikle-down economics,
 free-market economics]

In [22]:
type(phrase_patterns[0])

spacy.tokens.doc.Doc

In [23]:
matcher.add("EconMatcher", None, *phrase_patterns)

This essentially graps each doc and passes in individually into this matter as a pattern. And then what we need to do is actually built a list of matches. 

In [24]:
found_matches = matcher(doc3)

In [25]:
found_matches

[(3680293220734633682, 41, 45),
 (3680293220734633682, 54, 56),
 (3680293220734633682, 61, 65),
 (3680293220734633682, 673, 677)]

In [26]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc3[start:end]
    print(match_id, string_id, start, end, span.text)

3680293220734633682 EconMatcher 41 45 supply-side economics
3680293220734633682 EconMatcher 54 56 voodoo economics
3680293220734633682 EconMatcher 61 65 free-market economics
3680293220734633682 EconMatcher 673 677 supply-side economics


In [27]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc3[start-5:end+5]
    print(match_id, string_id, start, end, span.text)

3680293220734633682 EconMatcher 41 45 policies are commonly associated with supply-side economics, referred to as trickle
3680293220734633682 EconMatcher 54 56 trickle-down economics or voodoo economics by political opponents, and
3680293220734633682 EconMatcher 61 65 by political opponents, and free-market economics by political advocates.


3680293220734633682 EconMatcher 673 677 attracted a following from the supply-side economics movement, which formed in
