## Rule based matching

In [2]:
import spacy

### Not only regex

The spacy matcher works not only with strings but also Doc and Token objects.The Rule Based Matching allows you to search terms based on texts and other lexical attributes. Rules can be written by leveraging the model's predictions.
eg. find the word "duck" only if it's a verb, not a noun.

### Match patterns

Match patterns are lists of dictionaries. Each dictionary consists of a token with the attributes names and values of a token represented by a key value pair. 

In [3]:
# import Matcher class
from spacy.matcher import Matcher

#create nlp object with model
nlp = spacy.load("en_core_web_sm")

#initialize matcher object with a shared vocab
matcher = Matcher(nlp.vocab)

#add a pattern to the matcher
pattern = [{'TEXT':'iPhone'},{'TEXT':'X'}]

"""The first argument is a unique ID to identify which pattern was matched. 
The second argument is an optional callback. We don't need one here, so we set it to None. 
The third argument is the pattern."""

matcher.add('IPHONE',None,pattern)

#create document
doc = nlp("New iPhone X release date leaked")

#get all the matches
matches = matcher(doc)

#display matches
for match_id, start, end in matches:
    match = doc[start:end]
    print(match.text)

iPhone X


In the above example calling the matcher on a doc returns a list of tuples where each tuple has three values: 
- Match ID which is the hash value of the pattern name
- Start index of matched span
- End index of matched span

The start and end index can be used to slice the span of the matched text out of the doc object.

### Matching Lexical Attributes and using operators and quantifiers

Operators and quantifiers let you define how often a token should be matched. They can be added using the "OP" key.
Here, the "?" operator makes the determiner token optional, so it will match a token with the lemma "buy", an optional article and a noun.

"OP" can have one of four values:
- An "!" negates the token, so it's matched 0 times.
- A "?" makes the token optional, and matches it 0 or 1 times.
- A "+" matches a token 1 or more times.
- an "*" matches 0 or more times.

Operators can make your patterns a lot more powerful, but they also add more complexity – so use them wisely.

- Write one pattern that only matches mentions of the full iOS versions: “iOS 7”, “iOS 11” and “iOS 10”.

In [4]:
doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{"TEXT": "iOS"}, {"IS_DIGIT": True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IOS_VERSION", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


- Write one pattern that only matches forms of “download” (tokens with the lemma “download”), followed by a token with the part-of-speech tag 'PROPN' (proper noun).

In [5]:
doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


- Write one pattern that matches adjectives ('ADJ') followed by one or two 'NOUN's (one noun and one optional noun).

In [6]:
doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

# Write a pattern for adjective plus one or two nouns
pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "NOUN", "OP": "?"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("ADJ_NOUN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 5
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses


- Write a pattern so that it correctly matches all case-insensitive mentions of "Amazon" plus a title-cased proper noun.
- Write pattern2 so that it correctly matches all case-insensitive mentions of "ad-free", plus the following noun.

In [7]:
doc = nlp(
    "Twitch Prime, the perks program for Amazon Prime members offering free "
    "loot, games and other benefits, is ditching one of its best features: "
    "ad-free viewing. According to an email sent out to Amazon Prime members "
    "today, ad-free viewing will no longer be included as a part of Twitch "
    "Prime for new members, beginning on September 14. However, members with "
    "existing annual subscriptions will be able to continue to enjoy ad-free "
    "viewing until their subscription comes up for renewal. Those with "
    "monthly subscriptions will have access to ad-free viewing until October 15."
)

# Create the match patterns
pattern1 = [{"LOWER": "amazon"}, {"IS_TITLE": True, "POS": "PROPN"}]
pattern2 = [{"LOWER": "ad"}, {"TEXT": "-"}, {"LOWER": "free"}, {"POS": "NOUN"}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add("PATTERN1", None, pattern1)
matcher.add("PATTERN2", None, pattern2)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing
