# 10. Rule-based Matching
## Q :Why not just regular expressions?
Ans : 
<ul>
    <li> Match on Doc objects, not just strings</li>
    <li> Match on tokens and token attributes</li>
    <li> Use the model's predictions</li>
    <li> Example: "duck" (verb) vs. "duck" (noun)</li>
</ul>


## Match patterns
<ul>
    <li> Lists of dictionaries, one per token </li>
    <li> Match exact token texts <br/> 
            <span style="padding-left: 50px; background:yellow;"> [{'TEXT': 'iPhone'}, {'TEXT': 'X'}] </span>
    </li>
     <li> Match lexical attributes <br/> 
            <span style="padding-left: 50px; background:yellow;"> [{'LOWER': 'iphone'}, {'LOWER': 'x'}] </span>
    </li>
         <li> Match any token attributes <br/> 
            <span style="padding-left: 50px; background:yellow;"> [{'LEMMA': 'buy'}, {'POS': 'NOUN'}] </span>
    </li>
</ul>

## Using the Matcher (1)

In [1]:
import spacy

# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load('en_core_web_sm')

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]
matcher.add('IPHONE_PATTERN', None, pattern)

# Process some text
doc = nlp("New iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)

## Using the Matcher (2)

In [2]:
# Call the matcher on the doc
doc = nlp("New iPhone X release date leaked")
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

iPhone X


<ul>
    <li> match_id: hash value of the pattern name </li>
   <li> start: start index of matched span </li>
   <li> end: end index of matched span </li>
</ul>

## Matching lexical attributes

In [10]:
pattern = [
    {'IS_DIGIT': True},
    {'LOWER': 'fifa'},
    {'LOWER': 'world'},
    {'LOWER': 'cup'},
    {'IS_PUNCT': True}
]

matcher.add('NEW_PATTERN', None, pattern)

In [11]:
doc = nlp("2018 FIFA World Cup: France won!")

In [12]:
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

2018 FIFA World Cup:


## Matching other token attributes

In [13]:
pattern = [
    {'LEMMA': 'love', 'POS': 'VERB'},
    {'POS': 'NOUN'}
]
matcher.add('NEW_PATTERN_2', None, pattern)
doc = nlp("I loved dogs but now I love cats more.")
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)


loved dogs
love cats


## Using operators and quantifiers (1)

In [14]:
pattern = [
    {'LEMMA': 'buy'},
    {'POS': 'DET', 'OP': '?'},  # optional: match 0 or 1 times
    {'POS': 'NOUN'}
]
matcher.add('NEW_PATTERN_3', None, pattern)
doc = nlp("I bought a smartphone. Now I'm buying apps.")
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

bought a smartphone
buying apps


## Using operators and quantifiers (2)
<table clas=="table">
    <thead>
        <tr>
        <th>Example</th>
        <th>Description</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td><code>{'OP': '!'}</code></td>
            <td>Negation: match 0 times</td>
        </tr>
        <tr>
            <td><code>{'OP': '?'}</code></td>
            <td>Optional: match 0 or 1 times</td>
         </tr>
        <tr>
            <td><code>{'OP': '+'}</code></td>
            <td>Match 1 or more times</td>
        </tr>
        <tr>
            <td><code>{'OP': '*'}</code></td>
            <td>Match 0 or more times</td>
        </tr>
    </tbody>
</table>