# Making the Most of spaCy's Rule-Based Matcher

This is an abbreviated introduction from my blog post.

## Rule-based matching with `Matcher`
**✅👍 Why rule-based matching is awesome 👍✅**
- No training data required (good for smaller datasets)
- Match results give you flexibility returning relevant sections from your document.
    - i.e. document classification works on whole documents. What if you wanted to find a sentence that was relevant to your business case? The matcher allows you to work with specific sentences and spans after matching.
- Fast!
- Easy to integrate into an NLP pipeline
- Composable and cheap
  - e.g. you could write a bunch of quick, simple rules and use boolean logic on the matches to interesting match combinations.
  - e.g. you can write a really general rule to find relevant documents, then some mutually exclusive sub-rules that operate only on documents that match a general rule.
- Usually explainable to non-technical stakeholders

    
**⚠️ Why rule-based matching is challenging ⚠️**
- Good rules require lots of iteration and experimentation
- Writing rules requires some linguistics knowledge about parts-of-speech and dependency parsing
- Token matching operators are greedy
  - Requires frequent sanity checking, testing, and iterating to be sure the matcher is returning what you think it is
- Requires complementary tools for maximum benefit
  - You'll probably want to use an `on_match` callback to work with the full document or a span after a match. Often times the match itself is just the start.
  - We train a text classification model on our matches to help us iterate. I found parameter tuning difficult and training very fragile due to the rare outcome rate for matches.


## Our Workflow
![Pattern Match Workflow](https://i.postimg.cc/Vv18XQ7z/Pattern-Match-4.png)

## Our Use Case
Let's say we work at Amazon, and our boss wants us to explore a potential a new feature for product pages: a list of reasons why people purchased that item. If we assume people write things like `I bought this because my old whisk broke` in their reviews, we could use the text from product reviews to initially build this feature.

We'll use a dataset of [Home and Kitchen product reviews](http://jmcauley.ucsd.edu/data/amazon/) as our initial data.

In [2]:
import gzip
import json
import random
from collections import Counter
from dataclasses import dataclass
from pprint import pprint
from statistics import stdev
from textwrap import shorten
from typing import List, Dict

import numpy as np
import spacy
from imblearn.under_sampling import RandomUnderSampler
from nltk.corpus import wordnet
from sklearn.model_selection import train_test_split
from spacy.lang.en import English as EnglishPipeline
from spacy.matcher import Matcher
from spacy.pipeline.pipes import TextCategorizer
from spacy.tokenizer import Tokenizer
from spacy.tokens import Doc, Span
from spacy.util import compounding, minibatch

## Download And Load Data

In [3]:
!wget -N http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Home_and_Kitchen_5.json.gz

--2019-07-18 07:23:04--  http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Home_and_Kitchen_5.json.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
HTTP request sent, awaiting response... 304 Not Modified
File ‘reviews_Home_and_Kitchen_5.json.gz’ not modified on server. Omitting download.



In [4]:
with gzip.open("./reviews_Home_and_Kitchen_5.json.gz", "rb") as f:
    data = [json.loads(line) for line in f]

In [5]:
print("Sample:")
pprint(data[0])
print()
print("Review Snippets:")
for i, d in enumerate(data[:10]):
    text_snip = shorten(d["reviewText"], 200)
    print(f"Text {i}: {text_snip}")

Sample:
{'asin': '0615391206',
 'helpful': [0, 0],
 'overall': 5.0,
 'reviewText': 'My daughter wanted this book and the price on Amazon was the '
               'best.  She has already tried one recipe a day after receiving '
               'the book.  She seems happy with it.',
 'reviewTime': '10 19, 2013',
 'reviewerID': 'APYOBQE6M18AA',
 'reviewerName': 'Martin Schwartz',
 'summary': 'Best Price',
 'unixReviewTime': 1382140800}

Review Snippets:
Text 0: My daughter wanted this book and the price on Amazon was the best. She has already tried one recipe a day after receiving the book. She seems happy with it.
Text 1: I bought this zoku quick pop for my daughterr with her zoku quick maker. She loves it and have fun to make her own ice cream.
Text 2: There is no shortage of pop recipes available for free on the web, but I purchased the "Zoku Quick Pops" book, because Zoku has some good recipes for fruit pops on its blog. I was hoping there [...]
Text 3: This book is a must have if you 

## Creating the first match rule
A simple match would be someone saying `"I bought this because _____"`. Let's find some synonyms for `bought` to be sure we're capturing some variation.

## Find synonyms for "bought" to generate first rule


Here's our process with wordnet:
1. Identify all the possible synsets we could use
2. Choose the one with the closest definition to our word
3. Print all the lemmas from the hyponyms and hypernyms of that word.
4. See if any of these new lemmas would work as synonyms for your matching rule.

In [6]:
for synset in wordnet.synsets("buy", pos="v"):
    print(synset, synset.definition(), sep=":\t")
    print(", ".join(synset.lemma_names()))
    pprint(synset.hyponyms())
    pprint(synset.hypernyms())
    print("\n")

Synset('buy.v.01'):	obtain by purchase; acquire by means of a financial transaction
buy, purchase
[Synset('buy_back.v.01'),
 Synset('get.v.22'),
 Synset('impulse-buy.v.01'),
 Synset('pick_up.v.08'),
 Synset('subscribe.v.05'),
 Synset('take.v.33'),
 Synset('take_out.v.07'),
 Synset('take_over.v.05')]
[Synset('get.v.01')]


Synset('bribe.v.01'):	make illegal payments to in exchange for favors or influence
bribe, corrupt, buy, grease_one's_palms
[Synset('buy_off.v.01'), Synset('sop.v.01')]
[Synset('pay.v.01')]


Synset('buy.v.03'):	be worth or be capable of buying
buy
[]
[Synset('be.v.01')]


Synset('buy.v.04'):	acquire by trade or sacrifice or exchange
buy
[]
[Synset('get.v.01')]


Synset('buy.v.05'):	accept as true
buy
[]
[Synset('believe.v.01')]




In [7]:
reference_synset = wordnet.synset("buy.v.01")
print(reference_synset)
print(", ".join(reference_synset.lemma_names()))
pprint([l.lemma_names() for l in reference_synset.hyponyms()])
pprint([l.lemma_names() for l in reference_synset.hypernyms()])

Synset('buy.v.01')
buy, purchase
[['buy_back', 'repurchase'],
 ['get'],
 ['impulse-buy'],
 ['pick_up'],
 ['subscribe', 'subscribe_to', 'take'],
 ['take'],
 ['take_out', 'buy_food'],
 ['take_over', 'buy_out', 'buy_up']]
[['get', 'acquire']]


##  Cycle 1
### Cycle 1: Define Initial Match Pattern
We're going to create one match rule to match text where individuals provide their rationale for purchasing. This text should match the following:

```
I [bought/purchased/got] [0 to inf tokens] because
```

Which means it should match phrases like `I bought this because [...]`,  also references to the specific item, like `I got the BlendMAX 5000 because [...]`, and other combinations like `I purchased this wonderful whisk last week because [...]`.

### Storing Matches
We're going to use an [on_match callback](https://spacy.io/usage/rule-based-matching#on_match) to store the start and end token indices of the match as an [extension attribute](https://spacy.io/usage/processing-pipelines#custom-components-attributes) on the document.

In [6]:
nlp = spacy.load("en_core_web_md", disable=["ner"])
matcher = Matcher(nlp.vocab)


@dataclass
class MatchRule:
    name: str  # no spaces, needs to be callable as an attr
    patterns: List 

In [7]:
patterns_1 = [
    [
        {"ORTH": "I"},
        {"LOWER": {"IN": ["bought", "purchased", "got"]}},
        {"IS_ALPHA": True, "OP": "*"},
        {"LOWER": {"IN": ["because"]}},
    ]
]

initial_rule = MatchRule("initial_rule", patterns_1)

In [8]:
def add_matches(matcher: Matcher, doc: Doc, i: int, matches: List):
    match_id, start, end = matches[i]
    string_id = nlp.vocab.strings[match_id]
    mlist = doc._.get(string_id)
    if (start, end) not in mlist:
        mlist.append((start, end))
        doc._.set(string_id, mlist)


Doc.set_extension(initial_rule.name, default=[], force=True)
matcher.add(initial_rule.name, add_matches, *initial_rule.patterns)

### Cycle 1: Match a Subset of Reviews

In [9]:
subset_length = 5000

subset = [d["reviewText"] for i, d in enumerate(data) if i < subset_length]
parsed_subset = [doc for doc in nlp.pipe(subset)]

In [10]:
for _ in matcher.pipe(parsed_subset):
    pass  # leave it to the add_matches callback

#### Statistics & Sanity Check

In [11]:
total = sum(bool(d._.initial_rule) for d in parsed_subset)
length = len(parsed_subset)
print(f"{total} matches. {(total/length)*100:.2f}% of data")

59 matches. 1.18% of data


In [12]:
for d in parsed_subset:
    if bool(d._.initial_rule):
        for match in d._.initial_rule:
            print(d[match[0] : match[1]])

I bought it because
I bought it because
I bought it because
I bought this because
I bought this ice cream maker last summer and ended up not using it much because
I bought this because
I bought this because
I bought this ice cream maker for my roommate because
I bought this heater because
I bought it because
I bought a timer to use with it because
I got this because
I bought this because
I purchased one for my mother in law because
I bought this because
I bought this because
I bought this vacuum cleaner because
I bought this vacuum cleaner belt because
I bought this quality peeler because
I bought this because
I got this because
I bought this item because
I bought this because
I purchased this brush is because
I bought this because
I bought this one because
I bought this because
I bought this is because
I bought this initially because
I bought this opener for my son who has difficulty opening can because
I bought this spinner to replace an older model that was years old because
I purch

### Cycle 1: Train Textcat Model
Now we're going to train a model to predict documents that should match our rules. The idea here is that the model will pick up on other patterns of the text that our rules missed. Then, by treating the matched documents as actual labels, we will inspect the *False Negatives* from the model.

The evaluation dataset is there as a sanity check to make sure that our model is training correctly. We're not using it to evaluate against specific metrics, so as long as loss goes down and the metrics go up, we're doing good. 

We're breaking a few machine learning rules here in service of our utilitarian model. We are going to make predictions back on the same data we used to train the model. However, this is not a model we're ever going to use in production, and we're using it simply as an exploratory tool. We are also only going to train for a few epochs. If we were more interested in high performance, we would fine-tune the model paramters and train for much longer. However, we really only need it to be directionally accurate.

If you think this idea of "labeling" data by writing rules is interesting and want to use it to actually build a production model, you may be interested in [weak supervision](http://ai.stanford.edu/blog/weak-supervision/).

This code pulls heavily from the [textcat example](https://github.com/explosion/spaCy/blob/develop/examples/training/train_textcat.py) and is modified to only classify a single category.

In [13]:
label = "BUYBECAUSE"

textcat = nlp.create_pipe(
    "textcat", config={"exclusive_classes": False, "architecture": "ensemble"}
)
nlp.add_pipe(textcat, last=True)
textcat.add_label(label)

1

In [14]:
rus = RandomUnderSampler(random_state=0, sampling_strategy=0.33)
X = parsed_subset
y = [bool(d._.initial_rule) for d in parsed_subset]
X_resampled, y_resampled = rus.fit_resample(np.array(parsed_subset).reshape(-1, 1), y)

all_texts = X_resampled[:, 0].tolist()
all_cats = [{label: outcome} for outcome in y_resampled]

In [15]:
data_splits = train_test_split(
    all_texts, all_cats, train_size=0.8, test_size=0.2, random_state=666
)
train_texts, dev_texts, train_cats, dev_cats = data_splits
print(f"Total: {len(all_texts)} ({len(train_texts)} train, {len(dev_texts)} eval)")
all_outcome = sum(c[label] for c in all_cats) / len(all_cats)
train_outcome = sum(c[label] for c in train_cats) / len(train_cats)
dev_outcome = sum(c[label] for c in dev_cats) / len(dev_cats)
print(
    f"Outcome Rate: {all_outcome:.3f} ({train_outcome:.3f} train, {dev_outcome:.3f} eval)"
)
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))

Total: 237 (189 train, 48 eval)
Outcome Rate: 0.249 (0.254 train, 0.229 eval)


### Model Training
The code below will train a model for 15 epochs on our training examples. I've modified the evaluation code to only apply to a single category. 

I had some difficulty finding the correct training parameters even after following the guidance [here](https://spacy.io/usage/training#textcat). Thus, I've left it as close to the original textcat example. Sometimes this requires manually retraining the model if the training loop fails to "click". **If running this on your own, you may have to retrain the model or toggle the parameters.**

I also added a measure of the standard deviation of scores to check whether the model is training. Usually this provides some initial feedback about whether the model is training before the performance measures start to cohere. We're looking for this to be non-zero as we continue to train to demonstrate that the model isn't getting stuck and predicting the same score for every document.

In [16]:
def evaluate_single_category(
    tokenizer: Tokenizer,
    textcat: TextCategorizer,
    docs: List[Doc],
    eval_cats: List[Dict[str, int]],
    label: str,
) -> Dict[str, float]:
    docs = (tokenizer(doc.text) for doc in docs)
    tp = 0.0  # True positives
    fp = 1e-8  # False positives
    fn = 1e-8  # False negatives
    tn = 0.0  # True negatives
    scores = []
    for i, doc in enumerate(textcat.pipe(docs)):
        score = doc.cats[label]
        gold = int(eval_cats[i][label])
        if score >= 0.5 and gold >= 0.5:
            tp += 1.0
        elif score >= 0.5 and gold < 0.5:
            fp += 1.0
        elif score < 0.5 and gold < 0.5:
            tn += 1
        elif score < 0.5 and gold >= 0.5:
            fn += 1
        scores.append(score)
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    if (precision + recall) == 0:
        f_score = 0.0
    else:
        f_score = 2 * (precision * recall) / (precision + recall)
    scores_std = stdev(scores)
    return {
        "textcat_p": precision,
        "textcat_r": recall,
        "textcat_f": f_score,
        "textcat_score_std": scores_std,
    }

In [17]:
n_iter = 15

other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
with nlp.disable_pipes(*other_pipes):  # only train textcat
    optimizer = nlp.begin_training()
    print("Training the model...")
    print("{:^5}\t{:^5}\t{:^5}\t{:^5}\t{:^5}".format("LOSS", "P", "R", "F", "STD"))
    for i in range(n_iter):
        losses = {}
        random.shuffle(train_data)
        batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
        with textcat.model.use_params(optimizer.averages):
            scores = evaluate_single_category(
                nlp.tokenizer, textcat, dev_texts, dev_cats, label
            )
        print(
            "{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}\t{4:.3f}".format(  # print a simple table
                losses["textcat"],
                scores["textcat_p"],
                scores["textcat_r"],
                scores["textcat_f"],
                scores["textcat_score_std"],
            )
        )

Training the model...
LOSS 	  P  	  R  	  F  	 STD 
0.642	0.583	0.636	0.609	0.184
0.418	0.714	0.455	0.556	0.235
0.400	0.750	0.818	0.783	0.353
0.249	0.687	1.000	0.815	0.426
0.217	0.647	1.000	0.786	0.441
0.175	0.611	1.000	0.759	0.460
0.166	0.611	1.000	0.759	0.463
0.653	0.687	1.000	0.815	0.453
0.192	0.733	1.000	0.846	0.449
0.792	0.733	1.000	0.846	0.450
0.132	0.687	1.000	0.815	0.451
0.141	0.687	1.000	0.815	0.446
0.106	0.687	1.000	0.815	0.446
0.090	0.733	1.000	0.846	0.438
0.383	0.786	1.000	0.880	0.434


#### Model Sanity Check

In [18]:
test_text_relevant = "I purchased this because I wanted a cool device."
test_text_irrelevant = "This has nothing to do with the category classification."

doc_relevant = nlp(test_text_relevant)
doc_irrelevant = nlp(test_text_irrelevant)

print(
    f"Relevant Score: {doc_relevant.cats[label]:.3f} :: Irrelevant Score: {doc_irrelevant.cats[label]:.3f}"
)

assert doc_relevant.cats[label] > doc_irrelevant.cats[label]
assert doc_relevant.cats[label] - doc_irrelevant.cats[label] > 0.1

Relevant Score: 0.937 :: Irrelevant Score: 0.004


In [19]:
for doc in textcat.pipe(parsed_subset):
    pass  # delegates to the predict and set_annotations methods.

### Cycle 1: Review *False Negatives*

In [20]:
fn_docs = [doc for doc in parsed_subset if not bool(doc._.initial_rule) and doc.text]
fn_docs_sorted = sorted(fn_docs, key=lambda d: d.cats[label], reverse=True)

for doc in fn_docs_sorted[:50]:
    print(f"{doc.cats[label]:.2f}")
    printed_because = False
    for sent in doc.sents:
        if "because" in sent.text:
            printed_because = True
            print("BECAUSE: \t", sent.text)
    if not printed_because:
        print("FULL: \t", doc.text)

0.98
FULL: 	 Corer works well for cupcakes, the main reason I bought it. Easy to clean, easy to use...beats the heck out of a knife. I haven't used it on fruit, and probably won't, so can't comment on traditional use. Sorry about that :)
0.98
FULL: 	 My last masher (from a respected and pricey brand name) couldn't take the pressure-literally!  One day I'm just mashing up a batch of potatoes and *snap&#34; it's time for the trash bin.  That's why I bought this masher from OXO.This one is a solid, heavy-gauge steel wire that goes all the way up into the sturdy handle.  I'm not sure I could break it if I tried!  Works great at mashing (just as good as one with square or circle holes), and the large grip makes it easy to hold even if you get a little butter on your hands (trust me, I have done it).
0.98
FULL: 	 OXO has provided a very nice product at a fair price one again.I buy more and more OXO products, at first I thought that they were expensive yet when you get these spatulas you will


After reviewing the top 50 false negatives, we've come up with some new additions to our rules. I've also documented some noted exclusions - patterns that might be what we're looking for, but will be out of scope for this iteration.

✅ **Rule Additions:**
- Add "We" to the first token set of the rule
- Add `ordered` to the verbs
- Accept lowercase "i" or "we" as initial pronouns
- Accept non-alpha tokens between verb and because (`I bought this pastry scraper/chopper because` isn't catching because of the `/`)

🚫 **Noted Exclusions:**
- No reference to person or pronoun. `Bought this because the ratings were high.` Could do a rule without the pronoun, but would capture things like "My friends bought this..." which we don't want.

💭 **Other Ideas:**
- Include an optional adverb between `-PRON-` and `bought` to capture phrases like `I primarily bought this because`.
- Verify matches do not span more than a single sentence. This is to counter the greedy nature of the matcher.   
  - Additionally, let's be sure the token matches the first token in a sentence to avoid some false positives.

## Cycle  2
### Cycle 2: Update Match Pattern & Match Documents

In [21]:
def add_shortest_match(matcher: Matcher, doc: Doc, i: int, matches: List):
    match_id, start, end = matches[i]
    sent_boundaries = ((sent.start, sent.end) for sent in doc.sents)
    for (s_start, s_end) in sent_boundaries:
        if (start == s_start) and (end <= s_end):
            string_id = nlp.vocab.strings[match_id]
            mlist = doc._.get(string_id)
            starts = [s for s, e in mlist]
            if start not in starts:
                mlist.append((start, end))
                doc._.set(string_id, mlist)

In [22]:
patterns_2 = [
    [
        {"LOWER": {"IN": ["i", "we"]}},
        {"POS": "ADV", "OP": "?"},
        {"LOWER": {"IN": ["bought", "purchased", "got", "ordered"]}},
        {"LENGTH": {">=": 1}, "OP": "*"},
        {"LOWER": {"IN": ["because"]}},
    ]
]

second_rule = MatchRule("second_rule", patterns_2)
Doc.set_extension(second_rule.name, default=[], force=True)
matcher.add(second_rule.name, add_shortest_match, *second_rule.patterns)

for _ in matcher.pipe(parsed_subset):
    pass  # leave it to the callback

#### Statistics & Sanity Check

In [23]:
total = sum(bool(d._.second_rule) for d in parsed_subset)
length = len(parsed_subset)
print(f"{total} matches. {(total/length)*100:.2f}% of data")

77 matches. 1.54% of data


In [24]:
new_docs = (d for d in parsed_subset if d._.second_rule and not d._.initial_rule)
for doc in new_docs:
    for match in doc._.second_rule:
        print(doc[match[0] : match[1]])

I ordered this lesson plan because
I ordered the Cuisinart because
I ordered this vacuum because
We purchased this unit originally for our own use, because
I bought this 3670G 12 amp Eureka mainly for the tiled floors, cleaning the drapes and (because
I purchased a Hoover Wind Tunnel model UH70120 and bought these belts only because
I ordered those belts with the Hoover Tempo Widepath Upright Vacuum,  U5140-900 because
I ordered this particular ricer because
We got rid of the automatic can opener because
I ordered a pair of these tongs as a gift because
I bought these, believe it or not, because
I bought this for my daughter's birthday, because
I bought this while on a health kick, thinking it would make me eat more salad, and because
I bought this salad spinner from Bed Bath & Beyond because
I got so tired of wasting money on the prepackaged salads, because
I bought 2 of these because
I ordered this because
We bought this because
I bought the "Oxo SteeL Can Opener" because
We bought a

### Cycle 2: Train Textcat Model

In [25]:
label = "BUYBECAUSE"

if "textcat" in nlp.pipe_names:
    nlp.remove_pipe("textcat")

textcat = nlp.create_pipe(
    "textcat", config={"exclusive_classes": False, "architecture": "ensemble"}
)
nlp.add_pipe(textcat, last=True)
textcat.add_label(label)

1

In [26]:
rus = RandomUnderSampler(random_state=0, sampling_strategy=0.25)
X = parsed_subset
y = [bool(d._.second_rule) for d in parsed_subset]
X_resampled, y_resampled = rus.fit_resample(np.array(parsed_subset).reshape(-1, 1), y)

all_texts = X_resampled[:, 0].tolist()
# Take advantage of the fact that an empty list is Falsy
all_cats = [{label: outcome} for outcome in y_resampled]

In [27]:
data_splits = train_test_split(
    all_texts, all_cats, train_size=0.8, test_size=0.2, random_state=666
)
train_texts, dev_texts, train_cats, dev_cats = data_splits
print(f"Total: {len(all_texts)} ({len(train_texts)} train, {len(dev_texts)} eval)")
all_outcome = sum(c[label] for c in all_cats) / len(all_cats)
train_outcome = sum(c[label] for c in train_cats) / len(train_cats)
dev_outcome = sum(c[label] for c in dev_cats) / len(dev_cats)
print(
    f"Outcome Rate: {all_outcome:.3f} ({train_outcome:.3f} train, {dev_outcome:.3f} eval)"
)
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))

Total: 385 (308 train, 77 eval)
Outcome Rate: 0.200 (0.214 train, 0.143 eval)


#### Model Training

In [28]:
n_iter = 15

other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
with nlp.disable_pipes(*other_pipes):  # only train textcat
    optimizer = nlp.begin_training()
    print("Training the model...")
    print("{:^5}\t{:^5}\t{:^5}\t{:^5}\t{:^5}".format("LOSS", "P", "R", "F", "STD"))
    for i in range(n_iter):
        losses = {}
        random.shuffle(train_data)
        batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
        with textcat.model.use_params(optimizer.averages):
            scores = evaluate_single_category(
                nlp.tokenizer, textcat, dev_texts, dev_cats, label
            )
        print(
            "{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}\t{4:.3f}".format(  # print a simple table
                losses["textcat"],
                scores["textcat_p"],
                scores["textcat_r"],
                scores["textcat_f"],
                scores["textcat_score_std"],
            )
        )

Training the model...
LOSS 	  P  	  R  	  F  	 STD 
0.774	0.692	0.818	0.750	0.308
0.504	0.562	0.818	0.667	0.329
0.461	0.556	0.909	0.690	0.360
0.387	0.667	0.909	0.769	0.307
0.206	0.588	0.909	0.714	0.344
0.182	0.556	0.909	0.690	0.366
0.131	0.611	1.000	0.759	0.358
0.140	0.600	0.818	0.692	0.354
0.103	0.625	0.909	0.741	0.357
0.073	0.571	0.727	0.640	0.348
0.087	0.600	0.818	0.692	0.354
0.088	0.600	0.818	0.692	0.352
0.033	0.562	0.818	0.667	0.360
0.050	0.529	0.818	0.643	0.361
0.038	0.500	0.818	0.621	0.365


#### Model Sanity Check

In [29]:
# Quick check that the model makes sensible predictions
test_text_relevant = "I purchased this because my old whisk broke."
test_text_irrelevant = "This has nothing to do with the category classification."

doc_relevant = nlp(test_text_relevant)
doc_irrelevant = nlp(test_text_irrelevant)

print(
    f"Relevant Score: {doc_relevant.cats[label]:.3f} :: Irrelevant Score: {doc_irrelevant.cats[label]:.3f}"
)

assert doc_relevant.cats[label] > doc_irrelevant.cats[label]
assert doc_relevant.cats[label] - doc_irrelevant.cats[label] > 0.1

Relevant Score: 0.990 :: Irrelevant Score: 0.034


In [30]:
for doc in textcat.pipe(parsed_subset):
    pass  # delegates to the predict and set_annotations methods.

### Cycle 2: Review *False Negatives*


In [31]:
fp_docs = [doc for doc in parsed_subset if not bool(doc._.second_rule) and doc.text]
fp_docs_sorted = sorted(fp_docs, key=lambda d: d.cats[label], reverse=True)

In [32]:
for doc in fp_docs_sorted[:50]:
    print(f"{doc.cats[label]:.2f}")
    printed_because = False
    for sent in doc.sents:
        if "because" in sent.text:
            printed_because = True
            print("BECAUSE: \t", sent.text)
    if not printed_because:
        print("FULL: \t", doc.text)

0.99
FULL: 	 Ok I bought this b/c I wanted a icing knife that was long enough to do the whole top of a 9 inch circle cake (what I typically bake). Unfortunately, while the knife is long like I wanted, it is too flimsy and flexible to do much else. I can't stir the frosting w/ it unless  I want to do a weak job, I can't frost the sides b/c of it's flimsiness, etc. So, OXO, please make a more sturdy long knife if you haven't already!
0.99
FULL: 	 I recently received a bread maker and the crust on the whole wheat loaf is quite thick, thus my butter knives just weren't up to the task.  Therefore, I figured it's time to get a real bread knife.  I bought this OXO since I have lots of their others products that I like, and the reviews were good.  I'm happy to say, that it's as great as everyone says.The knife cuts through bread loaves smoothly and cleanly.  I've washed it by hand and in the dishwasher and have never had any issues.  It feels good in the hand and balances nicely.All in all... 

✅ **Rule Additions:**
- Other Prepositional reasons: `I bought this tool [for/to/as/because]`

🚫 **Noted Exclusions:**
- None

💭 **Other Ideas:**
- None

## Finalize Patterns
We can continue this cycle as long as we need to. Additional cycles will depend on your business problem, how consistent your data is, and how complex your match rules need to be. At this point we've reviewed the data twice and we'll build the final match rule for this problem.

Due to the complexity of language and the inconsistency of our data, we will still likely have some false positives: `I bought this at walgreens for much less...` being one example. In this case, the best place to handle these would some additional logic be in the `on_match` function, since the tail of the sentence contains the tokens after our match.

In [33]:
patterns_3 = [
    [
        {"LOWER": {"IN": ["i", "we"]}},
        {"POS": "ADV", "OP": "?"},
        {"LOWER": {"IN": ["bought", "purchased", "got", "ordered"]}},
        {"LENGTH": {">=": 1}, "OP": "*"},
        {
            "LOWER": {"IN": ["for", "to", "as", "because"]},
            "POS": {"IN": ["ADP", "PART"]},
        },
    ]
]

third_rule = MatchRule("third_rule", patterns_3)
Doc.set_extension(third_rule.name, default=[], force=True)
if matcher.has_key(third_rule.name):
    matcher.remove(third_rule.name)
matcher.add(third_rule.name, add_shortest_match, *third_rule.patterns)

for _ in matcher.pipe(parsed_subset):
    pass  # leave it to the callback

#### Statistics and Sanity Check

In [34]:
total = sum(bool(d._.third_rule) for d in parsed_subset)
length = len(parsed_subset)
print(f"{total} matches. {(total/length)*100:.2f}% of data")

383 matches. 7.66% of data


In [35]:
new_docs = (
    d
    for d in parsed_subset
    if d._.third_rule and not (d._.initial_rule or d._.second_rule)
)
for doc in new_docs:
    for match in doc._.third_rule:
        span = Span(doc, match[0], match[1])
        docstr = f"{span} | {doc[span.end:span.sent.end]}"
        print(docstr)

I bought this zoku quick pop for | my daughterr with her zoku quick maker.
I got this as | a gift for my niece and she loved it.
I got this for | my daughter in law.
I bought this book to | help me out with decorating cakes and cupcakes as a beginner.
I bought other things in the book from Amazon, however, the list goes on too long to | keep mentioning everything.
I bought this book to | find out that I have the same one with just a different cover!
I purchased the Wilton Course 1 guide a few weeks before I took my class, I found the book very easy to | follow and easy to understand.  
I bought this for | my daughter.  
I got this book for | my daughter as the local craft store wouldn't let her take this class unless she took the other two first which she doesn't need.
I also bought the class kit to | go with it and it works great.
I got this for | my bf and he loves it.
I bought one for | myself and everybody that sees it loves it.  
I bought this for | my boyfriend and he just loved 

## Larger Match Test
We'll run this on a single product to see what all the results would be.

In [36]:
asin = "B00004SPEU" # coffee grinder
product_subset = [
    (d["reviewText"], d) for d in data if d["asin"] == asin
]

Doc.set_extension("metadata", default={}, force=True)

parsed_product_subset = []
for doc, context in nlp.pipe(product_subset, as_tuples=True):
    # FYI, I learned this great way to use attribute extensions 
    # to store the document metadata from the spaCy course.
    # Have you taken it yet? https://course.spacy.io/
    
    metadata = {k: v for k, v in context.items() if k != "reviewText"}
    doc._.set("metadata", metadata)
    parsed_product_subset.append(doc)

In [37]:
for _ in matcher.pipe(parsed_product_subset):
    pass  # leave it to the callback

In [39]:
total = sum(bool(d._.third_rule) for d in parsed_product_subset)
length = len(parsed_product_subset)
print(f"{total} matches. {(total/length)*100:.2f}% of data")

71 matches. 11.99% of data


In [40]:
product_matches = [
    d for d in parsed_product_subset if d._.metadata["asin"] == asin and d._.third_rule
]

product_matches = sorted(
    product_matches,
    key=lambda d: (d[d._.third_rule[0][1] - 1 : d._.third_rule[0][1] + 1].text),
)

print(f"Product: https://www.amazon.com/dp/{asin}")
print("====================")
print("Customers buy this product ...")
for doc in product_matches:
    for match in doc._.third_rule:
        span = Span(doc, match[0], match[1])
        sent = span.sent
        docstr = f"... {doc[span.end-1:sent.end]}"
        print(docstr)

Product: https://www.amazon.com/dp/B00004SPEU
Customers buy this product ...
... as Christmas presents because I have one like them and am very pleased with it so feel these three will be just as welcome in their new kitchens.
... as a travel accessory, and it will do me quite well.  
... as a replacement for a Salton model.
... as a Christmas gift for my son in law.  
... as a new coffee maker for serving guests of an annual retreat we host in our home.  
... as a spice grinder and are very happy.  
... as a christmas gift for a friend and it works great.
... as a spice grinder and a coffee grinder
... as a wedding gift...24 years ago.
... because it was cheap and had good reviews, and I have not been disappointed.
... because of the excellent reviews.  
... because of Amazon reviews but don't remember a particular comment regarding the mess caused by getting the grinds out of the unit.  
... because they both had good reviews.
... because we had been so happy with it.
... for 1 thing

## Conclusion
Rule-based matching is great if you need to explain your algorithm, find the position of specific matches within a document, need a fast way to identify matching phrases, and you're comfortable with the linguistic attributes needed to write rules. 