## spaCy’s Rule-Based Matching

Before we get started, let’s talk about Marti Hearst. She is a computational linguistics researcher and a professor in the School of Information at the University of California, Berkeley. How does she fit into this article? I can sense you wondering.

> Professor Marti has actually done extensive research on the topic of information extraction. One of her most interesting studies focuses on building a set of text-patterns that can be employed to extract meaningful information from text. **These patterns are popularly known as “Hearst Patterns”**.

Let’s look at the example below:
![](https://i2.wp.com/s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2019/09/example_1.png?resize=703%2C127&ssl=1)

We can infer that “Gelidium” is a type of “red algae” just by looking at the structure of the sentence.

> In linguistics terms, we will call “red algae” as Hypernym and “Gelidium” as its Hyponym.

We can formalize this pattern as “X such as Y”, where X is the hypernym and Y is the hyponym. This was one of the many patterns from the Hearst Patterns. Here’s a list to give you an intuition behind the idea:

![hearst patterns](https://i0.wp.com/s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2019/09/hearst_patterns.png?resize=566%2C321&ssl=1)



Now let’s try to extract hypernym-hyponym pairs by using these patterns/rules. We will use spaCy’s rule-based matcher to perform this task.

First, we will import the required libraries:

In [1]:
import re 
import string 
import nltk 
import spacy 
import pandas as pd 
import numpy as np 
import math 
from tqdm import tqdm 

from spacy.matcher import Matcher 
from spacy.tokens import Span 
from spacy import displacy 

pd.set_option('display.max_colwidth', 200)

Next, load a spaCy model:

In [2]:
# load spaCy model
nlp = spacy.load("en_core_web_sm")

We are all set to mine information from text based on these Hearst Patterns.

- __Pattern: X such as Y__

In [3]:
# sample text 
text = "GDP in developing countries such as Vietnam will continue growing at a high rate." 

# create a spaCy object 
doc = nlp(text)

To be able to pull out the desired information from the above sentence, it is really important to understand its syntactic structure – things like the subject, object, modifiers, and parts-of-speech (POS) in the sentence.

We can easily explore these syntactic details in the sentence by using spaCy:

In [4]:
# print token, dependency, POS tag 
for tok in doc: 
    print(tok.text, "-->",tok.dep_,"-->", tok.pos_)

GDP --> nsubj --> NOUN
in --> prep --> ADP
developing --> amod --> VERB
countries --> pobj --> NOUN
such --> amod --> ADJ
as --> prep --> ADP
Vietnam --> pobj --> PROPN
will --> aux --> VERB
continue --> ROOT --> VERB
growing --> xcomp --> VERB
at --> prep --> ADP
a --> det --> DET
high --> amod --> ADJ
rate --> pobj --> NOUN
. --> punct --> PUNCT


Have a look around the terms “such” and “as” . They are followed by a noun (“countries”). And after them, we have a proper noun (“Vietnam”) that acts as a hyponym.

So, let’s create the required pattern using the dependency tags and the POS tags:

In [5]:
#define the pattern 
pattern = [{'POS':'NOUN'}, 
           {'LOWER': 'such'}, 
           {'LOWER': 'as'}, 
           {'POS': 'PROPN'} ] #proper noun

Let’s extract the pattern from the text:

In [6]:
# Matcher class object 
matcher = Matcher(nlp.vocab) 
matcher.add("matching_1", None, pattern) 

matches = matcher(doc) 
span = doc[matches[0][1]:matches[0][2]] 

print(span.text)

countries such as Vietnam


Nice! It works perfectly. However, if we could get “developing countries” instead of just “countries”, then the output would make more sense.

So, we will now also capture the modifier of the noun just before “such as” by using the code below:

In [7]:
# Matcher class object
matcher = Matcher(nlp.vocab)

#define the pattern
pattern = [{'DEP':'amod', 'OP':"?"}, # adjectival modifier
           {'POS':'NOUN'},
           {'LOWER': 'such'},
           {'LOWER': 'as'},
           {'POS': 'PROPN'}]

matcher.add("matching_1", None, pattern)
matches = matcher(doc)

span = doc[matches[0][1]:matches[0][2]]
print(span.text)

developing countries such as Vietnam


Here, “developing countries” is the hypernym and “Vietnam” is the hyponym. Both of them are semantically related.

> _Note: The key ‘OP’: ‘?’ in the pattern above means that the modifier (‘amod’) can occur once or not at all._

In a similar manner, we can get several pairs from any piece of text:

**Fruits** such as **apples**
**Cars** such as **Ferrari**
**Flowers** such as **rose**

Now let’s use some other Hearst Patterns to extract more hypernyms and hyponyms.

- **Pattern: X and/or Y**

In [8]:
doc = nlp("Here is how you can keep your car and other vehicles clean.") 

# print dependency tags and POS tags
for tok in doc: 
    print(tok.text, "-->",tok.dep_, "-->",tok.pos_)

Here --> advmod --> ADV
is --> ROOT --> VERB
how --> advmod --> ADV
you --> nsubj --> PRON
can --> aux --> VERB
keep --> csubj --> VERB
your --> poss --> ADJ
car --> dobj --> NOUN
and --> cc --> CCONJ
other --> amod --> ADJ
vehicles --> conj --> NOUN
clean --> oprd --> ADJ
. --> punct --> PUNCT


In [9]:
# Matcher class object 
matcher = Matcher(nlp.vocab) 

#define the pattern 
pattern = [{'DEP':'amod', 'OP':"?"}, 
           {'POS':'NOUN'}, 
           {'LOWER': 'and', 'OP':"?"}, 
           {'LOWER': 'or', 'OP':"?"}, 
           {'LOWER': 'other'}, 
           {'POS': 'NOUN'}] 
           
matcher.add("matching_1", None, pattern) 

matches = matcher(doc) 
span = doc[matches[0][1]:matches[0][2]] 
print(span.text)

car and other vehicles


Let’s try out the same code to capture the “X or Y” pattern:

The rest of the code will remain the same:

In [10]:
# replaced 'and' with 'or'
doc = nlp("Here is how you can keep your car or other vehicles clean.")

# Matcher class object 
matcher = Matcher(nlp.vocab) 

#define the pattern 
pattern = [{'DEP':'amod', 'OP':"?"}, 
           {'POS':'NOUN'}, 
           {'LOWER': 'and', 'OP':"?"}, 
           {'LOWER': 'or', 'OP':"?"}, 
           {'LOWER': 'other'}, 
           {'POS': 'NOUN'}] 
           
matcher.add("matching_1", None, pattern) 

matches = matcher(doc) 
span = doc[matches[0][1]:matches[0][2]] 
print(span.text)

car or other vehicles


Excellent – it works!

- **Pattern: X, including Y**

In [11]:
doc = nlp("Eight people, including two children, were injured in the explosion") 

for tok in doc: 
    print(tok.text, "-->",tok.dep_, "-->",tok.pos_)

Eight --> nummod --> NUM
people --> nsubjpass --> NOUN
, --> punct --> PUNCT
including --> prep --> VERB
two --> nummod --> NUM
children --> pobj --> NOUN
, --> punct --> PUNCT
were --> auxpass --> VERB
injured --> ROOT --> VERB
in --> prep --> ADP
the --> det --> DET
explosion --> pobj --> NOUN


In [12]:
# Matcher class object 
matcher = Matcher(nlp.vocab) 

#define the pattern 
pattern = [{'DEP':'nummod','OP':"?"}, # numeric modifier 
           {'DEP':'amod','OP':"?"}, # adjectival modifier 
           {'POS':'NOUN'}, 
           {'IS_PUNCT': True}, 
           {'LOWER': 'including'}, 
           {'DEP':'nummod','OP':"?"}, 
           {'DEP':'amod','OP':"?"}, 
           {'POS':'NOUN'}]            

matcher.add("matching_1", None, pattern) 

matches = matcher(doc) 
span = doc[matches[0][1]:matches[0][2]] 
print(span.text)

Eight people, including two children


- **Pattern: X, especially Y**

In [13]:
doc = nlp("A healthy eating pattern includes fruits, especially whole fruits.") 

for tok in doc: 
    print(tok.text, "-->",tok.dep_, "-->",tok.pos_)

A --> det --> DET
healthy --> amod --> ADJ
eating --> compound --> NOUN
pattern --> nsubj --> NOUN
includes --> ROOT --> VERB
fruits --> dobj --> NOUN
, --> punct --> PUNCT
especially --> advmod --> ADV
whole --> amod --> ADJ
fruits --> appos --> NOUN
. --> punct --> PUNCT


In [14]:
# Matcher class object 
matcher = Matcher(nlp.vocab) 

#define the pattern 
pattern = [
           {'DEP':'nummod','OP':"?"}, 
           {'DEP':'amod','OP':"?"}, 
           {'POS':'NOUN'}, 
           {'IS_PUNCT': True}, 
           {'LOWER': 'especially'}, 
           {'DEP':'nummod','OP':"?"}, 
           {'DEP':'amod','OP':"?"}, 
           {'POS':'NOUN'}]            

matcher.add("matching_1", None, pattern) 

matches = matcher(doc) 
span = doc[matches[0][1]:matches[0][2]] 
print(span.text)

fruits, especially whole fruits


# Subtree Matching for Relation Extraction
The simple rule-based methods work well for information extraction tasks. However, they have a few drawbacks and shortcomings.

We have to be extremely creative to come up with new rules to capture different patterns. It is difficult to build patterns that generalize well across different sentences.

To enhance the rule-based methods for relation/information extraction, we should try to understand the dependency structure of the sentences at hand.

Let’s take a sample text and build its dependency graphing tree:

In [19]:
text = "Tableau was recently acquired by Salesforce." 

# Plot the dependency graph 
doc = nlp(text) 
displacy.render(doc, style='dep', jupyter=True)

Can you find any interesting relation in this sentence?

If you look at the entities in the sentence – Tableau and Salesforce – they are related by the term ‘acquired’. So, the pattern I can extract from this sentence is either “Salesforce acquired Tableau” or “X acquired Y”.

Now consider this statement: “Careem, a ride-hailing major in the middle east, was acquired by Uber.”

Its dependency graph will look something like this:

![dependency tree nlp](https://i1.wp.com/s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2019/09/DG_2.png?resize=679%2C191&ssl=1)

Pretty scary, right?

Don’t worry! All we have to check is which dependency paths are common between multiple sentences. This method is known as Subtree matching.

For instance, if we compare this statement with the previous one:
![information extraction](https://i0.wp.com/s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2019/09/DG_3.png?resize=675%2C193&ssl=1)


![information extraction](https://i2.wp.com/s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2019/09/DG_4.png?resize=531%2C170&ssl=1)

We will just consider the common dependency paths and extract the entities and the relation (acquired) between them. Hence, the relations extracted from these sentences are:

- **Salesforce acquired Tableau**
- **Uber acquired Careem**

Let’s try to implement this technique in Python. We will again use spaCy as it makes it pretty easy to traverse a dependency tree.

We will start by taking a look at the dependency tags and POS tags of the words in the sentence:

In [21]:
text = "Tableau was recently acquired by Salesforce." 
doc = nlp(text) 

for tok in doc: 
    print(tok.text,"-->",tok.dep_,"-->",tok.pos_)

Tableau --> nsubjpass --> PROPN
was --> auxpass --> VERB
recently --> advmod --> ADV
acquired --> ROOT --> VERB
by --> agent --> ADP
Salesforce --> pobj --> PROPN
. --> punct --> PUNCT


Here, the dependency tag for “Tableau” is _nsubjpass_ which stands for a passive subject (as it is a passive sentence). The other entity “Salesforce” is the _object_ in this sentence and the term “acquired” is the ROOT of the sentence which means it somehow connects the object and the subject.

Let’s define a function to perform subtree matching:

In [34]:
def subtree_matcher(doc): 
    x = '' 
    y = '' 

    # iterate through all the tokens in the input sentence 
    for i,tok in enumerate(doc): 
        # extract subject 
        if tok.dep_.find("subjpass") == True: 
            y = tok.text 

        # extract object 
        if tok.dep_.endswith("obj") == True: 
            x = tok.text 

    return x,y

In this case, we just have to find all those sentences that:

- Have two entities, and
- The term “acquired” as the only ROOT in the sentence

We can then capture the subject and the object from the sentences. Let’s call the above function:

In [24]:
subtree_matcher(doc)

('Salesforce', 'Tableau')

Here, the subject is the acquirer and the object is the entity that is getting acquired. Let’s use the same function, subtree_matcher( ), to extract entities related by the same relation (“acquired”):

In [25]:
text_2 = "Careem, a ride hailing major in middle east, was acquired by Uber." 

doc_2 = nlp(text_2) 
subtree_matcher(doc_2)

('Uber', 'Careem')

Did you see what happened here? This sentence had more words and punctuation marks but still, our logic worked and successfully extracted the related entities.

But wait – what if I change the sentence from passive to active voice? Will our logic still work?

In [26]:
text_3 = "Salesforce recently acquired Tableau." 
doc_3 = nlp(text_3) 
subtree_matcher(doc_3)

('Tableau', '')

That’s not quite what we expected. The function has failed to capture ‘Salesforce’ and wrongly returned ‘Tableau’ as the acquirer.

So, what could go wrong? Let’s understand the dependency tree of this sentence:

In [28]:
for tok in doc_3:    
    print(tok.text, "-->",tok.dep_, "-->",tok.pos_)

Salesforce --> nsubj --> PROPN
recently --> advmod --> ADV
acquired --> ROOT --> VERB
Tableau --> dobj --> PROPN
. --> punct --> PUNCT


It turns out that the grammatical functions (subject and object) of the terms ‘Salesforce’ and ‘Tableau’ have been interchanged in the active voice. However, now the dependency tag for the subject has changed to ‘nsubj’ from ‘nsubjpass’. This tag indicates that the sentence is in the active voice.

We can use this property to modify our subtree matching function. Given below is the new function for subtree matching:

In [32]:
def new_subtree_matcher(doc):
    subjpass = 0

    for i,tok in enumerate(doc):
    # find dependency tag that contains the text "subjpass"    
        if tok.dep_.find("subjpass") == True:
            subjpass = 1

    x = ''
    y = ''

    # if subjpass == 1 then sentence is passive
    if subjpass == 1:
        for i,tok in enumerate(doc):
            if tok.dep_.find("subjpass") == True:
                y = tok.text

            if tok.dep_.endswith("obj") == True:
                x = tok.text

    # if subjpass == 0 then sentence is not passive
    else:
        for i,tok in enumerate(doc):
            if tok.dep_.endswith("subj") == True:
                x = tok.text

            if tok.dep_.endswith("obj") == True:
                y = tok.text

    return x,y

Let’s try this new function on the active voice sentence:

In [33]:
new_subtree_matcher(doc_3)

('Salesforce', 'Tableau')

Great! The output is correct. Let’s pass the previous passive sentence to this function:

In [35]:
new_subtree_matcher(nlp("Tableau was recently acquired by Salesforce."))

('Salesforce', 'Tableau')

That’s exactly what we were looking for. We have made the function slightly more general. I would urge you to deep dive into the grammatical structure of different types of sentences and try to make this function more flexible.