This file contains notes on **EntityRuler**, **Matcher**,


-------

# EntityRuler


Spacy offers a few different methods for performing `rules-based NER`

The **`EntityRuler`** is a spaCy factory that allows one to create a set of patterns with corresponding labels.

A **factory** in spaCy is a set of classes and functions preloaded in spaCy that perform set tasks.

The full documentation of spaCy EntityRuler can be found here: https://spacy.io/api/entityruler 

-------
In the code below, we will introduce a new pipe into spaCy’s off-the-shelf small English model. 

The purpose of this EntityRuler will be to identify small villages in Poland correctly


In [2]:
import spacy

# build upon the small model
nlp = spacy.load("en_core_web_sm")

# sample text
text = "The village of Treblinka is in Poland. Treblinka was also an extermination camp."

# create the entity rule
ruler = nlp.add_pipe("entity_ruler", after="ner")

patterns = [
    {"label": "GPE", "pattern": "Treblinka"}
]
ruler.add_patterns(patterns)

doc = nlp(text)

for ent in doc.ents:
    print(ent.text, ent.label_)
    


Treblinka GPE
Poland GPE
Treblinka GPE


The spaCy EntityRuler also allows the user to introduce a variety of complex rules and variances (via, among other things, RegEx) by passing the rules to the pattern.

In [3]:
#Sample text
text = "This is a sample number (555) 555-5555."


#Build upon the spaCy Small Model
nlp = spacy.blank("en")

#Create the Ruler and Add it
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)
patterns = [
                {"label": "PHONE_NUMBER", "pattern": [{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "ddd"},
                {"ORTH": "-", "OP": "?"}, {"SHAPE": "dddd"}]}
            ]
#add patterns to ruler
ruler.add_patterns(patterns)

#create the doc
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

(555) 555-5555 PHONE_NUMBER


# Spacy Matcher


In [11]:
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")

matcher = Matcher(nlp.vocab)
pattern = [{"LIKE_EMAIL": True}]

matcher.add("EMAIL_ADDRESS", [pattern])
doc = nlp("This is an email address: wmattingly@aol.com")
matches = matcher(doc)

# lexeme, start token, end token
print(matches)

print (nlp.vocab[matches[0][0]].text)

[(16571425990740197027, 6, 7)]
EMAIL_ADDRESS
