# 5. Using SpaCy's EntityRuler

## 5.1. Key Concepts

* pipe

* factory

* EntityRuler

* PhraseMatcher

* Matcher

## 5.2. Introduction to Spacy's EntityRuler

The Python library spaCy offers a few different methods for performing rules-based NER. One such method is via its EntityRuler.

The EntityRuler is a spaCy factory that allows one to create a set of patterns with corresponding labels. A factory in spaCy is a set of classes and functions preloaded in spaCy that perform set tasks. In the case of the EntityRuler, the factory at hand allows the user to create an EntityRuler, give it a set of instructions, and then use this instructions to find and label entities.

Once the user has created the EntityRuler and given it a set of instructions, the user can then add it to the spaCy pipeline as a new pipe.

A pipe is a component of a pipeline. A pipeline’s purpose is to take input data, perform some sort of operations on that input data, and then output those operations either as a new data or extracted metadata. A pipe is an individual component of a pipeline. In the case of spaCy, there are a few different pipes that perform different tasks. The tokenizer, tokenizes the text into individual tokens; the parser, parses the text, and the NER identifies entities and labels them accordingly. All of this data is stored in the Doc object.

It is important to remember that pipelines are sequential. This means that components earlier in a pipeline affect what later components receive. Sometimes this sequence is essential, meaning later pipes depend on earlier pipes. At other times, this sequence is not essential, meaning later pipes can function without earlier pipes. It is important to keep this in mind as you create custom spaCy models (or any pipeline for that matter).

## 5.3. Demonstration of EntityRuler in Action

In [65]:
import spacy

In [66]:
nlp = spacy.load("en_core_web_sm")
text = "West Chestertenfieldville was referenced in Mr. Deeds."

In [67]:
doc = nlp(text)

In [68]:
for ent in doc.ents:
  print(ent.text, ent.label_)

West Chestertenfieldville GPE
Deeds PERSON


In [69]:
ruler = nlp.add_pipe("entity_ruler")

In [70]:
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'],
   'requires': [],
   'scores': ['ents_f', 'ent

In [71]:
patterns = [
      {"label": "GPE", "pattern": "West Chestertenfieldville"}
]

In [72]:
ruler.add_patterns(patterns)

In [73]:
doc2 = nlp(text)
for ent in doc2.ents:
  print(ent.text, ent.label_)

West Chestertenfieldville GPE
Deeds PERSON


In [74]:
nlp2 = spacy.load("en_core_web_sm")

In [75]:
ruler = nlp2.add_pipe("entity_ruler", before="ner")

In [76]:
ruler.add_patterns(patterns)

In [77]:
doc = nlp2(text)

In [78]:
for ent in doc.ents:
  print(ent.text, ent.label_)

West Chestertenfieldville GPE
Deeds PERSON


In [79]:
nlp2.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ent

In [80]:
nlp3 = spacy.load("en_core_web_sm")

In [81]:
ruler = nlp3.add_pipe("entity_ruler", before="ner")

In [82]:
patterns = [
    {"label": "GPE", "pattern": "West Chestertenfieldville"},
    {"label": "FILM", "pattern": "Mr. Deeds"}
]

In [83]:
ruler.add_patterns(patterns)

In [84]:
doc = nlp3(text)

In [85]:
for ent in doc.ents:
  print(ent.text, ent.label_)

West Chestertenfieldville GPE
Mr. Deeds FILM


## 5.4. Introducing Complex Rules and Variance to the EntityRuler (Advanced)

In some instances, labels may have a set type of variance that follow a distinct pattern or sets of patterns. One such example (included in the spaCy documentation) is phone numbers. In the United States, phone numbers have a few forms. The standard formal method is (xxx)-xxx-xxxx, but it is not uncommon to see xxx-xxx-xxxx or xxxxxxxxxx. If the owner of the phone number is giving that same number to someone outside the US, then +1(xxx)-xxx-xxxx.

If you are working within a United States domain, you can pass RegEx formulas to the pattern matcher to grab all of these instances.

The spaCy EntityRuler also allows the user to introduce a variety of complex rules and variances (via, among other things, RegEx) by passing the rules to the pattern. There are many arguments that one can pass to the patterns.

In [86]:
text = "This is a sample number (555) 555-5555"

In [87]:
# Build upon the spacy small model
nlp4 = spacy.blank("en")

In [89]:
# Create the ruler and add it
ruler = nlp4.add_pipe("entity_ruler")

In [90]:
# List of entities and patterns
patterns = [
    {"label": "PHONE_NUMBER", "pattern": [{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "ddd"},
                                          {"ORTH": "-", "OP": "?"}, {"SHAPE": "dddd"}]}
]

In [91]:
# add patterns to ruler
ruler.add_patterns(patterns)

In [92]:
doc3 = nlp4(text)

In [94]:
# extract entities
for ent in doc3.ents:
  print(ent.text, ent.label_)

(555) 555-5555 PHONE_NUMBER
