# Entity Ruler

A **factory** in spaCy is a set of classes and functions preloaded in spaCy that perform set tasks. 

In the case of the **EntityRuler**, the factory at hand allows the user to create an EntityRuler, give it a set of instructions, and then use this instructions to find and label entities.

\
**Once the user has created the EntityRuler and given it a set of instructions, the user can then add it to the spaCy pipeline as a new pipe.**

### Pipes and Pipelines

A pipe is an individual component of a pipeline.

There are different pipes that perform different tasks:
- **The tokenizer** tokenizes the text into individual tokens.
- **The Parser** parses the text
- **The NER** identifies entities and labels them accordingly.

\
**The sequence of a pipeline is crucial as later pipes can depend on earlier pipes.** This is important when creating custom spaCy pipelines.

### Creating EntityRuler as a new pipe
spaCy models do not come with a pre-built EntityRuler.

The user must create an EntityRuler as a new pipe, give it instructions, and then add it to the model.

In [2]:
import spacy

### Rules-based vs Machine Learning-based Approach

**Rules-based appraoch**

This is good for cases where the rules can be implemented and that the rules will always return true positives.

An example of this can be the different ways that dates can be written.

eg. "Jan 1 2024", "January 1 2024", "01/01/24", "1st January 2024", etc.

\
**Machine Learning based approach**

This is used when we cannot simply apply rules for what is or isn't a certain type of entity.

An example would be recognizing names.

We cannot possibly have a list of every possible first name, last name, prefix, and suffix in the world.

This is where the machine learning-based approach would be used instead.

In [17]:
nlp = spacy.load("en_core_web_sm")
text = "West Chestertenfieldville was referenced in Mr. Deeds."

In [18]:
doc = nlp(text)

In [19]:
for ent in doc.ents:
    print(ent.text, ent.label_)

West Chestertenfieldville GPE
Deeds PERSON


**In an older version, West Chestertenfieldville was labeled as a Person not a GPE.**

Mr. Deeds is a movie, not a person.

We need to fix this using an EntityRuler.

In [20]:
ruler = nlp.add_pipe("entity_ruler")

In [21]:
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'],
   'requires': [],
   'scores': ['ents_f', 'ent

We see in the above output that the entity_ruler has now been added to the model.

### Adding patterns to the pipeline

A list of dictionaries that defines what label the model will give when it finds something matching a certain pattern.

In [43]:
patterns = [
    {"label": "GPE", "pattern": "West Chestertenfieldville"},
    {"label": "FILM", "pattern": "Mr. Deeds"}
]

In [44]:
ruler.add_patterns(patterns)

In [45]:
doc2 = nlp(text)
for ent in doc2.ents:
    print(ent.text, ent.label_)

West Chestertenfieldville GPE
Deeds PERSON


### Order of pipes matters!

**The above code would not work since NER comes before Entity Ruler in the model pipeline**

This means that the identification and labeling of entities takes place before the entity ruler.

### What is the solution?
There are 2 ways to solve this.

1. We can give the EntityRuler the ability overwrite the NER.

2. We can put the EntityRuler before the NER.

In [46]:
nlp2 = spacy.load("en_core_web_sm")

In [47]:
ruler = nlp2.add_pipe("entity_ruler", before="ner")

In [48]:
ruler.add_patterns(patterns)

In [49]:
nlp2.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ent

In [50]:
doc3 = nlp2(text)
for ent in doc3.ents:
    print(ent.text, ent.label_)

West Chestertenfieldville GPE
Mr. Deeds FILM


**We can see that the entity_ruler now comes before the NER in the above pipeline**.

**We can see above that the entities are correctly recognized as GPE and as FILM**

### spaCy can do a lot more
We can not only look for exact matches for the pattern but we can also look for a list of linguistic features as well and label them with something we want.