# spaCy EntityRuler

EntityRuler is a component in spaCy that allows us to include or modify named entities using pattern matching rules.

EntityRuler let us add named-entities to a Doc container. It can be used on its own or combined with EntityRecognizer (a spaCy pipeline component for named-entity recognition), to boost accuracy.

We can add named-entities to a Doc container using an entity pattern. 

Entity patterns are dictionaries with two keys. One key is the "label" which is specifying the label to assign to the entity if the pattern is matched, and the second key is the "pattern", which is the matched string. 

The entity ruler accepts two types of patterns: phrase entity and token entity patterns. 

- A phrase entity pattern is used for exact string matches. For example:

  `{'label':'ORG','pattern':'Microsoft'}`

- A token entity pattern uses one dictionary to describe one token. For example:

  `{'label':'GPE','pattern':[{'LOWER':'san'},{'LOWER':'francisco'}]`

We can have a combination of patterns

The EntityRuler can be added to a spaCy model using `.add_pipe()` method by passing "entity_ruler" name. When the nlp model is called on a text, it will find matches in the doc container and add them as entities in the doc-dot-ents, using the specified pattern label as the entity label. 

In [1]:
# import required libraries
import spacy

In [2]:
# let us add entity ruler

nlp = spacy.blank('en')
entity_ruler = nlp.add_pipe('entity_ruler')

pattern = [{'label':'ORG', 'pattern':'Microsoft'},
           {'label':'GPE', 'pattern':[{'LOWER':'san'},{'LOWER':'francisco'}]}]

entity_ruler.add_patterns(pattern)

In [3]:
# let us now test above rule on a sample text
doc = nlp('Microsoft is hiring software developers in San Francisco.')

print([(ent.text, ent.label_) for ent in doc.ents])

[('Microsoft', 'ORG'), ('San Francisco', 'GPE')]


In [4]:
# let us now test above rule on a sample text
doc = nlp('microsoft is hiring software developers in San Francisco.')

print([(ent.text, ent.label_) for ent in doc.ents])

[('San Francisco', 'GPE')]


Above code doesn't match microsoft since it is exact matching rule that we have defined, so with lower case it does not work

In [5]:
# let us redefine the pattern to be case insensitive

nlp = spacy.blank('en')
entity_ruler = nlp.add_pipe('entity_ruler')

pattern = [{'label':'ORG', 'pattern':[{'LOWER':'microsoft'}]},
           {'label':'GPE', 'pattern':[{'LOWER':'san'},{'LOWER':'francisco'}]}]

entity_ruler.add_patterns(pattern)

In [6]:
doc = nlp('Microsoft is hiring software developers in San Francisco.')

print([(ent.text, ent.label_) for ent in doc.ents])

[('Microsoft', 'ORG'), ('San Francisco', 'GPE')]


In [7]:
# Let us try with another example
nlp_sm = spacy.load('en_core_web_sm')
doc = nlp_sm('Taj Mahal is in Agra')

for ent in doc.ents:
    print('Text {0} is {1}'.format(ent.text, ent.label_))

Text Taj Mahal is PERSON
Text Agra is GPE


As we saw above existing mode is unable to detect the entities correctly. We can correct it with entity ruler.

In [8]:
# Let us add entity ruler to nlp pipeline
ruler = nlp_sm.add_pipe('entity_ruler', after='ner')

If we add the ruler after an existing ner component by setting the "after" argument of the .add_pipe() method to "ner", the entity ruler will only add entities to the doc.ents if they don’t overlap with existing entities predicted by the model. 

In [9]:
# Let us test with the same text
patterns = [
    {'label': 'FAC', 'pattern': 'Taj Mahal'},
    {'label': 'GPE', 'pattern': 'Agra'}
]
ruler.add_patterns(patterns)

doc = nlp_sm('Taj Mahal is in Agra')

for ent in doc.ents:
    print('Text {0} is {1}'.format(ent.text, ent.label_))

Text Taj Mahal is PERSON
Text Agra is GPE


Hence, still it is unable to detect entities correctly

However, if we add an EntityRuler before the ner component by setting the "before" argument of .add_pipe() method to "ner", to recognize Taj Mahal as FAC, the entity recognizer will respect the existing entity spans and adjust its predictions based on patterns added to the EntityRuler. This can improve model accuracy in our case.

In [12]:
# Let us add entity ruler to nlp pipeline before
nlp_sm.remove_pipe('entity_ruler')
ruler = nlp_sm.add_pipe('entity_ruler', before='ner')

In [13]:
# Let us test with the same text
patterns = [
    {'label': 'FAC', 'pattern': 'Taj Mahal'},
    {'label': 'GPE', 'pattern': 'Agra'}
]
ruler.add_patterns(patterns)

doc = nlp_sm('Taj Mahal is in Agra')

for ent in doc.ents:
    print('Text {0} is {1}'.format(ent.text, ent.label_))

Text Taj Mahal is FAC
Text Agra is GPE


In [19]:
# Get the NER components
ner = nlp_sm.get_pipe("ner")

# Print the entity labels
print("Entity labels in the model:")
print(ner.labels)

Entity labels in the model:
('CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART')
