#### Key Concepts
*> pipe*

*> factory*

*> EntityRuler*

*> PhraseMatcher*

*> Matcher*

#### Introduction to Spacy's ****EntityRuler****
*spaCy offers a few different methods for performing rules-based NER. One of them is EntityRuler.*

*****EntityRuler**** is a spacy factory that allows to create a set of patterns with corresponding labels. A ****factory**** is a set of classes and functions preloaded in spaCy that perform set tasks. In the case of EntityRuler, user can create an EntityRuler, give it a set of instructions, and use this instructions to find and label entities. Once a EntityRuler is created, user can then add it to the spaCy pipeline as a new pipe.*

*A ****pipe**** is an individual component of a pipeline. A ****pipeline-**** take input data, perform some sort of operations on that input data, and then output those operations either as a new data or extracted metadata. In spaCy, there are a few different pipes that perform different tasks. The ****tokenizer-**** tokenizes the text into individual tokens; the ****parser-**** parses the text and the ****NER-**** identifies entities and labels them accordingly.*

*It is important to remember that pipelines are sequential. Sometimes the sequences are essential because later pipes might be depend on earlier pipes and sometimes not essential. It should keep in mind while creating custom spaCy model.*

*The full documentation of spaCy EntityRuler can be found here: https://spacy.io/api/entityruler*

#### Demonstration of ******EntityRuler******
*Here, we eill introduce a new pipe into spaCy's off-the-shelf small English model.*

In [1]:
# Import the requisite library
import spacy

# Build upon the spaCy model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "The Dhaka University is situated in Dhaka, Bangladesh. And Chittagong is called the port city."

# Create the Doc object
doc = nlp(text)

# Extract entities
for ent in doc.ents:
    print(ent.text, ent.label_)

The Dhaka University ORG
Dhaka GPE
Bangladesh GPE
Chittagong PERSON


*Often times the domains in which we wish to deploy models, off-the-shelf models will fail because the have not been trained on domain-specific texts. We can resolve this- either via EntityRuler or via training a new model.*

*Now, let fix the problem by giving the model instructions to correctly identify Chittagong. For simplicity, we will use spaCy's GPE label.*

In [2]:
# Import the libraries
import spacy

# Build the spaCy Small Model
nlp = spacy.load("en_core_web_sm")

# Sample Text
text = "The Dhaka University is situated in Dhaka, Bangladesh. And Chittagong is called the port city."

# Create the EntityRuler
ruler = nlp.add_pipe("entity_ruler")

# List of Entities and Patterns
patterns = [{"label":"LOC", "pattern":"Chittagong"}]
ruler.add_patterns(patterns)

doc = nlp(text)

# Extract Entities
for ent in doc.ents:
    print(ent.text, ent.label_)

The Dhaka University ORG
Dhaka GPE
Bangladesh GPE
Chittagong PERSON


*If you executed the code above and found that you had athe same output, then you did everything correctly. This method has failed. Why?- The answer comes back to the concept of pipelines. We created and added the EntityRuler to the spaCy model's pipelines, but by default, spaCy add's new pipe to the end of the pipeline.*

*In order to visualize the pipeline, let's use spaCy's ******analize_pipes()*******

In [3]:
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'],
   'requires': [],
   'scores': ['ents_f', 'ent

*This can be a bit difficult to read at first, but what it shows us is the order in which our pipes are set up and a few other key pieces of information about each pipe. If we locate "ner", we notice that "entity_ruler" sits behind it.*

*In order for our EntityRuler to have primacy, we have to assign it to after the "ner" pipe.*

In [4]:
# Build upon the spaCy Small Model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "The Dhaka University is situated in Dhaka, Bangladesh. And Chittagong is called the port city."

# Create the EntityRuler
ruler = nlp.add_pipe("entity_ruler", after="ner")

# List of Entities and Patterns
patterns = [{"label":"GPE", "pattern":"Chittagong"}]

ruler.add_patterns(patterns)

doc = nlp(text)
# Extract Entities
for ent in doc.ents:
    print(ent.text, ent.label_)

'''
xxx
> Didn't change the Entity Level of Chittagong
'''

The Dhaka University ORG
Dhaka GPE
Bangladesh GPE
Chittagong PERSON


In [5]:
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'],
   'requires': [],
   'scores': ['ents_f', 'ent

#### Complex Rules and Varience to the EntityRuler
*In some instances, labels may have a set type of varience that follow a distinct pattern or sets of patterns. One such example is phone number. In Bangladesh, phone numbers have few forms. The standard formal method is (xxx)xxxxx-xxxxxx.*

*The spaCy EntityRuler allows the user to introduce a variety of complex rules and variences(via RegEx) by passing the rules to the pattern. For working within a United States domain, you can pass RegEx formulas to pattern matcher to grab all of these issues. There are many arguments that one can pass to the patterns. For a complete list, see: https://spacy.io/usage/rule-based-matching.*

*In the example below, we work with one example from spaCy documentation in which we extract a phone nunber from a text. This same task can be done via Regex as well.*

In [10]:
# Import the requisite library
import spacy

# Sample text
text = "Any query about Alchemy Software Limited, contact with the number (+88)01313-406600"

# Build upon the spaCy Small Model
nlp = spacy.blank("en")

# Create the Ruler and Add it
ruler = nlp.add_pipe("entity_ruler")

# List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)
patterns = [{"label":"PHONE_NUMBER", "pattern":[{"ORTH":"("}, {"ORTH":"+"}, {"SHAPE":"dd"}, {"ORTH":")"}, {"SHAPE":"ddddd"}, {"ORTH":"-", "OP":"?"}, {"SHAPE":"dddddd"}]}]

# Add patterns to Ruler
ruler.add_patterns(patterns)

# Create the doc
doc = nlp(text)

# Extarct Entitirs
for ent in doc.ents:
    print(ent.text, ent.label_)


'''
xxx
> Could not configure the pattern of Mobile Number
'''

#### Other Rules-Based Matching Techniques
*There are two other rules-based methods in spaCy: *****Matcher***** and *****PhraseMatcher*****. We habe already met the *****Matcher***** in *****Rules-Based Matching*****. We will be meeting other more complex rules-based matching methods further.*