### Importance of Custom rules.

- Spacy NLP model has been trained on millions of text data sample. But text data sample is so wide that we cannot train the model on all the data samples. 
- Many a times, we come across a document that has text data on which the model has not been trained on. Hence we need custom rules to work with such data.

### Custom Rule 1: Expanding the named entities

- For eg, the corpus spacy's English model were trained on defines a PERSON entity as just the person name, without titles like "Mr." and "Dr." This makes sense, because it makes it easire to resolve the entity type back to a knowledge base. But what if your application requires full names including the titles?

- Mr. Ashish Mehta

In [1]:
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span
from spacy import displacy

In [2]:
nlp = spacy.load("en_core_web_sm")

In [3]:
doc = nlp("Dr. Alex Smith chaired first board meeting at Google")

In [4]:
print([(ent.text, ent.label_) for ent in doc.ents])

[('Alex Smith', 'PERSON'), ('first', 'ORDINAL'), ('Google', 'ORG')]


- As we can see in the above output, spacy has missed to mention Dr. in person entity. Hence we have to create a custom rule for that scenario

In [13]:
def add_title(doc):
    new_ents = []
    for ent in doc.ents:
        if ent.label_ == "PERSON" and ent.start != 0:
            prev_token = doc[ent.start-1]
            if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms.", "Mrs", "Mrs."):
                new_ent = Span(doc, ent.start-1, ent.end, label = ent.label)
                new_ents.append(new_ent)
            else:
                new_ents.append(ent)
    doc.ents = new_ents
    return doc

In [14]:
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(add_title, after="ner")

In [15]:
doc = nlp("Dr. Alex Smith chaired first board meeting at Google")

In [16]:
print([(ent.text, ent.label_) for ent in doc.ents])

[('Dr. Alex Smith', 'PERSON')]


- We have created a new rule based function that only recognizes a PERSON entitiy with a title assigned to it. Hence all other entities cannot be recognized

### Use of POS and Dependency Parsing 

In [17]:
nlp = spacy.load("en_core_web_sm")

In [18]:
doc = nlp("Alex Smith was working at Google")

In [21]:
displacy.render(doc, style = "dep",  options = {"compact": True, "distance" : 100})

- We will extract all the organizations in which, a person has previously worked at

In [40]:
def get_person_orgs(doc):
    person_entity = [ent for ent in doc.ents if ent.label_ == "PERSON"]
    for ent in person_entity:
        head = ent.root.head
        if head.lemma_ == "work":
            preps = [token for token in head.children if token.dep_ == "prep"]
            for prep in preps:
                orgs = [token for token in prep.children if token.ent_type_ == "ORG"]
                print ({"person": ent, "org": orgs, "past": head.tag_ == "VBD" })
    return doc

In [41]:
from spacy.pipeline import merge_entities

In [42]:
nlp = spacy.load("en_core_web_sm")

In [43]:
nlp.add_pipe(merge_entities)

In [44]:
nlp.add_pipe(get_person_orgs)

In [45]:
doc = nlp("Alex Smith was working at Google")

{'person': Alex Smith, 'org': [Google], 'past': False}


- We can see past for [was working] must be TRUE. But is has returned False

In [47]:
doc = nlp("Alex Smith worked at Google")

{'person': Alex Smith, 'org': [Google], 'past': True}


- Past for [worked] has come true

### Modify Model

In [49]:
def get_person_orgs(doc):
    person_entity = [ent for ent in doc.ents if ent.label_ == "PERSON"]
    for ent in person_entity:
        head = ent.root.head
        if head.lemma_ == "work":
            preps = [token for token in head.children if token.dep_ == "prep"]
            for prep in preps:
                orgs = [token for token in prep.children if token.ent_type_ == "ORG"]
                aux = [token for token in head.children if token.dep_ == "aux"]
                past_aux = any(t.tag_ == "VBD" for t in aux)
                past = head.tag_ == "VBD" or head.tag_ == "VBG" and past_aux
                print ({"person": ent, "org": orgs, "past": past })
    return doc

In [50]:
from spacy.pipeline import merge_entities
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(merge_entities)
nlp.add_pipe(get_person_orgs)
doc = nlp("Alex Smith was working at Google")

{'person': Alex Smith, 'org': [Google], 'past': True}
