# Named Entity Recognition

<div class='alert alert-success' style='margin:20px'> Named Entity Recognition (NER) seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organisations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
    
**Let's Explore NER with spacy**

In [1]:
# Import spacy and load english language library

import spacy
import en_core_web_sm
nlp=en_core_web_sm.load()

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [6]:
# Create a function to show entities
def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text + '-' + ent.label_ + '-' + spacy.explain(ent.label_))
    else:
        print("No entities found")

In [7]:
doc=nlp(u"Hey, How are you?")
show_ents(doc)

No entities found


In [8]:
doc=nlp(u"May i go to Delhi to see India Gate?")
show_ents(doc)

Delhi-GPE-Countries, cities, states
India Gate-FAC-Buildings, airports, highways, bridges, etc.


## Entity annotations
`Doc.ents` are token spans with their own set of annotations.
<table>
<tr><td>`ent.text`</td><td>The original entity text</td></tr>
<tr><td>`ent.label`</td><td>The entity type's hash value</td></tr>
<tr><td>`ent.label_`</td><td>The entity type's string description</td></tr>
<tr><td>`ent.start`</td><td>The token span's *start* index position in the Doc</td></tr>
<tr><td>`ent.end`</td><td>The token span's *stop* index position in the Doc</td></tr>
<tr><td>`ent.start_char`</td><td>The entity text's *start* index position in the Doc</td></tr>
<tr><td>`ent.end_char`</td><td>The entity text's *stop* index position in the Doc</td></tr>
</table>



In [11]:
doc1=nlp(u"Should I buy 500 shares of Microsoft company?")
for ent in doc1.ents:
    print(ent.text, ent.start, ent.end,ent.start_char,ent.end_char,ent.label_)

500 3 4 13 16 CARDINAL
Microsoft 6 7 27 36 ORG


## NER Tags
Tags are accessible through the `.label_` property of an entity.
<table>
<tr><th>TYPE</th><th>DESCRIPTION</th><th>EXAMPLE</th></tr>
<tr><td>`PERSON`</td><td>People, including fictional.</td><td>*Fred Flintstone*</td></tr>
<tr><td>`NORP`</td><td>Nationalities or religious or political groups.</td><td>*The Republican Party*</td></tr>
<tr><td>`FAC`</td><td>Buildings, airports, highways, bridges, etc.</td><td>*Logan International Airport, The Golden Gate*</td></tr>
<tr><td>`ORG`</td><td>Companies, agencies, institutions, etc.</td><td>*Microsoft, FBI, MIT*</td></tr>
<tr><td>`GPE`</td><td>Countries, cities, states.</td><td>*France, UAR, Chicago, Idaho*</td></tr>
<tr><td>`LOC`</td><td>Non-GPE locations, mountain ranges, bodies of water.</td><td>*Europe, Nile River, Midwest*</td></tr>
<tr><td>`PRODUCT`</td><td>Objects, vehicles, foods, etc. (Not services.)</td><td>*Formula 1*</td></tr>
<tr><td>`EVENT`</td><td>Named hurricanes, battles, wars, sports events, etc.</td><td>*Olympic Games*</td></tr>
<tr><td>`WORK_OF_ART`</td><td>Titles of books, songs, etc.</td><td>*The Mona Lisa*</td></tr>
<tr><td>`LAW`</td><td>Named documents made into laws.</td><td>*Roe v. Wade*</td></tr>
<tr><td>`LANGUAGE`</td><td>Any named language.</td><td>*English*</td></tr>
<tr><td>`DATE`</td><td>Absolute or relative dates or periods.</td><td>*20 July 1969*</td></tr>
<tr><td>`TIME`</td><td>Times smaller than a day.</td><td>*Four hours*</td></tr>
<tr><td>`PERCENT`</td><td>Percentage, including "%".</td><td>*Eighty percent*</td></tr>
<tr><td>`MONEY`</td><td>Monetary values, including unit.</td><td>*Twenty Cents*</td></tr>
<tr><td>`QUANTITY`</td><td>Measurements, as of weight or distance.</td><td>*Several kilometers, 55kg*</td></tr>
<tr><td>`ORDINAL`</td><td>"first", "second", etc.</td><td>*9th, Ninth*</td></tr>
<tr><td>`CARDINAL`</td><td>Numerals that do not fall under another type.</td><td>*2, Two, Fifty-two*</td></tr>
</table>

___
## Adding a Named Entity to a Span
Normally we would have spaCy build a library of named entities by training it on several samples of text.<br>In this case, we only want to add one value:

In [22]:
doc=nlp(u"Tesla to build a U.K. factory for $60 million")
show_ents(doc)

U.K.-GPE-Countries, cities, states
$60 million-MONEY-Monetary values, including unit


As you can see above Tesla does not have any entity, but we can add it's entity

In [23]:
from spacy.tokens import Span

In [24]:
ORG=doc.vocab.strings["ORG"]

In [25]:
ORG

383

In [26]:
doc.vocab[ORG].text

'ORG'

In [27]:
new_ent=Span(doc,0,1,label=ORG)

In [28]:
doc.ents=list(doc.ents) + [new_ent]

In [29]:
show_ents(doc)

Tesla-ORG-Companies, agencies, institutions, etc.
U.K.-GPE-Countries, cities, states
$60 million-MONEY-Monetary values, including unit


In [30]:
# Now you can see Tesla has an entity.

### Adding Named Entities to all matching spans

In [38]:
doc=nlp(u"Our company created a brand new vaccum cleaner."
        u"This new vaccum-cleaner is the best in show.")
show_ents(doc)

No entities found


In [39]:
from spacy.matcher import PhraseMatcher

In [40]:
matcher=PhraseMatcher(nlp.vocab)

phrase_list=['vaccum cleaner','vaccum-cleaner']

phrase_patterns=[nlp(text) for text in phrase_list]

matcher.add('newproduct',[*phrase_patterns])
found_matches=matcher(doc)

In [41]:
found_matches

[(2689272359382549672, 6, 8), (2689272359382549672, 11, 14)]

In [46]:
matcher.remove('newproduct')

In [42]:
from spacy.tokens import Span

In [43]:
PROD=doc.vocab.strings["PRODUCT"]
PROD

386

In [44]:
new_ents=[Span(doc,match[1],match[2],label=PROD) for match in found_matches ]

doc.ents=list(doc.ents) + new_ents

In [45]:
show_ents(doc)   # We have successfully added new entity to our spans

vaccum cleaner-PRODUCT-Objects, vehicles, foods, etc. (not services)
vaccum-cleaner-PRODUCT-Objects, vehicles, foods, etc. (not services)


### Counting Entities

In [47]:
doc=nlp(u"Originally i paid $20 for this toy car, but now this marked down to $14.6.")

In [48]:
[ent for ent in doc.ents if ent.label_=='MONEY']

[20, 14.6]

In [49]:
len([ent for ent in doc.ents if ent.label_=="MONEY"])

2

## Noun Chunks
`Doc.noun_chunks` are *base noun phrases*: token spans that include the noun and words describing the noun. Noun chunks cannot be nested, cannot overlap, and do not involve prepositional phrases or relative clauses.<br>
Where `Doc.ents` rely on the **ner** pipeline component, `Doc.noun_chunks` are provided by the **parser**.

### `noun_chunks` components:
<table>
<tr><td>`.text`</td><td>The original noun chunk text.</td></tr>
<tr><td>`.root.text`</td><td>The original text of the word connecting the noun chunk to the rest of the parse.</td></tr>
<tr><td>`.root.dep_`</td><td>Dependency relation connecting the root to its head.</td></tr>
<tr><td>`.root.head.text`</td><td>The text of the root token's head.</td></tr>
</table>

In [51]:
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

for chunk in doc.noun_chunks:
    print(chunk.text+' - '+chunk.root.text+' - '+chunk.root.dep_+' - '+chunk.root.head.text)

Autonomous cars - cars - nsubj - shift
insurance liability - liability - dobj - shift
manufacturers - manufacturers - pobj - toward
