## Named Entity Recongnition(NER) 

### What is Named Entity ?

**Any word which represents a person, organization, location etc, is a Named Entity. Named entiy recognition is a subtask of information extraction and is the process of identifying words which are named entities in a given test. It is also called entity identification or entity chunking.**

--- 

### Example

**"Apple acquired Zoom in China on Wednesday 6th May 2020"**
- Here named entites are Apple, Zoom, China and Wednesday 6th May 2020
- Named entity recongnition is the task of identifying these words from the text.
---

### Why is it important ?

**In order to understand the meaning from a given text, it is important to indentify who did what to whom. Named entity recognition is the first task of identifying the words which may represent the who, what and whom in the text. It helps in indentifying the major entities the text is talking about.**


**What this means is that, any NLP task which involves automatically understanding text and acts baed on it, needs the NER or Named Entity Recognition in its pipeline.**

--- 


### Team ideation 

we have understood that there are different ways to build the model for NER, but we will go with mainly both NLTK and Spacy approach and compare which gives the better results, though both are not perfect.


---

### Approaches
1. NLTK
    a. Word based segmentation
    b. Sentence based segmentation - Sentence allowes entities to be formed out of context, rather than depending just on the meaning of only the word.
2. Spacy
3. Using Stanford NLP NER


### Imports

In [29]:
import nltk
import pandas as pd
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tag import pos_tag
from nltk import ne_chunk
## Spacy
import spacy
from spacy import displacy
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/ananthan2k/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/ananthan2k/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /home/ananthan2k/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

Data

NOTE: For the ease of debugging the text we have used a small text

In [6]:
text = "Apple acquired Zoom in China on Wednesday 6th May 2020.\
This news has made Apple and Google stock jump by 5% on Dow Jones Index in the \
United States of America"

**NER using word segmentation**

In [7]:
words = word_tokenize(text)
words

['Apple',
 'acquired',
 'Zoom',
 'in',
 'China',
 'on',
 'Wednesday',
 '6th',
 'May',
 '2020.This',
 'news',
 'has',
 'made',
 'Apple',
 'and',
 'Google',
 'stock',
 'jump',
 'by',
 '5',
 '%',
 'on',
 'Dow',
 'Jones',
 'Index',
 'in',
 'the',
 'United',
 'States',
 'of',
 'America']

In [10]:
## POS tagging
pos_tags = pos_tag(words)
pos_tags

[('Apple', 'NNP'),
 ('acquired', 'VBD'),
 ('Zoom', 'NNP'),
 ('in', 'IN'),
 ('China', 'NNP'),
 ('on', 'IN'),
 ('Wednesday', 'NNP'),
 ('6th', 'CD'),
 ('May', 'NNP'),
 ('2020.This', 'CD'),
 ('news', 'NN'),
 ('has', 'VBZ'),
 ('made', 'VBN'),
 ('Apple', 'NNP'),
 ('and', 'CC'),
 ('Google', 'NNP'),
 ('stock', 'NN'),
 ('jump', 'NN'),
 ('by', 'IN'),
 ('5', 'CD'),
 ('%', 'NN'),
 ('on', 'IN'),
 ('Dow', 'NNP'),
 ('Jones', 'NNP'),
 ('Index', 'NNP'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('United', 'NNP'),
 ('States', 'NNPS'),
 ('of', 'IN'),
 ('America', 'NNP')]

**Next we can chunk the tags. So here is the part we classify the pos_tagged words into named entities.
We will follow two ways:
1. Binary=True -> So here either a pos tagged word is NE or not NE. And just indicates NE words labelled as 'NE'
2. Binary=False -> All pos tagged as identified as Named Entity, in some form**

**Binary=True**

In [18]:
chunks = ne_chunk(pos_tags, binary=True)
for chunk in chunks:
    print(chunk)

(NE Apple/NNP)
('acquired', 'VBD')
('Zoom', 'NNP')
('in', 'IN')
(NE China/NNP)
('on', 'IN')
('Wednesday', 'NNP')
('6th', 'CD')
('May', 'NNP')
('2020.This', 'CD')
('news', 'NN')
('has', 'VBZ')
('made', 'VBN')
(NE Apple/NNP)
('and', 'CC')
(NE Google/NNP)
('stock', 'NN')
('jump', 'NN')
('by', 'IN')
('5', 'CD')
('%', 'NN')
('on', 'IN')
('Dow', 'NNP')
('Jones', 'NNP')
('Index', 'NNP')
('in', 'IN')
('the', 'DT')
(NE United/NNP States/NNPS)
('of', 'IN')
(NE America/NNP)


In [21]:
print(dir(chunks[0]))

['__add__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__mul__', '__ne__', '__new__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_frozen_class', '_get_node', '_label', '_parse_error', '_pformat_flat', '_repr_png_', '_set_node', 'append', 'chomsky_normal_form', 'clear', 'collapse_unary', 'convert', 'copy', 'count', 'draw', 'extend', 'flatten', 'freeze', 'fromlist', 'fromstring', 'height', 'index', 'insert', 'label', 'leaf_treeposition', 'leaves', 'node', 'pformat', 'pformat_latex_qtree', 'pop', 'pos', 'pprint', 'pretty_print', 'productions', 'remove', 'reverse', 'set_label', 'sor

In [26]:
entities =[]
labels =[]
for chunk in chunks:
    if hasattr(chunk,'label'):
        #print(chunk)
        #print(chunk[0])
        entities.append(' '.join(c[0] for c in chunk))
        labels.append(chunk.label())
        
entities_labels = list(set(zip(entities, labels)))
#print(entities_labels)
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["Entities","Labels"]
entities_df

Unnamed: 0,Entities,Labels
0,China,NE
1,Apple,NE
2,America,NE
3,Google,NE
4,United States,NE


**Binary=False**

In [28]:
chunks = ne_chunk(pos_tags, binary=False) #either NE or not NE
for chunk in chunks:
    print(chunk)
    
entities =[]
labels =[]
for chunk in chunks:
    if hasattr(chunk,'label'):
        entities.append(' '.join(c[0] for c in chunk))
        labels.append(chunk.label())
        
entities_labels = list(set(zip(entities, labels)))
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["Entities","Labels"]
entities_df

(PERSON Apple/NNP)
('acquired', 'VBD')
(PERSON Zoom/NNP)
('in', 'IN')
(GPE China/NNP)
('on', 'IN')
('Wednesday', 'NNP')
('6th', 'CD')
('May', 'NNP')
('2020.This', 'CD')
('news', 'NN')
('has', 'VBZ')
('made', 'VBN')
(PERSON Apple/NNP)
('and', 'CC')
(ORGANIZATION Google/NNP)
('stock', 'NN')
('jump', 'NN')
('by', 'IN')
('5', 'CD')
('%', 'NN')
('on', 'IN')
(PERSON Dow/NNP Jones/NNP Index/NNP)
('in', 'IN')
('the', 'DT')
(GPE United/NNP States/NNPS)
('of', 'IN')
(GPE America/NNP)


Unnamed: 0,Entities,Labels
0,Google,ORGANIZATION
1,China,GPE
2,United States,GPE
3,America,GPE
4,Dow Jones Index,PERSON
5,Zoom,PERSON
6,Apple,PERSON


**NE based on sentence segmentation**

In [31]:
entities = []
labels = []

sentence = sent_tokenize(text)
for sent in sentence:
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)),binary=False):
        if hasattr(chunk,'label'):
            entities.append(' '.join(c[0] for c in chunk))
            labels.append(chunk.label())
            
entities_labels = list(set(zip(entities,labels)))

entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["Entities","Labels"]
entities_df

Unnamed: 0,Entities,Labels
0,Google,ORGANIZATION
1,China,GPE
2,United States,GPE
3,America,GPE
4,Dow Jones Index,PERSON
5,Zoom,PERSON
6,Apple,PERSON


### Spacy based approach

In [32]:
#Download spacy models
#!python -m spacy download en_core_web_sm

In [34]:
## Load the spacy model
nlp = spacy.load("en_core_web_sm")

In [37]:
## Feed the text
doc = nlp(text)
print([(X.text, X.label_) for X in doc.ents])

[('Apple', 'ORG'), ('Zoom', 'ORG'), ('China', 'GPE'), ('Wednesday 6th', 'DATE'), ('Apple', 'ORG'), ('5%', 'PERCENT'), ('Dow Jones Index', 'ORG'), ('the United States of America', 'GPE')]


In [40]:
from collections import Counter
labels = [x.label_ for x in doc.ents]
labels

['ORG', 'ORG', 'GPE', 'DATE', 'ORG', 'PERCENT', 'ORG', 'GPE']

In [44]:
sentence = [sent for sent in doc.sents]
sentence
#len(sentence)

[Apple acquired Zoom in China on Wednesday 6th,
 May 2020.This news has made Apple and Google stock jump by 5% on Dow Jones Index in the United States of America]

In [47]:
displacy.render(nlp(text), jupyter=True, style='ent')