### **Named Entity Recognition(NER)**
###### **Indentifying Named Entities in Text Data**

#### **What is Named Entity?**
###### **Any words that represents a person, organization, location etc. is a Named Entity. Named Entity Recognition is a subtask of Information Extraction and is the process of identifying words which are named entities in a given text. It is also called Entity Identification or Entity Chunking.**
##### **Example**

###### **"The Padma Bridge which is situated in Bangladesh is opened by the Prime Minister on Saturday 25th June 2022"**

* Here, named entities are - The Padma Bridge, Bangladesh, Prime Minister, Saturday, 25th June 2022
* Named Entity Recognition is the task of identifying these words from the text.

#### **Why it is Important?**
###### **In order to understand the meaning from a given text, it is important to identify who did what to whom. Named Entity Recognition is the first task of identifying the words which may represent the who, what and whom in the text. It helps in identifying the major entities the text is talking about - Any NLP task which involves automatically understanding text and acts based on it, needs Named Entity Recognition in its pipeline.**

#### **The Approaches**
* **Basic NLTK Algorithm**
    * with word segmentation
    * with sentence segmentation
* **Stanforn NLP NER**
* **Using Spacy**

### **Let's Start**

##### **Import Dependencies**

In [2]:
import nltk
import pandas as pd

##### **Data**

In [3]:
text = "Stanford NER is a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names. It comes with well-engineered feature extractors for Named Entity Recognition, and many options for defining feature extractors. Included with the download are good named entity recognizers for English, particularly for the 3 classes (PERSON, ORGANIZATION, LOCATION), and we also make available on this page various other models for different languages and circumstances, including models trained on just the CoNLL 2003 English training data."

##### **Named Entity Tagging using NLTK (Word Based)**

In [4]:
# Tokenize the words
words = nltk.word_tokenize(text)
print(words)

['Stanford', 'NER', 'is', 'a', 'Java', 'implementation', 'of', 'a', 'Named', 'Entity', 'Recognizer', '.', 'Named', 'Entity', 'Recognition', '(', 'NER', ')', 'labels', 'sequences', 'of', 'words', 'in', 'a', 'text', 'which', 'are', 'the', 'names', 'of', 'things', ',', 'such', 'as', 'person', 'and', 'company', 'names', ',', 'or', 'gene', 'and', 'protein', 'names', '.', 'It', 'comes', 'with', 'well-engineered', 'feature', 'extractors', 'for', 'Named', 'Entity', 'Recognition', ',', 'and', 'many', 'options', 'for', 'defining', 'feature', 'extractors', '.', 'Included', 'with', 'the', 'download', 'are', 'good', 'named', 'entity', 'recognizers', 'for', 'English', ',', 'particularly', 'for', 'the', '3', 'classes', '(', 'PERSON', ',', 'ORGANIZATION', ',', 'LOCATION', ')', ',', 'and', 'we', 'also', 'make', 'available', 'on', 'this', 'page', 'various', 'other', 'models', 'for', 'different', 'languages', 'and', 'circumstances', ',', 'including', 'models', 'trained', 'on', 'just', 'the', 'CoNLL', '20

In [5]:
# Parts-of-Speech Tagging
pos_tags = nltk.pos_tag(words)
print(pos_tags)

[('Stanford', 'NNP'), ('NER', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('Java', 'NNP'), ('implementation', 'NN'), ('of', 'IN'), ('a', 'DT'), ('Named', 'NNP'), ('Entity', 'NNP'), ('Recognizer', 'NNP'), ('.', '.'), ('Named', 'VBN'), ('Entity', 'NNP'), ('Recognition', 'NNP'), ('(', '('), ('NER', 'NNP'), (')', ')'), ('labels', 'VBZ'), ('sequences', 'NNS'), ('of', 'IN'), ('words', 'NNS'), ('in', 'IN'), ('a', 'DT'), ('text', 'NN'), ('which', 'WDT'), ('are', 'VBP'), ('the', 'DT'), ('names', 'NNS'), ('of', 'IN'), ('things', 'NNS'), (',', ','), ('such', 'JJ'), ('as', 'IN'), ('person', 'NN'), ('and', 'CC'), ('company', 'NN'), ('names', 'RB'), (',', ','), ('or', 'CC'), ('gene', 'NN'), ('and', 'CC'), ('protein', 'JJ'), ('names', 'NNS'), ('.', '.'), ('It', 'PRP'), ('comes', 'VBZ'), ('with', 'IN'), ('well-engineered', 'JJ'), ('feature', 'NN'), ('extractors', 'NNS'), ('for', 'IN'), ('Named', 'NNP'), ('Entity', 'NNP'), ('Recognition', 'NNP'), (',', ','), ('and', 'CC'), ('many', 'JJ'), ('options', 'NNS'), (

In [6]:
# Check NLTK description of the Tags
nltk.help.upenn_tagset('NNS')

NNS: noun, common, plural
    undergraduates scotches bric-a-brac products bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...


###### **NE Chunk**

In [7]:
chunks = nltk.ne_chunk(pos_tags, binary=True)   # Either NE or not NE
for chunk in chunks:
    print(chunk)

(NE Stanford/NNP)
('NER', 'NNP')
('is', 'VBZ')
('a', 'DT')
('Java', 'NNP')
('implementation', 'NN')
('of', 'IN')
('a', 'DT')
(NE Named/NNP Entity/NNP Recognizer/NNP)
('.', '.')
('Named', 'VBN')
(NE Entity/NNP Recognition/NNP)
('(', '(')
(NE NER/NNP)
(')', ')')
('labels', 'VBZ')
('sequences', 'NNS')
('of', 'IN')
('words', 'NNS')
('in', 'IN')
('a', 'DT')
('text', 'NN')
('which', 'WDT')
('are', 'VBP')
('the', 'DT')
('names', 'NNS')
('of', 'IN')
('things', 'NNS')
(',', ',')
('such', 'JJ')
('as', 'IN')
('person', 'NN')
('and', 'CC')
('company', 'NN')
('names', 'RB')
(',', ',')
('or', 'CC')
('gene', 'NN')
('and', 'CC')
('protein', 'JJ')
('names', 'NNS')
('.', '.')
('It', 'PRP')
('comes', 'VBZ')
('with', 'IN')
('well-engineered', 'JJ')
('feature', 'NN')
('extractors', 'NNS')
('for', 'IN')
(NE Named/NNP Entity/NNP Recognition/NNP)
(',', ',')
('and', 'CC')
('many', 'JJ')
('options', 'NNS')
('for', 'IN')
('defining', 'VBG')
('feature', 'NN')
('extractors', 'NNS')
('.', '.')
('Included', 'VBN')
(

In [8]:
from html import entities

entities = []
labels = []
for chunk in chunks:
    if(hasattr(chunk, 'label')):
        entities.append(' '.join(c[0] for c in chunk))
        labels.append(chunk.label())

entities_labels = list(set(zip(entities, labels)))
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["Entities", "Labels"]
entities_df

Unnamed: 0,Entities,Labels
0,Stanford,NE
1,Named Entity Recognition,NE
2,LOCATION,NE
3,NER,NE
4,Named Entity Recognizer,NE
5,PERSON,NE
6,ORGANIZATION,NE
7,English,NE
8,Entity Recognition,NE


In [9]:
chunks = nltk.ne_chunk(pos_tags, binary=False)   # Either NE or not NE
for chunk in chunks:
    print(chunk)

(PERSON Stanford/NNP)
(ORGANIZATION NER/NNP)
('is', 'VBZ')
('a', 'DT')
('Java', 'NNP')
('implementation', 'NN')
('of', 'IN')
('a', 'DT')
('Named', 'NNP')
('Entity', 'NNP')
('Recognizer', 'NNP')
('.', '.')
('Named', 'VBN')
(PERSON Entity/NNP Recognition/NNP)
('(', '(')
(ORGANIZATION NER/NNP)
(')', ')')
('labels', 'VBZ')
('sequences', 'NNS')
('of', 'IN')
('words', 'NNS')
('in', 'IN')
('a', 'DT')
('text', 'NN')
('which', 'WDT')
('are', 'VBP')
('the', 'DT')
('names', 'NNS')
('of', 'IN')
('things', 'NNS')
(',', ',')
('such', 'JJ')
('as', 'IN')
('person', 'NN')
('and', 'CC')
('company', 'NN')
('names', 'RB')
(',', ',')
('or', 'CC')
('gene', 'NN')
('and', 'CC')
('protein', 'JJ')
('names', 'NNS')
('.', '.')
('It', 'PRP')
('comes', 'VBZ')
('with', 'IN')
('well-engineered', 'JJ')
('feature', 'NN')
('extractors', 'NNS')
('for', 'IN')
(PERSON Named/NNP Entity/NNP Recognition/NNP)
(',', ',')
('and', 'CC')
('many', 'JJ')
('options', 'NNS')
('for', 'IN')
('defining', 'VBG')
('feature', 'NN')
('extrac

In [10]:
from html import entities

entities = []
labels = []
for chunk in chunks:
    if(hasattr(chunk, 'label')):
        entities.append(' '.join(c[0] for c in chunk))
        labels.append(chunk.label())

entities_labels = list(set(zip(entities, labels)))
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["Entities", "Labels"]
entities_df

Unnamed: 0,Entities,Labels
0,Stanford,PERSON
1,Named Entity Recognition,PERSON
2,CoNLL,ORGANIZATION
3,NER,ORGANIZATION
4,LOCATION,ORGANIZATION
5,PERSON,ORGANIZATION
6,Entity Recognition,PERSON
7,English,GPE
8,ORGANIZATION,ORGANIZATION


##### **Named Entity Tagging using NLTK (Sentence Based)**

In [11]:
from html import entities

entities = []
labels = []

sentence = nltk.sent_tokenize(text)
for sent in sentence:
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)), binary=False):
        if(hasattr(chunk, 'label')):
            entities.append(' '.join(c[0] for c in chunk))
            labels.append(chunk.label())

entities_labels = list(set(zip(entities, labels)))
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["Entities", "Labels"]
entities_df

Unnamed: 0,Entities,Labels
0,Stanford,PERSON
1,Named Entity Recognition,PERSON
2,CoNLL,ORGANIZATION
3,NER,ORGANIZATION
4,LOCATION,ORGANIZATION
5,PERSON,ORGANIZATION
6,Entity Recognition,PERSON
7,English,GPE
8,ORGANIZATION,ORGANIZATION


##### **Stanford NLP NER**
###### **Installation and Configuration: https://blog.manash.io/configuring-stanford-parser-and-stanford-ner-tagger-with-nltk-in-python-on-windows-f685483c374a**

In [12]:
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
import os

In [14]:
model = "C:/Users/Abs_Sayem/packages/nlp/stanford-ner-2020-11-17/classifiers/english.all.3class.distsim.crf.ser.gz"
jar = "C:/Users/Abs_Sayem/packages/nlp/stanford-ner-2020-11-17/stanford-ner.jar"

st = StanfordNERTagger(model, jar, encoding='utf-8')

java_path = "C:/Program Files/Java/jdk1.8.0_161/bin/java.exe"
os.environ['JAVAHOME'] = java_path

In [15]:
tokenized_text = nltk.word_tokenize(text)
classified_text = st.tag(tokenized_text)

classified_text_df = pd.DataFrame(classified_text)
classified_text_df.drop_duplicates(keep='first', inplace=True)
classified_text_df.reset_index(drop=True, inplace=True)
classified_text_df.columns = ["Entities", "Labels"]
classified_text_df

LookupError: 

===========================================================================
NLTK was unable to find the java file!
Use software specific configuration paramaters or set the JAVAHOME environment variable.
===========================================================================