# **Named Entity Recognition**

Named entity recognition (NER) is likely the first step toward information extraction. 

It seeks to locate and classify named entities in text into predefined categories such as the names of people, organizations, locations, expressions of time, numbers, percentages, etc.

## **NER : Pretrained**

In [None]:
import spacy
from spacy import displacy
nlp = spacy.load('en')

**nlp = spacy.load('en') :**, where load( ) is used to load the model, and 'en' provides the name/unicode of the model to load, in this case, English. 

**sent :** includes the sentence that was passed in order to extract identified entities.

**sent.ents :**  The ents property in this case includes a tuple of all the entities recognised by our model.

**word.label_ :** holds the label(PERSON / GPE / DATE) assigned to that entity by the model.

**word.text :** contains entities like (MS Dhoni, 2007)

In [None]:
sent= nlp("Mahendra Singh Dhoni (born 7 July 1981), commonly known as MS Dhoni, is international cricketer who captained the Indian national team in limited-overs formats from 2007 to 2016 and in Test cricket from 2008 to 2014. Under his captaincy, India won the 2007 ICC World Twenty20, the 2010 and 2016 Asia Cups, the 2011 ICC Cricket World Cup and the 2013 ICC Champions Trophy. A right-handed middle-order batsman and wicketkeeper, Dhoni is one of the highest run scorers in One Day Internationals (ODIs) with more than 10,000 runs scored and considered an effective finisher in limited-overs formats. He is also regarded by some as one of the best wicketkeepers in modern limited-overs international cricket. He made his ODI debut in December 2004 against Bangladesh and played his first Test a year later against Sri Lanka. Dhoni has been the recipient of many awards, including the ICC ODI Player of the Year award in 2008 and 2009 (the first player to win the award twice), the Rajiv Gandhi Khel Ratna award in 2007, the Padma Shri, India's fourth highest civilian honor, in 2009 and the Padma Bhushan, India's third highest civilian honor, in 2018.He was named as the captain of the ICC World Test XI in 2009, 2010 and 2013. He has also been selected a record 8 times in ICC World ODI XI teams, 5 times as captain. The Indian Territorial Army conferred the honorary rank of Lieutenant Colonel to Dhoni on 1 November 2011. He is the second Indian cricketer after Kapil Dev to receive this honor.")


for word in sent.ents:
  print(word.text,word.label_)

Mahendra Singh Dhoni PERSON
7 July 1981 DATE
MS Dhoni PERSON
Indian NORP
2007 DATE
2016 DATE
2008 DATE
2014 DATE
India GPE
2007 DATE
ICC World ORG
Twenty20 GPE
2010 DATE
2016 DATE
Asia Cups WORK_OF_ART
2011 CARDINAL
ICC Cricket World Cup EVENT
2013 CARDINAL
ICC Champions Trophy EVENT
Dhoni PERSON
one CARDINAL
One Day DATE
more than 10,000 CARDINAL
some as one CARDINAL
ODI ORG
December 2004 DATE
Bangladesh ORG
first ORDINAL
a year later DATE
Sri Lanka GPE
Dhoni PERSON
2008 DATE
2009 DATE
first ORDINAL
Gandhi Khel Ratna PERSON
2007 DATE
India GPE
fourth ORDINAL
2009 DATE
the Padma Bhushan PRODUCT
India GPE
third ORDINAL
2018.He CARDINAL
the ICC World Test XI ORG
2009 DATE
2010 DATE
2013 DATE
8 CARDINAL
ICC World ODI XI ORG
5 CARDINAL
The Indian Territorial Army ORG
Dhoni PERSON
1 November 2011 DATE
second ORDINAL
Indian NORP
Kapil Dev PERSON


In [None]:
displacy.render(sent ,style='ent',jupyter=True)

In [None]:
spacy.explain('ORG')

'Companies, agencies, institutions, etc.'

## **NER :  Custom**

In [None]:
!python -m spacy info

[1m

spaCy version    2.2.4                         
Location         /usr/local/lib/python3.7/dist-packages/spacy
Platform         Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
Python version   3.7.12                        
Models           en                            



In [None]:
import spacy
import pickle
import random

In [None]:
# one more example

for ent in test("Apple is looking at buying U.K. startup for $1 billion").ents:
  print(f'{ent.label_.upper():{10}} - {ent.text}')

ORG        - Apple
GPE        - U.K.
MONEY      - $1 billion


We may utilize the explain method of spacy to find out what a named entity means.

In [None]:
print(f'PERSON - {spacy.explain("PERSON")}')
print(f'GPE    - {spacy.explain("GPE")}')
print(f'DATE   - {spacy.explain("DATE")}')
print(f'MONEY  - {spacy.explain("MONEY")}')

PERSON - People, including fictional
GPE    - Countries, cities, states
DATE   - Absolute or relative dates or periods
MONEY  - Monetary values, including unit


In [None]:
import os
for dirname, _, filenames in os.walk('/content/drive/MyDrive/Colab Notebooks/data/Alice Clark CV.pdf'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import os
for dirname, _, filenames in os.walk('/content/drive/MyDrive/Colab Notebooks/data/train_data.pkl'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Because our résumé is in pdf format, we will extract data from it using PyMuPDF. PyPDF2 is another option.

In [None]:
pip install PyMuPDF

Collecting PyMuPDF
  Downloading PyMuPDF-1.18.19-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.4 MB)
[K     |████████████████████████████████| 6.4 MB 29.9 MB/s 
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.18.19


In [None]:
import sys,fitz
fname = '/content/drive/MyDrive/Colab Notebooks/data/Alice Clark CV.pdf'
doc= fitz.open(fname)
alice_cv=""
for page in doc:
  alice_cv = alice_cv + str(page.getText())

print(alice_cv)

## We extracted data from the pdf file using PyMuPDF and saved it in the alice_cv variable.

Alice Clark 
AI / Machine Learning 
 
Delhi, India Email me on Indeed 
• 
20+ years of experience in data handling, design, and development 
• 
Data Warehouse: Data analysis, star/snow flake scema data modelling and design specific to 
data warehousing and business intelligence 
• 
Database: Experience in database designing, scalability, back-up and recovery, writing and 
optimizing SQL code and Stored Procedures, creating functions, views, triggers and indexes. 
Cloud platform: Worked on Microsoft Azure cloud services like Document DB, SQL Azure, 
Stream Analytics, Event hub, Power BI, Web Job, Web App, Power BI, Azure data lake 
analytics(U-SQL) 
Willing to relocate anywhere 
 
WORK EXPERIENCE 
Software Engineer 
Microsoft – Bangalore, Karnataka 
January 2000 to Present 
1. Microsoft Rewards Live dashboards: 
Description: - Microsoft rewards is loyalty program that rewards Users for browsing and shopping 
online. Microsoft Rewards members can earn points when searching with Bing, bro

In [None]:
# Well we can now run our test data through our pre-trained spacy model and see how well it worked.

test = spacy.load('en')
ts = test(" ".join(alice_cv.split('\n'))) # we have splitted our data with '\n' and rejoined with space

**ts = test( )**, where we send the variable to test (the object of the spacy english model), which learns POS tags, NER, and other information and saves it in ts.

All POS tags, NER, and so on are stored in this variable. We will just extract NER and personally check the result.

In [None]:
# Here, we are only extracting all PERSON named entities.

for ent in ts.ents:
  if ent.label_.upper() == 'PERSON':
    print(f'{ent.label_.upper():{10}} - {ent.text}')

PERSON     - Alice Clark
PERSON     - Stored Procedures
PERSON     - Cloud
PERSON     - Document DB
PERSON     - Web Job
PERSON     - Web App
PERSON     - Bing


Our pre-trained model performed poorly on test data. Only one name ('Alice Clark') was labelled correctly, while the others were labelled incorrectly.

Below, except for 'MICROSOFT,' all ORG-named entities are incorrectly labelled.

In [None]:
# Here, we are only extracting all ORG named entities.

for ent in ts.ents:
  if ent.label_.upper() == 'ORG':
    print(f'{ent.label_.upper():{10}} - {ent.text}')

ORG        - AI / Machine Learning
ORG        - star/snow flake scema
ORG        - SQL
ORG        - Microsoft Azure
ORG        - SQL Azure
ORG        - Stream Analytics
ORG        - Power BI
ORG        - Power BI
ORG        - SQL
ORG        - Microsoft
ORG        - Microsoft
ORG        - Microsoft
ORG        - Microsoft
ORG        - Microsoft Edge
ORG        - Microsoft
ORG        - Microsoft
ORG        - PBI
ORG        - Technology/Tools
ORG        - Indian Institute of Technology
ORG        - Big Data


### **Hence, in these situations it is better to use  CUSTOM-NER MODEL.**

This is why we must first train our spacy model on manually labelled data and generate bespoke NER.

Because we used resume data for testing, we must first train our spacy model on some manually labelled resume data. We obtained data for training purposes from the internet, but you may produce training data based on your needs.

In [None]:
train_data = pickle.load(open('/content/drive/MyDrive/Colab Notebooks/data/train_data.pkl','rb'))
print(f"Training data consist of {len(train_data)} manually labelled resume's.")

Training data consist of 200 manually labelled resume's.


In [None]:
# Checking format of one resume data

train_data[100]

('Puneet Bhandari SAP SD lead - Microsoft IT  Pune, Maharashtra - Email me on Indeed: indeed.com/r/Puneet-Bhandari/c9002fa44d6760bd  Willing to relocate: Anywhere  WORK EXPERIENCE  SAP SD lead  Microsoft IT -  August 2010 to Present  Team Size: 8 Duration: Seven months  Scope: * Enhancement of Mexico invoicing process as per the current regulations * Requirement gathering from third party and client on new process * Responsible for implementing the changes in system  Area of Exposure: * Understand the AS-IS process and develop to- Be design document to meet the business and Government requirement * Requirement gathering for all SD process for client * Developed solution blueprint and Process Design Documents for OTC 3-way and 1-way invoice processes * Interacting with third party to gather requirements from their end * Creating functional specification and Gap analysis document for different country implementation with client * Design test scripts for functional unit testing (FUT), Int

**The structure of our train data**

Our train data is saved as a tuple of 200 resume data, with each resume data consisting of two parts/indices.

The first index [0] includes all the resume's data (Name, degree, designation, and companies worked for).

The second index [1] comprises a dictionary object with only one key, namely 'entities,' and its value is carefully examined.
The value of the 'entities' key is a list of tuples, with some number and some labeling in each tuple.

For example (0, 15, 'Name'), 0 represents the start index, and 15 denotes the end index of the label ' Name,' which is 'Puneet Bhandari.' Similarly, we can see that each other tuple has a start and end index and its label.

In [None]:
# loading blank spacy model as we want to customize our model.
# spacy.blank('en') will create a blank model of a given language class i.e., for here English.

nlp = spacy.blank('en') 

In [None]:
# Creating a function to train our model

def train_model(train_data):
    
  if 'ner' not in nlp.pipe_names:# Checking if NER is present in pipeline
    ner = nlp.create_pipe('ner') # creating NER pipe if not present
    nlp.add_pipe(ner, last=True) # adding NER pipe in the end

  for _, annotation in train_data: # Getting 1 resume at a time from our training data of 200 resumes
    for ent in annotation['entities']: # Getting each tuple at a time from 'entities' key in dictionary at index[1] i.e.,(0, 15, 'Name') and so on
      ner.add_label(ent[2])  # here we are adding only labels of each tuple from entities key dict, eg:- 'Name' label of (0, 15, 'Name')
    
  # In above for loop we finally added all custom NER from training data.
    

  other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner'] # getting all other pipes except NER.
  with nlp.disable_pipes(*other_pipes): # Disabling other pipe's as we want to train only NER.
        optimizer = nlp.begin_training()
        
        for itn in range(10):         # training model for 10 iteraion
            print('Starting iteration ' + str(itn))
            random.shuffle(train_data) # shuffling data in every iteration 
            losses = {}
            for text, annotations in train_data:
              try:
                nlp.update(
                    [text],        #batch of texts
                    [annotations], #batch of annotations
                    drop=0.2,      #dropout rate -makes it harder to memorise
                    sgd=optimizer, #callable to update weights
                    losses=losses) #Dictionary to update with the loss, keyed by pipeline component.
              except Exception as e:
                pass

In [None]:
# pass train data to function.

train_model(train_data)

Starting iteration 0
Starting iteration 1
Starting iteration 2
Starting iteration 3
Starting iteration 4
Starting iteration 5
Starting iteration 6
Starting iteration 7
Starting iteration 8
Starting iteration 9


In [None]:
# Saving our trained model to re-use.

nlp.to_disk('nlp_model')

In [None]:
# Loading our trained model

nlp_model = spacy.load('nlp_model')

In [None]:
# Checking all the custom NER created

nlp_model.entity.labels

('College Name',
 'Companies worked at',
 'Degree',
 'Designation',
 'Email Address',
 'Graduation Year',
 'Location',
 'Name',
 'Skills',
 'UNKNOWN',
 'Years of Experience')

In [None]:
doc = nlp_model(" ".join(alice_cv.split('\n')))
for ent in doc.ents:
  print(f'{ent.label_.upper():{20}} - {ent.text}')

NAME                 - Alice Clark
LOCATION             - Delhi
DESIGNATION          - Software Engineer
COMPANIES WORKED AT  - Microsoft –
COMPANIES WORKED AT  - Microsoft
COMPANIES WORKED AT  - Microsoft
DEGREE               - Indian Institute of Technology – Mumbai
GRADUATION YEAR      - 2001
