### Context
Following on from data collection through webscrapping, data cleaning and analysis on text metrics, let's do some Named Entity Recognition (NER). NER is widely used across many industries to automate information extraction, for example, identifying and extracting text elements such as organisation, personnel names, location etc..  

Let's try a test sentence on using nltk and spacy methods

In [9]:
# import libraries
import pandas as pd
from nltk import sent_tokenize, word_tokenize, pos_tag, ne_chunk
import spacy

In [10]:
sentence = "Grand Canyon National Park, in Arizona, is home to much of the immense Grand Canyon, with its layered bands of red rock revealing millions of years of geological history. Viewpoints include Mather Point, Yavapai Observation Station and architect Mary Colter’s Lookout Studio and her Desert View Watchtower."

In [11]:
print(sentence)

Grand Canyon National Park, in Arizona, is home to much of the immense Grand Canyon, with its layered bands of red rock revealing millions of years of geological history. Viewpoints include Mather Point, Yavapai Observation Station and architect Mary Colter’s Lookout Studio and her Desert View Watchtower.


In [12]:
for sent in sent_tokenize(sentence):
    # Chunk the tags that were tagged on the words of each sentence
    for chunk in ne_chunk(pos_tag(word_tokenize(sent))):
        if hasattr(chunk, 'label'):
            print(chunk.label(), ' '.join(c[0] for c in chunk))
            

GPE Grand
PERSON Canyon National Park
GPE Arizona
PERSON Mather Point
PERSON Yavapai Observation Station
PERSON Lookout Studio


In [3]:
# load spacy model
nlp = spacy.load('en_core_web_sm')

In [93]:
doc_test = nlp(sentence)
for ent in doc_test.ents:
    print(ent.text, ent.label_)

Grand Canyon National Park LOC
Arizona GPE
Grand Canyon GPE
millions of years DATE
Yavapai Observation Station ORG
Mary Colter’s PERSON
Desert View Watchtower PERSON


Either methods are fine for this excercise. Let's go with spacy first.

In [58]:
Ent = []
Labels = []

for ent in doc_test.ents:
    Ent.append(ent.text)
    Labels.append(ent.label_)
    
ent_df = pd.DataFrame({'Entity': Ent,
                      'Label':Labels})
ent_df

Unnamed: 0,Entity,Label
0,Grand Canyon National Park,LOC
1,Arizona,GPE
2,Grand Canyon,GPE
3,millions of years,DATE


Applying it on the speech data. Each speech would generate a dataframe of identified entities and labels. In this case, we could collect these dataframes as a dictionary of dataframes.

In [59]:
df = pd.read_csv('df.csv', index_col=0)
df

Unnamed: 0,speech,sent count,words per sent,avg syll per word,Dale-Chall score,SMOG score,FleschReadingEase
0,"Your excellencies, delegates, ladies and gentl...",40,16.525,1.5,8.11,11.4,63.19
1,"Thank you, Professor Klaus Schwab, Hilde Schwa...",32,17.96875,1.5,8.48,11.3,61.67
2,"Your excellencies, ladies, and gentlemen. ‘We ...",28,12.857143,1.5,8.27,9.8,66.84


In [4]:
# function to return a dictionary of dataframes containing entities and labels
def SpacyEnt_dict(frameCol):
    # instantiate empty dictionary
    Result_dict_ = {}
    # loop through the speechs, identify entity and labels, get them into indiv. dataframe
    for i in range(len(frameCol)):
        Ent_ = []
        Labels_ = []
        doc_ = nlp(frameCol[i])
        for ent in doc_.ents:
            Ent_.append(ent.text)
            Labels_.append(ent.label_)
        frame_ = pd.DataFrame({'Entity': Ent_,
                               'Label':Labels_})
        Result_dict_[i] = frame_
    return Result_dict_

In [88]:
# apply onto the speech data
SpacySpeech_dict = SpacyEnt_dict(df['speech'])

In [82]:
# COP26 speech
SpacySpeech_dict[0]

Unnamed: 0,Entity,Label
0,the next two weeks,DATE
1,between 180 and 300,CARDINAL
2,"just over 10,000 years ago",DATE
3,Earth,LOC
4,first,ORDINAL
5,"the last 10,000 years",DATE
6,one,CARDINAL
7,Today,DATE
8,today,DATE
9,Earth,LOC


In [83]:
# Davos speech
SpacySpeech_dict[1]

Unnamed: 0,Entity,Label
0,Klaus Schwab,PERSON
1,Hilde Schwab,ORG
2,the World Economic Forum,ORG
3,"12,000-year",DATE
4,today,DATE
5,one,CARDINAL
6,Holocene,ORG
7,The Garden of Eden,ORG
8,first,ORDINAL
9,1979,DATE


In [84]:
# Poland peech
SpacySpeech_dict[2]

Unnamed: 0,Entity,Label
0,the United Nations’,ORG
1,the UN Charter,PRODUCT
2,thousands of years,DATE
3,The United Nations,ORG
4,Paris,GPE
5,the United Nations,ORG
6,People’s Seat,ORG
7,today,DATE
8,the last two weeks,DATE
9,the ‘Voice of the People’:,ORG


Let's try nltk next. Since we have a better understanding on the overall approach, we adopt the similar approach as above and wrap it in a function.

In [7]:
# function to return a dictionary of dataframes containing entities and labels
def NltkEnt_dict(frameCol):
    # instantiate empty dictionary
    Result_dict_ = {}
    # loop through the speechs, identify entity and labels, get them into indiv. dataframe
    for i in range(len(frameCol)):
        Ent_ = []
        Labels_ = []
        for sent in sent_tokenize(frameCol[i]):
            # Chunk the tags that were tagged on the words of each sentence
            for chunk in ne_chunk(pos_tag(word_tokenize(sent))):
                if hasattr(chunk, 'label'):
                    Ent_.append(''.join(c[0] for c in chunk))
                    Labels_.append(chunk.label())
        
            frame_ = pd.DataFrame({'Entity': Ent_, 'Label':Labels_})
            Result_dict_[i] = frame_
    return Result_dict_

In [105]:
# apply onto speech data
NltkSpeech_dict = NltkEnt_dict(df['speech'])

In [106]:
# COP26 speech
NltkSpeech_dict[0]

Unnamed: 0,Entity,Label
0,Earth,PERSON
1,Celsius,PERSON
2,Earth,LOCATION
3,Affordable,GPE
4,Nature,GPE


In [107]:
# Davos speech
NltkSpeech_dict[1]

Unnamed: 0,Entity,Label
0,ProfessorKlausSchwab,PERSON
1,HildeSchwab,PERSON
2,Davos,GPE
3,Global,GPE
4,Holocene,ORGANIZATION
5,Garden,ORGANIZATION
6,Eden,GPE
7,Anthropocene,ORGANIZATION
8,Humans,GPE
9,Blue,ORGANIZATION


In [108]:
# Poland speech
NltkSpeech_dict[2]

Unnamed: 0,Entity,Label
0,UnitedNations,ORGANIZATION
1,UN,ORGANIZATION
2,Climate,PERSON
3,UnitedNations,ORGANIZATION
4,Paris,GPE
5,UnitedNations,ORGANIZATION
6,People,ORGANIZATION
7,Seat,GPE
8,Time,GPE
9,Sir,PERSON



In this simple example, we explored NER application using both nltk and spacy. Results vary based on the underlying algorithms used in these packages, and the complexity of the text inputs. 

That's it for now. Hope you had fun reading.