# Entity Recognition Using NLP and Word Frequency with Counter

The epic high fantasy movie Lord of the Rings has touched the hearts of many- and the scripts of the three movies have played a great part in the trilogy's success. This project aims to better understand the scripts by utilizing Natural Language Processing (NLP), which is the automatic manipulation of natural languages, such as speech and text, by software (Brownlee, 2019).

This project focuses on two parts,  Entity classification using NLP, as well as utilizing Counter to identify word frequencies in the script. This project utilizes database ‘lotr_scripts.csv’ downloaded from https://www.kaggle.com/paultimothymooney/lord-of-the-rings-data. 

For the first part of this project, we attempt to exercise entity classification only on the dialogues, as they contain sentences with words that could be further classified. In this exercise, we will also utilize SpaCy, an open-surce software library for advanced natural language processing written in Python (Choi et al., 2015). SpaCy uses artificial neural networks to train its pipeline model. 

In [1]:
#Installing spaCy
!pip3 install spacy



One such trained pipeline model we will use is ‘en_core_web_sm’,  an english trained pipeline model optimized for CPU. The trained pipeline model contains pipelines such as tok2vec, tagger, parser, attribute_ruler, lemmatizer, and ner. Hence, the trained pipeline model contributes to the convenience of exercising NLP.

In [3]:
#import the necessary libraries and packages
import spacy
from spacy import displacy
import pandas as pd
from collections import Counter
#download the trained pipeline model
! python -m spacy download en_core_web_sm



[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
#Using Pandas ot read the CSV file for visualisation purposes only.
lotr_visualisation = pd.read_csv('lotr_scripts.csv')
lotr_visualisation.head(10)

The database shown above contains the characters and their lines as well as the movie the lines are from.

In [87]:
lotr = pd.read_csv("lotr_scripts.csv")
lotr = lotr['dialog']
#showcasing the first 5 dialogues in the database
lotr.head(5)

0    Oh Smeagol Ive got one! , Ive got a fish Smeag...
1       Pull it in! Go on, go on, go on, pull it in!  
2                                             Arrghh! 
3                                            Deagol!  
4                                            Deagol!  
Name: dialog, dtype: object

In [4]:
#using the trained pipeline model 'en_core_web_sm'
nlp = spacy.load('en_core_web_sm')

## First Part - Entity Classification using NLP

 Some useful functions to be called for entity classification are the nlp() function and  (available through SpaCy) and the .ents attribute. The nlp() function effortlessly tokenizes the lines, preparing it for entity recognition. The .ents attribute then narrows the lines down to just words or phrases they consider as entities. To better visualize the lines alongside the recognized entities, we shall make use of displaCy, an open-source named entity visualizer from spaCy that supports NLP rendering. 

Two lines from the database are of our main focus. The first sentence, labelled sent1, is as displayed as follows:

In [5]:
#first sentence to be analyzed
sent1 = nlp(lotr[100])
print(sent1.text)

    My dear Sam, you cannot always be torn in two. , You will     have to be one and whole for many years. You have so much to enjoy and to be     and to do. Your part in the story will go on.  


In [86]:
#printing the entities in the 101st line from the movie 'Return of the Kings'
print(sent1.ents)

(Sam, two, many years)


In [7]:
#visualizing displacy for sent1
displacy.render(sent1, style= 'ent' , jupyter = True, options= {'distance':100})

The same exercise is done on the 40th line in the database, shown as follows:

In [8]:
#our second sentence
sent2 = nlp(lotr[39])
print(sent2.text)
print(sent2.ents)
displacy.render(sent2, style= 'ent' , jupyter = True, options= {'distance':100})

    ,And thus     it was a Fourth Age of Middle Earth began. And the Fellowship of the Ring,     though eternally bound by friendship and love was ended.            Thirteen months to the day since Gandalf sent us on our long journey we find     ourselves looking upon a familiar sight.  
(a Fourth Age, Middle Earth, the Fellowship of the Ring, Thirteen months to the day)


### What have we got?

It seems that more entities are identified in this line. We shall use this line for further analysis in the next part. 

Another common way to exercise NLP is through the usage of  Natural Language ToolKit (NLTK). It is a suite of libraries and programs for symbolic and statistical NLP for English written in Python.
For this project, the pipelines used are:  ‘punkt’, ‘average_perceptron_tagger’, ‘maxent_ne_chunker’. 'punkt' is a tokenizer which divides a text into a list of sentences, and is trained using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences <https://www.kite.com/python/docs/nltk.punkt>. 'average_perceptron_tagger' is one of many taggers available in the NLTK library. It is used to tag words with their parts of speech (POS). 'maxent_ne_chunker' is a chunker tool that contains two pre-trained English named entity chunkers trained on an ACE corpus. Chunking is an alternative to parsing that provides a partial syntactic structure of a sentence, with a limited tree depth, as opposed to full on parsing. Despite being more limited than parsing, it provides a more robust process and is sufficient for this project. 

A key step in NLP is preprocessing. Preprocessing is a method to clean the text data and make it ready to feed data to the model (Agrawal, 2021). In this project, preprocessing includes using word_tokenize to tokenize the sentences into words, as well as attaching tags using pos_tag to these tokenized words. 

In [88]:
#using nltk
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/callistastephineyu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/callistastephineyu/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/callistastephineyu/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!


In [10]:
###preprocessing the text

#creating a function for preprocessing
def preprocess_sent(sent):
    text = word_tokenize(sent)
    tags = pos_tag(text)
    return tags

sent2_preprocessed = preprocess_sent(sent2.text)
#to display the preprocessed sentence 2
sent2_preprocessed

[(',', ','),
 ('And', 'CC'),
 ('thus', 'RB'),
 ('it', 'PRP'),
 ('was', 'VBD'),
 ('a', 'DT'),
 ('Fourth', 'NNP'),
 ('Age', 'NNP'),
 ('of', 'IN'),
 ('Middle', 'NNP'),
 ('Earth', 'NNP'),
 ('began', 'VBD'),
 ('.', '.'),
 ('And', 'CC'),
 ('the', 'DT'),
 ('Fellowship', 'NNP'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('Ring', 'NNP'),
 (',', ','),
 ('though', 'IN'),
 ('eternally', 'RB'),
 ('bound', 'VBN'),
 ('by', 'IN'),
 ('friendship', 'NN'),
 ('and', 'CC'),
 ('love', 'NN'),
 ('was', 'VBD'),
 ('ended', 'VBN'),
 ('.', '.'),
 ('Thirteen', 'JJ'),
 ('months', 'NNS'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('day', 'NN'),
 ('since', 'IN'),
 ('Gandalf', 'NNP'),
 ('sent', 'VBD'),
 ('us', 'PRP'),
 ('on', 'IN'),
 ('our', 'PRP$'),
 ('long', 'JJ'),
 ('journey', 'NN'),
 ('we', 'PRP'),
 ('find', 'VBP'),
 ('ourselves', 'PRP'),
 ('looking', 'VBG'),
 ('upon', 'IN'),
 ('a', 'DT'),
 ('familiar', 'JJ'),
 ('sight', 'NN'),
 ('.', '.')]

What do we have here? it is a list of tuples containing the individual words in the sentence and their corresponding parts of speech (POS). Now that the lines have been preprocessed, it is ready to be chunked. The output of the chunking process will be entities recognized, along with their labels from the NLTK library.  

In [11]:
#named entities
named_entities = nltk.ne_chunk(sent2_preprocessed)
print(named_entities)

(S
  ,/,
  And/CC
  thus/RB
  it/PRP
  was/VBD
  a/DT
  (ORGANIZATION Fourth/NNP Age/NNP)
  of/IN
  (GPE Middle/NNP Earth/NNP)
  began/VBD
  ./.
  And/CC
  the/DT
  (ORGANIZATION Fellowship/NNP)
  of/IN
  the/DT
  Ring/NNP
  ,/,
  though/IN
  eternally/RB
  bound/VBN
  by/IN
  friendship/NN
  and/CC
  love/NN
  was/VBD
  ended/VBN
  ./.
  Thirteen/JJ
  months/NNS
  to/TO
  the/DT
  day/NN
  since/IN
  (PERSON Gandalf/NNP)
  sent/VBD
  us/PRP
  on/IN
  our/PRP$
  long/JJ
  journey/NN
  we/PRP
  find/VBP
  ourselves/PRP
  looking/VBG
  upon/IN
  a/DT
  familiar/JJ
  sight/NN
  ./.)


### What is the conclusion here?
We can see that the Fellowship is recognised as an organisation. Middle Earth is also rightfully recognised as a location.  The NLTK POS tag 'NNP' indicates a proper noun. The definition of the other tags displayed above can easily be found in https://www.guru99.com/pos-tagging-chunking-nltk.html. 
The entity recognition process done in spaCy seem to provide a different output as compared to NLTK. Why is that so? To answer this question,we need to first understand the way these two libraries were built. NLTK is a string processing library, where each function takes strings as input and returns a processed string. In contrast, spaCy takes on an object-oriented approach. Each function returns objects instead of strings or arrays (Kakarla, 2019). In terms of usability, both seem to get the job done well!


# Second Part - Word Frequency with Counter

Now we find ourselves in the second part of this project, which is to identify the character that spoke the most lines throughout the three movies. We will then see if the character that speaks the most is the main character or not.

To do this, the CSV file is read by pandas for the third time, this time only considering the ‘char’ column in the database. This column contains the characters that spoke the lines in the movies. Hence, duplicates will surely be present, and is to be counted through the Counter() function in python to find the characters that spoke the most lines.

To compile all the characters in the column, the list ‘characters’ is created, as the Counter() function only applies to lists and dictionaries. The most_common attribute is used to find the most common word that occurs along the ‘char’ column. The top 10 most common words are then displayed.

In [53]:
#loading the LOTR script CSV file again, this time only extracting the characters and not their lines. 
lotr2 = pd.read_csv("lotr_scripts.csv")
lotr2 = lotr2['char']
lotr2.head(5)
characters = [char for char in lotr2]
count = Counter(characters)
top_ten = count.most_common(10)
print(top_ten)

[('FRODO', 225), ('SAM', 216), ('GANDALF', 204), ('ARAGORN', 185), ('PIPPIN', 163), ('MERRY', 137), ('GOLLUM', 133), ('GIMLI', 116), ('THEODEN', 110), ('FARAMIR', 65)]


It seems that the character ‘Frodo’ speaks the most lines (speaking 225 times throughout the three movies), followed by his best friend/gardener Sam, and Gandalf the wizard. 

Frodo may be the main character in the Lord of the Rings, but which characters are mentioned the most in the dialogues of the three movies? To answer this, we shall use some features from the NLTK library.

In [57]:
#which characters are mentioned the most throughout the three movies?
script = [line for line in lotr]
#to demonstrate what script is:
print(script[:5])

['Oh Smeagol Ive got one! , Ive got a fish Smeagol, Smeagol!    ', 'Pull it in! Go on, go on, go on, pull it in! \xa0', 'Arrghh! ', 'Deagol! \xa0', 'Deagol! \xa0']


We also keep in mind that the trilogy scripts contain stop words such as “a”, “the”, “is”, “are” and etc., words that we do not want to appear in the most common words list. Hence, from the NLTK library we import the ‘stopwords’ package. RegexpTokenizer is also imported as we want to split a string into substrings according to the code that we specify. In this exercise, we specify the tokenizer as ‘r\w+’ for RegexpTokenizer to split the words in sentences apart. We then get a list of strings of individual words instead of sentences. 

To further clean the tokenized list, we shall then create a new list called words_clean that contains all the tokenized words that are not stop words. The hard space or no-break space ‘xa0’ is also removed such that the words_clean list is free from unwanted symbols.

In [73]:
#Importing stop words and Regexp for punctuation removal
nltk.download('stopwords')
from nltk.tokenize import RegexpTokenizer
stop_words = nltk.corpus.stopwords.words("english")


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/callistastephineyu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [85]:
#tokenizing the script
tokenizer = RegexpTokenizer(r'\w+')
tokenized = tokenizer.tokenize(str(script))
print(tokenized[:20])

['Oh', 'Smeagol', 'Ive', 'got', 'one', 'Ive', 'got', 'a', 'fish', 'Smeagol', 'Smeagol', 'Pull', 'it', 'in', 'Go', 'on', 'go', 'on', 'go', 'on']


In [89]:
#cleaning words
# 'xa0' seems to appear quite often in the tokenized list. 
#It actually represents a hard space or a no-break space in a program. A simple way to remove it is as follows: 
words_clean = [word for word in tokenized if word.lower() not in stop_words and word !='xa0']
print(words_clean[:20])

['Oh', 'Smeagol', 'Ive', 'got', 'one', 'Ive', 'got', 'fish', 'Smeagol', 'Smeagol', 'Pull', 'Go', 'go', 'go', 'pull', 'Arrghh', 'Deagol', 'Deagol', 'Deagol', 'Give']


As demonstrated, the stop words are gone from the tokenized list! 

Again, the Counter() function is used to count the number of occurences of words in words_clean. The top ten words are shown below:

In [84]:
count2 = Counter(words_clean)
top_ten_words = count2.most_common(10)
print(top_ten_words)

[('Frodo', 147), ('us', 110), ('one', 87), ('go', 86), ('must', 83), ('come', 76), ('Gandalf', 68), ('would', 68), ('back', 67), ('know', 67)]


### Wrapping up

Interesting! The word that is spoken the most throughout the trilogy is ‘Frodo’! His name is mentioned 147 times across the three movies. It seems that Frodo really is significant in the movie. Gandalf’s name seems to be called more than Sam’s though. I wonder why that is...? You may infer the answer by reading the Lord of the Rings character analysis in this website: https://literariness.org/2021/02/18/the-lord-of-the-rings-character-analysis/

By now, we have successfully performed entity recognition and word frequency counting. These are just two simple examples of the many things we can achieve through Natural Language Processing! 



### References
Choi et al. 2015. It Depends: Dependency Parser Comparison Using A Web-based Evaluation Tool.

Agrawal, R., 2021. Must known techniques for text preprocessing in NLP. Analytics Vidhya. Available at: https://www.analyticsvidhya.com/blog/2021/06/must-known-techniques-for-text-preprocessing-in-nlp/ [Accessed January 18, 2022]. 

Brownlee, J., 2019. What is natural language processing? Machine Learning Mastery. Available at: https://machinelearningmastery.com/natural-language-processing/ [Accessed January 18, 2022]. 

Kakarla, S., 2021. Natural language processing: NLTK vs spacy. ActiveState. Available at: https://www.activestate.com/blog/natural-language-processing-nltk-vs-spacy/ [Accessed January 18, 2022]. 