## TI3160TU: Natural Language Processing - NLP Tagging (Part-of-Speech and Named Entity Recognition) Lab

In this hands-on lab, we will explore how we can perform NLP tagging, particularly focusing on Part-of-Speech (PoS) and Named Entity Recognition (NER). To demonstrate how we can perform PoS and NER tagging, we are going to use two popular libraries (NLTK and SpaCy). We will demonstrate the applicability of these methods on data obtained from Reddit. This lab consists of the following parts:

0. **Loading dataset from Reddit**
1. **Performing Part of Speech Tagging** 
2. **Performing Named Entity Recognition**

### 0. Loading dataset from Reddit

For the purposes of this lab, we will use comments posted on the /r/politics subreddit on Reddit.

In [1]:
# we need the library json as the reddit data is stored in line-delimited json objects
# (one json object in each line, with each line representing a Reddit comment)
import json

# function to load all comment data into a list of strings
# Input: the path of the file including our data
# Output: a list of strings including the body of the Reddit comments
def load_reddit_comment_data(data_directory):
    
    comments_data = [] # list object that will store the loaded Reddit comments
    
    # we first open the file that includes our dataset
    with open(data_directory, 'r') as f:
        # iterate the file, reading it line by line
        for line in f:
            # load the data petraining to a line into a json object in memory 
            data = json.loads(line)
            
            # append the comment
            comments_data.append(data['body'])
    
    # the method returns all the loaded Reddit comments
    return comments_data
            
# our data is stored in this file
data_dir = './comments_politics_sample.ndjson'
# lets load our dataset into memory
reddit_data = load_reddit_comment_data(data_dir)
print("Successfully loaded Reddit comments! Our dataset includes %d Reddit comments!" %len(reddit_data))

Successfully loaded Reddit comments! Our dataset includes 10000 Reddit comments!


### 1. Performing Part of Speech Tagging

#### 1.1 Using NLTK

Lets start first with how we can perform PoS tagging using the NLTK library. By default, the NLTK library uses the Penn Treebank tagset (see https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).



In [2]:
import nltk # importing nltk library
from nltk.tokenize import word_tokenize # method to tokenize text

# example for demonstration purposes
example = "Google, headquartered in Mountain View (1600 Amphitheatre Pkwy, Mountain View, CA 940430), unveiled the new Android phone for $799 at the Consumer Electronic Show. Sundar Pichai said in his keynote that users love their new Android phones."


# method to perform PoS tagging using NLTK
# INPUT: Raw text
# Output: A list of tuples including each word and its respective PoS tag
def pos_tagging_nltk(text):
    
    # first we tokenize the raw text
    tokenized_text = nltk.word_tokenize(text)
    
    # extract the PoS tags
    pos_tags = nltk.pos_tag(tokenized_text)
    
    # return the PoS tags
    return pos_tags

# lets run the method and explore the output
pos_tagging_nltk(example)

[('Google', 'NNP'),
 (',', ','),
 ('headquartered', 'VBD'),
 ('in', 'IN'),
 ('Mountain', 'NNP'),
 ('View', 'NNP'),
 ('(', '('),
 ('1600', 'CD'),
 ('Amphitheatre', 'NNP'),
 ('Pkwy', 'NNP'),
 (',', ','),
 ('Mountain', 'NNP'),
 ('View', 'NNP'),
 (',', ','),
 ('CA', 'NNP'),
 ('940430', 'CD'),
 (')', ')'),
 (',', ','),
 ('unveiled', 'VBD'),
 ('the', 'DT'),
 ('new', 'JJ'),
 ('Android', 'NNP'),
 ('phone', 'NN'),
 ('for', 'IN'),
 ('$', '$'),
 ('799', 'CD'),
 ('at', 'IN'),
 ('the', 'DT'),
 ('Consumer', 'NNP'),
 ('Electronic', 'NNP'),
 ('Show', 'NNP'),
 ('.', '.'),
 ('Sundar', 'NNP'),
 ('Pichai', 'NNP'),
 ('said', 'VBD'),
 ('in', 'IN'),
 ('his', 'PRP$'),
 ('keynote', 'NN'),
 ('that', 'IN'),
 ('users', 'NNS'),
 ('love', 'VBP'),
 ('their', 'PRP$'),
 ('new', 'JJ'),
 ('Android', 'NNP'),
 ('phones', 'NNS'),
 ('.', '.')]

NLTK also supports other tagsets. For instance, lets try to do PoS tagging with a different tagset (e.g., universal tagset, see https://universaldependencies.org/u/pos/).

In [3]:
# make sure that we have downloaded the universal tagset
nltk.download('universal_tagset', quiet=True)

# method to perform PoS tagging using NLTK and a specific tagset
# INPUT: Raw text and the tagset that we want to use
# Output: A list of tuples including each word and its respective PoS tag
def pos_tagging_nltk_tagset(text, tagset):
    tokenized_text = nltk.word_tokenize(text)
    pos_tags = nltk.pos_tag(tokenized_text, tagset=tagset)
    return pos_tags

pos_tagging_nltk_tagset(example, 'universal')

[('Google', 'NOUN'),
 (',', '.'),
 ('headquartered', 'VERB'),
 ('in', 'ADP'),
 ('Mountain', 'NOUN'),
 ('View', 'NOUN'),
 ('(', '.'),
 ('1600', 'NUM'),
 ('Amphitheatre', 'NOUN'),
 ('Pkwy', 'NOUN'),
 (',', '.'),
 ('Mountain', 'NOUN'),
 ('View', 'NOUN'),
 (',', '.'),
 ('CA', 'NOUN'),
 ('940430', 'NUM'),
 (')', '.'),
 (',', '.'),
 ('unveiled', 'VERB'),
 ('the', 'DET'),
 ('new', 'ADJ'),
 ('Android', 'NOUN'),
 ('phone', 'NOUN'),
 ('for', 'ADP'),
 ('$', '.'),
 ('799', 'NUM'),
 ('at', 'ADP'),
 ('the', 'DET'),
 ('Consumer', 'NOUN'),
 ('Electronic', 'NOUN'),
 ('Show', 'NOUN'),
 ('.', '.'),
 ('Sundar', 'NOUN'),
 ('Pichai', 'NOUN'),
 ('said', 'VERB'),
 ('in', 'ADP'),
 ('his', 'PRON'),
 ('keynote', 'NOUN'),
 ('that', 'ADP'),
 ('users', 'NOUN'),
 ('love', 'VERB'),
 ('their', 'PRON'),
 ('new', 'ADJ'),
 ('Android', 'NOUN'),
 ('phones', 'NOUN'),
 ('.', '.')]

#### 1.2 Using SpaCy

In [4]:
# first we import the spacy library
import spacy

# spacy offers many models that do a variety of NLP tasks. For the purposes of this lab we are gonna use the "en_core_web_sm" model which is one of their smaller models
nlp = spacy.load('en_core_web_sm')

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Method that extracts the PoS tags for each word 
# Input: The spacy model that we are going to use as well as the input text
# Output: a list of tuples including the word and its PoS tag
def pos_tagging_spacy(model, text):
    
    # list to store the output
    output = []
    
    # run our model on the input text
    doc = model(text)
    
    # iterate the results and store the tags in the output
    for token in doc:
        output.append((token.text, token.pos_))
    
    # return the output 
    return output

# run the PoS tagging method on the example and observe the output
pos_tagging_spacy(nlp, example)

  from .autonotebook import tqdm as notebook_tqdm


[('Google', 'PROPN'),
 (',', 'PUNCT'),
 ('headquartered', 'VERB'),
 ('in', 'ADP'),
 ('Mountain', 'PROPN'),
 ('View', 'PROPN'),
 ('(', 'PUNCT'),
 ('1600', 'NUM'),
 ('Amphitheatre', 'PROPN'),
 ('Pkwy', 'PROPN'),
 (',', 'PUNCT'),
 ('Mountain', 'PROPN'),
 ('View', 'PROPN'),
 (',', 'PUNCT'),
 ('CA', 'PROPN'),
 ('940430', 'NUM'),
 (')', 'PUNCT'),
 (',', 'PUNCT'),
 ('unveiled', 'VERB'),
 ('the', 'DET'),
 ('new', 'ADJ'),
 ('Android', 'PROPN'),
 ('phone', 'NOUN'),
 ('for', 'ADP'),
 ('$', 'SYM'),
 ('799', 'NUM'),
 ('at', 'ADP'),
 ('the', 'DET'),
 ('Consumer', 'PROPN'),
 ('Electronic', 'PROPN'),
 ('Show', 'PROPN'),
 ('.', 'PUNCT'),
 ('Sundar', 'PROPN'),
 ('Pichai', 'PROPN'),
 ('said', 'VERB'),
 ('in', 'ADP'),
 ('his', 'PRON'),
 ('keynote', 'NOUN'),
 ('that', 'SCONJ'),
 ('users', 'NOUN'),
 ('love', 'VERB'),
 ('their', 'PRON'),
 ('new', 'ADJ'),
 ('Android', 'PROPN'),
 ('phones', 'NOUN'),
 ('.', 'PUNCT')]

The SpaCy library offers also some nice visualization tools. Lets see how the tool visualizes the dependency graph and the PoS tags of this simple example.

In [5]:
# render the results 
from spacy import displacy
doc = nlp(example)
displacy.render(doc, style='dep',jupyter=True)

Now that we have methods to extract PoS tags using either the NLTK or SpaCy libraries, lets run the tagging methods to all the comments from our Reddit dataset. Lets for example try to find out which are the most popular PoS tags in our dataset

In [6]:
# helper methods for counting stuff
from collections import Counter

# lets use the Pandas library for visualizing the results 
import pandas as pd

# lists to store our results (words and PoS tags for each library)
nltk_pos_results = []
spacy_pos_results = []

# iterate through the posts in our Reddit dataset
for post in reddit_data:
    
    # use our method to extract PoS tags using the NLTK library
    post_pos_tags_nltk = pos_tagging_nltk(post)
    
    # use our method to extract PoS tags using the SpaCy library
    post_pos_tags_spacy = pos_tagging_spacy(nlp, post)
    
    # store the results of this post to the lists we defined outside of the loop
    nltk_pos_results.extend(post_pos_tags_nltk)
    spacy_pos_results.extend(post_pos_tags_spacy)
    
    
# now we have a list of tuples including all the words and the identified PoS tags
# we need to count the most popular ones and report the results

# first we count all instances and extract the 10 most common pair (word-PoS tag)
counts = Counter(nltk_pos_results).most_common(10)

# we need to flatten the results cause they look like ((word, pos-tag), count) (nested tuple) and we need to convert to flattened tuple (word, pos-tag, count)
counts_flatten = [(a, b, c) for (a, b), c in counts]

# create a Pandas Dataframe with the results including the words, pos tags, and the counts of each pair 
nltk_popular_tags = pd.DataFrame(counts_flatten, columns=['Word (NLTK)', 'PoS tag (NLTK)', 'Count (NLTK)'])

#  we count all instances and extract the 10 most common pair (word-PoS tag). this is for the spacy results
counts = Counter(spacy_pos_results).most_common(10)

# flatten the nested tuples again
counts_flatten = [(a, b, c) for (a, b), c in counts]

# create a dataframe with the results
spacy_popular_tags = pd.DataFrame(counts_flatten, columns=['Word (SpaCy)', 'PoS tag (SpaCy)', 'Count (SpaCy)'])

# concatenate both dataframes (counts from NLTK and SpaCy results)
popular_pos_tags = pd.concat([nltk_popular_tags, spacy_popular_tags], axis=1)
popular_pos_tags

Unnamed: 0,Word (NLTK),PoS tag (NLTK),Count (NLTK),Word (SpaCy),PoS tag (SpaCy),Count (SpaCy)
0,.,.,15289,.,PUNCT,15114
1,the,DT,10288,the,DET,10292
2,",",",",9030,",",PUNCT,9026
3,to,TO,7164,a,DET,5636
4,a,DT,5646,and,CCONJ,5291
5,and,CC,5269,to,PART,5001
6,of,IN,4900,of,ADP,4875
7,I,PRP,3967,I,PRON,3984
8,is,VBZ,3721,is,AUX,3718
9,in,IN,3465,in,ADP,3436


Inspect the above dataframe and think what meaningful insights we can extract by looking at the top 10 words and pos tags. What differences do you observe across the libraries? Why are the results slightly different, given that we use the same input dataset?

Overall, this example highlights the need to always validate and evaluate how well each method works. 

### **Exercise:** Preprocess the Reddit dataset to remove the punctuation characters from the input posts. Then, re-run the PoS tagging methods and generate the new top 10 most popular pairs of words/PoS-tags. What changes do you observe between the models with and without punctuation?

In [7]:
# Insert your code here:

### 3. Performing Named Entity Recognition

In this part of the tutorial we are going to perform Named Entity Recognition using the NLTK and SpaCy libraries.

#### 3.1. Performing Named Entity Recognition using NLTK

In [9]:
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk

def extract_entities_nltk(text):
    #tokenize the text first 
    words = word_tokenize(text)
    
    # extract PoS tags
    pos_tags = pos_tag(words)
    
    # NLTK uses PoS tags to identify entities
    tree = ne_chunk(pos_tags)
    
    # Extract named entities from the tree
    entities = []
    for subtree in tree.subtrees():
        if subtree.label() in ['PERSON', 'ORGANIZATION', 'GPE', 'LOCATION', 'DATE', 'TIME', 'MONEY']:
            entity = " ".join([word for word, tag in subtree.leaves()])
            entities.append((entity, subtree.label()))
    # return the results
    return entities


# Extract entities
extract_entities_nltk(example)


[('Google', 'GPE'),
 ('Mountain View', 'GPE'),
 ('Mountain View', 'PERSON'),
 ('Consumer Electronic Show', 'ORGANIZATION'),
 ('Sundar Pichai', 'PERSON')]

As we can observe from the results, NLTK is not very accurate in performing NER. For instance, in this simple example, Google is considered as a Geo-Political Entitiy and Mountain View both as a GPE and a Person. This is mainly because NLTK relies heavily on PoS tags for identifying the named entities. Overall, in most of the cases, we use other models/libraries to perform NER such as the SpaCy library which offers more accurate models. Lets see how we can use SpaCy for this task.

#### 3.2. Performing Named Entity Recognition using SpaCy

In [10]:
# method to extract NER entities using the SpaCy library
# Input: the SpaCy NLP model that we are using and the input text
# Output: A list of tuples that includes all the recognized NER entities (text span + NER tag)
def extract_ner_entities_spacy(model, text):
    
    # Process the text using spaCy NLP pipeline
    doc = model(text)
    
    # Extract entities and store them in a list of tuples
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    
    # return the result
    return entities

# lets run our method to the simple example and see the output
extract_ner_entities_spacy(nlp, example)


[('Google', 'ORG'),
 ('Mountain View', 'GPE'),
 ('1600', 'CARDINAL'),
 ('Mountain View', 'GPE'),
 ('CA 940430', 'ORG'),
 ('Android', 'ORG'),
 ('799', 'MONEY'),
 ('the Consumer Electronic Show', 'ORG'),
 ('Sundar Pichai', 'PERSON'),
 ('Android', 'ORG')]

SpaCy also supports nice visualization tools. Lets see how we can use them to visualize the NER results

In [11]:
# method to extract NER entities and visualize them using SpaCy
# Input: the SpaCy NLP model that we are using and the input text
# Output: a visualization of the NER tags
def ner_spacy_visulization(model, text):
    doc = model(text)
    displacy.render(doc, style='ent', jupyter=True)

# run our method to the example
ner_spacy_visulization(nlp, example)

Now lets run the entity recognition pipeline on our Reddit dataset and find out the ten most popular entities in our Reddit dataset. For this task, we will focus only on using the method that leverages the SpaCy library.

In [12]:
# list to store our results (text spans and NER tags)
spacy_ner_results = []

# iterate through the posts in our Reddit dataset
for post in reddit_data:
    
    # use our method to extract NER tags using the SpaCy library
    post_ner_tags_spacy = extract_ner_entities_spacy(nlp, post)
    # store the results of this post to the list we defined outside of the loop
    spacy_ner_results.extend(post_ner_tags_spacy)
    
    
# now we have a list of tuples including all the words and the identified NER tags
# we need to count the most popular ones and report the results

# first we count all instances and extract the 10 most common pair (text-NER tag)
counts = Counter(spacy_ner_results).most_common(10)

# we need to flatten the results cause they look like ((text, ner-tag), count) (nested tuple) and we need to convert to flattened tuple (text, ner-tag, count)
counts_flatten = [(a, b, c) for (a, b), c in counts]

# create a Pandas Dataframe with the results including the text spans, ner tags, and the counts of each pair 
ner_popular_tags = pd.DataFrame(counts_flatten, columns=['Text', 'NER tag', 'Count'])
ner_popular_tags

Unnamed: 0,Text,NER tag,Count
0,Trump,ORG,740
1,Biden,PERSON,337
2,Republicans,NORP,234
3,US,GPE,207
4,America,GPE,196
5,one,CARDINAL,188
6,first,ORDINAL,162
7,Texas,GPE,162
8,GOP,ORG,144
9,Russia,GPE,144


The most popular entity is Trump with 740 occurrences in the dataset. But why is it recognized as an Organization. This is because of the model we are using from Spacy. The model is trained on old data from news articles and the Web. So in that dataset it was more likely to find mentions of the Trump Organization rather than mentions to Donald Trump which is a person. Of course, in our recent dataset from Reddit, most of the occurrences are actually referring to Trump the person. Again, these results highlight the need to validate the model and make sure that the model we use is appropriate. 

Spacy offers more models that are more accurate but also need more processing power. For instance for increasing the accuracy we can use the model "en_core_web_trf" that leverages transformers models and is more accurate. 

### **Exercise:** Preprocess the Reddit dataset to remove the case (i.e., everything in lower-case). Then, re-run the NER tagging method and generate the new top 10 most popular pairs of text spans/NER tags. What changes do you observe between the models with and without case?


In [13]:
# Insert your code here:

## TI3160TU: Natural Language Processing - Tagging (PoS-NER) Lab -- END