# Dan Crouthamel – SMU NLP Course — Homework 4

## Assignment Objectives

1.	Run one of the part-of-speech (POS) taggers available in Python.  
* Find the longest sentence you can, longer than 10 words, that the POS tagger tags correctly. Show the input and output.

* Find the shortest sentence you can, shorter than 10 words, that the POS tagger fails to tag 100 percent correctly. Show the input and output. Explain your conjecture as to why the tagger might have been less than perfect with this sentence.

2.  Run a different tagger in Python. Process the same two sentences from Question 1.
* Does it produce the same or different output?

* Explain any differences as best as you can.

3.  In a news article from this week's news, find a random sentence of at least 10 words.
* Looking at the Penn tag set, manually POS tag the sentence yourself.

* Now run the same sentences through both taggers that you implemented for questions 1 and 2. Did either of the taggers produce the same results as you had created manually?

* Explain any differences between the two taggers and your manual tagging as much as you can.




https://stackoverflow.com/questions/30821188/python-nltk-pos-tag-not-returning-the-correct-part-of-speech-tag

## Solution

### Library Imports

In [2]:
import pandas as pd
import nltk
import spacy

### Question 1 and 2

For **Question 1**, we'll use the nltk.pos_tag function to determine part of speech. Under the covers, this is using a Perceptron tagger and comes with a pretrained model, trained and tested on parts of a Wall Street Jounal corpus. For more information, see the links below. 

https://stackoverflow.com/questions/32016545/how-does-nltk-pos-tag-work/41384824
https://www.kaggle.com/nltkdata/averaged-perceptron-tagger

For **Question 2**, we will use the Spacy tagger and then compare the results with the output form Question 1. The model we will use for Spacy is 'en_core_web_sm', which is a small English pipeline trained on written web text (blogs, news, and comments).

https://spacy.io/api/tagger  
https://spacy.io/models

Let's define a long text and short text. The long text is taken from a recent Wall Street journal article. The short text is something I came up with to see if the taggers get it correct.

**Long Text** -> "More U.S. workers are quitting their jobs than at any time in at least two decades, signaling optimism among many professionals while also adding to the struggle companies face trying to keep up with the economic recovery."  

**Short Text** -> "Women need men like fish need bicycles."

In [3]:
sentence = "More U.S. workers are quitting their jobs than at any time in at least two decades, signaling optimism among many professionals while also adding to the struggle companies face trying to keep up with the economic recovery."

# Tokenize the sentence
text = nltk.word_tokenize(sentence)

# Create a data frame and add the Word column
columns = ['Word','NLTK TAG','Spacy Tag']
df = pd.DataFrame(columns=columns)
df['Word'] = text

# Create NLTK Tag Column derived from pos_tag
col_vals = [tag[1] for tag in nltk.pos_tag(text)]
df['NLTK TAG'] = col_vals

# Create Spacy Tag Column derived from Spacy
sp = spacy.load('en_core_web_sm')
col_vals = [word.tag_ for word in sp(sentence)]
df['Spacy Tag'] = col_vals

df

Unnamed: 0,Word,NLTK TAG,Spacy Tag
0,More,JJR,JJR
1,U.S.,NNP,NNP
2,workers,NNS,NNS
3,are,VBP,VBP
4,quitting,VBG,VBG
5,their,PRP$,PRP$
6,jobs,NNS,NNS
7,than,IN,IN
8,at,IN,IN
9,any,DT,DT


The only difference that can be seen between the two taggers is line 25. NLTK classifies 'to' as TO, but Spacy classes it as IN. IN is a preposition, and so is the word 'to'. I'm not sure why Spacy didn't classify it as TO.

In [4]:
sentence = "Women need men like fish need bicycles."

# Tokenize the sentence
text = nltk.word_tokenize(sentence)

# Create a data frame and add the Word column
columns = ['Word','NLTK TAG','Spacy Tag']
df = pd.DataFrame(columns=columns)
df['Word'] = text

# Create NLTK Tag Column derived from pos_tag
col_vals = [tag[1] for tag in nltk.pos_tag(text)]
df['NLTK TAG'] = col_vals

# Create Spacy Tag Column derived from Spacy
sp = spacy.load('en_core_web_sm')
col_vals = [word.tag_ for word in sp(sentence)]
df['Spacy Tag'] = col_vals

df.T

Unnamed: 0,0,1,2,3,4,5,6,7
Word,Women,need,men,like,fish,need,bicycles,.
NLTK TAG,NNS,VBP,NNS,IN,JJ,NN,NNS,.
Spacy Tag,NNS,VBP,NNS,IN,NN,VBP,NNS,.


For question 2, I tried to come up with a short sentence in which I thought could trip things up. NLTK did not get the POS correct for the words 'fish" and 'need', but Spacy did. I supspect this is primarily due to what information the Perceptron and Spacy taggers were trained on.

### Question 3

The sentence I chose is from the following article linked below.

"Last December, as part of the omnibus spending and coronavirus relief package, Congress stipulated a report conducted by multiple agencies must be handed over this month with detailed analysis of UAP sightings by U.S. military members."

https://www.washingtonpost.com/washington-post-live/2021/06/08/ufos-national-security-with-luis-elizondo-former-director-advanced-aerospace-threat-identification-program-aatip/

In the code below I created a column with my manual tagging, based on my limited grammar skills. This is based on the tags defined below. I then combine that with the output from NLTK pos_tag and Spacy.

https://www.sketchengine.eu/penn-treebank-tagset/

In [5]:
sentence = "Last December, as part of the omnibus spending and coronavirus relief package, Congress stipulated a report conducted by multiple agencies must be handed over this month with detailed analysis of UAP sightings by U.S. military members."

# Tokenize the sentence
text = nltk.word_tokenize(sentence)

# Create a data frame and add the Word column
columns = ['Word','NLTK TAG','Spacy Tag',"Dan's Tag"]
df = pd.DataFrame(columns=columns)
df['Word'] = text

# Create NLTK Tag Column derived from pos_tag
col_vals = [tag[1] for tag in nltk.pos_tag(text)]
df['NLTK TAG'] = col_vals

# Create Spacy Tag Column derived from Spacy
sp = spacy.load('en_core_web_sm')
col_vals = [word.tag_ for word in sp(sentence)]
df['Spacy Tag'] = col_vals

df["Dan's Tag"] = ['JJ', 'NN', ',', 'IN', 'NN', 'IN', 'DT', 'JJ', 'NN', 'CC', \
                   'NN', 'NN', 'NN', ',', 'NNP', 'VBD', 'DT', 'NN', 'VBD', 'IN', \
                   'JJ', 'NNS', '??', 'VB', 'VB', 'IN', 'DT', 'NN', 'IN', 'JJ', \
                   'NN', 'IN', 'NNP', 'NNS', 'IN', 'NNP', 'JJ', 'NNS', '.'
                  ]
    

df

Unnamed: 0,Word,NLTK TAG,Spacy Tag,Dan's Tag
0,Last,JJ,JJ,JJ
1,December,NNP,NNP,NN
2,",",",",",",","
3,as,IN,IN,IN
4,part,NN,NN,NN
5,of,IN,IN,IN
6,the,DT,DT,DT
7,omnibus,JJ,JJ,JJ
8,spending,NN,NN,NN
9,and,CC,CC,CC


Above we see that the NLTK and Spacy taggers returned the same results. However, my tagging is not the same, primarily because I have forgotten some rules of grammar. Here are some of the things I missed.

* December - Classified as noun (NN), instead of proper noun (NP)
* Conducted - Classified as past tense verb (VBD), instead of past participle (VBN)
* Must - I didn't know how to classify this. Spacy and pos_tag returned modal (MD), which I believe is the same as a helping verb.
* Over - Classified as a preposition (IN), instead of particle.

What is a particle part of speach? A word that doesn't fit into one of the traditional 8 parts of speech.
For example, a noun, verb, pronoun, adjective, preposition, adverb, interjection, or conjunction. See link below.

So to conclude, I'd be more comfortable relying on the taggers to do the work for me :)

https://www.gingersoftware.com/content/particle-grammar/
