## "Parts of Speech"
#### _(la grammaire, qui sait régenter jusq'aux rois - Molière)_

#### MSDS 7337 - Natural Language Processing - Homework 04  
##### Patrick McDevitt  
##### 14-Oct-2018  

***     

For this project we are requested to :

1. Run one of the part-of-speech (POS) taggers available in Python.
    1. Find the longest sentence you can, longer than 10 words, that the POS tagger tags correctly. Show the input and output.
    2. Find the shortest sentence you can, shorter than 10 words, that the POS tagger fails to tag 100 percent correctly.
        1. Show the input and output.
        2. Explain your conjecture as to why the tagger might have been less than perfect with this sentence.
2. Run a different POS tagger in Python. Process the same two sentences from question 1.
    1. Does it produce the same or different output ?
    2. Explain any differences as best you can.
3. In a news article from this week’s news, find a random sentence of at least 10 words.
    1. Looking at the Penn tag set, manually POS tag the sentence yourself.
    2. Now run the same sentences through both taggers that you implemented for questions 1 and 2.
        1. Did either of the taggers produce the same results as you had created manually ?
    3. Explain any differences between the two taggers and your manual tagging as much as you can.

In [7]:
#
# ... file : pos_tagger.py
#
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
# ...
# ... msds 7337 NLP
# ... homework 04
# ... part of speech tagging
# ... pmcdevitt@smu.edu
# ... 10-oct-2018
# ...
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
# ... load packages
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import os
import re
import numpy as np
import pandas as pd

import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk.probability import ConditionalFreqDist

In [8]:
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
# ... some directory and file name definitions
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

home_dir = "/home/mcdevitt/_ds/_smu/_src/nlp/homework_04/"

os.chdir(home_dir)

penn_pos = pd.read_csv("penn_treebank.csv")

In [9]:
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
# ... import some packages
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

import pandas as pd
import nltk
import pattern
from nltk.corpus import PlaintextCorpusReader
from nltk.probability import ConditionalFreqDist
from pattern.en import tag

In [10]:
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
# ... function to call POS taggers (nltk and pattern)
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

def pos_tagger (phrase, tagger) : 
    
    penn_pos = pd.read_csv("penn_treebank.csv")

    if tagger == 'nltk' :

# ... nltk pos tagger

        tokens = nltk.word_tokenize(phrase)
        tag_nltk = nltk.pos_tag(tokens)

        df_tag = pd.DataFrame(tag_nltk, columns = ['word', 'pos'])
        df_dscr = pd.merge(df_tag, penn_pos, on = 'pos', how = 'left')
        df_dscr.columns = ['word', 'pos', 'descr']
 
    elif tagger == 'pattern': 

# ... pattern pos tagger

        tag_ptrn = tag(phrase)

        df_tag = pd.DataFrame(tag_ptrn, columns = ['word', 'pos'])
        df_dscr = pd.merge(df_tag, penn_pos, on = 'pos', how = 'left')
        df_dscr.columns = ['word', 'pos', 'descr']
            
    return (df_dscr)

In [11]:
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
# ... define some sentences
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

long_sentence = """The power of numbers is never more evident 
than when we use them to speculate on the time of our dying.""".replace('\n', '')

short_sentence = "I was cured all right."

news_sentence = """His Republican opponent, the incumbent Bruce Rauner, 
has made much of the toilet issue in the final months of the gubernatorial campaign, 
running an ad calling Pritzker the porcelain prince of tax avoidance.""". replace('\n', '')

***  

### 1a - POS tagger in python - long sentence

*** 

- Find the longest sentence you can, longer than 10 words, that the POS tagger tags correctly.
- Show the input and output.

__I agree with the parsing that was returned by nltk tagger for this sentence.__
    

In [13]:
# ... nltk pos tagger for long sentence

print ('\n -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-\n')
print ('\t1.a - Long sentence : \n\n', long_sentence)
print ('\n -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-\n')

df_long_nltk = pos_tagger(long_sentence, "nltk")

print ('\tNLTK pos tagged phrase : \n')

df_long_nltk


 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

	1.a - Long sentence : 

 The power of numbers is never more evident than when we use them to speculate on the time of our dying.

 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

	NLTK pos tagged phrase : 



Unnamed: 0,word,pos,descr
0,The,DT,Determiner
1,power,NN,"Noun, singular or mass"
2,of,IN,Preposition or subordinating conjunction
3,numbers,NNS,"Noun, plural"
4,is,VBZ,"Verb, 3rd person singular present"
5,never,RB,Adverb
6,more,RBR,"Adverb, comparative"
7,evident,JJ,Adjective
8,than,IN,Preposition or subordinating conjunction
9,when,WRB,Wh-adverb



### 1b - POS tagger in python - short sentence

*** 

Find the shortest sentence you can, shorter than 10 words, that the POS tagger fails to tag 100 percent correctly.
    
- Show the input and output.
- Explain your conjecture as to why the tagger might have been less than perfect with this sentence.

For this case, we chose the sentence ; "I was cured all right."

The results of the NLTK POS tagger are shown in below table.

We can observe that NLTK tags the adverb _all right_ as :
* all 	DT 	(Determiner)
* right NN 	(Noun, singular or mass)

We can appreciate the difficulty in identifying the adverbial phrase, since it is constructed of two words which more typically have meanings as identified by the NLTK tagger, i.e., _all_  as a determiner and _right_ which has many meanings from which to make a tagging choice.

Inerestingly, WordNet shows only 3 variations of the meaning of _all_ and the 3rd of these is as an adverb in phrases like 'all right'. So, even though this is recognized as a high potential tagging option, NLTK did not find the correct solution in this case.

In [14]:
# ... nltk pos tagger for short sentence

print ('\n -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-\n')
print ('\t1.b - Short sentence : ', short_sentence)
print ('\n -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-\n')

df_short_nltk = pos_tagger(short_sentence, "nltk")

print ('\tNLTK pos tagged phrase : \n')

df_short_nltk


 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

	1.b - Short sentence :  I was cured all right.

 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

	NLTK pos tagged phrase : 



Unnamed: 0,word,pos,descr
0,I,PRP,Personal pronoun
1,was,VBD,"Verb, past tense"
2,cured,VBN,"Verb, past participle"
3,all,DT,Determiner
4,right,NN,"Noun, singular or mass"
5,.,.,


***

### 2a - part of speech tagger - pattern - long sentence

*** 
    
2. Run a different POS tagger in Python. Process the same two sentences from question 1.  
    a. Does it produce the same or different output?  
    b. Explain any differences as best you can.  


#### The output between these taggers for the long sentence is shown in below table. The sentence is 21 words in length. All of the first 20 words in the sentence are tagged identically by the two taggers (nltk and pattern).  

The last word : _dying_ is tagged :
- NN  (Noun, singular or mass) by the nltk tagger, and
- VBG (Verb, gerund or present participle) by the pattern tagger

Given that the phrase that contains _dying_ is _of our dying_ we can see that dying in this case is a noun - as it succeeds the possessive adjective _our_. Since the word ends with the _-ing_ form of the verb _to die_  we can appreciate that a default choice for part of speech is _verb gerund_, but in this case that does not factor in the fact that the word is preceded by a possessive adjective.


In [16]:
# ... pattern pos tagger for long sentence

print ('\n -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-\n')
print ('\t1.a - Long sentence : \n\n', long_sentence)
print ('\n -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-\n')

df_long_ptrn = pos_tagger(long_sentence, "pattern")

print ('\tNLTK & Pattern Parts of Speech : \n')

#df_long_ptrn

df_long = pd.DataFrame()

df_long = pd.concat([df_long_nltk, df_long_ptrn], axis = 1)
df_long.columns = ['word', 'NLTK POS', 'Descr NLTK', 'word', 'Pattern POS', 'Descr Pattern']


df_long


 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

	1.a - Long sentence : 

 The power of numbers is never more evident than when we use them to speculate on the time of our dying.

 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

	NLTK & Pattern Parts of Speech : 



Unnamed: 0,word,NLTK POS,Descr NLTK,word.1,Pattern POS,Descr Pattern
0,The,DT,Determiner,The,DT,Determiner
1,power,NN,"Noun, singular or mass",power,NN,"Noun, singular or mass"
2,of,IN,Preposition or subordinating conjunction,of,IN,Preposition or subordinating conjunction
3,numbers,NNS,"Noun, plural",numbers,NNS,"Noun, plural"
4,is,VBZ,"Verb, 3rd person singular present",is,VBZ,"Verb, 3rd person singular present"
5,never,RB,Adverb,never,RB,Adverb
6,more,RBR,"Adverb, comparative",more,RBR,"Adverb, comparative"
7,evident,JJ,Adjective,evident,JJ,Adjective
8,than,IN,Preposition or subordinating conjunction,than,IN,Preposition or subordinating conjunction
9,when,WRB,Wh-adverb,when,WRB,Wh-adverb


***

### 2b - part of speech tagger - pattern - short sentence

*** 
    
2. Run a different POS tagger in Python. Process the same two sentences from question 1.  
    a. Does it produce the same or different output?  
    b. Explain any differences as best you can.  


#### The output between these taggers for the short sentence is shown in below table.

The sentence is 5 words in length. The first 4 words in the sentence are tagged identically by the two taggers (nltk and pattern).  

The last word : _right_ is tagged :
- NN  (Noun, singular or mass) by the nltk tagger, and
- RB  (Adverb) by the pattern tagger

In this case, the __pattern__ tagger correctly identifies _right_ as adverb in this sentence.
As discussed above, WordNet provides the adverb POS tag to one of the three meanings of _all_ as it is used in this context. So, we can assume that the __pattern__ tagger has logical preference to recognize that _right_ succeeding _all_ can be used as adverb.


In [17]:
# ... pattern pos tagger for sort sentence

print ('\n -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-\n')
print ('\t1.a - short sentence : ', short_sentence)
print ('\n -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-\n')

df_short_ptrn = pos_tagger(short_sentence, "pattern")

print ('\tNLTK & Pattern Parts of Speech : \n')

#df_short_ptrn

df_short = pd.DataFrame()

df_short = pd.concat([df_short_nltk, df_short_ptrn], axis = 1)
df_short.columns = ['word', 'NLTK POS', 'Descr NLTK', 'word', 'Pattern POS', 'Descr Pattern']


df_short


 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

	1.a - short sentence :  I was cured all right.

 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

	NLTK & Pattern Parts of Speech : 



Unnamed: 0,word,NLTK POS,Descr NLTK,word.1,Pattern POS,Descr Pattern
0,I,PRP,Personal pronoun,I,PRP,Personal pronoun
1,was,VBD,"Verb, past tense",was,VBD,"Verb, past tense"
2,cured,VBN,"Verb, past participle",cured,VBN,"Verb, past participle"
3,all,DT,Determiner,all,DT,Determiner
4,right,NN,"Noun, singular or mass",right,RB,Adverb
5,.,.,,.,.,


### 3 - News article sentence

In a news article from this week’s news, find a random sentence of at least 10 words.  

    a. Looking at the Penn tag set, manually POS tag the sentence yourself.
    b. Now run the same sentences through both taggers that you implemented for questions 1 and 2.
    
__Did either of the taggers produce the same results as you had created manually ?__

The sentence from the news article is 33 words in length. There are 5 words of the 33 for which there is disagreement about the poart of speech tagging.

The results of the hand tagged, NLTK-tagged, and Pattern-tagged responses are shown in below table.

Differences are observed for the following words :  

|word | Hand tag | NLTK | Pattern |  
|-----|-------|------|-----|  
|incumbent | NN | JJ | JJ
|much | RB | RB | JJ
|toilet | JJ | NN | NN
|porcelain | JJ | NN | NN
|tax | JJ | NN | NN
|-----|-------|------|-----|  

    c. Explain any differences between the two taggers and your manual tagging as much as you can.
    
For 4 of the 5 words that had difference among taggers, they are noun / adjective confusion. In the first case, NLTK and Pattern tagged the word incumbent as an adjective, when incumbent is actually used as a noun in this sentence. We can imagine the confusion on this as incumbent is often used as an adjective, as in a phrase _incumbent governor_, but in this case incumbent is followed by a proper noun. In the other noun-adjective cases, the confusion is in the other direction - adjectives are confused for nouns. The reason for that is that the words, toilet, porcelain, and tax are typically nouns, but in this case they are used as adjectives to modify other nouns : _toilet issue_, _porcelain prince_, and _tax avoidance_. 

In the remaining confused case, the Pattern tagger identified _much_ as an adjective. I cannot provide an explanation of why this is the case. Five of the seven use cases for _much_ in WordNet are adverbs. So, by default, adverb would be a good choice. If _much_ were an adjective one would expect it to be directly followed by a noun; in this case, it directly succeeds a verb and is followed by a noun phrase. The more obvious choice for _much_ in the setting of this sentence seems to be as adverb.

In [19]:
# ... pattern pos tagger for news sentence


df_hand_tag = pd.read_csv("news_sentence_tagged.csv")

print ('\n -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-\n')
print ('\t3 - News article sentence : \n\n', news_sentence)
print ('\n -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-\n')

df_news_nltk = pos_tagger(news_sentence, "nltk")
df_news_ptrn = pos_tagger(news_sentence, "pattern")

print ('\tNLTK & Pattern Parts of Speech : \n')

#df_news_ptrn

df_news = pd.DataFrame()

df_news = pd.concat([df_hand_tag, df_news_nltk, df_news_ptrn], axis = 1)
df_news.columns = ['word', 'Hand_Tagged_Descr', 'Hand_Tagged_POS', 'word-nltk', 'NLTK_POS', 
                   'nltk_descr', 'word-ptrn', 'Pattern_POS', 'pattern_descr']

df_news.drop(['Hand_Tagged_Descr', 'word-nltk', 'nltk_descr', 'word-ptrn', 'pattern_descr'],
             axis = 1, inplace = True)
df_news


 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

	3 - News article sentence : 

 His Republican opponent, the incumbent Bruce Rauner, has made much of the toilet issue in the final months of the gubernatorial campaign, running an ad calling Pritzker the porcelain prince of tax avoidance.

 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

	NLTK & Pattern Parts of Speech : 



Unnamed: 0,word,Hand_Tagged_POS,NLTK_POS,Pattern_POS
0,His,PRP$,PRP$,PRP$
1,Republican,JJ,JJ,JJ
2,opponent,NN,NN,NN
3,",",,",",","
4,the,DT,DT,DT
5,incumbent,NN,JJ,JJ
6,Bruce,NNP,NNP,NNP
7,Rauner,NNP,NNP,NNP
8,",",,",",","
9,has,VBZ,VBZ,VBZ


### End_of_file