# Part of Speech tagging
- Taggers:
    - NLTK
    - Spacy
    - Build Your Own Tagger
    - n-gram tagger

In [2]:
import nltk
import numpy as np
from string import punctuation
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import pandas as pd
import re
import matplotlib.pyplot as plt
from pprint import pprint

pd.options.display.max_colwidth = 200
%matplotlib inline


## Question 1:
### Run one of the part-of-speech (POS) taggers available in Python. 

1.	Find the longest sentence you can, longer than 10 words, that the POS tagger tags correctly. Show the input and output.
2.	Find the shortest sentence you can, shorter than 10 words, that the POS tagger fails to tag 100 percent correctly. 
    - Show the input and output. Explain your conjecture as to why the tagger might have been less than perfect with this sentence.


As a geologist I decided to find the required long and  short sentences in the description of sedimentary rocks

In [3]:
import wikipedia as wk

In [4]:
#Load the description of sedimentary rocks from Wikipedia
sed_rk = wk.page("Sedimentary rock")
sed_rk_txt = sed_rk.content

In [5]:
#Let us see the text
sed_rk_txt

'Sedimentary rocks are types of rock that are formed by the accumulation or deposition of mineral or organic particles at the Earth\'s surface, followed by cementation. Sedimentation is the collective name for processes that cause these particles to settle in place. The particles that form a sedimentary rock are called sediment, and may be composed of geological detritus (minerals) or biological detritus (organic matter). The geological detritus originated from weathering and erosion of existing rocks, or from the solidification of molten lava blobs erupted by volcanoes. The geological detritus is transported to the place of deposition by water, wind, ice or mass movement, which are called agents of denudation. Biological detritus was formed by bodies and parts (mainly shells) of dead aquatic organisms, as well as their fecal mass, suspended in water and slowly piling up on the floor of water bodies (marine snow). Sedimentation may also occur as dissolved minerals precipitate from wate

##### Text preprocesing and normalization
- The text contains \n as newline characters as well as other non alphanumeric characters (e.g. ==)
      - that are irrelevent to the actual text.
      - We need to remove these.

In [6]:
#remove special new line characters
def remove_irrelevant_characters(text, remove_digits=False):
    text = re.sub(r'\n|\r', ' ', text)
    text = re.sub(r'\t|\r', ' ', text)
    text = re.sub(r'\n==|\r', ' ', text)
    text = re.sub(r'\==|\r', ' ', text)
    text = re.sub(r' +', ' ', text)
    text = text.strip()
    return text

sed_3 = remove_irrelevant_characters(sed_rk_txt, 
                          remove_digits=True)


In [7]:
sed_3

'Sedimentary rocks are types of rock that are formed by the accumulation or deposition of mineral or organic particles at the Earth\'s surface, followed by cementation. Sedimentation is the collective name for processes that cause these particles to settle in place. The particles that form a sedimentary rock are called sediment, and may be composed of geological detritus (minerals) or biological detritus (organic matter). The geological detritus originated from weathering and erosion of existing rocks, or from the solidification of molten lava blobs erupted by volcanoes. The geological detritus is transported to the place of deposition by water, wind, ice or mass movement, which are called agents of denudation. Biological detritus was formed by bodies and parts (mainly shells) of dead aquatic organisms, as well as their fecal mass, suspended in water and slowly piling up on the floor of water bodies (marine snow). Sedimentation may also occur as dissolved minerals precipitate from wate

In [9]:
#Sentence tokenize to extract long and short sentences
punkt_st = nltk.tokenize.PunktSentenceTokenizer()
sed_rk1 = punkt_st.tokenize(sed_3)
len(sed_rk1)

320

In [10]:
#Sanity check: Let us see the first 3 sentences
sed_rk1[:3]

["Sedimentary rocks are types of rock that are formed by the accumulation or deposition of mineral or organic particles at the Earth's surface, followed by cementation.",
 'Sedimentation is the collective name for processes that cause these particles to settle in place.',
 'The particles that form a sedimentary rock are called sediment, and may be composed of geological detritus (minerals) or biological detritus (organic matter).']

#### 1A
Find the longest sentence you can, longer than 10 words, that the POS tagger tags correctly. Show the input and output.

In [47]:
#Separate the 'long' (>10 words) sentences from the short ones. 

sed_rk_long_sent = [] #list of long sentences
sed_rk_short = [] #list of short sentences
for sent in punkt_st.tokenize(sed_rk_txt):
    if len(sent.split()) > 10: #condition for long sentences
        sed_rk_long_sent.append(sent)
    else:
        sed_rk_short.append(sent) #condition for short sentences

In [14]:
len(sed_rk_long_sent) #how many long sentences do we have

290

In [15]:
sed_rk_long_sent[0] #see the first sentence as example

"Sedimentary rocks are types of rock that are formed by the accumulation or deposition of mineral or organic particles at the Earth's surface, followed by cementation."

### Part of speech tagger 1

#### Using the nltk tagger
- Part of Speech tagging algorithms take in words and tag them accordingly. 
    - Therefore we need to word-tokenize the text before feeding it.

In [16]:
#object function to tokenize the text

default_wt = nltk.word_tokenize

In [18]:
#Select the first long sentence
#Word tokenize the sentence and POS it
nltk_pos_tagged_long = nltk.pos_tag(default_wt(sed_rk_long_sent[0])) #tag the long sentence
sed_long_tag = pd.DataFrame(nltk_pos_tagged_long, columns=['Word', 'POS tag']) #Display it
sed_long_tag

Unnamed: 0,Word,POS tag
0,Sedimentary,JJ
1,rocks,NNS
2,are,VBP
3,types,NNS
4,of,IN
5,rock,NN
6,that,WDT
7,are,VBP
8,formed,VBN
9,by,IN


#### 1B
Find the shortest sentence you can, that the POS tagger tags correctly. Show the input and output.

In [21]:
#Randomly select a short sentence
short_sent = sed_rk_short[8]

In [22]:
#nltk_pos_tagged_short = nltk.pos_tag(default_wt(sed_rk_short[8]))
nltk_pos_tagged_short = nltk.pos_tag(default_wt(short_sent))
sed_short_tag = pd.DataFrame(nltk_pos_tagged_short, columns=['Word', 'POS tag']).T
sed_short_tag 

Unnamed: 0,0,1,2,3,4,5,6,7
Word,Larger,",",well-preserved,fossils,are,relatively,rare,.
POS tag,NNP,",",JJ,NNS,VBP,RB,JJ,.


For this sentence:
>'Larger, well-preserved fossils are relatively rare.'

- 'Larger' is adjective but the POS module tagged it as Proper Noun

-I think it's because 'Larger' begins the sentence or/and is capitalized.


## Question 2
- Run a different POS tagger in Python. Process the same two sentences from question 1.
    - Does it produce the same or different output?
    - Explain any differences as best you can.

### Other part of speech taggers (BYOB, n-gram, Spacy)

#### B.Y.O.T: Build your own tagger

*reference: Text Analytics with Python. Dipanjan 2019*

- We leverage some classes provided by NLTK. 
- To evaluate the performance of our taggers, we use some test data from the treebank corpus in NLTK. 
- We will also be using some training data for training some of our taggers. 
- To start with, we will first get the necessary data for training and evaluating the taggers by reading in the tagged treebank corpus.

In [55]:
from nltk.corpus import treebank
data = treebank.tagged_sents()
train_data = data[:3500]
test_data = data[3500:]
print(train_data[0])

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]


In [37]:
from nltk.classify import NaiveBayesClassifier, MaxentClassifier
from nltk.tag.sequential import ClassifierBasedPOSTagger

In [41]:
#Fit the Naive Bayes classifier to the treebank pos tagged corpus.
nbt = ClassifierBasedPOSTagger(train=train_data, classifier_builder=NaiveBayesClassifier.train)


In [42]:
# evaluate tagger on test data and unknown (our short sentence)
print(nbt.evaluate(test_data))
print(nbt.tag(default_wt(sed_rk_short[8])))

0.9306806079969019
[('Larger', 'NNP'), (',', ','), ('well-preserved', 'VBD'), ('fossils', 'NNS'), ('are', 'VBP'), ('relatively', 'RB'), ('rare', 'JJ'), ('.', '.')]


In [45]:
#Run the short sentence through the designed tagger
nbt_tag_short = nbt.tag(default_wt(short_sent))
sed_tag_short_naiveB = pd.DataFrame(nbt_tag_short, columns=['Word', 'POS tag_NaiveB']).T
sed_tag_short_naiveB

Unnamed: 0,0,1,2,3,4,5,6,7
Word,Larger,",",well-preserved,fossils,are,relatively,rare,.
POS tag_NaiveB,NNP,",",VBD,NNS,VBP,RB,JJ,.


In [46]:
##Run the long sentence through the designed tagger
nbt_tag = nbt.tag(default_wt(sed_rk_long_sent[0]))
sed_tag_long_naiveB = pd.DataFrame(nbt_tag, columns=['Word', 'POS tag_NaiveB']).T
sed_tag_long_naiveB

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,19,20,21,22,23,24,25,26,27,28
Word,Sedimentary,rocks,are,types,of,rock,that,are,formed,by,...,at,the,Earth,'s,surface,",",followed,by,cementation,.
POS tag_NaiveB,NNP,NNS,VBP,NNS,IN,NN,WDT,VBP,VBN,IN,...,IN,DT,JJ,POS,NN,",",VBN,IN,CD,.


##### N gram taggers

In [49]:
# regex tagger
from nltk.tag import RegexpTagger

In [50]:
# define regex tag patterns
patterns = [
    (r'.*ing$', 'VBG'), # gerunds
    (r'.*ed$', 'VBD'), # simple past
    (r'.*es$', 'VBZ'), # 3rd singular present
    (r'.*ould$', 'MD'), # modals
    (r'.*\'s$', 'NN$'), # possessive nouns
    (r'.*s$', 'NNS'), # plural nouns
    (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
    (r'.*', 'NN') # nouns (default)
]

rt = RegexpTagger(patterns)

In [51]:
## N gram taggers
from nltk.tag import UnigramTagger
ut = UnigramTagger(train_data)

In [52]:
# testing performance of unigram tagger on both test and long sentence
print(ut.evaluate(test_data))  #86% accuracy on the test data
ngram_tag = ut.tag(default_wt(sed_rk_long_sent[0])) #Performance on 'unknown' data (my 'long' sentence)
sed_tag_ngram = pd.DataFrame(ngram_tag, columns=['Word', 'POS tag_Ngram']).T
sed_tag_ngram

0.8607803272340013


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,19,20,21,22,23,24,25,26,27,28
Word,Sedimentary,rocks,are,types,of,rock,that,are,formed,by,...,at,the,Earth,'s,surface,",",followed,by,cementation,.
POS tag_Ngram,,,VBP,NNS,IN,NN,IN,VBP,VBN,IN,...,IN,DT,,POS,,",",VBD,IN,,.


In [54]:
# # testing performance of unigram tagger on both test data and short sentence
print(ut.evaluate(test_data))
ngram_tag = ut.tag(default_wt(sed_rk_short[8]))
sed_tag_ngram = pd.DataFrame(ngram_tag, columns=['Word', 'POS tag_Ngram']).T
sed_tag_ngram

0.8607803272340013


Unnamed: 0,0,1,2,3,4,5,6,7
Word,Larger,",",well-preserved,fossils,are,relatively,rare,.
POS tag_Ngram,,",",,,VBP,RB,JJ,.


#### Spacy tagger

In [33]:
import spacy

In [34]:
import en_core_web_sm



In [35]:
nlp = spacy.load('en_core_web_sm', parse=True, tag=True, entity=True)

In [36]:
sentence = "US unveils world's most powerful supercomputer, beats China."

In [37]:
sentence_nlp = nlp(sed_rk_long_sent[0])

In [38]:
#Spacy tagger also uses the Penn Treebank notation for POS tagging.
#I did not even need to word-tokenize the sentence.
spacy_pos_tagged = [(word, word.tag_, word.pos) for word in sentence_nlp]
pd.DataFrame(spacy_pos_tagged, columns=['Word', 'POS tag.spacy', 'Tag type']).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,19,20,21,22,23,24,25,26,27,28
Word,Sedimentary,rocks,are,types,of,rock,that,are,formed,by,...,at,the,Earth,'s,surface,",",followed,by,cementation,.
POS tag.spacy,JJ,NNS,VBP,NNS,IN,NN,WDT,VBP,VBN,IN,...,IN,DT,NNP,POS,NN,",",VBN,IN,NN,.
Tag type,84,92,87,92,85,92,95,87,100,85,...,85,90,96,94,92,97,100,85,92,97


### 2A,B: Comparison of tags
All the taggers algorithms use the Penn Tree Bank Notation
- Both nltk and Spacy produced the same POS tags for the sentences.
- 'Sedimentary' is labeled as JJS/JJ by nltk and Spacy taggers but as NNP by Naive Bayes.
    - nltk/Spacy pos tag is more correct because sedimentary is a type of rock/qualifies the rock so it is a JJS(adjective)
- 'Earth' is labelled as NNP by nltk but JJ by Naive Bayes
    - nltk is more correct because Earth is a possessive noun in this case and less like an adjective that qualifies.

- Compared to the nltk and Naive Bayes, the ngram produced the worst pos tags.
    - the ngram failed to correctly tag many words even in the 6-word short sentence.

- I think the error in the tags produced by BYOB and n-gram is because of the limited sized corpus I trained these taggers on. (i.e. only 3,500 length tagged words)


## 3.	
- In a news article from this week’s news, find a random sentence of at least 10 words.
    - a.	Looking at the Penn tag set, manually POS tag the sentence yourself.
    - b.	Now run the same sentences through both taggers that you implemented for questions 1 and 2. Did either of the taggers produce the same results as you had created manually?
    - c.	Explain any differences between the two taggers and your manual tagging as much as you can.
    
Source of news sentence:  

<a>https://www.reuters.com/business/biden-supply-chain-strike-force-target-china-trade-2021-06-08/<a>    

'Officials also said the Department of Commerce is considering initiating a Section 232 investigation into the national security impact of neodymium magnet imports used in motors and other industrial applications, which the United States largely obtains from China.'

In [154]:
news = 'Officials also said the Department of Commerce is considering initiating a Section 232 investigation into the national security impact of neodymium magnet imports used in motors and other industrial applications, which the United States largely obtains from China.'

In [155]:
#news_nlp = nlp(news)
#spacy_pos_tagged1 = [(word, word.tag_, word.pos) for word in news_nlp]
#spacy_pos_tagged= [(word, word.tag_) for word in news_nlp]
#news_pos_tagged_df = pd.DataFrame(spacy_pos_tagged, columns=['Word', 'POS tag.spacy'])
#news_pos_tagged_df

#news_pos_tagged_df.to_csv('news_pos_tagged_df.csv', index = False, header=True)

## 3A
#### Manually tag the news text.

Manual tagging:
(Officials, NNS)
(also, RB) 
(said, VBD)
(the, DT)
(Department, NNP)
(of, IN)
(Commerce, NNP)
(is, VBZ)
(considering, VBG)
(initiating, VBG)
(a, DT)
(Section, NN)
 (232,JJ)
(investigation, NN)
(into, IN)
(the, DT)
(national, JJ)
(security, JJ)
(impact, NN)
(of, IN)
(neodymium, JJ)
(magnet, JJ)
(imports, NNS)
(used, VBN)
(in, IN)
(motors, NNS)
(and, CC)
(other, JJ)
(industrial, JJ)
(applications, NNS)
(,)
(which, WDT)
(the, DT)
(United, NNP)
(States, NNP)
(largely, RB)
(obtains, VBZ)
(from, IN)
(China, NNP)
(.)

In [156]:
manual_tag = [('Officials', 'NNS'),
('also', 'RB'),
('said', 'VBD'),
('the', 'DT'),
('Department', 'NNP'),
('of', 'IN'),
('Commerce', 'NNP'),
('is','VBZ'),
('considering', 'VBG'),
('initiating', 'VBG'),
('a', 'DT'),
('Section', 'NN'),
 ('232','JJ'),
('investigation', 'NN'),
('into', 'IN'),
('the', 'DT'),
('national', 'JJ'),
('security', 'JJ'),
('impact', 'NN'),
('of', 'IN'),
('neodymium', 'JJ'),
('magnet', 'JJ'),
('imports', 'NNS'),
('used', 'VBN'),
('in', 'IN'),
('motors', 'NNS'),
('and', 'CC'),
('other', 'JJ'),
('industrial', 'JJ'),
('applications', 'NNS'),
(',', 'Punctuation'),
('which', 'WDT'),
('the', 'DT'),
('United', 'NNP'),
('States', 'NNP'),
('largely', 'RB'),
('obtains', 'VBZ'),
('from', 'IN'),
('China', 'NNP'),
('.', 'Punctuation')
]

## 3B
Now run the same sentences through both taggers that you implemented for questions 1 and 2. Did either of the taggers produce the same results as you had created manually?

In [157]:
#A display to compare the automatic pos tags with the manual tagging side by side.
manaul_tag_list = [token for word, token in manual_tag] #Extract pos tag from manually tagged text
news_tag_list = [word for word, token in manual_tag] #Extract words of the news text that were tagged
nltk_tag_list = [token for word, token in nltk.pos_tag(default_wt(news))] #Extract nltk pos tags
spacy_tag_list = [token.tag_ for token in nlp(news)] #Extract Spacy pos tags

#We now make a dataframe to compare those tags
pos_tagged_df = pd.DataFrame({'Word.news': news_tag_list, 'Spacy.tag':spacy_tag_list, 'nltk.tag': nltk_tag_list, 'Manual.tag': manaul_tag_list})
pos_tagged_df

Unnamed: 0,Word.news,Spacy.tag,nltk.tag,Manual.tag
0,Officials,NNS,NNS,NNS
1,also,RB,RB,RB
2,said,VBD,VBD,VBD
3,the,DT,DT,DT
4,Department,NNP,NNP,NNP
5,of,IN,IN,IN
6,Commerce,NNP,NNP,NNP
7,is,VBZ,VBZ,VBZ
8,considering,VBG,VBG,VBG
9,initiating,VBG,VBG,VBG


### 3B and C
#### Comment on the tags
1. Both taggers produced the same result
2. More than 90% of the words have the same part of speech tags
3. The part of speech tags that are different can also be ambiguous for human beings.
4. I do note that if 'Section 232' was treated as a phrase, the taggers would have tagged '232' as an adjective of 'Section' like I did.