### DS7337 NLP - HW 4
#### David Wei

# Homework4

<u>**HW 4:**</u>

[book link](http://www.nltk.org/book/)

1.	Run one of the part-of-speech (POS) taggers available in Python. 
    - a. Find the longest sentence you can, longer than 10 words, that the POS tagger tags correctly. Show the input and output.
    - b. Find the shortest sentence you can, shorter than 10 words, that the POS tagger fails to tag 100 percent correctly. Show the input and output. Explain your conjecture as to why the tagger might have been less than perfect with this sentence.

2.	Run a different POS tagger in Python. Process the same two sentences from question 1.
    - a. Does it produce the same or different output?
    - b. Explain any differences as best you can.

3.	In a news article from this week’s news, find a random sentence of at least 10 words.
    - a. Looking at the Penn tag set, manually POS tag the sentence yourself.
    - b. Now run the same sentences through both taggers that you implemented for questions 1 and 2. Did either of the taggers produce the same results as you had created manually?
    - c. Explain any differences between the two taggers and your manual tagging as much as you can.



In [None]:
#!pip install -U textblob
#!pip install requests
#!pip install bs4
# !pip install spacy
# !pip install -U varname

In [52]:
# python
import os
import numpy as np
import requests
import time
import re
from urllib import request
import urllib.request
import pandas as pd
# nltk
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import ToktokTokenizer
# nltk corpus
from nltk.corpus import brown
# POS taggers
from textblob import TextBlob
import spacy
# viz
from IPython.display import Image
from IPython.core.display import HTML 
import matplotlib as plt
# sklearn
from sklearn.preprocessing import minmax_scale
# data mine
from bs4 import BeautifulSoup
from string import punctuation

In [57]:
##### Global Variables #####
toktok = ToktokTokenizer()

long_sentence = '''It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of despair.'''
short_sentence = '''please start with me'''
short_sentence2 = '''start please with me'''

print("# of words in long_sentence: "+str(len(toktok.tokenize(long_sentence))))
print("# of words in short_setence: "+str(len(toktok.tokenize(short_sentence))))

# storing tokenzed variables in-memory
long_token = toktok.tokenize(long_sentence)
short_token = toktok.tokenize(short_sentence)
short_token2 = toktok.tokenize(short_sentence2)

# of words in long_sentence: 70
# of words in short_setence: 4


## Using NLTK pos_tag

In [4]:
long_POS = nltk.pos_tag(long_token)
short_POS = nltk.pos_tag(short_token)
short_POS2 = nltk.pos_tag(short_token2)
# print(long_POS)
# print('='*80)
print(short_POS)
print('='*80)
print(short_POS2)

[('please', 'VB'), ('start', 'NN'), ('with', 'IN'), ('me', 'PRP')]
[('start', 'JJ'), ('please', 'NN'), ('with', 'IN'), ('me', 'PRP')]


For this example, I utilized my prior corpus example from Charles Dicken's classic novel, 'A Tale of Two Cities', the intro sentence to his book is one that is very famous and renowned as it not only long but is also poetic in explaining the duality of the curernt political and social climate at the time. Running the NLTK pos_tagger that NLTK (by default) recommends as it's out of the box tagger, it seems to capture most of the word tokens accurately as each comma separated intro of "it was the ..." all accurately are labeled it's PRP, VD and DT labels. 

Comparing a shorter text, it took a few tries to find a sentence that the NLTK pos_tagger incorrectly identified and couldn't seem to find one myself so I found one online that a user was asking about. It appeared that in this case, the pos_tagger was reading in 'please' as a noun though in reality, it could be used as an adverb (RB), verb (FB) or interjection (UH). My guess why the tagger is recogning the 'please' as a noun is that it's order in the sentence comes after the noun 'start' which coincidently is also incorrectly labeled as an adjective and thus becomes the "subject" of the sentence as the parser doesn't recognize the re-ordering.


Source: https://stackoverflow.com/questions/35737099/why-is-pos-tag-in-nltk-tagging-please-as-nn

## Testing with TextBlob and spaCy

In [5]:
from textblob import TextBlob

long_POS_textbob = TextBlob(long_sentence)
short_POS_textbob = TextBlob(short_sentence)
short_POS2_textbob = TextBlob(short_sentence2)
# print(long_POS)
# print('='*80)
print(short_POS_textbob.tags)
print('='*80)
print(short_POS2_textbob.tags)

[('please', 'VB'), ('start', 'NN'), ('with', 'IN'), ('me', 'PRP')]
[('start', 'JJ'), ('please', 'NN'), ('with', 'IN'), ('me', 'PRP')]


In [51]:
# python -m spacy download en_core_web_sm
spacey = spacy.load("en_core_web_sm")

# function to extract spacy objects
def getPOS(sentence):
    words = [i for i in sentence]
    pos_tag = [i.tag_ for i in sentence]
    pair = list(zip(words, pos_tag))
    return pair

long_POS_spacey = spacey(long_sentence)
short_POS_spacey = spacey(short_sentence)
short_POS2_spacey = spacey(short_sentence2)
# print(long_POS)
# print('='*80)
print(getPOS(short_POS_spacey))
print('='*80)
print(getPOS(short_POS2_spacey))

[(please, 'UH'), (start, 'VB'), (with, 'IN'), (me, 'PRP')]
[(start, 'VB'), (please, 'UH'), (with, 'IN'), (me, 'PRP')]


We can see that after testing our incorrectly identified short sentence, we next test the same short sentence using 2 different POS taggers: TextBlob and spaCy. We can see that based on the POS tagged results, TextBlob performs very similary to the NLTK pos_tag as there is virtually no difference between the two based on our simple test. However, utilizing the spaCy pos tagger, we can see that identifies the issues found in both prior taggers. After further researech to the differences between NLTK and spaCy we found that many experts in the field consider spaCy the "industrial strength" python NLP library that is geared towards performance. From a POS tagging perspective, there is evidence showing how spaCy outperforms NLTK when it comes to POS-tagging at the word tokenization level but not the sentence tokenization level. This is due to how NLTK splits the text at a sentence level whereas spaCy constructs a syntactic tree for each sentence. 

Source: https://medium.com/@akankshamalhotra24/introduction-to-libraries-of-nlp-in-python-nltk-vs-spacy-42d7b2f128f2

## Comparing NLTK vs spaCy with latest news snippet

News article: https://finance.yahoo.com/news/state-crypto-congressional-hearings-ramping-133000991.html

In this next section, we will use some of the latest news in the crypto-currency market and run a sentence through both our POS taggers (NLTK, spaCy) and compare the results. I chose to use crypto as the primary news as the verbage is considered "modern" to see how up to date some of the dictionaries both NLTK and spaCy uses. 

For our testing purpose we will observe the sentence below:
> "It feels like Congress is starting to look a bit more closely at crypto"

To begin, we will first manually do a POS tag on the sentence:

> ('It', 'PRP'), ('feels', 'VBZ'), ('like', 'IN'), ('Congress', 'NNP'), ('is', 'VBZ'), ('starting', 'VBG'), ('to', 'TO'), ('look', 'VB'), ('a', 'DT'), ('bit', 'NN'), ('more', 'RBR'), ('closely', 'RB'), ('at', 'IN'), ('crypto', 'NN')



In [74]:
news_sentence_1 = '''It feels like Congress is starting to look a bit more closely at crypto'''
news_sentence_2 = '''Last week, U.S. Sen. Elizabeth Warren (D-Mass.), a former presidential contender and a longstanding advocate for consumer protections, hosted a Senate subcommittee hearing on cryptocurrencies.'''
news = [news_sentence_1, news_sentence_2]


nltk_pos = []
spacy_pos = []

def pos_compare(show_results=False):
    num = 0
    for i in news:
        num +=1
        # returns token word count
        news_token = toktok.tokenize(i)
        
        # POS tag each sentence
        nltk_pos_result = nltk.pos_tag(news_token)
        spacey_pos_result = getPOS(spacey(i))

        nltk_pos.append(nltk_pos_result)
        spacy_pos.append(spacey_pos_result)
        
        # printing results parameter (set to False for final)
        if (show_results == True):
            print('news_sentence_'+str(num)+':')
            print('# of words: ',len(news_token))
            print('nltk pos_tag: \n'+str(nltk_pos))
            print('\n')
            print('spaCy: \n'+str(spacy_pos))
            print('='*100)
        else:
            if (show_results in (False, None)): pass
            else: 
                if (show_results not in (True, False, None)): 
                    raise ValueError('wrong parameter provided')
    return nltk_pos, spacy_pos

In [76]:
pos_results = pos_compare(show_results=False) # suppressing results for final writeup

Comparing my manually tagged sentence along with both the NLTK pos_tagger and the spaCy tagger, we can observe that all the tagged tokenized words are matching and accurate. What's uncertain is that though 'crypto' is natually a noun, but both taggers labeled 'crypto' also as a noun, but where I'm unsure about is if that's because both taggers do not know what 'crypto' is and thus defaulting it as a noun or if it is actually correctly tagging that word.

Since the first sentence was not quite provideing "different" results, I pulled another sentence to test (news_sentece_2) to test with. Though most POS tagged parts were once again the same, the following interestingly was differet.

NLTK:
> ('D-Mass.', 'NNP')

spaCy:
> (D, 'NNP'), (-, 'HYPH'), (Mass., 'NNP')

We can see that spaCy as part of it's pipeline, the default tokenizer is a bit more granular as it reads D.Mass (District of Massachusetts) separately as opposed to the single element the NLTK tokenizer uses. However aside from different tokenization methods, both POS taggers provide similar results from the 2 sentence comparison tested here.

spaCy pipeline documentation: https://spacy.io/usage/processing-pipelines