# NLP - Homework 4
### Miguel Bonilla

1. [Run First POS Tagger](#1.-Run-First-POS-Tagger)
    - [a. Long Sentence](#a.-Long-Sentence)
    - [b. Short Sentence](#b.-Short-Sentence)
2. [Run Second POS Tagger](#2.-Run-Second-POS-Tagger)
    - [a. Does it Produce the Same Outcome?](#a.-Does-it-Produce-the-Same-Outcome?)
    - [b. Explain Any Differences](#b.-Explain-Any-Differences)
3. [Random Sentence From Article](#3.-Random-Sentence-From-Article)
    - [a. Manual Tagging with Penn Tagset](#a.-Manual-Tagging-with-Penn-Tagset:)
    - [b. Run Sentence Through Both Taggers](#b.-Run-Sentence-Through-Both-Taggers)
    - [c. Explain the Differences](#c.-Explain-the-Differences)

1.	Run one of the part-of-speech (POS) taggers available in Python. 
    a.	Find the longest sentence you can, longer than 10 words, that the POS tagger tags correctly. Show the input and output.  
    b.	Find the shortest sentence you can, shorter than 10 words, that the POS tagger fails to tag 100 percent correctly. Show the input and output. Explain your conjecture as to why the tagger might have been less than perfect with this sentence.

2.	Run a different POS tagger in Python. Process the same two sentences from question 1.
    a.	Does it produce the same or different output?  
    b.	Explain any differences as best you can.

3.	In a news article from this week’s news, find a random sentence of at least 10 words.
    a.	Looking at the Penn tag set, manually POS tag the sentence yourself.  
    b.	Now run the same sentences through both taggers that you implemented for questions 1 and 2. Did either of the taggers produce the same results as you had created manually?  
    c.	Explain any differences between the two taggers and your manual tagging as much as you can.


In [1]:
import nltk
from nltk.tokenize import word_tokenize

## 1. Run First POS Tagger

#### a. Long Sentence

In [2]:
### 43 word long sentence, from Russian Thinkers by Isaiah Berlin
sent_long = '''Although the most extreme forms of this faith, with their dehumanising visions of individuals as instruments of abstract historical forces, have led to criminal perversions of political practice, Berlin emphasises that the faith itself cannot be dismissed as the product of sick minds.'''

In [3]:
print(sent_long)

Although the most extreme forms of this faith, with their dehumanising visions of individuals as instruments of abstract historical forces, have led to criminal perversions of political practice, Berlin emphasises that the faith itself cannot be dismissed as the product of sick minds.


In [4]:
## tag words using the pos_tag function from NLTK, which uses a version of the Penntree tagset
word_long_tokens = word_tokenize(sent_long)
long_tags = nltk.pos_tag(word_long_tokens)
print(long_tags)

[('Although', 'IN'), ('the', 'DT'), ('most', 'RBS'), ('extreme', 'JJ'), ('forms', 'NNS'), ('of', 'IN'), ('this', 'DT'), ('faith', 'NN'), (',', ','), ('with', 'IN'), ('their', 'PRP$'), ('dehumanising', 'JJ'), ('visions', 'NNS'), ('of', 'IN'), ('individuals', 'NNS'), ('as', 'IN'), ('instruments', 'NNS'), ('of', 'IN'), ('abstract', 'JJ'), ('historical', 'JJ'), ('forces', 'NNS'), (',', ','), ('have', 'VBP'), ('led', 'VBN'), ('to', 'TO'), ('criminal', 'JJ'), ('perversions', 'NNS'), ('of', 'IN'), ('political', 'JJ'), ('practice', 'NN'), (',', ','), ('Berlin', 'NNP'), ('emphasises', 'VBZ'), ('that', 'IN'), ('the', 'DT'), ('faith', 'NN'), ('itself', 'PRP'), ('can', 'MD'), ('not', 'RB'), ('be', 'VB'), ('dismissed', 'VBN'), ('as', 'IN'), ('the', 'DT'), ('product', 'NN'), ('of', 'IN'), ('sick', 'JJ'), ('minds', 'NNS'), ('.', '.')]


The default tagger, surprisingly, tags this 43 word sentence correctly. Moreover, the sentence is written in UK English, with words that are spelled differently from American English. Perhaps most surprisingly, it properly tagged "dehumanising" and "abstract" as adjectives, based on the context.

#### b. Short Sentence

In [5]:
sent_short = '''John was alone'''
print(sent_short)

John was alone


In [6]:
word_short_tokens = word_tokenize(sent_short)
short_tags = nltk.pos_tag(word_short_tokens)
short_tags

[('John', 'NNP'), ('was', 'VBD'), ('alone', 'RB')]

In this case, the tagger incorregly tagged the word alone as an adverb, when in this context it is an adjective, since it's not modifying the verb (or an adjective or other adverb). A multi-step tagging process would have included the default tagger first, which would have ensured no words were left untagged. 

## 2. Run Second POS Tagger

In [7]:
from nltk.corpus import brown
from nltk.corpus import treebank
from nltk.tag import UnigramTagger

In [8]:
tagger = UnigramTagger(treebank.tagged_sents())

In [None]:
# define a UnigramTagger function for the sentences tagged previously.
def pos_unigram(sent):
    tags = tagger.tag(sent)
    print(tags)

In [10]:
pos_unigram(word_long_tokens)

[('Although', 'IN'), ('the', 'DT'), ('most', 'JJS'), ('extreme', None), ('forms', 'NNS'), ('of', 'IN'), ('this', 'DT'), ('faith', 'NN'), (',', ','), ('with', 'IN'), ('their', 'PRP$'), ('dehumanising', None), ('visions', None), ('of', 'IN'), ('individuals', 'NNS'), ('as', 'IN'), ('instruments', 'NNS'), ('of', 'IN'), ('abstract', None), ('historical', 'JJ'), ('forces', 'NNS'), (',', ','), ('have', 'VBP'), ('led', 'VBN'), ('to', 'TO'), ('criminal', 'JJ'), ('perversions', None), ('of', 'IN'), ('political', 'JJ'), ('practice', 'NN'), (',', ','), ('Berlin', 'NNP'), ('emphasises', None), ('that', 'IN'), ('the', 'DT'), ('faith', 'NN'), ('itself', 'PRP'), ('can', 'MD'), ('not', 'RB'), ('be', 'VB'), ('dismissed', 'VBN'), ('as', 'IN'), ('the', 'DT'), ('product', 'NN'), ('of', 'IN'), ('sick', None), ('minds', None), ('.', '.')]


In [None]:
pos_unigram(word_short_tokens)

#### a. Does it Produce the Same Outcome?

For the longer sentence, we can see that it produces a different outcome from the tags in part 1. There are several words which did not get tagged, a multi-step tagging process would have begun by introducing the default tagger, which would have ensured no words were left untagged.

the short sentence, was tagged the same exact way as it was tagged in part 1, but it should be noted both instances are incorrect, since as mentioned previously, the word alone is an adjective in the context of the short sentence.

#### b. Explain Any Differences

For the longer sentence, we can see that for the word sequence "the most extreme", the unigram tagger improperly tags "most" as an adjective and leaves extreme untagged. The unigram tagger in this case was unable to identify that the word most was modifying the word extreme, which is itself an adjective, meaning most is in this case an adverb. A bigram tagger should have been able to properly tag that bigram.

The words dehumanising, visions and abstract, were both properly tagged as adjective in step 1, but both failed to get tagged by the unigram tagger. The POS tagger probably uses a more sophisticated, n-gram approach, looking at a sequence of words and being able to make a determination based on the statistical probabilities of the preceeding and proceeding words.

## 3. Random Sentence From Article

In [16]:
### news sentence, 13 words long. From "Jimmy Carter's Presidency Was Not What You Think", from the NYTimes.
news_sent = '''He decided to use power righteously, ignore politics and do the right thing.'''

#### a. Manual Tagging with Penn Tagset:

word | tag
-----|----
He | PRP (Personal Pronoun)
decided | VBD (Verb past tense)
to | TO (infinitive 'to')
use | VB (Verb base form)
power | NN (Noun singular)
righteously | RB (Adverb)
ignore | VB (Verb base form)
politics | NNS (Noun, plural)
and | CC (Coordinating Conjunction)
do | VB (Verb base form)
the | DT (Determiner)
right | JJ (Adjective)
thing | NN (Noun, singular)

#### b. Run Sentence Through Both Taggers

In [18]:
##tokenize sentence
word_news_tokens = word_tokenize(news_sent)

#tagger from part 1
news_tags = nltk.pos_tag(word_news_tokens)
print(news_tags)

[('He', 'PRP'), ('decided', 'VBD'), ('to', 'TO'), ('use', 'VB'), ('power', 'NN'), ('righteously', 'RB'), (',', ','), ('ignore', 'NN'), ('politics', 'NNS'), ('and', 'CC'), ('do', 'VBP'), ('the', 'DT'), ('right', 'JJ'), ('thing', 'NN'), ('.', '.')]


In [17]:
### tagger from part 2
pos_unigram(word_news_tokens)

[('He', 'PRP'), ('decided', 'VBD'), ('to', 'TO'), ('use', 'NN'), ('power', 'NN'), ('righteously', None), (',', ','), ('ignore', None), ('politics', 'NNS'), ('and', 'CC'), ('do', 'VBP'), ('the', 'DT'), ('right', 'NN'), ('thing', 'NN'), ('.', '.')]


Neither of the taggers produced the exact same results as my manually tagged sentence. The first tagger, improperly tagged "ignore" as a noun, when it is in this case a verb. Additionally, the word "do" was tagged as VBP, meaning Verb non-3rd ps. singular present (not sure in this case if mine is the correct tag or if the automated tagger is correct in this case). 

The unigram tagger incorrectly tagged "use" as a noun when it is a verb, it also failed to tag righteously as an adverb and ignore as a verb.

#### c. Explain the Differences