# Pos, Stem, and Ngrams: Oh My!
<img src="images/fdr.jpg" height="200" width="200" align="left">
<center>
<h3>In This Worksheet</h3> We will parse a new text using the methods described up to this point and introduce some new ways to characterize words and sentences in texts.
<h3>The Data</h3> <strong>Pearl Harbor Address to the Nation</strong><br><i>President Franklin Delano Roosevelt, December 8th, 1941</i><br>
This is FDR's call to war to the American people after Pearl Harbor.<br>
https://youtu.be/3VqQAf74fsE
</center>

First, let's start by doing some imports and then loading up the FDR data set we created in the last exercise, using pandas's read_csv method.

In [28]:
import nltk
import pandas as pd
from nltk.corpus import stopwords
import string

In [27]:
fp = 'data/FDR-PearlHarbor_parsed.csv'
speech = pd.read_csv(fp, index_col=0)
speech.head()

Unnamed: 0,is_punct,is_stop,sent_id,token
0,False,False,1,mr.
1,False,False,1,vice
2,False,False,1,president
3,True,False,1,","
4,False,False,1,mr.


## Ngrams
We will cover ngrams first because they are the most easy to visualize with our existing data.  Ngrams represent words that occur sequentially together.  

In [58]:
ex_tokens = ['there', 'is', 'a', 'dog', 'in', 'my', 'purse']
for ngram in nltk.ngrams(ex_tokens, 2):
    print(ngram)

('there', 'is')
('is', 'a')
('a', 'dog')
('dog', 'in')
('in', 'my')
('my', 'purse')


We have seen ngram-related concepts at work previously when we looked at context and collocations.

In [59]:
#make a text out of list of tokens from DataFrame
text = nltk.Text(speech.token.tolist())

Remember when we generated the context of a word, we got back a FreqDist of all of the word pairs it was surrounded by.

In [60]:
contexts = nltk.ContextIndex(text.tokens)
contexts._word_to_contexts['japanese']

FreqDist({(',', 'forces'): 3,
          ('after', 'air'): 1,
          ('the', 'ambassador'): 1,
          ('the', 'attacked'): 2,
          ('the', 'empire'): 1,
          ('the', 'government'): 2})

We can do this manually as well using ngrams.  Let's look for the word 'japanese' and count up all of its contexts as they occur in trigrams in which it is included.

In [61]:
from collections import Counter

context_dict = Counter()
for gram3 in nltk.ngrams(text.tokens, 3):
    if gram3[1] == 'japanese':
        context_dict.update([(gram3[0],gram3[2])])
context_dict

Counter({(',', 'forces'): 3,
         ('after', 'air'): 1,
         ('the', 'ambassador'): 1,
         ('the', 'attacked'): 2,
         ('the', 'empire'): 1,
         ('the', 'government'): 2})

Collocations looked for pairs of words that were commonly seen occurring in the same window.  NLTK's BigramCollocationFinder actually looks at all ngrams in the text, so we will only consider the 4grams as an example.

In [72]:
text.collocations()

united states; last night; december 7th; forces attacked; japanese
forces; japanese government; japanese attacked


In [73]:
manual_coll_list = [('united', 'states'), ('last', 'night'), ('december', '7th'), ('forces', 'attacked'), 
                    ('japanese', 'forces'), ('japanese', 'government'), ('japanese', 'attacked')]
coll_dict = {}
for gram4 in nltk.ngrams(text.tokens, 4):
    for coll in manual_coll_list:
        term1,term2 = coll
        if term1 in gram4 and term2 in gram4:
            coll_dict[coll] = coll_dict.get(coll,[]) + [gram4]
            
print(coll_dict[('japanese','attacked')])

[(',', 'japanese', 'forces', 'attacked'), ('japanese', 'forces', 'attacked', 'hong'), (',', 'japanese', 'forces', 'attacked'), ('japanese', 'forces', 'attacked', 'guam'), (',', 'japanese', 'forces', 'attacked'), ('japanese', 'forces', 'attacked', 'the'), (',', 'the', 'japanese', 'attacked'), ('the', 'japanese', 'attacked', 'wake'), ('japanese', 'attacked', 'wake', 'island'), (',', 'the', 'japanese', 'attacked'), ('the', 'japanese', 'attacked', 'midway'), ('japanese', 'attacked', 'midway', 'island')]


## Stemming
Stemming is a crude way of shortening a word so that various lemmas of a word do not prevent us from identifying words as similar.

For example, two words appear in our speech, 'attacked' and 'attack'.  In most cases, we want these to be understood as the same token, 'attack'.  NLTK's PorterStemmer can help us do this.

In [97]:
stemmer = nltk.PorterStemmer()
print(stemmer.stem('attacked'))
print(stemmer.stem('attack'))

attack
attack


We will see that in certain cases, though, the stemmer will fail to do what we want.

In [98]:
print(stemmer.stem('is'))
print(stemmer.stem('are'))

is
are


## Lemmatizing
NLTK's default lemmatizer is the WordNet lemmatizer.  This looks at WordNet's morphy feature in order to generate a lexeme for the word, given the word's part of speech (default is noun).  The availabel part of speeches are:

|POS|Representation|
|---|---|
|ADJ|'a'|
|ADJ_SAT|'s'|
|ADV|'r'|
|NOUN|'n'|
|VERB|'v'|

In [100]:
lemmater = nltk.WordNetLemmatizer()
print(lemmater.lemmatize('is', pos='v'))
print(lemmater.lemmatize('are', pos='v'))

be
be


A few more examples:

In [113]:
print(lemmater.lemmatize('halves'))
print(lemmater.lemmatize('foci'))
print(lemmater.lemmatize('polarizing', pos='v'))
print(lemmater.lemmatize('wandering', pos='v'))

half
focus
polarize
wander


The lemmatizer works pretty poorly on certain words, though, and if WordNet does not find a word, it will return the word unchanged, which is less useful than our stemmer.

In [116]:
print(lemmater.lemmatize('merrily', pos='r'))
print(lemmater.lemmatize('additional', 'a'))

merrily
additional


One way around this is to use the WordNet synset, which is the associated set of 'cognitive synonyms' for the word. 

Here we use 'merrily.r.1' to specify grabbing they synset for 'merrily' of type 'r' at index '0'.

In [123]:
from nltk.corpus import wordnet as wn

synset = wn.synset('merrily.r.0')
print(synset)

Synset('happily.r.01')


We then grab all of its lemmas associated with the synset.

In [124]:
lemmas = synset.lemmas()
print(lemmas)

[Lemma('happily.r.01.happily'), Lemma('happily.r.01.merrily'), Lemma('happily.r.01.mirthfully'), Lemma('happily.r.01.gayly'), Lemma('happily.r.01.blithely'), Lemma('happily.r.01.jubilantly')]


Once one gets to a lemma, you gain access to a whole new set of attributes, to include antonyms, homonyms, pertainyms, etc.  The point here is that the WordNet module is extremely powerful, but it is also very complex and standardizing a way to interact with it, all to acquire the best lexeme for your word, may not be the simplest concept.  This is why you may just want to use a stemmer.  But we will play with WordNet a little more later.

## POS Tagging
We had to identify the POS in order to properly lemmatize certain words using the WordNet Lemmatizer.  But how do we get parts of speech in an automated fashion?  One option is to write your own model.  But NLTK also offers a pre-trained built in POS tagger.  We have already discussed some of the features that these models take into account, so let's see how the POS tagger works.

In [152]:
from nltk import pos_tag

sample_tokens = speech.token.tolist()[:10]
print(pos_tag(sample_tokens))

[('mr.', 'JJ'), ('vice', 'NN'), ('president', 'NN'), (',', ','), ('mr.', 'NN'), ('speaker', 'NN'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT')]


### What do all of these tags mean?
Varies on tagset, but in general:

|Tag|Description|Example|
|---|---|---|
|CC|conjunction, coordinating|and, or, but|
|CD|cardinal number|five, three, 13%|
|DT|determiner|the, a, these |
|EX|existential there|there were six boys |
|FW|foreign word|mais |
|IN|conjunction, subordinating or preposition|of, on, before, unless |
|JJ|adjective|nice, easy|
|JJR|adjective, comparative|nicer, easier|
|JJS|adjective, superlative|nicest, easiest |
|LS|list item marker| |
|MD|verb, modal auxillary|may, should |
|NN|noun, singular or mass|tiger, chair, laughter |
|NNS|noun, plural|tigers, chairs, insects |
|NNP|noun, proper singular|Germany, God, Alice |
|NNPS|noun, proper plural|we met two Christmases ago |
|PDT|predeterminer|both his children |
|POS|possessive ending|'s|
|PRP|pronoun, personal|me, you, it |
|PRP\$|pronoun, possessive|my, your, our |
|RB|adverb|extremely, loudly, hard  |
|RBR|adverb, comparative|better |
|RBS|adverb, superlative|best |
|RP|adverb, particle|about, off, up |
|SYM|symbol|None|
|TO|infinitival to|what to do? |
|UH|interjection|oh, oops, gosh |
|VB|verb, base form|think |
|VBZ|verb, 3rd person singular present|she thinks |
|VBP|verb, non-3rd person singular present|I think |
|VBD|verb, past tense|they thought |
|VBN|verb, past participle|a sunken ship |
|VBG|verb, gerund or present participle|thinking is fun |
|WDT|wh-determiner|which, whatever, whichever |
|WP|wh-pronoun, personal|what, who, whom |
|WP\$|wh-pronoun, possessive|whose, whosever |
|WRB|wh-adverb|where, when |
|.|punctuation mark, sentence closer|.;?* |
|,|punctuation mark, comma|, |
|:|punctuation mark, colon|: |
|(|contextual separator, left paren|( |
|)|contextual separator, right paren|) |

So, we can actually see that our POS's are..

|Token|POS|Interpretation|
|--|--|--|
|mr.|JJ|adjective|
|vice|NN|noun singular|
|president|NN|noun singular|
|,|,|,|
|mr.|NN|noun singular|
|speaker|NN|noun singular|
|,|,|,|
|members|NNS|noun plural|
|of|IN|conjunction or preposition|
|the|DT|determiner|

Notice that mr. receives two different parts of speech at two different places.  This indicates that the context of the word is being considered when parts of speech are being determined (syntactic).

We can get the part of speech for a single token by writing a little function!

In [154]:
def get_pos(token):
    return pos_tag([token])[0][1]

get_pos('mr.')

'NN'

But, we are much better off doing multiple tokens at once so that more context is given.

In [156]:
tokens = speech.token.tolist()
pos_tags = pos_tag(tokens)
just_tags = [x[1] for x in pos_tags]

And we can create a new feature column for our tokens called 'pos' that stores the POS for each token.

In [159]:
speech['pos'] = pd.Series( just_tags )
speech.head()

Unnamed: 0,is_punct,is_stop,sent_id,token,pos
0,False,False,1,mr.,JJ
1,False,False,1,vice,NN
2,False,False,1,president,NN
3,True,False,1,",",","
4,False,False,1,mr.,NN


In [162]:
speech.pos.value_counts()

NN      111
IN       71
JJ       62
DT       60
NNS      36
,        30
.        26
CC       24
RB       22
VB       20
PRP      15
VBN      14
PRP$     14
VBD      14
VBP      13
TO       11
MD       11
VBZ       6
CD        5
:         4
VBG       3
WDT       3
WRB       2
EX        1
Name: pos, dtype: int64

In [163]:
noun_pos = ['NN', 'NNS', 'NNP', 'NNPS']
speech[speech.pos.isin(noun_pos)].token.value_counts().head(10)

forces        6
states        6
attack        5
people        5
japan         5
yesterday     4
night         4
i             4
government    3
nation        3
Name: token, dtype: int64