# An analysis of the State of the Union speeches - Part 2

In [14]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

import shelve
import nltk
from nltk.stem import PorterStemmer

plt.style.use('seaborn-dark')
plt.rcParams['figure.figsize'] = (10, 6)

Let's start by loading some of the data created in the previous part, so we can continue where we left off:

In [15]:
addresses = pd.read_hdf('results/df1.h5', 'addresses')
with shelve.open('results/vars1') as db:
    speeches = db['speeches']

Let's double-check that we're getting the full set of speeches:

In [16]:
print(addresses.shape)
print(len(speeches))

(227, 3)
227


## Basic text analysis

Let's ask a few basic questions about this text, by populating our `addresses` dataframe with some extra information. As a reminder, so far we have:

In [6]:
addresses.head()

Unnamed: 0,president,title,date
0,George Washington,State of the Union Address,1790-01-08
1,George Washington,State of the Union Address,1790-12-08
2,George Washington,State of the Union Address,1791-10-25
3,George Washington,State of the Union Address,1792-11-06
4,George Washington,State of the Union Address,1793-12-03


Now, let's add the following information to this DF:

* `n_words`: number of words in the speech
* `n_uwords`: number of *unique* words in the speech
* `n_swords`: number of *unique, stemmed* words in the speech
* `n_chars`: number of letters in the speech
* `n_sent`: number of sentences in the speech

For this level of complexity, it's probably best if we go with NLTK. Remember, that `speeches` is our list with all the speeches, indexed in the same way as the `addresses` dataframe:

In [12]:
from nltk.corpus import stopwords
import string

def clean_word_tokenize(doc):
    """custom word tokenizer which removes stop words and punctuation
    
    Parameters
    ----------
    doc : string
        A document to be tokenized
        
    Returns
    -------
    tokens
    """
    # combine stop words and punctuation
    stop = stopwords.words("english") + list(string.punctuation)
    
    # filter out stop words and punctuation and send to lower case
    tokens = [token.lower() for token in nltk.word_tokenize(doc) if token not in stop]

    return(tokens)
    

In [13]:
stemmer = PorterStemmer()

len(set(stemmer.stem(word) for word in clean_word_tokenize(speeches[2])))
#len([stemmer.stem(word) for word in x])


626

In [9]:
len(set(clean_word_tokenize(speeches[4])))

714

In [10]:
len(speeches[2])

14203

Now we compute these quantities for each speech, as well as saving the set of unique, stemmed words for each speech, which we'll need later to construct the complete term-document matrix across all speeches.

In [11]:
addresses['n_sent'] = [len(nltk.sent_tokenize(speech)) for speech in speeches]
addresses['n_words_all'] = [len(nltk.word_tokenize(speech)) for speech in speeches]
addresses['n_words'] = [len(clean_word_tokenize(speech)) for speech in speeches]
addresses['n_uwords'] = [len(set(clean_word_tokenize(speech))) for speech in speeches]
addresses['n_swords'] = [len(set(stemmer.stem(word) for word in clean_word_tokenize(speech))) for speech in speeches]
addresses['n_chars'] = [len(speech) for speech in speeches]                             

KeyboardInterrupt: 

Let's look at a summary of these 

In [None]:
addresses.head()

In [None]:
pd.options.display.precision = 2
addresses.describe()

## Visualizing characteristics of the speeches

Now we explore some of the relationships between the speeches, their authors, and time.

How properties of the speeches change over time.

In [None]:
# Use Seaborn to provide a plot such as this, and discuss:
fig, ax = plt.subplots(3, 2)
#ax = sns.barplot(data = w, x = w.Months, y = w.Count, color = 'b');
#ax.set_title("# of addresses per month")
#ax.set_ylabel('')    
#ax.set_xlabel('')

#plt.savefig("fig/speech_changes.png")

Now for the distributions by president

In [None]:
# YOUR CODE HERE
plt.savefig("fig/speech_characteristics.png");

## Intermediate results storage

Since this may have taken a while, we now serialize the results we have for further use. Note that we don't overwrite our original dataframe file, so we can load both (even though in this notebook we reused the name `addresses`):

In [17]:
speech_words = [nltk.word_tokenize(speech) for speech in speeches]
speeches_cleaned = [clean_word_tokenize(speech) for speech in speeches]

In [18]:
addresses.to_hdf('results/df2.h5', 'addresses')

with shelve.open('results/vars2') as db:
    db['speech_words'] = speech_words
    db['speeches_cleaned'] = speeches_cleaned