# An analysis of the State of the Union speeches

**Authors:** Sarah Johnson, Chitwan Kaudan, Nadav Tadelis

In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

import shelve
import nltk
from nltk.stem import PorterStemmer

plt.style.use('seaborn-dark')
plt.rcParams['figure.figsize'] = (10, 6)

## Abstract

.. brief summary of your conclusions here

## Introduction & Exploring Data Gaps

For our dataset (data/stateoftheunion1970-2017) we used Project Gutenberg's EBook of Complete State of the Union Addresses in which speeches after 2002 were pulled from UCSB The American Presidency Project. 

We first examined our dataset to get a sense of which types of speeches are included, which presidents are represented and if there are significant timeline gaps.

In [8]:
#Load in our data frame of speech titles, president names and dates from notebook 1
addresses = pd.read_hdf('results/df1.h5', 'addresses')

We looked into what types of speeches were included in our dataset. 

In [5]:
addresses['title'].unique()  

array([' State of the Union Address', ' Address on Administration Goals',
       ' Address on Administration Goals (Budget Message)',
       ' Address Before a Joint Session of Congress'], dtype=object)

In [6]:
addresses.loc[addresses['title'] != ' State of the Union Address']

Unnamed: 0,president,title,date
197,George H.W. Bush,Address on Administration Goals,1989-02-09
201,William J. Clinton,Address on Administration Goals,1993-02-17
209,George W. Bush,Address on Administration Goals (Budget Message),2001-02-27
218,Barack Obama,Address Before a Joint Session of Congress,2009-02-24
226,Donald J. Trump,Address Before a Joint Session of Congress,2017-02-27


Although these speeches are not called State of the Union Addresses a quick Google serach reveals that they were all delivered in front of a Joint Session of Congress around the same time a State of the Union address are typically delivered. They also served the same purpose - they informed the Congress and the public the state of the nation and revealed the administrations top priorities for the coming year. For our we considered these speeches as functionally equivalent to State of Union Adresses.



Next we examined which Presidents were represented and which were not in our dataset.

In [7]:
#We can list out presidents in order and cross reference a complete list of presidents in order to figure out which ones are missing.
pd.Series(addresses['president'].unique(), index = list(range(1, 43)))

1           George Washington
2                  John Adams
3            Thomas Jefferson
4               James Madison
5                James Monroe
6           John Quincy Adams
7              Andrew Jackson
8            Martin van Buren
9                  John Tyler
10                 James Polk
11             Zachary Taylor
12           Millard Fillmore
13            Franklin Pierce
14             James Buchanan
15            Abraham Lincoln
16             Andrew Johnson
17           Ulysses S. Grant
18        Rutherford B. Hayes
19          Chester A. Arthur
20           Grover Cleveland
21          Benjamin Harrison
22           William McKinley
23         Theodore Roosevelt
24            William H. Taft
25             Woodrow Wilson
26             Warren Harding
27            Calvin Coolidge
28             Herbert Hoover
29      Franklin D. Roosevelt
30            Harry S. Truman
31       Dwight D. Eisenhower
32            John F. Kennedy
33          Lyndon B. Johnson
34        

Donald J. Trump is our 45th President yet our dataset only includes 42 Presidents. The Presidents missing from this list are John Tyler and James A. Garfield who both died within a couple days of gaining office so never had a chance to deliver a State of the Union Address. Also, Grover Cleveland served two non-consecutive terms so he was both the 22nd and the 24th president. 

Next we plotted Speech Number agaisnt the date they were delivered to looked at whether there were large timeline gaps in out dataset. 

![Data Gap](fig/timeline.png)

The gap from 1893-1897 represents President Cleveland's second term. A quick google search revealed that President Cleveland did deliver written State of the Union Addresses from 1893-1897 but they are not included in our dataset. Suspecting a formatting error, we took a closer look at all of President Cleveland and found that his 1889 State of the Union ends very unexpectedly mid sentence. It appeared like our data cut off part of his speech and didn;t include his speeches from his second term. We kept these descrepencies in mind when doing our analysis on this speeches. 

## Speech Features Analysis Over Time



![](fig/feature_dists.png)

![Speech Changes Over Time](fig/speech_changes.png)

![Speech Changes By President](fig/speech_characteristics.png)

## Sentiment Analysis