# State of the Union Addresses

The State of the Union Address is an annual address given by the President of the United States to the Congress of the United States.

The first address was a speech given by George Washington in 1790. Some of the addresses have been given as a speech, others have been given as a letter, while some have been given as some combination of both a speech and a letter. 

In this lab we will consider the collection of addresses as a dataset and see if we can detect differences and similarities between the addresses.

A text file containing 214 State of the Union addresses can be obtained from Project Gutenburg. We have downloaded this file and processed it into the file sotu.csv located in this folder.

Here is how the csv file was created. The project Gutenburg file contains a lot of header and trailer information besides the original addresses, so the file needs to be processed. It was first split into 214 separate files just containing the addresses. These were then used to create bigram vectors. 

A bigram vector of a text counts how many times each two-word sequence appears in the text. For example, the bigram vector for the text

     'to be or not to be'
     
 would be 
 
    to be  : 2
    be or  : 1
    or not : 1
    not to : 1
    
A bigram vector is created for each address. For example, the bigram vector for George Washington's first address contains 926 bigrams. The top 10 bigrams are

    of the : 10
    to the : 13
    in the : 7
    to be : 6
    will be : 5
    of our : 5
    from the : 5
    united states : 4
    the united: 4 
    that the : 4


The top ten bigrams for Abraham Lincoln's first address are

    of the : 149
    to the : 56
    it is : 27
    for the : 26
    by the : 22
    have been : 21
    and the : 21
    the union : 19
    that the : 18

As you can see, there are some similarities and some differences. 

These 214 bigram vectors can be considered as 214 multivariate data points and we can conduct many statistical tests on these. 

Our data set currently has a very large number of features or dimensions. (Think of all the possible two word combinations) To be able to view the data set we need to somehow reduce the number of dimensions while still retaining most of the variation in the data. One popular technique is to conduct Princple Component Analysis or PCA for short. PCA detemines the directions in which a sample varies the most. It is by far the most popular way for dimensional reduction. In general there will still be many components after conducting PCA. The csv file just contains the first two components. Think of this as reducing our data set to just a two-dimensional data set where the two dimensions are the directions that the data varies the most.

The sotu.csv data set contains the following 8 features

1) the name of the President
2) the year of the address
3) the delivery of the address (s for speech, w for written letter, b for both, and o for other)
4) the party affiliation of the president (I for Independent, DR for Democratic-Republican, F for Federalist, W for Whig, D for Democrat, R for Republican)
5) the first PCA component
6) the second PCA component
7) the third PCA component
8) the length of the address

XXThe code given below with plot the data set in 3D using features 2,5, and 6

Let's begin our analysis by plotting the length as a function of the year


In [None]:
import csv
import numpy as np
import matplotlib.pyplot as plt

filename = "sotu.csv"
file = open(filename,"r")
reader = csv.reader(file, delimiter=",")

fig = plt.figure()
ax = fig.add_subplot()
fig.set_size_inches(5,5, forward=True)

colormap = {'s':'black','w':'black','o':'black','b':'black'}

next(reader)

for line in reader:
   # print(line[1],line[7],line[3])
    x = float(line[1])
    y = float(line[7])
    col = colormap[line[2]]
    ax.scatter(x,y,marker="o",color=black)
    
file.close()
plt.show()

As we can see, the first addresses (from 1790 to around 1820) were all about the same length. Then the length of the addresses started to grow and then started to vary significantly. The later addresses all seem to be very similar in length except one outlier. This outlier is by far the longest address and was the last address given by president Jimmy Carter in 1981.
Notice the following lines of code

colormap = {'s':'black','w':'black','o':'black','b':'black'}

This defines a map (or function). The map will color a data point based on the value of the delevery feature of the data point: speech, written, other, or both.
Currently these are all set to black. Try changing these to 'blue' for speech, 'red' for written, and 'black' for other and both.

Now run the code again to replot the data set.

In the cell below, comment on the variation that you now see in the data set. What are there differences or similarites based on delivery?


Now let's plot the pca features as a function of the year.

In [None]:
import csv
import numpy as np
import matplotlib.pyplot as plt

filename = "sotu.csv"

file = open(filename,'r')
csv_reader = csv.reader(file, delimiter=',')

colormap = {'s':'black','w':'black','o':'black','b':'black'}
sizemap = {'I':10,'F':10,'DR':10,'W':10,'D':10,'R':10}

fig = plt.figure()
ax = fig.add_subplot(projection='3d')
fig.set_size_inches(10, 10, forward=True)

next(csv_reader)
for line in csv_reader:
    x = float(line[1])
    y = float(line[4])
    z = float(line[5])
    col = colormap[line[2]]
    size = sizemap[line[3]]
    ax.scatter(x,y,z,marker="o", s=size,color=col)

file.close()
plt.show()

Notice the following two lines of code

colormap = {'s':'black','w':'black','o':'black','b':'black'}
sizemap = {'I':10,'F':10,'DR':10,'W':10,'D':10,'R':10}

We again have a colormap which will color data points based on delivery, and we have added a sizemap which will change the size of a data point based on party affiliation: Independent, Federalist, Democratic-Replublican, Whig, Democratic, and Republican.
Currently the colors are all set to black and the sizes are set to 10. 

As before, try changing the colors. Also adjust the sizes, maybe 5 for 'D' and 40 for 'R'.

Now run the code again to replot the data set.

In the cell below, comment on the variation that you now see in the data set. Are there differences or similarites based on delivery ( written or speech) and/or party affiliation?