# Lede Algorithms -- Assignment 2

In this assignment you will use all your text analysis skills to analyze the U.S. State of the Union speeches in the 20th century. 

First, load `state-of-the-union.csv`. This is is a standard CSV file with one speech per row. There are two columns: the year of the speech, and the text of the speech. 

In [1]:
# Some stuff you'll need
import pandas as pd
from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
import math

In [2]:
# load 'state-of-the-union.csv'
speeches=pd.read_csv('state-of-the-union.csv')
speeches.head()

Unnamed: 0,year,text
0,1790,"George Washington\nJanuary 8, 1790\n\nFellow-C..."
1,1790,\nState of the Union Address\nGeorge Washingto...
2,1791,\nState of the Union Address\nGeorge Washingto...
3,1792,\nState of the Union Address\nGeorge Washingto...
4,1793,\nState of the Union Address\nGeorge Washingto...


We will work with ony those speeches in the 20th century, so start by filtering out only the rows with a year between 1900 and 1999

In [3]:
state_speeches=speeches[(speeches['year']>=(1900))&(speeches['year']<=(1999))]
state_speeches.reset_index(inplace=True)

In [4]:
#data cleaning
state_speeches= state_speeches.drop(columns=['index'])

In [5]:
state_speeches.head()

Unnamed: 0,year,text
0,1900,\nState of the Union Address\nWilliam McKinley...
1,1901,\nState of the Union Address\nTheodore Rooseve...
2,1902,\nState of the Union Address\nTheodore Rooseve...
3,1903,\nState of the Union Address\nTheodore Rooseve...
4,1905,\nState of the Union Address\nTheodore Rooseve...


The first step in your analysis task will be to tokenize each document in this set and create a dataframe of tf-idf vectors. We're going to need to tokenize first, so write (or cut and paste!) a tokenizer function that takes a string and returns a list of standardized tokens.

In [6]:
def tokenize(s):
    blob = TextBlob(s.lower()) 
    words = [token for token in blob.words if len(token)>2]  
    return words

In [7]:
for documents in state_speeches['text']:
    tokenize(documents)
    
documents

'\nState of the Union Address\nWilliam J. Clinton\nJanuary 19, 1999\n\nMr. Speaker, Mr. Vice President, members of Congress, honored guests, my\nfellow Americans:\n\nTonight I have the honor of reporting to you on the State of the Union.\n\nLet me begin by saluting the new speaker of the House and thanking him\nespecially tonight for extending an invitation to two guests sitting in the\ngallery with Mrs. Hastert. Lyn Gibson and Wei Ling Chestnut are the widows\nof the two brave Capitol Hill police officers who gave their lives to\ndefend freedom\'s house.\n\nMr. Speaker, at your swearing in you asked us all to work together in a\nspirit of civility and bipartisanship. Mr. Speaker, let\'s do exactly that.\n\nTonight, I stand before you to report that America has created the longest\npeacetime economic expansion in our history. With nearly 18 million new\njobs, wages rising at more than twice the rate of inflation, the highest\nhomeownership in history, the smallest welfare roles in 30 y

Good stuff. Now use this to create a matrix of tf-idf vectors for the document set.

In [8]:
#New Count Vectorizer
vectorizer = TfidfVectorizer(stop_words='english', tokenizer=tokenize)

matrix = vectorizer.fit_transform(state_speeches.text)

# The easiest way to see what happenned is to make a dataframe
tfidf = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
tfidf.head()

Unnamed: 0,'70,'76,'82,'86,'89,'90,'follow,'forties,'ll,'re,...,zest,zigzag,zimbabwe,zimbabwean,zinc,zion,zone,zones,zoological,zooming
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005191,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010902,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.018962,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.014207,0.0,0.0,0.0


You're probably going to want a way to print out the most highly weighted terms this as well, so we'll use print_sorted_vector from the lesson notebook:

In [9]:
def print_sorted_vector(v):
    sorted_list = sorted(v.items(), key=lambda x: (x[1],x[0]), reverse=True) 
    sorted_list = sorted_list[:20]
    print('\n'.join([str(x) for x in sorted_list]))

Print out a few of the State of The Union vectors for individual speeches to get a sense of what's happening here.

In [10]:
print_sorted_vector(tfidf.iloc[10])

('government', 0.21073172220930012)
('1911', 0.18110916439111122)
('united', 0.18068613565416997)
('estimates', 0.180545369401533)
('states', 0.17717755457565965)
('department', 0.14153778659459598)
('american', 0.13922289923788575)
('commercial', 0.13037904085769578)
('1910', 0.12648224960815058)
('diplomatic', 0.11972148923645383)
('china', 0.11726892616703778)
('international', 0.10552529175491292)
('tariff', 0.10351825927399781)
('year', 0.1023697788513866)
('1912', 0.10139847293001968)
('convention', 0.09634202703142629)
('loan', 0.09489173766783132)
('countries', 0.09213279577764368)
('court', 0.09089104280877965)
('arbitration', 0.08670782432828367)


Now sum the vectors for each decade, and print out the results. Do you see any themes? Can you connect the terms to major historical events? (wars, the great depression, assassinations, the civil rights movement, Watergate…)

for decade in range(1900, 1990, 10):
    print('Decade between: '+ str(decade) + 'and' + str(decade+10))
    state_speeches=speeches[(speeches['year']>=(decade))&(speeches['year']<=(decade+10))]
    vectorizer = CountVectorizer(stop_words='english', tokenizer=tokenize)
    matrix = vectorizer.fit_transform(state_speeches.text)
    tfidf = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
    print_sorted_vector(tfidf.iloc[7])

In [11]:
decades = []

for decade in range(1900,2000,10):
    decades.append(tfidf[(speeches['year']>=decade)&(speeches['year']<=decade+10)].sum())

  after removing the cwd from sys.path.


In [12]:
decades

['70              0.0
 '76              0.0
 '82              0.0
 '86              0.0
 '89              0.0
 '90              0.0
 'follow          0.0
 'forties         0.0
 'll              0.0
 're              0.0
 'round           0.0
 'the             0.0
 'thirties        0.0
 'til             0.0
 'to              0.0
 'trusts          0.0
 'twenties        0.0
 've              0.0
 000              0.0
 000,000          0.0
 0111             0.0
 031              0.0
 1,000            0.0
 1,000,000        0.0
 1,000,000,000    0.0
 1,005            0.0
 1,010            0.0
 1,022,000,000    0.0
 1,026,000        0.0
 1,034,000,000    0.0
                 ... 
 you're           0.0
 you've           0.0
 young            0.0
 younger          0.0
 youngest         0.0
 youngsters       0.0
 youth            0.0
 youthful         0.0
 yugoslavia       0.0
 yukon            0.0
 zablocki-and     0.0
 zarfos           0.0
 zeal             0.0
 zealand          0.0
 zealous  

Which two decades are most similar, according to the cosine similarity of their average vectors? You will need to use a double loop that compares every pair of decades and finds the pair with the smallest distance.

In [13]:
#Saved those summed vectors into a list,  then put that list into a new dataframe. 
sou = pd.DataFrame(decades)
sou.head()

Unnamed: 0,'70,'76,'82,'86,'89,'90,'follow,'forties,'ll,'re,...,zest,zigzag,zimbabwe,zimbabwean,zinc,zion,zone,zones,zoological,zooming
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
#Now I have to write a function that will show the Document similarity to compare different rows 
#in my new dataframe, each of which was the summed vector for a decade
def doc2vec_count(s):
    tokens = tokenize(s)
    vec = {}
    for t in tokens:
        vec[t] = vec.get(t, 0) + 1
    return vec

In [15]:
state_speeches.text.head(3)

0    \nState of the Union Address\nWilliam McKinley...
1    \nState of the Union Address\nTheodore Rooseve...
2    \nState of the Union Address\nTheodore Rooseve...
Name: text, dtype: object

In [21]:
def doc_distance(a_vec,b_vec):
    a_vec = a_vec/math.sqrt(sum([x*x for x in a_vec]))
    b_vec = b_vec/math.sqrt(sum([x*x for x in b_vec]))
    similarity = a_vec.dot(b_vec)
    return 1-similarity
def dij(i,j):
    
    return doc_distance(sou.iloc[i], sou.iloc[j])

In [22]:
smallest_value=100

for i in range(len(sou)):
    for j in range(i+1,10):
        if dij(i,j) < smallest_value:
            smallest_value = dij(i,j)
        smallest_index = (i,j)
print (smallest_index, smallest_value)

(8, 9) 100


# Write a 500 word (max) article on what U.S. presidents discussed in their SOTU speeches in the 20th century. You should obviously use your tf-idf analysis as a primary source *but* you will not be able to complete this without actually reading some of the speeches, and comparing them to other historical references.

Turn in this notebook, with your article below.
    

Women were there, but no president took care to mention them. 

During the so-called Progressive Era, the period between 1900 and 1920, women were nowhere to be found in the constitutionally required address to Congress by any president of the United States. Alas, the word "women" was barely mentioned at all during the rest of the century. 

Even though significant changes were taking place in the country, the 19th amendment granted women the right to vote included, it seems like the chiefs of state prioritized other issues and events. The beginning of the epoch was characterized by the drive to effect reforms and proposals to radically alter the government and its relationship to the people it served. During that time, government, law, public, and country were the most used themes in the State of the Union addresses.

In the 1920's, Presidents’ speeches spoke about reestablishing and strengthen trust in the State's institution. There was a palpable level of corruption throughout the federal government and the US Senate. However, "corruption" was not mentioned. 

It was in 1913 that president Woodrow Wilson decided to give the State of the Union Address himself. It was well received by the American people and only few presidents after him reverted to the 19th century tradition of sending the discourse. Almost a hundred years before, Thomas Jefferson started the tradition of not personally delivering the constitutionally required speeches before Congress.

