##text analysis 

In this assignment I  will use all your text analysis skills to analyze the U.S. State of the Union speeches in the 20th century. 

First, load `state-of-the-union.csv`. This is is a standard CSV file with one speech per row. There are two columns: the year of the speech, and the text of the speech. 

In [1]:
# Some stuff I 'll need
import pandas as pd
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer
import math 
import numpy as np
from nltk import WordNetLemmatizer
from nltk.stem import SnowballStemmer

In [2]:
# load 'state-of-the-union.csv'
S = pd.read_csv('state-of-the-union.csv')
S.head(120)

Unnamed: 0,year,text
0,1790,"George Washington\nJanuary 8, 1790\n\nFellow-C..."
1,1790,\nState of the Union Address\nGeorge Washingto...
2,1791,\nState of the Union Address\nGeorge Washingto...
3,1792,\nState of the Union Address\nGeorge Washingto...
4,1793,\nState of the Union Address\nGeorge Washingto...
5,1794,\nState of the Union Address\nGeorge Washingto...
6,1795,\nState of the Union Address\nGeorge Washingto...
7,1796,\nState of the Union Address\nGeorge Washingto...
8,1797,\nState of the Union Address\nJohn Adams\nNove...
9,1798,\nState of the Union Address\nJohn Adams\nDece...


We will work with ony those speeches in the 20th century, so start by filtering out only the rows with a year between 1900 and 1999

In [3]:
W = S.iloc[111:212].reset_index()
W

Unnamed: 0,index,year,text
0,111,1900,\nState of the Union Address\nWilliam McKinley...
1,112,1901,\nState of the Union Address\nTheodore Rooseve...
2,113,1902,\nState of the Union Address\nTheodore Rooseve...
3,114,1903,\nState of the Union Address\nTheodore Rooseve...
4,115,1905,\nState of the Union Address\nTheodore Rooseve...
5,116,1905,\nState of the Union Address\nTheodore Rooseve...
6,117,1906,\nState of the Union Address\nTheodore Rooseve...
7,118,1907,\nState of the Union Address\nTheodore Rooseve...
8,119,1908,\nState of the Union Address\nTheodore Rooseve...
9,120,1909,\nState of the Union Address\nWilliam H. Taft\...


In [4]:
W.text[0].split('\n')

['',
 'State of the Union Address',
 'William McKinley',
 'December 3, 1900',
 '',
 'To the Senate and House of Representatives:',
 '',
 'At the outgoing of the old and the incoming of the new century you begin',
 'the last session of the Fifty-sixth Congress with evidences on every hand',
 'of individual and national prosperity and with proof of the growing',
 'strength and increasing power for good of Republican institutions. Your',
 'countrymen will join with you in felicitation that American liberty is more',
 'firmly established than ever before, and that love for it and the',
 'determination to preserve it are more universal than at any former period',
 'of our history.',
 '',
 'The Republic was never so strong, because never so strongly entrenched in',
 'the hearts of the people as now. The Constitution, with few amendments,',
 'exists as it left the hands of its authors. The additions which have been',
 'made to it proclaim larger freedom and more extended citizenship. Popular'

In [None]:
lemmatized_words = [x.lemmatize() for x in TextBlob(W.text[0]).words]
lemmatized_words

The first step in your analysis task will be to tokenize each document in this set and create a dataframe of tf-idf vectors. We're going to need to tokenize first, so write (or cut and paste!) a tokenizer function that takes a string and returns a list of standardized tokens.

In [8]:
def tokenize(s):
    blob = TextBlob(s.lower())
    words = [token for token in blob.words if len(token)>2]
    return words

In [9]:
def doc2vec_count(s):
    tokens = tokenize(s)
    vec = {}
    for t in tokens:
        vec[t] = vec.get(t, 0) + 1
    return vec

Good stuff. Now use this to create a matrix of tf-idf vectors for the document set.

In [10]:
# tfidf = something
vectorizer = TfidfVectorizer(stop_words='english', tokenizer=tokenize)

matrix = vectorizer.fit_transform(W.text)

# The easiest way to see what happenned is to make a dataframe
tfidf = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
tfidf.tail()

Unnamed: 0,'70,'76,'82,'86,'89,'90,'follow,'forties,'ll,'re,...,zest,zigzag,zimbabwe,zimbabwean,zinc,zion,zone,zones,zoological,zooming
96,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.09711,0.203049,...,0.0,0.0,0.0,0.0,0.0,0.016243,0.0,0.0,0.0,0.0
97,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.046757,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012439,0.0,0.0
98,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01139,0.056952,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012121,0.0,0.0
99,0.0,0.0,0.0,0.0,0.0,0.0,0.019749,0.0,0.075136,0.150272,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011422,0.0,0.0
100,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.08045,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010701,0.0,0.0


You're probably going to want a way to print out the most highly weighted terms this as well, so we'll use print_sorted_vector from the lesson notebook:

In [11]:
def print_sorted_vector(v):
    # this "lambda" thing is an anonymous function, google me to unluck bonus coding knowledge
    sorted_list = sorted(v.items(), key=lambda x: (x[1],x[0]), reverse=True) 
    sorted_list = sorted_list[:20]
    print('\n'.join([str(x) for x in sorted_list]))

Print out a few of the State of The Union vectors for individual speeches to get a sense of what's happening here.

In [12]:
print_sorted_vector(tfidf.iloc[3])

('isthmus', 0.32268228104755603)
('panama', 0.24044097196069691)
('colombia', 0.20158251161468396)
('states', 0.1668534661246448)
('treaty', 0.1612442055652871)
('united', 0.1502690521124975)
('canal', 0.14873667275399705)
('colombian', 0.1389298521409286)
('government', 0.1312966409886614)
('revolution', 0.1020056888525717)
('congress', 0.0996786940484891)
('granada', 0.08827468158646427)
('territory', 0.08806002780461358)
('year', 0.08789105584244485)
('1903', 0.08769490823091025)
('property', 0.08390750330350058)
('public', 0.08355556383190199)
('commerce', 0.08262357270839105)
('riot', 0.07480838192203848)
('department', 0.07465128921126148)


Now sum the vectors for each decade, and print out the results. Do you see any themes? Can you connect the terms to major historical events? (wars, the great depression, assassinations, the civil rights movement, Watergate…)

In [47]:
decades = []
for year in range(1900,2000,10):
    decade_year_sum = tfidf[(year <= W['year']) & (W['year'] <= year + 10)].sum()
    length = math.sqrt(sum([x*x for x in decade_year_sum]))
    decades_vet = decade_year_sum / length 
    decades.append(decades_vet)
    

In [48]:
decades = pd.DataFrame(decades)

In [49]:
decades

Unnamed: 0,'70,'76,'82,'86,'89,'90,'follow,'forties,'ll,'re,...,zest,zigzag,zimbabwe,zimbabwean,zinc,zion,zone,zones,zoological,zooming
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.007801,0.000689,0.001447,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.005094,0.0,0.0,0.0,0.0,0.0,0.010039,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.003454,0.0,0.003013,0.001266,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.007507,0.0,0.0,...,0.0,0.0,0.0,0.0,0.004507,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00627,0.0,0.0,...,0.0,0.0,0.0,0.0,0.004507,0.0,0.000872,0.000426,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.002489,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002105,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.005261,0.005261,0.0,0.0,0.0,0.0,0.0,0.0,0.014746,0.0,...,0.004157,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.000661,0.003828,0.009031,0.003615,0.0,0.0,0.074056,0.16711,...,0.0,0.0,0.001322,0.000661,0.0,0.0,0.004894,0.016161,0.0,0.003785
9,0.0,0.0,0.0,0.0,0.009158,0.0,0.002732,0.0,0.067536,0.153547,...,0.0,0.0,0.0,0.0,0.0,0.002247,0.002823,0.015044,0.0,0.0


In [53]:
#to see the first decade,1910s.
print_sorted_vector(decades.iloc[1])

('government', 0.19908051270506308)
('congress', 0.1443608407850869)
('shall', 0.13554139456874822)
('great', 0.13497272484508469)
('states', 0.12949624429430415)
('war', 0.12699275176613226)
('country', 0.1266529917392772)
('united', 0.10887843630172067)
('men', 0.10666575285720918)
('present', 0.10407976042649078)
('people', 0.10375603169363752)
('world', 0.09809137622342767)
('necessary', 0.09536428161588009)
('time', 0.0944946198241971)
('make', 0.09096819062627874)
('law', 0.08787486894526979)
('american', 0.08714667036615828)
('peace', 0.08422112131863582)
('commerce', 0.08372052104196002)
('year', 0.07991345140150717)


In [54]:
# 1940s
print_sorted_vector(decades.iloc[4])

('war', 0.2690012359018265)
('world', 0.21291008049003537)
('people', 0.13499478274053048)
('congress', 0.1324987955056536)
('nations', 0.12628282926559323)
('government', 0.12259931319805573)
('production', 0.12072237140573214)
('national', 0.12009151835565528)
('nation', 0.11875756490872132)
('united', 0.11551557323027101)
('peace', 0.11011818965515321)
('great', 0.1072463033012767)
('year', 0.10700956096124681)
('economic', 0.1002746838024043)
('program', 0.09364371718706141)
('shall', 0.0915633723579339)
('american', 0.08467844677838987)
('time', 0.08388599974929697)
('new', 0.08367700381611703)
('security', 0.08256423236594566)


In [55]:
#1950
print_sorted_vector(decades.iloc[5])

('world', 0.2374287815533759)
('free', 0.16766965951925122)
('nations', 0.1464153166469469)
('people', 0.135455645230432)
('congress', 0.1338576278658556)
('government', 0.13333601747588394)
('economic', 0.12934291923626257)
('military', 0.1225998241439423)
('defense', 0.11743544334281862)
('year', 0.11304834899705381)
('security', 0.11218804750599864)
('peace', 0.111505114977177)
('program', 0.11126905997073924)
('shall', 0.10898249743990725)
('strength', 0.1082550511913779)
('new', 0.10787032149159408)
('freedom', 0.10200223810454569)
('federal', 0.1005485665855903)
('nation', 0.09440053360962569)
('communist', 0.09118108696769708)


In [56]:
#1970s
print_sorted_vector(decades.iloc[7])

('world', 0.20737390050613808)
('new', 0.19764312479547014)
('people', 0.19693818717123593)
('america', 0.1890197604578252)
('government', 0.16923424547697408)
('congress', 0.16820096805848506)
('year', 0.14843464425489356)
('years', 0.14512675053599877)
('american', 0.13748239952671168)
('nation', 0.1254209591675926)
('federal', 0.12254621934479018)
('peace', 0.1206623163085216)
('energy', 0.11554562890359628)
('americans', 0.10653412080344565)
('programs', 0.10233572050058687)
('inflation', 0.10194398283229193)
('states', 0.09127297527773406)
('great', 0.09117066985542382)
('union', 0.08882803044289582)
('oil', 0.08602594056814264)


In [57]:
#1980s
print_sorted_vector(decades.iloc[8])

('america', 0.22479257096753813)
("'ve", 0.19897015090868875)
('tonight', 0.170840060109173)
("'re", 0.16711002370986575)
('people', 0.14342074207650063)
('world', 0.13864146341010788)
('let', 0.12701718448916102)
('new', 0.12144121594567085)
('soviet', 0.1213740320333906)
("n't", 0.12069809095872185)
('government', 0.11802835006320174)
('american', 0.1136746613666042)
('years', 0.11218738925343287)
('year', 0.10887235434821538)
('budget', 0.10602962002841851)
('freedom', 0.1028344173386279)
('congress', 0.09699848804627718)
('time', 0.09358147104650073)
('future', 0.09236135171448262)
('peace', 0.09176449173294648)


In [58]:
#1990s
print_sorted_vector(decades.iloc[9]) 

('people', 0.20462353985996382)
('america', 0.17921905443485922)
('tonight', 0.17389543012937514)
("n't", 0.17265570245183942)
('new', 0.15838478389044908)
("'re", 0.1535467203780513)
('world', 0.14142316303842692)
('children', 0.13252658640459836)
('american', 0.13225753952770936)
('work', 0.1273028532934855)
('know', 0.12725321399418796)
('americans', 0.12391897624833517)
('year', 0.12043792064326989)
('let', 0.11526906067013634)
("'ve", 0.11470161186467857)
('years', 0.10807562954709918)
('jobs', 0.1029427882046634)
('make', 0.09989026442949378)
('congress', 0.09752068778436071)
('parents', 0.09746729167109876)


Which two decades are most similar, according to the cosine similarity of their average vectors? You will need to use a double loop that compares every pair of decades and finds the pair with the smallest distance.

In [59]:
def doc_distance(a_vec,b_vec):
  
    similarity = a_vec.dot(b_vec)
 
    return 1-similarity

def dij(i,j):
    return doc_distance(decades.iloc[i], decades.iloc[j])


In [60]:
smallest_number = 1

for i in range(len(decades)):
    for j in range(i+1,len(decades)):
        if dij(i, j) < smallest_number:
            smallest_number = dij(i, j)
            smallest_index = (i,j)
            
print(smallest_index,smallest_number)

(8, 9) 0.1948297913368341


In [52]:
dij(8,9)

0.5521829552689532

In [157]:
The_1970s = W.iloc[70:80].text
The_1970s

70    \nState of the Union Address\nLyndon B. Johnso...
71    \nState of the Union Address\nRichard Nixon\nJ...
72    \nState of the Union Address\nRichard Nixon\nJ...
73    \nState of the Union Address\nRichard Nixon\nJ...
74    \nState of the Union Address\nRichard Nixon\nF...
75    \nState of the Union Address\nRichard Nixon\nJ...
76    \nState of the Union Address\nGerald R. Ford\n...
77    \nState of the Union Address\nGerald R. Ford\n...
78    \nState of the Union Address\nGerald R. Ford\n...
79    \nState of the Union Address\nJimmy Carter\nJa...
Name: text, dtype: object

In [158]:
W.iloc[75].text.split('\n')

['',
 'State of the Union Address',
 'Richard Nixon',
 'January 30, 1974',
 '',
 'Mr. Speaker, Mr. President, my colleagues in the Congress, our',
 'distinguished guests, my fellow Americans:',
 '',
 'We meet here tonight at a time of great challenge and great opportunities',
 'for America. We meet at a time when we face great problems at home and',
 'abroad that will test the strength of our fiber as a nation. But we also',
 'meet at a time when that fiber has been tested, and it has proved strong.',
 '',
 'America is a great and good land, and we are a great and good land because',
 'we are a strong, free, creative people and because America is the single',
 'greatest force for peace anywhere in the world. Today, as always in our',
 'history, we can base our confidence in what the American people will',
 'achieve in the future on the record of what the American people have',
 'achieved in the past.',
 '',
 'Tonight, for the first time in 12 years, a President of the United States',
 

In [168]:
The_1950s = W.iloc[50:60].text
The_1950s

50    \nState of the Union Address\nHarry S. Truman\...
51    \nState of the Union Address\nHarry S. Truman\...
52    \nState of the Union Address\nHarry S. Truman\...
53    \nState of the Union Address\nDwight D. Eisenh...
54    \nState of the Union Address\nDwight D. Eisenh...
55    \nState of the Union Address\nDwight D. Eisenh...
56    \nState of the Union Address\nDwight D. Eisenh...
57    \nState of the Union Address\nDwight D. Eisenh...
58    \nState of the Union Address\nDwight D. Eisenh...
59    \nState of the Union Address\nDwight D. Eisenh...
Name: text, dtype: object

In [169]:
W.iloc[55].text.split('\n')

['',
 'State of the Union Address',
 'Dwight D. Eisenhower',
 'January 6, 1955',
 '',
 'Mr. President, Mr. Speaker, Members of the Congress:',
 '',
 'First, I extend cordial greetings to the 84th Congress. We shall have much',
 'to do together; I am sure that we shall get it done--and, that we shall do',
 'it in harmony and good will.',
 '',
 'At the outset, I believe it would be well to remind ourselves of this great',
 'fundamental in our national life: our common belief that every human being',
 'is divinely endowed with dignity and worth and inalienable rights. This',
 'faith, with its corollary--that to grow and flourish people must be',
 'free--shapes the interests and aspirations of every American.',
 '',
 'From this deep faith have evolved three main purposes of our Federal',
 'Government:',
 '',
 'First, to maintain justice and freedom among ourselves and to champion them',
 'for others so that we may work effectively for enduring peace;',
 '',
 'Second, to help keep our econo

Write a 500 word (max) article on what U.S. presidents discussed in their SOTU speeches in the 20th century. You should obviously use your tf-idf analysis as a primary source *but* you will no be able to complete this without actually reading some of the speeches, and comparing them to other historical references.

Turn in this notebook, with your article below.
    

(your SOTU article here)