# Fingerprinting Presidential Speeches

The goal of this project is to create a model that can predict which president gave a speech, based solely on a transcript of that speech.

To begin with, we will import all the modules and functions we need in this notebook. Most of these have been used before in this class, except for `import_ipynb`. This module allows us to import functions from Jupyter notebooks. If any of these modules get an import error, you may need to run `pip install [name]` in your command line.

In [1]:
import pandas as pd
import import_ipynb
import numpy as np
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.svm import SVC
from sklearn.decomposition import PCA
import nltk

These are nltk packages required for the code to work.

In [None]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('twitter_samples')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

Finally, we import the functions we wrote in other notebooks that are used here. 

In [2]:
# Function Imports
from word_frequency import word_frequency
from average_named_years import named_years
from year_from_wars import year_from_wars
from Sentiment_Analysis import positivity_score, build_sentiment_model
from Some_functions_Copy import mreplace, sentlen, wordlen, avesylls, SWprop, readlvl, sentcount, wordcount
from word_pca_Copy import PCAphrases

importing Jupyter notebook from word_frequency.ipynb
importing Jupyter notebook from average_named_years.ipynb
importing Jupyter notebook from year_from_wars.ipynb
importing Jupyter notebook from Sentiment_Analysis.ipynb
importing Jupyter notebook from Some_functions_Copy.ipynb


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jrnoo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


importing Jupyter notebook from word_pca_Copy.ipynb


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jrnoo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Here is what each function imported above does (and where you can find more details, such as the function's definition):

`word_frequency` takes in a data frame and a number $n$. The data frame must have a column called "Transcripts". It returns a series of the percentage of each transcript that consists of the $n$ most common words in the transcript. It is found in the notebook `word_frequency.ipynb`

`named_years` also takes in a data frame with a "Transcripts" column. It returns a series of the average year mentioned in each speech, scaled by dividing by 2020. If no years are mentioned in the speech, a NaN is put in that speech's row. It can be found in `average_named_years.ipynb`.

`year_from_wars` is similar - it returns an average year based on the wars mentioned in the speech. In this case, it is a scaled version of the start date of all wars mentioned in the speech. It is found in `year_from_wars.ipynb`.

`positivity_score`, found in `Sentiment_Analysis.ipynb`, takes in the transcript of a single speech, as well as a classifier, which is generated by `build_sentiment_model`. It returns an estimate of how positive the sentiment expressed in the speech is.

The following functions are all found in `Some_functions.ipynb`. They are imported from `Some_functions_Copy.ipynb`, which is just a copy of `Some_functions.ipynb` with everything except the function definitions removed (this was done because `import-ipynb` imports a function by running the entire notebook, and there are cells in the original notebook used to test the functions that take a long time to run.
- `mreplace` takes in a string of text and two lists of strings (which must be of equal length). It replaces every occurence of a string in the first list with the corresponding string from the second list and returns the result. There is also an optional parameter that allows you to set a maximum number of times for each string to be replaced.
- `sentlen` takes in a series of strings and returns a series of the average sentence length of each string.
- `wordlen` takes in a series of strings and returns a series of the average word length of each string.
- `avesylls` takes in a series of strings and returns a series of the average number of syllables per word for each string.
- `SWprop` takes in a series of strings and returns a series of the proportion of stop words in each string.
- `readlvl` takes in a series of strings and returns a series of Flesch Reading Ease scores of each string. The Flesch Reading Ease score is a measure of how easy or confusing a text is to read.
- `sentcount` takes in a series of strings and returns a series of the number of sentences in each string.
- `wordcount` takes in a series of strings and returns a series of the number of words in each string.

`PCAphrases` takes in a data frame that includes only a "Presidents" and "Transcripts" column, as well as a number $n$ (the length of each phrase), and the number of features you want. It creates a list of all $n$-word phrases (also known as $n$-grams) in any of the transcripts, and then finds how many times each phrase appears in each transcript. This creates hundreds of features, which is reduced to the desired number of features via principle component analysis. The output is a data frame of the resulting PCA features. The function can be found in the `word_pca.ipynb` notebook (it is imported from a copy of said notebook with everything except the function definitions removed for the same reason as the functions from `Some_functions.ipynb`). Because of this function's long run time, we were unable to use it for any phrases with 2 or more words - in other words, we only used it to find PCA features based on the frequency of individual words, rather than the frequency of $n$-grams with $n>2$.

## Reading in Data File

In [5]:
orig_data = pd.read_csv("archive/presidential_speeches.csv")

names = ['George Washington', 'Donald Trump'] 
data = orig_data[orig_data['President'].isin(names)]
data = data.reset_index(drop = True)
data.tail()

Unnamed: 0,Date,President,Party,Speech Title,Summary,Transcript,URL
35,2019-01-19,Donald Trump,Republican,Remarks about the US Southern Border,President Donald Trump speaks about what he se...,"Just a short time ago, I had the honor of pres...",https://millercenter.org/the-presidency/presid...
36,2019-02-05,Donald Trump,Republican,State of the Union Address,"In his second State of the Union Address, Pres...","Madam Speaker, Mr. Vice President, Members of ...",https://millercenter.org/the-presidency/presid...
37,2019-02-15,Donald Trump,Republican,Speech Declaring a National Emergency,President Donald Trump declares a national eme...,"Thank you very much, everybody. Before we begi...",https://millercenter.org/the-presidency/presid...
38,2019-09-24,Donald Trump,Republican,Remarks at the United Nations General Assembly,President Donald Trump speaks to the 74th sess...,"Thank you very much. Mr. President, Mr. Secret...",https://millercenter.org/the-presidency/presid...
39,2019-09-25,Donald Trump,Republican,Press Conference,President Donald Trump holds a press conferenc...,"Thank you very much. Thank you. Well, thank yo...",https://millercenter.org/the-presidency/presid...


## Adding Features to DataFrame

Now we use the functions imported previously to add numerical features to our data frame.

In [6]:
data['Word Frequency'] = word_frequency(data, n = 10, remove_stopwords = True)
data['Named Years'] = named_years(data)
data['Years from Wars'] = year_from_wars(data)

In [7]:
classifier = build_sentiment_model()
#Since the positivity_score function only takes in one transcript and returns a single value, we make a list of 
#this function applied to every value in the series of transcripts.
data['Positivity Score'] = [positivity_score(x, classifier) for x in data['Transcript']]

In [8]:
data['Sentence Length'] = sentlen(data['Transcript'])
data['Word Length'] = wordlen(data['Transcript'])
data['Syllables per word'] = avesylls(data['Transcript'])
data['Stop Word Proportion'] = SWprop(data['Transcript'])
data['No. of Words'] = wordcount(data['Transcript'])
data['No. of Sentences'] = sentcount(data['Transcript'])

In [9]:
data['Reading Level'] = readlvl(data['Transcript'])

This next cell takes a while to run, usually in the range of 5 to 10 minutes, so don't worry if it doesn't finish right away.

In [10]:
data_PCA = data[['President', 'Transcript']] #PCA for single-word phrases (i.e. individual words)
PCAfeatures = PCAphrases(data_PCA, n = 1, numfeatures = 10)
PCAfeatures.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-0.548262,0.537188,0.379417,-0.092071,-0.000782,-0.281186,0.121612,0.580681,-0.498966,0.027098
1,-1.129736,-0.141785,0.00214,0.018123,0.521896,-0.55503,0.137725,-0.034745,-0.439306,-0.120725
2,-0.837219,0.302708,0.205634,0.066993,0.194485,-0.122272,-0.11562,0.215518,-0.170122,-0.178689
3,-0.495651,0.601764,0.272803,-0.045744,-0.016413,0.331705,-0.256244,-0.126577,0.063974,0.545111
4,-0.53713,0.239488,0.637453,0.467366,-0.632728,0.419387,0.952085,0.386188,0.954305,-1.859382


In [11]:
#Appending the PCA features to the original data frame
for i in range(10):
    data[f'Word PCA {i}'] = PCAfeatures[i]

Now that we have added all of the features to the data frame, this is what we are left with:

In [12]:
data

Unnamed: 0,Date,President,Party,Speech Title,Summary,Transcript,URL,Word Frequency,Named Years,Years from Wars,...,Word PCA 0,Word PCA 1,Word PCA 2,Word PCA 3,Word PCA 4,Word PCA 5,Word PCA 6,Word PCA 7,Word PCA 8,Word PCA 9
0,1789-04-30,George Washington,Unaffiliated,First Inaugural Address,Washington calls on Congress to avoid local an...,Fellow Citizens of the Senate and the House of...,https://millercenter.org/the-presidency/presid...,0.090491,,,...,-0.548262,0.537188,0.379417,-0.092071,-0.000782,-0.281186,0.121612,0.580681,-0.498966,0.027098
1,1789-10-03,George Washington,Unaffiliated,Thanksgiving Proclamation,"At the request of Congress, Washington establi...",Whereas it is the duty of all Nations to ackno...,https://millercenter.org/the-presidency/presid...,0.166667,,,...,-1.129736,-0.141785,0.00214,0.018123,0.521896,-0.55503,0.137725,-0.034745,-0.439306,-0.120725
2,1790-01-08,George Washington,Unaffiliated,First Annual Message to Congress,"In a wide ranging speech, President Washington...",Fellow Citizens of the Senate and House of Rep...,https://millercenter.org/the-presidency/presid...,0.088608,,,...,-0.837219,0.302708,0.205634,0.066993,0.194485,-0.122272,-0.11562,0.215518,-0.170122,-0.178689
3,1790-12-08,George Washington,Unaffiliated,Second Annual Message to Congress,Washington focuses on commerce in his second a...,Fellow citizens of the Senate and House of Rep...,https://millercenter.org/the-presidency/presid...,0.087025,,,...,-0.495651,0.601764,0.272803,-0.045744,-0.016413,0.331705,-0.256244,-0.126577,0.063974,0.545111
4,1790-12-29,George Washington,Unaffiliated,Talk to the Chiefs and Counselors of the Senec...,The President reassures the Seneca Nation that...,"I the President of the United States, by my ow...",https://millercenter.org/the-presidency/presid...,0.187107,,,...,-0.53713,0.239488,0.637453,0.467366,-0.632728,0.419387,0.952085,0.386188,0.954305,-1.859382
5,1791-10-25,George Washington,Unaffiliated,Third Annual Message to Congress,Washington praises the success of the new bank...,"I meet you, upon the present occasion, with th...",https://millercenter.org/the-presidency/presid...,0.101808,,,...,0.227922,1.44964,0.53258,-0.081933,-1.194568,1.272612,0.55696,-0.206723,0.363695,0.463402
6,1792-04-05,George Washington,Unaffiliated,Veto Message on Congressional Redistricting,President Washington returns a congressional r...,Gentlemen of the House of Representatives: I h...,https://millercenter.org/the-presidency/presid...,0.4,,,...,-1.328691,-0.08603,0.144293,-0.201699,0.214338,-0.281198,-0.041421,0.420639,-0.227211,-0.047755
7,1792-11-06,George Washington,Unaffiliated,Fourth Annual Message to Congress,,"Fellow Citizens of the Senate, and of the Hous...",https://millercenter.org/the-presidency/presid...,0.07967,,,...,0.179989,1.294261,0.430948,-0.20607,-0.609417,0.785066,0.146704,0.248295,0.256458,0.383606
8,1792-12-12,George Washington,Unaffiliated,Proclamation Against Crimes Against the Cherok...,Offering a reward for the capture of American ...,"Whereas I have received authentic information,...",https://millercenter.org/the-presidency/presid...,0.212121,,,...,-1.310242,-0.109463,0.145677,-0.200503,0.255762,-0.381351,0.028166,0.331568,-0.259182,-0.039684
9,1793-03-04,George Washington,Unaffiliated,Second Inaugural Address,"In a simple, brief speech, Washington expresse...",Fellow Citizens: I am again called upon by the...,https://millercenter.org/the-presidency/presid...,0.213115,,,...,-1.35381,-0.098117,0.184436,-0.222127,0.281323,-0.382783,-0.189949,0.43021,-0.272515,0.078999


This cell will save the data frame as a csv. That way, after we have saved it the first time, we won't have to re-run all of the cells involved in creating the data frame every time we start a new kernel.

In [13]:
data.to_csv('features_df.csv')

# Machine Learning using Dataframe 

In [14]:
data = pd.read_csv('features_df.csv',index_col=0) # Reloads the data frame from the csv
data.head() 

Unnamed: 0,Date,President,Party,Speech Title,Summary,Transcript,URL,Word Frequency,Named Years,Years from Wars,...,Word PCA 0,Word PCA 1,Word PCA 2,Word PCA 3,Word PCA 4,Word PCA 5,Word PCA 6,Word PCA 7,Word PCA 8,Word PCA 9
0,1789-04-30,George Washington,Unaffiliated,First Inaugural Address,Washington calls on Congress to avoid local an...,Fellow Citizens of the Senate and the House of...,https://millercenter.org/the-presidency/presid...,0.090491,,,...,-0.548262,0.537188,0.379417,-0.092071,-0.000782,-0.281186,0.121612,0.580681,-0.498966,0.027098
1,1789-10-03,George Washington,Unaffiliated,Thanksgiving Proclamation,"At the request of Congress, Washington establi...",Whereas it is the duty of all Nations to ackno...,https://millercenter.org/the-presidency/presid...,0.166667,,,...,-1.129736,-0.141785,0.00214,0.018123,0.521896,-0.55503,0.137725,-0.034745,-0.439306,-0.120725
2,1790-01-08,George Washington,Unaffiliated,First Annual Message to Congress,"In a wide ranging speech, President Washington...",Fellow Citizens of the Senate and House of Rep...,https://millercenter.org/the-presidency/presid...,0.088608,,,...,-0.837219,0.302708,0.205634,0.066993,0.194485,-0.122272,-0.11562,0.215518,-0.170122,-0.178689
3,1790-12-08,George Washington,Unaffiliated,Second Annual Message to Congress,Washington focuses on commerce in his second a...,Fellow citizens of the Senate and House of Rep...,https://millercenter.org/the-presidency/presid...,0.087025,,,...,-0.495651,0.601764,0.272803,-0.045744,-0.016413,0.331705,-0.256244,-0.126577,0.063974,0.545111
4,1790-12-29,George Washington,Unaffiliated,Talk to the Chiefs and Counselors of the Senec...,The President reassures the Seneca Nation that...,"I the President of the United States, by my ow...",https://millercenter.org/the-presidency/presid...,0.187107,,,...,-0.53713,0.239488,0.637453,0.467366,-0.632728,0.419387,0.952085,0.386188,0.954305,-1.859382


## SVM

### Imputing the Data

To make things simpler, we delete all non-numerical columns and change the Presidential labels to numbers.

These functions will be used to reclassify some columns of the dataframe.

In [15]:
# Numerically classify presidents
def reclassify_labels(data):
    presidents_to_classify = pd.unique(data['President'])
    
    # The exact value of the president will vary based on how many presidents we are trying to classify.
    for p in range(0, len(presidents_to_classify)):
        for i in range(0, len(data['President'])):
            if data['President'][i] == presidents_to_classify[p]:
                data['President'][i] = p
    return data

In [16]:
# Normalizing standardizes the data so all features are between -1 and 1.
def normalize(data):
    
    for column in data:
        if column != 'President': #We don't want to reclassify the presidential labels.
            data[column] = data[column]/(data[column].max())
    
    return data

In [17]:
pres_data = data.drop(columns = ['Date', 'Party', 'Speech Title', 'Summary', 'Transcript', 'URL'], axis = 1)

Replace any missing data with the column means.

In [18]:
pres_data['Named Years'].fillna(value=pres_data['Named Years'].mean(), inplace=True)
pres_data['Years from Wars'].fillna(value=pres_data['Years from Wars'].mean(), inplace=True)

Using the functions from above, we normalize the data and numerically classify the Presidents.

In [19]:
pres_data = normalize(pres_data)
pres_data = reclassify_labels(pres_data)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['President'][i] = p


### Separate the Features from the Labels

In [20]:
features = pres_data.drop(columns = 'President')
labels = pres_data['President'].astype(int)
features.head()

Unnamed: 0,Word Frequency,Named Years,Years from Wars,Positivity Score,Sentence Length,Word Length,Syllables per word,Stop Word Proportion,No. of Words,No. of Sentences,...,Word PCA 0,Word PCA 1,Word PCA 2,Word PCA 3,Word PCA 4,Word PCA 5,Word PCA 6,Word PCA 7,Word PCA 8,Word PCA 9
0,0.226227,0.995087,0.986486,0.916667,0.156414,0.949739,0.963421,0.974121,0.157686,0.030496,...,-0.19259,0.212659,0.139568,-0.022193,-0.000265,-0.079083,0.038194,0.233328,-0.166497,0.013991
1,0.416667,0.995087,0.986486,1.0,0.284686,0.912893,0.924069,0.945692,0.047834,0.005083,...,-0.396846,-0.056129,0.000787,0.004369,0.176848,-0.156101,0.043255,-0.013961,-0.146589,-0.06233
2,0.221519,0.995087,0.986486,0.809524,0.105709,0.985286,1.0,0.955462,0.093248,0.026684,...,-0.294092,0.119835,0.075642,0.016148,0.065902,-0.034389,-0.036313,0.086599,-0.056767,-0.092257
3,0.217563,0.995087,0.986486,0.725,0.091688,0.940159,0.928386,0.981746,0.154058,0.050826,...,-0.174109,0.238224,0.10035,-0.011026,-0.005562,0.093292,-0.080478,-0.050861,0.021347,0.28144
4,0.467767,0.995087,0.986486,0.72,0.076462,0.867136,0.839041,0.975943,0.154168,0.060991,...,-0.188679,0.094807,0.234487,0.112655,-0.214404,0.117952,0.299019,0.155177,0.318436,-0.959997


### Split the Data into Test and Training Sets

Initially, the training set is set to be 75% of the data.

In [21]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, train_size = 0.75, random_state = 0)

### Finding the Best Parameters for Model Using Grid Search

In [22]:
param_grid = {'kernel':('linear', 'rbf'),
              'C': [0.01, 0.1, 1, 10],
              'gamma': [1e-200, 1e-100, 1e-10, 1e-1]}
clf = GridSearchCV(SVC(), param_grid)

clf = clf.fit(train_features, train_labels)
print("Best estimator found by grid search:")
print(clf.best_estimator_)

Best estimator found by grid search:
SVC(C=1, gamma=1e-200, kernel='linear')


### Judging the Accuracy of the Model

In [30]:
pred_labels = clf.predict(test_features) #Running the model on the testing data
# Getting the confusion matrix and classification report
conf_mat = confusion_matrix(test_labels, pred_labels)
class_rep = classification_report(test_labels, pred_labels)
#formatting the confusion matrix and classification report:
# First, we want to replace the numerical labels in the class rep with the names of presidents. The extra space at
#the end prevents actual numerical data from being replaced by a president's name.
preslist = ['Washington ','Trump '] 
# We need to replace a number of 0's equal to 1 less than the length of the president's name before the number so 
#that everything is still aligned properly
replacelist = [' '*(len(preslist[i])-2)+f'{i} ' for i in range(len(preslist))]
class_rep = mreplace(class_rep,replacelist,preslist,max=1) #replace all the substrings in replacelist with preslist
#The confusion matrix will be turned into a DF with a multi-index so that it is easy to tell what each row and col
#represent.
cols = pd.MultiIndex.from_product([['Predicted Speaker'],preslist])
rows = pd.MultiIndex.from_product([['Actual Speaker'],preslist])
print(class_rep)
pd.DataFrame(conf_mat,index=rows,columns=cols)pd.DataFrame(conf_mat,index=rows,columns=cols)

              precision    recall  f1-score   support

  Washington       1.00      0.83      0.91         6
       Trump       0.80      1.00      0.89         4

    accuracy                           0.90        10
   macro avg       0.90      0.92      0.90        10
weighted avg       0.92      0.90      0.90        10



Unnamed: 0_level_0,Unnamed: 1_level_0,Predicted Speaker,Predicted Speaker
Unnamed: 0_level_1,Unnamed: 1_level_1,Washington,Trump
Actual Speaker,Washington,5,1
Actual Speaker,Trump,0,4


# Adding Another President to Model

## Reading in Data and Adding Features to Dataset

In [31]:
FDR_data = orig_data[orig_data['President'] == 'Franklin D. Roosevelt']
FDR_data = FDR_data.reset_index(drop = True)
FDR_data.tail()

Unnamed: 0,Date,President,Party,Speech Title,Summary,Transcript,URL
44,1944-06-12,Franklin D. Roosevelt,Democratic,Opening Fifth War Loan Drive,"Less than a week after D-Day, Roosevelt calls ...",Ladies and Gentlemen: All our fighting men ove...,https://millercenter.org/the-presidency/presid...
45,1944-07-20,Franklin D. Roosevelt,Democratic,Democratic National Convention,President Roosevelt accepts the Democratic Par...,I have already indicated to you why I accept t...,https://millercenter.org/the-presidency/presid...
46,1945-01-20,Franklin D. Roosevelt,Democratic,Fourth Inaugural Address,Franklin Delano Roosevelt makes a brief addres...,"Mr. Chief Justice, Mr. Vice President, my frie...",https://millercenter.org/the-presidency/presid...
47,1945-02-11,Franklin D. Roosevelt,Democratic,Joint Statement with Churchill and Stalin on t...,,THE DEFEAT OF GERMANY We have considered and d...,https://millercenter.org/the-presidency/presid...
48,1945-03-01,Franklin D. Roosevelt,Democratic,Address to Congress on Yalta,President Roosevelt reports on his meeting wit...,I hope that you will pardon me for this unusua...,https://millercenter.org/the-presidency/presid...


We are using the same functions that we used before so that this FDR dataset contains all the same features as the original.

In [32]:
FDR_data['Word Frequency'] = word_frequency(FDR_data, n = 10, remove_stopwords = True)
FDR_data['Named Years'] = named_years(FDR_data)
FDR_data['Years from Wars'] = year_from_wars(FDR_data)

In [33]:
classifier = build_sentiment_model()
FDR_data['Positivity Score'] = [positivity_score(x, classifier) for x in FDR_data['Transcript']]

In [34]:
FDR_data['Sentence Length'] = sentlen(FDR_data['Transcript'])
FDR_data['Word Length'] = wordlen(FDR_data['Transcript'])
FDR_data['Syllables per word'] = avesylls(FDR_data['Transcript'])
FDR_data['Stop Word Proportion'] = SWprop(FDR_data['Transcript'])
FDR_data['No. of Words'] = wordcount(FDR_data['Transcript'])
FDR_data['No. of Sentences'] = sentcount(FDR_data['Transcript'])

In [35]:
FDR_data['Reading Level'] = readlvl(FDR_data['Transcript'])

In [36]:
# Once again, this cell takes a while to run.
FDR_data_PCA = FDR_data[['President', 'Transcript']]
PCAfeatures = PCAphrases(FDR_data_PCA, n = 1, numfeatures = 10)
PCAfeatures.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-0.8723,-0.257656,-0.128421,-0.334871,-0.916542,-0.37905,-0.009807,0.171485,-0.390563,0.446454
1,-0.91566,-0.63999,-0.430109,0.572777,0.377286,-0.42778,-0.097,-0.263978,-0.154148,-0.483508
2,0.092083,-0.635472,1.106116,0.035093,0.090576,0.615359,0.046649,0.78751,0.747307,0.606194
3,-0.074718,-0.398439,0.875905,0.98827,0.481078,-0.94254,0.215479,0.133054,0.04551,-0.14454
4,-0.236318,-0.790374,0.362645,0.65338,0.325163,-0.999776,0.949917,-0.570474,0.552336,-0.100446


In [37]:
for i in range(10):
    FDR_data[f'Word PCA {i}'] = PCAfeatures[i]

In [38]:
FDR_data.head()

Unnamed: 0,Date,President,Party,Speech Title,Summary,Transcript,URL,Word Frequency,Named Years,Years from Wars,...,Word PCA 0,Word PCA 1,Word PCA 2,Word PCA 3,Word PCA 4,Word PCA 5,Word PCA 6,Word PCA 7,Word PCA 8,Word PCA 9
0,1933-03-04,Franklin D. Roosevelt,Democratic,First Inaugural Address,President Franklin Delano Roosevelt delivers t...,"President Hoover, Mr. Chief Justice, my friend...",https://millercenter.org/the-presidency/presid...,0.08,,,...,-0.8723,-0.257656,-0.128421,-0.334871,-0.916542,-0.37905,-0.009807,0.171485,-0.390563,0.446454
1,1933-03-12,Franklin D. Roosevelt,Democratic,On the Banking Crisis,"By the time of Roosevelt's inauguration, nearl...",I want to talk for a few minutes with the peop...,https://millercenter.org/the-presidency/presid...,0.156404,,,...,-0.91566,-0.63999,-0.430109,0.572777,0.377286,-0.42778,-0.097,-0.263978,-0.154148,-0.483508
2,1933-05-07,Franklin D. Roosevelt,Democratic,On Progress During the First Two Months,"Sixty days into the ""First Hundred Days"" Roose...",On a Sunday night a week after my Inauguration...,https://millercenter.org/the-presidency/presid...,0.082985,,,...,0.092083,-0.635472,1.106116,0.035093,0.090576,0.615359,0.046649,0.78751,0.747307,0.606194
3,1933-07-24,Franklin D. Roosevelt,Democratic,On the National Recovery Administration,Roosevelt defends the New Deal at the end of t...,After the adjournment of the historical specia...,https://millercenter.org/the-presidency/presid...,0.091231,,,...,-0.074718,-0.398439,0.875905,0.98827,0.481078,-0.94254,0.215479,0.133054,0.04551,-0.14454
4,1933-10-22,Franklin D. Roosevelt,Democratic,On Economic Progress,In the midst of discouraging economic news rep...,It is three months since I have talked with th...,https://millercenter.org/the-presidency/presid...,0.10023,0.956931,,...,-0.236318,-0.790374,0.362645,0.65338,0.325163,-0.999776,0.949917,-0.570474,0.552336,-0.100446


In [39]:
FDR_data.to_csv('FDR_features_df.csv')

### Imputing FDR Data

After combining data from FDR with that of the other presidents, we'll normalize the data, numerically categorize the presidents, and fill in any missing values using the same methods as we did above.

In [40]:
FDR_data = pd.read_csv('FDR_features_df.csv',index_col=0) # Reloads the data frame from the csv

In [41]:
full_data = pd.concat([pres_data, FDR_data], ignore_index = True)

In [42]:
full_data = full_data.drop(columns = ['Date', 'Party', 'Speech Title', 'Summary', 'Transcript', 'URL'], axis = 1)

In [43]:
full_data['Named Years'].fillna(value=full_data['Named Years'].mean(), inplace=True)
full_data['Years from Wars'].fillna(value=full_data['Years from Wars'].mean(), inplace=True)

In [44]:
full_data = normalize(full_data)
full_data = reclassify_labels(full_data)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['President'][i] = p


In [45]:
full_data

Unnamed: 0,President,Word Frequency,Named Years,Years from Wars,Positivity Score,Sentence Length,Word Length,Syllables per word,Stop Word Proportion,No. of Words,...,Word PCA 0,Word PCA 1,Word PCA 2,Word PCA 3,Word PCA 4,Word PCA 5,Word PCA 6,Word PCA 7,Word PCA 8,Word PCA 9
0,0,0.226227,0.995087,0.986486,0.916667,0.004757,0.185863,0.593942,0.974121,0.000026,...,-0.070020,0.092706,0.053464,-0.006965,-0.000088,-0.034305,0.012482,0.093401,-0.047858,0.005265
1,0,0.416667,0.995087,0.986486,1.000000,0.008659,0.178652,0.569682,0.945692,0.000008,...,-0.144281,-0.024469,0.000302,0.001371,0.058650,-0.067714,0.014136,-0.005589,-0.042136,-0.023457
2,0,0.221519,0.995087,0.986486,0.809524,0.003215,0.192819,0.616493,0.955462,0.000015,...,-0.106923,0.052240,0.028976,0.005068,0.021856,-0.014917,-0.011867,0.034665,-0.016317,-0.034720
3,0,0.217563,0.995087,0.986486,0.725000,0.002789,0.183988,0.572344,0.981746,0.000026,...,-0.063301,0.103850,0.038441,-0.003460,-0.001844,0.040468,-0.026300,-0.020359,0.006136,0.105917
4,0,0.467767,0.995087,0.986486,0.720000,0.002326,0.169697,0.517263,0.975943,0.000026,...,-0.068598,0.041330,0.089824,0.035354,-0.071105,0.051166,0.097720,0.062117,0.091532,-0.361286
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84,2,0.112346,0.960726,0.977145,0.551282,0.640665,0.886540,0.863390,0.506999,0.272335,...,-0.400912,0.321695,-0.189655,0.072517,0.099294,0.103555,-0.144542,-0.105355,0.041193,0.215175
85,2,0.112455,0.959406,0.977145,0.573529,0.793474,0.890021,0.877477,0.532694,0.294049,...,-0.320325,0.235751,0.024903,0.267182,0.166363,0.361985,0.060006,0.095675,-0.176046,-0.170430
86,2,0.152941,0.884653,0.977145,0.625000,0.656262,0.825702,0.801111,0.543672,0.092989,...,-0.733904,0.146780,-0.036112,0.068634,-0.206420,-0.139768,-0.183143,0.317865,-0.156056,0.046671
87,2,0.146056,0.974857,0.977145,0.507937,0.912904,1.000000,1.000000,0.489309,0.333333,...,-0.192242,-0.013236,-0.635365,0.122112,0.080074,1.000000,0.172139,-0.028810,0.381483,0.589089


### Separate the Features from the Labels

In [46]:
features = full_data.drop(columns = 'President')
labels = full_data['President'].astype(int)
features.head()

Unnamed: 0,Word Frequency,Named Years,Years from Wars,Positivity Score,Sentence Length,Word Length,Syllables per word,Stop Word Proportion,No. of Words,No. of Sentences,...,Word PCA 0,Word PCA 1,Word PCA 2,Word PCA 3,Word PCA 4,Word PCA 5,Word PCA 6,Word PCA 7,Word PCA 8,Word PCA 9
0,0.226227,0.995087,0.986486,0.916667,0.004757,0.185863,0.593942,0.974121,2.6e-05,0.000116,...,-0.07002,0.092706,0.053464,-0.006965,-8.8e-05,-0.034305,0.012482,0.093401,-0.047858,0.005265
1,0.416667,0.995087,0.986486,1.0,0.008659,0.178652,0.569682,0.945692,8e-06,1.9e-05,...,-0.144281,-0.024469,0.000302,0.001371,0.05865,-0.067714,0.014136,-0.005589,-0.042136,-0.023457
2,0.221519,0.995087,0.986486,0.809524,0.003215,0.192819,0.616493,0.955462,1.5e-05,0.000101,...,-0.106923,0.05224,0.028976,0.005068,0.021856,-0.014917,-0.011867,0.034665,-0.016317,-0.03472
3,0.217563,0.995087,0.986486,0.725,0.002789,0.183988,0.572344,0.981746,2.6e-05,0.000193,...,-0.063301,0.10385,0.038441,-0.00346,-0.001844,0.040468,-0.0263,-0.020359,0.006136,0.105917
4,0.467767,0.995087,0.986486,0.72,0.002326,0.169697,0.517263,0.975943,2.6e-05,0.000231,...,-0.068598,0.04133,0.089824,0.035354,-0.071105,0.051166,0.09772,0.062117,0.091532,-0.361286


### Split the Data into Testing and Training Sets

In [47]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, train_size = 0.75, random_state = 0)

### Pass Full Data to Model

Using the same classifiers that we found above, we can re-fit the model to the new data and test it.

In [48]:
clf = clf.fit(train_features, train_labels)

In [50]:
pred_labels = clf.predict(test_features)
# Once again, creating a confusion matrix and classification report
conf_mat = confusion_matrix(test_labels, pred_labels)
class_rep = classification_report(test_labels, pred_labels)
# Formatting the confusion matrixCand classification report in the same way as before
preslist = ['Washington ','Trump ','Roosevelt ']
replacelist = [' '*(len(preslist[i])-2)+f'{i} ' for i in range(len(preslist))]
class_rep = mreplace(class_rep,replacelist,preslist,max=1)
cols = pd.MultiIndex.from_product([['Predicted Speaker'],preslist])
rows = pd.MultiIndex.from_product([['Actual Speaker'],preslist])
print(class_rep)
pd.DataFrame(conf_mat,index=rows,columns=cols)

              precision    recall  f1-score   support

  Washington       1.00      1.00      1.00         6
       Trump       1.00      1.00      1.00         5
   Roosevelt       1.00      1.00      1.00        12

    accuracy                           1.00        23
   macro avg       1.00      1.00      1.00        23
weighted avg       1.00      1.00      1.00        23



Unnamed: 0_level_0,Unnamed: 1_level_0,Predicted Speaker,Predicted Speaker,Predicted Speaker
Unnamed: 0_level_1,Unnamed: 1_level_1,Washington,Trump,Roosevelt
Actual Speaker,Washington,6,0,0
Actual Speaker,Trump,0,5,0
Actual Speaker,Roosevelt,0,0,12


In [None]:
# Find repository of random presidential speeches and send through model?

In [None]:
# Beautify notebook (confusion matrix)
# Add graphs
# Correlation matrix
# Annotate notebook