In this notebook, I will walk through an analysis of speeches by the first three American presidents: George Washington, John Adams, and Thomas Jefferson.  

These presidents were chosen for several reasons.  They are all contemporaries of each other, and thus theoretically wrote speeches using similar language patterns.  The fact that they peformed these speeches more than 200 years ago adds an extra layer of challenge, as many of the phrases and words chosen are not commonly found in modern written communication.  Lastly, though they were all Founding Fathers, their levels of education, political philosophy, and temperaments were all very different. I will ultimately try to see if I can predict which of the founding fathers wrote a given sentance.  But first there is data collection and pre-processing to do.

I begin by importing the necessary libraries:

In [22]:
import pandas as pd
import nltk.tokenize
import subprocess

Included in the same folder as this Jupyter notebook is a python file named "speech_collection.py" which scrapes text data of presidential speeches of our three presidents from the Miller Center at the University of Virginia (https://millercenter.org/the-presidency/presidential-speeches).

The speeches are then saved as three dataframes, one for each president, with a column for the date of the speech and a column for the text of the speech.  These dataframes are then saved as .csv files and loaded into the notebook below.

In [3]:
speech_output = subprocess.run(["python", "speech_collection.py"], capture_output=True, text=True)

In [4]:
speech_output.stdout


"DataFrames saved to ('washington_speeches.csv', 'adams_speeches.csv', 'jefferson_speeches.csv')\n"

In [5]:
washington_df = pd.read_csv('washington_speeches.csv')
adams_df = pd.read_csv('adams_speeches.csv')
jefferson_df = pd.read_csv('jefferson_speeches.csv')

The first few columns of each of these dataframes appears as such:

In [6]:
washington_df.head()

Unnamed: 0,Speech_Date,Speech_Text
0,1789-04-30,Fellow Citizens of the Senate and the House of...
1,1789-10-03,Whereas it is the duty of all Nations to ackno...
2,1790-01-08,Fellow Citizens of the Senate and House of Rep...
3,1790-12-08,Fellow citizens of the Senate and House of Re...
4,1790-12-29,"I the President of the United States, by my o..."


The following steps involves a series of processing steps that convert the dataframe of presidential speeches into a more useable form.  
-First we convert all uppercase letters to lowercase in the speech dataframe
-Then we tokenize each speech into a new column of the dataframe for sentance sized tokens and word sized tokens.

In [33]:
df_list = [washington_df, adams_df, jefferson_df]

def remove_punct(tokenized_speech):
    return [word for word in tokenized_speech if word.isalpha()]

for df in df_list:
    df['Speech_Text'] = df['Speech_Text'].apply(str.lower)
    df['Sentence_Tokens'] = df['Speech_Text'].apply(nltk.sent_tokenize)
    df['Word_Tokens'] = df['Speech_Text'].apply(nltk.word_tokenize)
    df['Word_Tokens'] = df['Word_Tokens'].apply(remove_punct)

Now that each speech as been separated into sentence and word tokens, some more interesting analysis can be done.  Note that a remove_punct function was included to eliminate any punctuation that is treated as a word by the nltk.word_tokenize function.  

Our dataframes now appear as such:

In [35]:
washington_df.head()

Unnamed: 0,Speech_Date,Speech_Text,Sentence_Tokens,Word_Tokens
0,1789-04-30,fellow citizens of the senate and the house of...,[fellow citizens of the senate and the house o...,"[fellow, citizens, of, the, senate, and, the, ..."
1,1789-10-03,whereas it is the duty of all nations to ackno...,[whereas it is the duty of all nations to ackn...,"[whereas, it, is, the, duty, of, all, nations,..."
2,1790-01-08,fellow citizens of the senate and house of rep...,[fellow citizens of the senate and house of re...,"[fellow, citizens, of, the, senate, and, house..."
3,1790-12-08,fellow citizens of the senate and house of re...,[ fellow citizens of the senate and house of r...,"[fellow, citizens, of, the, senate, and, house..."
4,1790-12-29,"i the president of the united states, by my o...","[ i the president of the united states, by my ...","[i, the, president, of, the, united, states, b..."


where elements in the new columns are lists filled with sentences or words, respectively.