# Was GWB Really That Bad of a Public Speaker?

In [111]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import spacy
import nltk
from nltk.corpus import state_union, stopwords
import re
from sklearn import ensemble
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from collections import Counter

In [2]:
nltk.download('state_union')
nlp = spacy.load('en')

[nltk_data] Downloading package state_union to
[nltk_data]     /Users/brien/nltk_data...
[nltk_data]   Package state_union is already up-to-date!


## What do I mean?
George W. Bush (GWB) was known as a President with a less than eloquent way with words. When his State of The Union (SOTU) addresses are taken to the bare bones are they really that different than what other presidents are saying?

Let's look at his and Clinton's SOTU addresses and try to differentiate between the two using logistic regression.

This project could be taken further to analyze a couple key Republican Presidents and Democratic Presidents to see if we can predict party affiliations based on what a president says in their SOTU address. After all, they should be telling the truth and in this case, words speak louder than actions.

In [88]:
def text_cleaner(text):
    # getting read of the title, special punctuation and the indicator
    # that the president is speaking
    text = re.sub(r"([A-Z]{2,10}\s)", '', text)
    text = re.sub(r'(PRESIDENT:)', '', text)
    text = re.sub(r"(CLINTON'S)", '', text)
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text

def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(1000)]

def bow_features(sentences, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    df['speech_sentence'] = sentences['sentence']
    df['speaking_president'] = sentences['president']
    df.loc[:, common_words] = 0
    print(df.describe())
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['speech_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 50 == 0:
            print("Processing row {}".format(i))
            
    return df

## Data Loading

In [80]:
#manually loading 2001 due to his second address to congress following 9/11
gwb = state_union.raw('2001-GWBush-1.txt')
gwb += state_union.raw('2001-GWBush-2.txt')

#looping through the rest of the years to get all speeches in one text
for year in ['2002', '2003', '2004', '2005', '2006']:
    gwb += state_union.raw('{}-GWBush.txt'.format(year))
print(len(gwb))

197272


In [81]:
clinton = ""
office_years = ['1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000']
for year in office_years:
    clinton += state_union.raw(year + '-Clinton.txt')
print(len(clinton))

345929


In [6]:
clinton[:200]

"PRESIDENT BILL CLINTON'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION\n \nFebruary 17, 1993 \n\nMr. President, Mr. Speaker, Members of the House and the Senate, distinguished A"

In [82]:
#cleaning the text in preparation for parsing
gwb = text_cleaner(gwb)
gwb[:200]

"February 27, 2001 Mr. Speaker, Mr. Vice President, members of Congress: It's a great privilege to be here to outline a new budget and a new approach for governing our great country. I thank you for yo"

In [83]:
clinton = text_cleaner(clinton)
clinton[:200]

'A February 17, 1993 Mr. President, Mr. Speaker, Members of the House and the Senate, distinguished Americans here as visitors in this Chamber, as am I. It is nice to have a fresh excuse for giving a l'

In [84]:
#parsing the document, this cell can take a little longer to run
gwb_doc = nlp(gwb)
clinton_doc = nlp(clinton)

## Grouping into Sentences

In [85]:
#loading sentences into a dataframe to store BoW features
gwb_sents = [[sent, 'GWB'] for sent in gwb_doc.sents]
clinton_sents = [[sent, 'CLINTON'] for sent in clinton_doc.sents]
sentences = pd.DataFrame(gwb_sents + clinton_sents, columns=['sentence', 'president'])
sentences.head()

Unnamed: 0,sentence,president
0,"(February, 27, ,, 2001, Mr., Speaker, ,, Mr., ...",GWB
1,"(I, thank, you, for, your, invitation, to, spe...",GWB
2,"(I, know, Congress, had, to, formally, invite,...",GWB
3,"((, Laughter, ., ))",GWB
4,"(So, ,, Mr., Vice, President, ,, I, appreciate...",GWB


In [86]:
gwb_bow = bag_of_words(gwb_doc)
clinton_bow = bag_of_words(clinton_doc)

common_words = set(gwb_bow + clinton_bow)
print(len(gwb_bow))
print(len(clinton_bow))
print(len(common_words))

1000
1000
1366


In [21]:
print(sentences['sentence'].iloc[279:280])

279    (Forty, years, ago, ,, and, then, 20, years, a...
Name: sentence, dtype: object


## Creating word features

In [89]:
word_counts = bow_features(sentences, common_words)

         300  immigration  Senator  uncertainty     8  level   yes  religious  \
count   5407         5407     5407         5407  5407   5407  5407       5407   
unique     1            1        1            1     1      1     1          1   
top        0            0        0            0     0      0     0          0   
freq    5407         5407     5407         5407  5407   5407  5407       5407   

        maintain  school         ...           end  Eric  recent  Mississippi  \
count       5407    5407         ...          5407  5407    5407         5407   
unique         1       1         ...             1     1       1            1   
top            0       0         ...             0     0       0            0   
freq        5407    5407         ...          5407  5407    5407         5407   

        tackle  thank  direction  turn  \
count     5407   5407       5407  5407   
unique       1      1          1     1   
top          0      0          0     0   
freq      5407   540

In [90]:
word_counts.describe()

Unnamed: 0,300,immigration,Senator,uncertainty,8,level,yes,religious,maintain,school,...,end,Eric,recent,Mississippi,tackle,thank,direction,turn,speech_sentence,speaking_president
count,5407,5407,5407,5407,5407,5407,5407,5407,5407,5407,...,5407,5407,5407,5407,5407,5407,5407,5407,5407,5407
unique,2,2,3,2,2,3,2,2,2,5,...,3,2,2,2,3,3,3,2,5407,2
top,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(That, 's, about, 90, percent, of, the, firms,...",CLINTON
freq,5399,5399,5396,5401,5400,5389,5397,5389,5396,5257,...,5340,5402,5398,5401,5401,5286,5394,5369,1,3199


## Train Test Split

In [105]:
word_counts = word_counts.fillna(0)

word_counts.loc[:, 'is_gwb'] = 0
word_counts.loc[word_counts['speaking_president'] == 'GWB', 'is_gwb'] = 1

X = word_counts.loc[:, common_words]
Y = word_counts.loc[:, 'is_gwb']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.35, random_state=34)

print(len(X_train))
print(len(X_test))

3514
1893


## Logistic Regression

In [107]:
lr = LogisticRegression(class_weight='balanced', max_iter=300, tol=0.001)

params = {
    'C':Real(0, 1, 'uniform'),
    'penalty':Categorical(['l1', 'l2']),
    
}

opt = BayesSearchCV(
    lr,
    params,
    cv=5,
    n_iter=10,
    random_state=254,
    verbose=1,
    scoring='precision'
)

opt.fit(X_train, Y_train)

print(opt.best_params_)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.4s finished


{'C': 0.6609486803003481, 'penalty': 'l2'}




In [108]:
lr = LogisticRegression(
    class_weight='balanced',
    max_iter=300,
    tol=0.001,
    penalty='l2',
    C=0.66
)

lr.fit(X_train, Y_train)

Y_train_pred = lr.predict(X_train)
Y_test_pred = lr.predict(X_test)

print('__________Training Statistics__________')
print(confusion_matrix(Y_train, Y_train_pred))
print(classification_report(Y_train, Y_train_pred))

print('\n__________Test Statistics__________')
print(confusion_matrix(Y_test, Y_test_pred))
print(classification_report(Y_test, Y_test_pred))

__________Training Statistics__________
[[1955  139]
 [ 178 1242]]
              precision    recall  f1-score   support

           0       0.92      0.93      0.93      2094
           1       0.90      0.87      0.89      1420

   micro avg       0.91      0.91      0.91      3514
   macro avg       0.91      0.90      0.91      3514
weighted avg       0.91      0.91      0.91      3514


__________Test Statistics__________
[[902 203]
 [225 563]]
              precision    recall  f1-score   support

           0       0.80      0.82      0.81      1105
           1       0.73      0.71      0.72       788

   micro avg       0.77      0.77      0.77      1893
   macro avg       0.77      0.77      0.77      1893
weighted avg       0.77      0.77      0.77      1893





Overfit seems to be an issue with the training set. Overall the model does do a good job of classifying the speaker with precision = 0.73 for GWB being the speaker. The model does do a better job of classifying Clinton as the speaker, but that could just be down to the class imbalance.

## Random Forest

In [109]:
rfc = ensemble.RandomForestClassifier()
rfc.fit(X_train, Y_train)

Y_train_pred = rfc.predict(X_train)
Y_test_pred = rfc.predict(X_test)

print('__________Training Statistics__________')
print(confusion_matrix(Y_train, Y_train_pred))
print(classification_report(Y_train, Y_train_pred))

print('\n__________Test Statistics__________')
print(confusion_matrix(Y_test, Y_test_pred))
print(classification_report(Y_test, Y_test_pred))



__________Training Statistics__________
[[2081   13]
 [  62 1358]]
              precision    recall  f1-score   support

           0       0.97      0.99      0.98      2094
           1       0.99      0.96      0.97      1420

   micro avg       0.98      0.98      0.98      3514
   macro avg       0.98      0.98      0.98      3514
weighted avg       0.98      0.98      0.98      3514


__________Test Statistics__________
[[958 147]
 [332 456]]
              precision    recall  f1-score   support

           0       0.74      0.87      0.80      1105
           1       0.76      0.58      0.66       788

   micro avg       0.75      0.75      0.75      1893
   macro avg       0.75      0.72      0.73      1893
weighted avg       0.75      0.75      0.74      1893



Just vanilla Random Forest has some serious overfit issues, way worse than logistic regression. However, I think some tuning could fix this and boost our test set precision.

In [129]:
rfc = ensemble.RandomForestClassifier(
    max_features=None, 
    n_jobs=-2, 
    class_weight='balanced_subsample'
)

params = {
    'n_estimators':Integer(100, 200, 'normalize'),
    'criterion':Categorical(['entropy', 'gini']),
    'max_depth':Integer(6, 100, 'normalize')
}

opt = BayesSearchCV(
    rfc,
    params,
    cv=5,
    n_iter=10,
    random_state=254,
    verbose=1
)

opt.fit(X_train, Y_train)

print(opt.best_params_)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  1.6min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  1.5min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  1.3min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   52.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   44.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  1.2min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  1.0min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  1.4min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  1.8min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  1.9min finished


{'criterion': 'entropy', 'max_depth': 54, 'n_estimators': 175}


In [132]:
rfc = ensemble.RandomForestClassifier(
    criterion='entropy',
    n_estimators=175,
    max_features=None,
    max_depth=54,
    verbose=1,
    n_jobs=-2,
    class_weight='balanced_subsample'
)

rfc.fit(X_train, Y_train)

Y_train_pred = rfc.predict(X_train)
Y_test_pred = rfc.predict(X_test)

print('__________Training Statistics__________')
print(confusion_matrix(Y_train, Y_train_pred))
print(classification_report(Y_train, Y_train_pred))

print('\n__________Test Statistics__________')
print(confusion_matrix(Y_test, Y_test_pred))
print(classification_report(Y_test, Y_test_pred))

[Parallel(n_jobs=-2)]: Using backend ThreadingBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  44 tasks      | elapsed:    6.1s
[Parallel(n_jobs=-2)]: Done 175 out of 175 | elapsed:   22.6s finished
[Parallel(n_jobs=3)]: Using backend ThreadingBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:    0.0s
[Parallel(n_jobs=3)]: Done 175 out of 175 | elapsed:    0.1s finished
[Parallel(n_jobs=3)]: Using backend ThreadingBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:    0.0s


__________Training Statistics__________
[[2092    2]
 [ 316 1104]]
              precision    recall  f1-score   support

           0       0.87      1.00      0.93      2094
           1       1.00      0.78      0.87      1420

   micro avg       0.91      0.91      0.91      3514
   macro avg       0.93      0.89      0.90      3514
weighted avg       0.92      0.91      0.91      3514


__________Test Statistics__________
[[983 122]
 [336 452]]
              precision    recall  f1-score   support

           0       0.75      0.89      0.81      1105
           1       0.79      0.57      0.66       788

   micro avg       0.76      0.76      0.76      1893
   macro avg       0.77      0.73      0.74      1893
weighted avg       0.76      0.76      0.75      1893



[Parallel(n_jobs=3)]: Done 175 out of 175 | elapsed:    0.0s finished


Well that was unexpected... It seems that the RFC does better with default settings than after the HP tuning. Even varying the scoring method within BayesSearch to precision, recall, or f1 it performed worse than the model results displayed above.

## Conclusion

With all of the above in mind I think I can make some nice conclusions here.

1.) The Logistic Regression model is the one to go with here due to the smaller degree of overfitting and better test set performance.

2.) In regards the the question initially posed... GWB and Clinton are definitely talking about different things in their speeches. Once everything is distilled to the root words they are easily classified separately. *whew* It may seem like politicians say the same things as everyone else, but the data shows otherwise.

3.) The next step for this project would be to bring in the rest of the SOTU corpus and try out some unsupervised techniques to try and group presidents along party lines. From there it would be interesting to try and train some models using the clusters to predict whether a speaking politician aligns with a specific party based on their speech content.