# Using NLP to Understand 60 years of Congressional Addresses

Every year in January/February the President stands before the U.S. and Congress to give their State of The Union address (SOTU). Normally it is a rambling affair with lots of pauses for *(Applause.)* and *(Laughter.)* and the more than occasional *(Boo.)*. While everyone in the room is intensely analyzing the President and waiting for a political agenda to be revealed there is another group, the general population who just want to hear what the state of the union is. We want to know how our country is fairing in the World and on our home soil, but that is rarely what is delivered.

In this notebook I want to dig into the SOTU's from the last 60 years and use unsupervised and supervised NLP workflows to determine if party politics is at the forefront of the addresses. I want to see if the words the presidents use just pander to their party or if they are shooting straight and delivering the State of The Union that the rest of the country wants to hear.

As I work through the problem I will use the nltk state_union corpus and spacy module to tokenize the words in the speeches to try and predict first the speaker and secondly their political party. Before gettig to the supervised learning approach of classification, I want to try and cluster the sentences together using tf-idf vectors with SVD to perform latent semantic analysis. Once the components are generated I will use the component scores for each sentence to try and predict the speaker and their political affiliation.

In [386]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import spacy
import scipy
import nltk
from nltk.corpus import state_union, stopwords
import re
from sklearn import ensemble
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from collections import Counter

sns.set_style('white')

In [2]:
nltk.download('state_union')
nlp = spacy.load('en')

[nltk_data] Downloading package state_union to
[nltk_data]     /Users/brien/nltk_data...
[nltk_data]   Package state_union is already up-to-date!


## Functions

In [338]:
def speech_load_raw(president, year_lst):
    pres = ""
    for year in year_lst:
        pres += state_union.raw('{}-{}.txt'.format(year, president))
    print('{}: {} words'.format(president, len(pres)))
    return pres

def speech_load_sents(president, year_lst):
    pres = state_union.sents('{}-{}.txt'.format(year_lst[0], president))[2:]
    for year in year_lst[1:]:
        pres += state_union.sents('{}-{}.txt'.format(year, president))[2:]
    print('{}: {} sentences'.format(president, len(pres)))
    return pres

def text_cleaner(text):
    # getting read of the title, special punctuation and the indicator
    # that the president is speaking
    text = re.sub(r"([A-Z]{2,10}\s)", '', text)
    text = re.sub(r'(PRESIDENT:)', '', text)
    pattern = r"(CLINTON'S|S. TRUMAN'S|D. EISENHOWER'S|F. KENNEDY'S|B. JOHNSON'S|R. FORD'S|CARTER'S|REAGAN'S|H.W. BUSH'S)"
    text = re.sub(pattern, '', text)
    text = re.sub(r"((Applause.)|(Laughter.))", '', text)
    text = re.sub(r'(-|--)',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text

def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_.lower()
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(1000)]

def bow_features(sentences, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    df['speech_sentence'] = sentences['sentence']
    df['speaking_president'] = sentences['president']
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['speech_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 10 == 0:
            print("Processing row {}".format(i))
            
    return df

def bag_of_pos(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.pos_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(20)]

def amount_punc(sentence):
    #selecting only the punctuation in the sentence
    all_punc = [token
                for token in sentence
                if token.is_punct]
    #returning the amount of punctuation
    return len(all_punc)

def pos_features(sentences, common_pos):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_pos)
    df['text_sentence'] = sentences
    df.loc[:, common_pos] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        pos = [token.pos_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.pos_ in common_pos
                 )]
        
        # Populate the row with word counts.
        for word in pos:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 50 == 0:
            print("Processing row {}".format(i))
            
    return df

## Unsupervised Clustering with Tfidf

### Loading Raw Sentences

In [129]:
#loading in the raw sentences from the Corpus
truman = speech_load_sents('Truman', range(1945, 1951, 1))
ike = speech_load_sents('Eisenhower', range(1953, 1960, 1))
jfk = speech_load_sents('Kennedy', range(1961, 1963, 1))

lbj = speech_load_sents('Johnson', [1964, 1966, 1967, 1968, 1969])
#odd year with 2 speeches
lbj += state_union.sents('1965-Johnson-1.txt')
lbj += state_union.sents('1965-Johnson-2.txt')
print('Johnson: {} sentences'.format(len(lbj)))

nixon = speech_load_sents('Nixon', range(1970, 1974, 1))
ford = speech_load_sents('Ford', range(1975, 1977, 1))
carter = speech_load_sents('Carter', range(1978, 1980, 1))
reagan = speech_load_sents('Reagan', range(1981, 1988, 1))

bush = speech_load_sents('Bush', [1989, 1990, 1992])
bush += state_union.sents('1991-Bush-1.txt')
bush += state_union.sents('1991-Bush-2.txt')
print('Bush: {} sentences'.format(len(bush)))

clinton = speech_load_sents('Clinton', range(1993, 2000, 1))

gwb = speech_load_sents('GWBush', range(2002, 2006, 1))
gwb += state_union.sents('2001-GWBush-1.txt')
gwb += state_union.sents('2001-GWBush-2.txt')
print('GWBush: {} sentences'.format(len(gwb)))

Truman: 2325 sentences
Eisenhower: 2065 sentences
Kennedy: 496 sentences
Johnson: 1231 sentences
Johnson: 1694 sentences
Nixon: 618 sentences
Ford: 496 sentences
Carter: 411 sentences
Reagan: 1634 sentences
Bush: 811 sentences
Bush: 1223 sentences
Clinton: 2612 sentences
GWBush: 1215 sentences
GWBush: 1809 sentences


In [396]:
presidents = {
    'Truman':truman, 
    'Eisenhower':ike, 
    'Kennedy':jfk, 
    'Johnson':lbj, 
    'Nixon':nixon, 
    'Ford':ford, 
    'Carter':carter, 
    'Reagan':reagan, 
    'Bush':bush, 
    'Clinton':clinton, 
    'GWBush':gwb
}

party = {
    'Truman':'DEM', 
    'Eisenhower':'REP', 
    'Kennedy':'DEM', 
    'Johnson':'DEM', 
    'Nixon':'REP', 
    'Ford':'REP', 
    'Carter':'DEM', 
    'Reagan':'REP', 
    'Bush':'REP', 
    'Clinton':'DEM', 
    'GWBush':'REP'
}
pres_sentences = [[' '.join(sentence), pres] for pres in presidents for sentence in presidents[pres]]

sentences = pd.DataFrame(pres_sentences, columns=['sotu_sentence', 'sotu_president'])

for pres in party:
    sentences.loc[sentences['sotu_president'] == pres, 'pol_party'] = party[pres]

sentences.head()

Unnamed: 0,sotu_sentence,sotu_president,pol_party
0,"Mr . Speaker , Mr . President , Members of the...",Truman,DEM
1,"Only yesterday , we laid to rest the mortal re...",Truman,DEM
2,"At a time like this , words are inadequate .",Truman,DEM
3,The most eloquent tribute would be a reverent ...,Truman,DEM
4,"Yet , in this decisive hour , when world event...",Truman,DEM


In [397]:
X = sentences.loc[:, 'sotu_sentence']
Y = sentences.loc[:, ['sotu_president', 'pol_party']]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=6)

print('Number of rows in Training:', len(X_train))
print('Number of rows in Test:', len(X_test))

Number of rows in Training: 11537
Number of rows in Test: 3846


## Clusters on Training Set

### Tfidf Vectorization of Sentences

In [240]:
vectorizer = TfidfVectorizer(
    max_df=0.5, # drop words that occur in more than half the time
    min_df=0.001, 
    stop_words='english', # do not include stop words
    lowercase=True,
    use_idf=True,
    norm=u'l2', 
    smooth_idf=True 
)

sotu_tfidf = vectorizer.fit_transform(X_train)

print('Number of features:', sotu_tfidf.get_shape()[1])

Number of features: 1662


In [241]:
print('Number of words excluded based on vectorizer limiters:', len(vectorizer.stop_words_))

Number of words excluded based on vectorizer limiters: 8803


In [245]:
sotu_tfidf_csr = sotu_tfidf.tocsr()

n = sotu_tfidf_csr.shape[0]
tfidf_by_sent = [{} for _ in range(0,n)]
words = vectorizer.get_feature_names()

for i,j in zip(*sotu_tfidf_csr.nonzero()):
    tfidf_by_sent[i][words[j]] = sotu_tfidf_csr[i, j]

print(X_train.iloc[20])
print(tfidf_by_sent[20])

We will continue to cooperate with the Federal Reserve Board , seeking a steady policy that ensures price stability without keeping interest rates artificially high or needlessly holding down growth .
{'continue': 0.21098489877764123, 'cooperate': 0.3246295181264091, 'federal': 0.1861786313991985, 'reserve': 0.30127398128042426, 'board': 0.3307057280782114, 'seeking': 0.3103425586602814, 'steady': 0.2981272800699197, 'policy': 0.22307052525154417, 'price': 0.24407583976465427, 'stability': 0.27991168258486615, 'keeping': 0.2898832147161841, 'rates': 0.24120298351361513, 'high': 0.22685937155722882, 'growth': 0.22080203737152976}


### Feature/Dimensionality Reduction

In [279]:
#Feature reduction, input for svd is number of clusters
svd = TruncatedSVD(400)
lsa = make_pipeline(svd, Normalizer(copy=False))

# Run SVD on the training data, then project the training data.
X_train_lsa = lsa.fit_transform(sotu_tfidf)

variance_explained = svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance captured by all components:",total_variance*100)

#Looking at what sorts of paragraphs our solution considers similar, for the first five identified topics
sents_by_component = pd.DataFrame(X_train_lsa, index=X_train)
for i in range(5):
    print('\nComponent {}:'.format(i))
    print(sents_by_component.loc[:,i].sort_values(ascending=False)[0:10])

Percent variance captured by all components: 57.179293631926285

Component 0:
sotu_sentence
( Applause .)    0.999801
( Applause .)    0.999801
( Applause .)    0.999801
( Applause .)    0.999801
( Applause .)    0.999801
( Applause .)    0.999801
( Applause .)    0.999801
( Applause .)    0.999801
( Applause .)    0.999801
( Applause .)    0.999801
Name: 0, dtype: float64

Component 1:
sotu_sentence
It is a time to build , to build the America within reach , an America where everybody has a chance to get ahead with hard work ; where every citizen can live in a safe community ; where families are strong , schools are good , and all our young people can go on to college ; an America where scientists find cures for diseases , from diabetes to Alzheimer ' s to AIDS ; an America where every child can stretch a hand across a keyboard and reach every book ever written , every painting ever painted , every symphony ever composed ; where government provides opportunity and citizens honor the r

Here are the general themes of the sentences for the top 5 components from the SVD.

comp 0: applause
comp 1: rallying, building, strength, improvement America
comp 2: world comments
comp 3: sentences about people
comp 4: 'Merica

### Classification Modeling based on Unsupervised Clusters

#### Logistic Regression

In [298]:
lr = LogisticRegression(penalty='l1')

X_train_unsup = sents_by_component
Y_train_unsup = Y_train.loc[:, 'pol_party']

lr.fit(X_train_unsup, Y_train_unsup)

Y_train_pred_unsup = lr.predict(X_train_unsup)

print('__________Training Statistics__________')
print(confusion_matrix(Y_train_unsup, Y_train_pred_unsup))
print(classification_report(Y_train_unsup, Y_train_pred_unsup))



__________Training Statistics__________
[[3570 2056]
 [1943 3968]]
              precision    recall  f1-score   support

         DEM       0.65      0.63      0.64      5626
         REP       0.66      0.67      0.66      5911

   micro avg       0.65      0.65      0.65     11537
   macro avg       0.65      0.65      0.65     11537
weighted avg       0.65      0.65      0.65     11537



For some reason I could not choose 'elasticnet' for the logistic regression model. It says that the only penalties available are l1 or l2, even though elasticnet is a valid option based on the sklearn docs. Could be an sklearn version issue, but I am not sure. Given that I wanted to try the SGDClassifier as it did have an option for elasticnet and given the number of features, I thought that would be the best way to handle classification. The one thing that I do want to do with the logistic regression model is set it up for multiclass predictions to see how well the unsupervised clusters can identify the authors.

In [310]:
lr = LogisticRegression(penalty='l2', solver='newton-cg', multi_class='multinomial')

X_train_unsup = sents_by_component
Y_train_unsup = Y_train.loc[:, 'sotu_president']

lr.fit(X_train_unsup, Y_train_unsup)

Y_train_pred_unsup_mcl = lr.predict(X_train_unsup)

print('__________Training Statistics__________')
print(confusion_matrix(Y_train_unsup, Y_train_pred_unsup_mcl))
print(classification_report(Y_train_unsup, Y_train_pred_unsup_mcl))

__________Training Statistics__________
[[ 178    1  334   91    2   72   99    2    5   90   61]
 [  15   17   77   45    2   16   21    1    2   51   58]
 [  51    1 1432   74    1  114  114    2    6  106   82]
 [  19    3  147  867    1   34   90    3    4   73  333]
 [   9    2  105   61   20   22   38    1    5   38   48]
 [  39    1  313   75    2  706   75    1    6   88   60]
 [  29    1  280  160    2   52  487    2    4   68  161]
 [   8    0   54  119    0   15   46   24    1   30   84]
 [  22    1  111   71    1   13   63    2   68   46   48]
 [  52    4  338  134    3   70   98    1    7  420  114]
 [  19    1  134  275    1   40   77    2    9   39 1114]]
              precision    recall  f1-score   support

        Bush       0.40      0.19      0.26       935
      Carter       0.53      0.06      0.10       305
     Clinton       0.43      0.72      0.54      1983
  Eisenhower       0.44      0.55      0.49      1574
        Ford       0.57      0.06      0.10       

The model actually doesn't do too bad with classifying the presidents in regards to the precision, but there are a lot of issues with class imbalance in the dataset. Some presidents do seem to do better than others. GWBush has the best performance followed by JFK. Both presidents were known for their speaking abilities, for bad and good reasons respectively.

In [311]:
lr = LogisticRegression(penalty='l2', multi_class='multinomial', class_weight='balanced')

params = {
    'solver':Categorical(['newton-cg', 'sag', 'saga', 'lbfgs']),
    'C':Real(0.001, 1, 'uniform')
}

opt = BayesSearchCV(
    lr,
    params,
    cv=5,
    n_iter=7,
    random_state=45,
    verbose=1
)

opt.fit(X_train_unsup, Y_train_unsup)
print(opt.best_params_)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    6.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   32.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   24.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   27.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    5.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    4.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   21.7s finished


{'C': 0.6579478988568594, 'solver': 'newton-cg'}


In [312]:
lr = LogisticRegression(penalty='l2', solver='newton-cg', multi_class='multinomial', C=0.6579)

X_train_unsup = sents_by_component
Y_train_unsup = Y_train.loc[:, 'sotu_president']

lr.fit(X_train_unsup, Y_train_unsup)

Y_train_pred_unsup_mcl = lr.predict(X_train_unsup)

print('__________Training Statistics__________')
print(confusion_matrix(Y_train_unsup, Y_train_pred_unsup_mcl))
print(classification_report(Y_train_unsup, Y_train_pred_unsup_mcl))

__________Training Statistics__________
[[ 155    0  356   99    2   71   96    1    2   90   63]
 [  14    5   85   47    0   14   22    1    3   51   63]
 [  42    0 1456   76    1  108  104    1    5  102   88]
 [  14    3  159  867    0   30   86    1    2   63  349]
 [   8    1  113   65    9   18   36    0    3   38   58]
 [  33    0  336   77    2  687   73    0    6   86   66]
 [  23    0  305  161    2   45  472    1    3   68  166]
 [   7    0   61  124    0   14   43   12    0   33   87]
 [  20    0  125   78    1   13   68    0   43   47   51]
 [  41    2  363  146    2   62   95    1    3  408  118]
 [  16    1  144  281    1   35   80    1    4   36 1112]]
              precision    recall  f1-score   support

        Bush       0.42      0.17      0.24       935
      Carter       0.42      0.02      0.03       305
     Clinton       0.42      0.73      0.53      1983
  Eisenhower       0.43      0.55      0.48      1574
        Ford       0.45      0.03      0.05       

Hyperparameter tuning helped smooth out some of the precisions and even increased GWBush and Kennedy. Some of the smoothing is likely due to changing the class_weight to balanced. Since performance is still low overall, let's take a look at the SGD Classifier and see if we can use the elasticnet penalty to better results.

#### SGD Classifier

In [315]:
sgdc = SGDClassifier(
    loss='log',
    penalty='elasticnet',
    tol=0.00001,
    n_iter_no_change=10,
    eta0=0.001,
    max_iter=250
)

params = {
    'learning_rate':Categorical(['optimal', 'constant', 'adaptive']),
    'alpha':Real(1e-10, 0.1, 'uniform'),
    'l1_ratio':Real(0, 1, 'uniform')
}

opt = BayesSearchCV(
    sgdc,
    params,
    cv=5,
    n_iter=7,
    random_state=54,
    verbose=1
)

opt.fit(X_train_unsup, Y_train_unsup)
print(opt.best_params_)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   27.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   32.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  5.0min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  2.7min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   40.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   44.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   41.5s finished


{'alpha': 4.015624554416855e-05, 'l1_ratio': 0.07108300499461508, 'learning_rate': 'constant'}




The BayesSearch params chose an l1_ratio close to 0 which is effectively an L2 penalty so I assume that the results of the SGDClassifier will be pretty close to the LR model.

In [316]:
sgdc = SGDClassifier(
    loss='log',
    penalty='elasticnet',
    tol=0.00001,
    n_iter_no_change=10,
    eta0=0.001,
    learning_rate='constant',
    alpha=4e-5,
    l1_ratio=0.071
)

sgdc.fit(X_train_unsup, Y_train_unsup)

Y_train_pred_unsup_sgdc = sgdc.predict(X_train_unsup)

print('__________Training Statistics__________')
print(sgdc.n_iter_)
print(confusion_matrix(Y_train_unsup, Y_train_pred_unsup_sgdc))
print(classification_report(Y_train_unsup, Y_train_pred_unsup_sgdc))

__________Training Statistics__________
1000
[[ 160    1  339  100    2   75   94    2    7   91   64]
 [  16    9   85   46    0   15   21    1    3   49   60]
 [  45    1 1440   77    1  111  108    2    6  102   90]
 [  16    3  158  844    0   36   84    2    2   70  359]
 [   9    1  108   61   13   22   37    1    5   40   52]
 [  32    1  318   79    2  698   74    0    7   87   68]
 [  24    0  290  159    2   50  474    1    4   72  170]
 [   8    0   58  123    0   15   44   17    1   31   84]
 [  20    1  118   77    1   12   64    0   59   45   49]
 [  45    4  349  142    2   69   93    1    5  417  114]
 [  18    1  142  274    1   38   74    1    6   38 1118]]
              precision    recall  f1-score   support

        Bush       0.41      0.17      0.24       935
      Carter       0.41      0.03      0.06       305
     Clinton       0.42      0.73      0.53      1983
  Eisenhower       0.43      0.54      0.47      1574
        Ford       0.54      0.04      0.07  



It looks like SGD with an elasticnet penalty didn't quite perform as well as Logistic Regression with an L2 penalty. However, they are very close to one another in performance. 

One last thing... lets see if the tuned SGD model can predict party affiliations any better than before.

In [317]:
X_train_unsup = sents_by_component
Y_train_unsup = Y_train.loc[:, 'pol_party']

sgdc = SGDClassifier(
    loss='log',
    penalty='elasticnet',
    tol=0.00001,
    n_iter_no_change=10,
    eta0=0.001,
    learning_rate='constant',
    alpha=4e-5,
    l1_ratio=0.071
)

sgdc.fit(X_train_unsup, Y_train_unsup)

Y_train_pred_unsup_sgdc = sgdc.predict(X_train_unsup)

print('__________Training Statistics__________')
print('Number of Iterations:', sgdc.n_iter_)
print(confusion_matrix(Y_train_unsup, Y_train_pred_unsup_sgdc))
print(classification_report(Y_train_unsup, Y_train_pred_unsup_sgdc))

__________Training Statistics__________
474
[[3533 2093]
 [1885 4026]]
              precision    recall  f1-score   support

         DEM       0.65      0.63      0.64      5626
         REP       0.66      0.68      0.67      5911

   micro avg       0.66      0.66      0.66     11537
   macro avg       0.66      0.65      0.65     11537
weighted avg       0.66      0.66      0.65     11537



Performance is similar to the LR model that was done previously, which isn't too surprising again because of the penalty being essentially the same.

It's time to repeat the process for the test set and run the LR model with an L2 penalty to predict authorship and political affiliation.

## Clusters on Test Set

### Tfidf Vectorization of Sentences

In [318]:
vectorizer = TfidfVectorizer(
    max_df=0.5, # drop words that occur in more than half the time
    min_df=0.001, 
    stop_words='english', # do not include stop words
    lowercase=True,
    use_idf=True,
    norm=u'l2', 
    smooth_idf=True 
)

sotu_tfidf_test = vectorizer.fit_transform(X_test)

print('Number of features:', sotu_tfidf.get_shape()[1])
print('Number of words excluded based on vectorizer limiters:', len(vectorizer.stop_words_))

sotu_tfidf_csr_test = sotu_tfidf_test.tocsr()

n = sotu_tfidf_csr_test.shape[0]
tfidf_by_sent_test = [{} for _ in range(0,n)]
words = vectorizer.get_feature_names()

for i,j in zip(*sotu_tfidf_csr_test.nonzero()):
    tfidf_by_sent_test[i][words[j]] = sotu_tfidf_csr_test[i, j]

print(X_test.iloc[20])
print(tfidf_by_sent_test[20])

Number of features: 1662
Number of words excluded based on vectorizer limiters: 4677
This is not the sole responsibility of any one branch of our government .
{'responsibility': 0.5418620140275172, 'branch': 0.7476104822337042, 'government': 0.3840105787713807}


### Feature/Dimensionality Reduction

In [323]:
#Feature reduction, input for svd is number of clusters
svd = TruncatedSVD(400)
lsa = make_pipeline(svd, Normalizer(copy=False))

# Run SVD on the training data, then project the training data.
X_test_lsa = lsa.fit_transform(sotu_tfidf_test)

variance_explained = svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance captured by all components:",total_variance*100)

#Looking at what sorts of paragraphs our solution considers similar, for the first five identified topics
sents_by_component_test = pd.DataFrame(X_test_lsa, index=X_test)
for i in range(5):
    print('\nComponent {}:'.format(i))
    print(sents_by_component_test.loc[:,i].sort_values(ascending=False)[0:10])

Percent variance captured by all components: 60.34728350241241

Component 0:
sotu_sentence
( Applause .)    0.999223
( Applause .)    0.999223
( Applause .)    0.999223
( Applause .)    0.999223
( Applause .)    0.999223
( Applause .)    0.999223
( Applause .)    0.999223
( Applause .)    0.999223
( Applause .)    0.999223
( Applause .)    0.999223
Name: 0, dtype: float64

Component 1:
sotu_sentence
That is why my call upon the Congress today is for a high statesmanship , so that in the years to come Americans will look back and say because it withstood the intense pressures of a political year , and achieved such great good for the American people and for the future of this Nation , this was truly a great Congress .                                                                                                                                                                                           0.387325
I thank the Vice President for his leadership and the Congress for its support

Component themes for the top 5 components in the test set. Aside from the applause component none of them are the same as the training set. This isn't surprising considering the tfidf scores were done separately. The term and document frequencies for the training and test will be different since the dataset sizes are different. On top of that, since speech timings could affect the content covered, the train/test split altered how the clusters will turn out.

Comp 0: Applause
Comp 1: Strength and action by the nation and congress
Comp 2: ummm 'World Peace!'
Comp 3: people, people, people
Comp 4: statements of togetherness

### Classification Modeling on Test

#### Logistic Regression

In [324]:
X_test_unsup = sents_by_component_test
Y_test_unsup = Y_test.loc[:, 'sotu_president']

In [325]:
lr = LogisticRegression(penalty='l2', multi_class='multinomial', class_weight='balanced')

params = {
    'solver':Categorical(['newton-cg', 'sag', 'saga', 'lbfgs']),
    'C':Real(0.001, 1, 'uniform')
}

opt = BayesSearchCV(
    lr,
    params,
    cv=5,
    n_iter=7,
    random_state=45,
    verbose=1
)

opt.fit(X_test_unsup, Y_test_unsup)
print(opt.best_params_)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    8.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    6.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    6.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    6.3s finished


{'C': 0.6579478988568594, 'solver': 'newton-cg'}


In [326]:
lr = LogisticRegression(
    penalty='l2', 
    multi_class='multinomial', 
    class_weight='balanced',
    C=0.657,
    solver='newton-cg'
)

lr.fit(X_test_unsup, Y_test_unsup)

Y_test_pred_unsup_mcl = lr.predict(X_test_unsup)

print('__________Training Statistics__________')
print(confusion_matrix(Y_test_unsup, Y_test_pred_unsup_mcl))
print(classification_report(Y_test_unsup, Y_test_pred_unsup_mcl))

__________Training Statistics__________
[[126  19  27   6  23  14  23  14  19   9   8]
 [  1  79   2   2   8   1   2   4   3   2   2]
 [ 56  34 290  17  25  45  46  26  39  30  21]
 [ 22  21  13 229  21   9  24  46  20  21  65]
 [  7   7   2   0  96   3   7   7  10   1   7]
 [ 31  28  36  13  26 222  19  16  23  20   9]
 [ 30  28  26  29  26  15 182  31  27  25  29]
 [  1   1   1   4   2   1   5  92   5   0   3]
 [  5   7   4   6   4   3   8   6 114   8   7]
 [ 28  31  41  24  30  11  22  21  26 142  17]
 [ 11  30  16  69  34  11  29  37  31  13 333]]
              precision    recall  f1-score   support

        Bush       0.40      0.44      0.42       288
      Carter       0.28      0.75      0.40       106
     Clinton       0.63      0.46      0.53       629
  Eisenhower       0.57      0.47      0.51       491
        Ford       0.33      0.65      0.43       147
      GWBush       0.66      0.50      0.57       443
     Johnson       0.50      0.41      0.45       448
     Kenn

The multiclass prediction of the speaker was less consistent than the training set and overall performed worse. This does make sense considering the component score for each sentence is different and derived from different tfidf scores. It makes sense that the training set overfits to the speaker in this case when you consider that the frequencies of each word will be different from training to test.

If the tfidf scores and sentence vectors were done on the whole dataset and then broken into training and test, then I think the multiclass prediction and clusters would have been more stable. Let's check on the party affiliations before moving onto solely supervised techniques for speaker and party predictions.

In [327]:
X_test_unsup = sents_by_component_test
Y_test_unsup = Y_test.loc[:, 'pol_party']

lr = LogisticRegression(penalty='l1')

lr.fit(X_test_unsup, Y_test_unsup)

Y_test_pred_unsup = lr.predict(X_test_unsup)

print('__________Training Statistics__________')
print(confusion_matrix(Y_test_unsup, Y_test_pred_unsup))
print(classification_report(Y_test_unsup, Y_test_pred_unsup))



__________Training Statistics__________
[[1325  587]
 [ 638 1296]]
              precision    recall  f1-score   support

         DEM       0.67      0.69      0.68      1912
         REP       0.69      0.67      0.68      1934

   micro avg       0.68      0.68      0.68      3846
   macro avg       0.68      0.68      0.68      3846
weighted avg       0.68      0.68      0.68      3846



Precision and recall are very similar to training and shows no overfit. I think the party prediction works better than the speaker prediction because there isn't a class imbalance, party policy lines and speaking points have remained relatively consistent over time, and a presidents ability to please their party probably takes over a good majority of their SOTU address. 

## Classification Modeling sans clusters

I think the next step is to try predicting the political affiliations without the SVD clusters and latent semantic analysis (LSA). Considering the lemma, POS, and punc frequency worked so well with identifying Clinton of GWB in an earlier analysis of the same dataset, I want to try it with the larger dataset to predict political affiliations. 

### Data Loading

Some presidents have multiple speeches recorded in the same year for various reasons. GWBush did after the September 11th attacks, LBJ gave a special session talk for equal voting rights, and Bush Sr. gave a special session talk on the end of the Gulf War. While these aren't technically State of The Union (SOTU) addresses, they do represent the presidents personal affiliations and are representative of how they handled their time in office.

Due to the total size of the dataset being > 1.5 million words, I will likely need to take subsets of the speeches for computational resource considerations. As well, I will need to balance the classes some and make sure each president is properly represented in the dataset. At the end of the day we are trying to group presidents out based on their words, not if they were removed from office & the VP took over with only 3 years to speek (looking at you Gerald & Nixon).

In [34]:
truman = speech_load('Truman', range(1945, 1951, 1))
ike = speech_load('Eisenhower', range(1953, 1960, 1))
jfk = speech_load('Kennedy', range(1961, 1963, 1))

lbj = speech_load('Johnson', [1964, 1966, 1967, 1968, 1969])
#odd year with 2 speeches
lbj += state_union.raw('1965-Johnson-1.txt')
lbj += state_union.raw('1965-Johnson-2.txt')
print('Johnson: {} words'.format(len(lbj)))

nixon = speech_load('Nixon', range(1970, 1974, 1))
ford = speech_load('Ford', range(1975, 1977, 1))
carter = speech_load('Carter', range(1978, 1980, 1))
reagan = speech_load('Reagan', range(1981, 1988, 1))

bush = speech_load('Bush', [1989, 1990, 1992])
bush += state_union.raw('1991-Bush-1.txt')
bush += state_union.raw('1991-Bush-2.txt')
print('Bush: {} words'.format(len(bush)))

clinton = speech_load('Clinton', range(1993, 2000, 1))

gwb = speech_load('GWBush', range(2002, 2006, 1))
gwb += state_union.raw('2001-GWBush-1.txt')
gwb += state_union.raw('2001-GWBush-2.txt')
print('GWBush: {} words'.format(len(gwb)))

Truman: 301567 words
Eisenhower: 266595 words
Kennedy: 75014 words
Johnson: 144866 words
Johnson: 190589 words
Nixon: 83599 words
Ford: 54372 words
Carter: 46053 words
Reagan: 188406 words
Bush: 78229 words
Bush: 117973 words
Clinton: 293677 words
GWBush: 119011 words
GWBush: 163861 words


### Data Cleaning & Parsing

In [35]:
#cleaning all the speeches text & removing the titles
truman = text_cleaner(truman)
ike = text_cleaner(ike)
jfk = text_cleaner(jfk)
lbj = text_cleaner(lbj)
nixon = text_cleaner(nixon)
ford = text_cleaner(ford)
carter = text_cleaner(carter)
reagan = text_cleaner(reagan)
bush = text_cleaner(bush)
clinton = text_cleaner(clinton)
gwb = text_cleaner(gwb)

In [36]:
#This cell could take a couple minutes to run
truman_doc = nlp(truman)
ike_doc = nlp(ike)
jfk_doc = nlp(jfk)
lbj_doc = nlp(lbj)
nixon_doc = nlp(nixon)
ford_doc = nlp(ford)
carter_doc = nlp(carter)
reagan_doc = nlp(reagan)
bush_doc = nlp(bush)
clinton_doc = nlp(clinton)
gwb_doc = nlp(gwb)

### Sentence Grouping for Token Analysis

In [96]:
truman_sents = [[sents, 'Truman'] for sents in truman_doc.sents]
ike_sents = [[sents, 'Eisenhower'] for sents in ike_doc.sents]
jfk_sents = [[sents, 'Kennedy'] for sents in jfk_doc.sents]
lbj_sents = [[sents, 'Johnson'] for sents in lbj_doc.sents]
nixon_sents = [[sents, 'Nixon'] for sents in nixon_doc.sents]
ford_sents = [[sents, 'Ford'] for sents in ford_doc.sents]
carter_sents = [[sents, 'Carter'] for sents in carter_doc.sents]
reagan_sents = [[sents, 'Reagan'] for sents in reagan_doc.sents]
bush_sents = [[sents, 'Bush'] for sents in bush_doc.sents]
clinton_sents = [[sents, 'Clinton'] for sents in clinton_doc.sents]
gwb_sents = [[sents, 'GWBush'] for sents in gwb_doc.sents]

print('Truman Sentences:', len(truman_sents))
print('Ike Sentences:', len(ike_sents))
print('JFK Sentences:', len(jfk_sents))
print('LBJ Sentences:', len(lbj_sents))
print('Nixon Sentences:', len(nixon_sents))
print('Ford Sentences:', len(ford_sents))
print('Carter Sentences:', len(carter_sents))
print('Reagan Sentences:', len(reagan_sents))
print('Bush Sentences:', len(bush_sents))
print('Clinton Sentences:', len(clinton_sents))
print('GWBush Sentences:', len(gwb_sents))

Truman Sentences: 2353
Ike Sentences: 2080
JFK Sentences: 511
LBJ Sentences: 1715
Nixon Sentences: 648
Ford Sentences: 505
Carter Sentences: 415
Reagan Sentences: 1680
Bush Sentences: 1268
Clinton Sentences: 2689
GWBush Sentences: 1577


In [None]:
# test code to randomly select 500 sentences from each presidents corpus
# decided to keep just the first 500 so that each pres would be compared 
# during their same seniority level in office

#test = np.asarray(truman_sents)[:,::2].flatten() 
#test = np.random.choice(test, 500)
#test

In [330]:
#combining sentences and labels into a df and limiting
sentences_long = pd.DataFrame(
    truman_sents + 
    ike_sents + 
    jfk_sents + 
    lbj_sents + 
    nixon_sents +
    ford_sents +
    carter_sents +
    reagan_sents +
    bush_sents +
    clinton_sents +
    gwb_sents,
    columns=['sentence', 'president']
)

#confirming class balance dictated by speeker, Carter is low, but wanted to retain
#as much info as possible
sentences_long.president.value_counts()

Clinton       2689
Truman        2353
Eisenhower    2080
Johnson       1715
Reagan        1680
GWBush        1577
Bush          1268
Nixon          648
Kennedy        511
Ford           505
Carter         415
Name: president, dtype: int64

### Handling Class Imbalance (?)

In [341]:
#combining sentences and labels into a df and limiting
sentences = pd.DataFrame(
    truman_sents[:500] + 
    ike_sents[:500] + 
    jfk_sents[:500] + 
    lbj_sents[:500] + 
    nixon_sents[:500] +
    ford_sents[:500] +
    carter_sents +
    reagan_sents[:500] +
    bush_sents[:500] +
    clinton_sents[:500] +
    gwb_sents[:500],
    columns=['sentence', 'president']
)

#confirming class balance dictated by speeker, Carter is low, but wanted to retain
#as much info as possible
sentences.president.value_counts()

Bush          500
Eisenhower    500
Truman        500
GWBush        500
Ford          500
Kennedy       500
Reagan        500
Johnson       500
Clinton       500
Nixon         500
Carter        415
Name: president, dtype: int64

### Word Counts, POS, & Punc Counts

For the sentences I am using the shortened dataset so that the authors are balanced and for computational reasons. Looking for the lemmas in the sentence and then counting them in the sentence is not a fast task given the size of the common words list as well as the length of the sentences in the speeches. Since we are dealing with transcriptions of SOTU addresses, there are some major run-on sentences. Also, the speech writing uses a single or double dash (- or --) to indicate a pause or a break in the speach, which I replaced with a blank space during cleaning. That replacement could have used a period in some cases, but with the size of the raw dataset, it would be difficult to come up with a concrete rule for the re.sub script to follow.

In [337]:
truman_bow = bag_of_words(truman_doc)
ike_bow = bag_of_words(ike_doc)
jfk_bow = bag_of_words(jfk_doc)
lbj_bow = bag_of_words(lbj_doc)
nixon_bow = bag_of_words(nixon_doc)
ford_bow = bag_of_words(ford_doc)
carter_bow = bag_of_words(carter_doc)
reagan_bow = bag_of_words(reagan_doc)
bush_bow = bag_of_words(bush_doc)
clinton_bow = bag_of_words(clinton_doc)
gwb_bow = bag_of_words(gwb_doc)

common_words = set(
    truman_bow + 
    ike_bow + 
    jfk_bow + 
    lbj_bow + 
    nixon_bow + 
    ford_bow + 
    carter_bow +
    reagan_bow +
    bush_bow +
    clinton_bow +
    gwb_bow
)

len(common_words)

3084

In [342]:
word_count = bow_features(sentences, common_words)
word_count.head()

Processing row 0
Processing row 10
Processing row 20
Processing row 30
Processing row 40
Processing row 50
Processing row 60
Processing row 70
Processing row 80
Processing row 90
Processing row 100
Processing row 110
Processing row 120
Processing row 130
Processing row 140
Processing row 150
Processing row 160
Processing row 170
Processing row 180
Processing row 190
Processing row 200
Processing row 210
Processing row 220
Processing row 230
Processing row 240
Processing row 250
Processing row 260
Processing row 270
Processing row 280
Processing row 290
Processing row 300
Processing row 310
Processing row 320
Processing row 330
Processing row 340
Processing row 350
Processing row 360
Processing row 370
Processing row 380
Processing row 390
Processing row 400
Processing row 410
Processing row 420
Processing row 430
Processing row 440
Processing row 450
Processing row 460
Processing row 470
Processing row 480
Processing row 490
Processing row 500
Processing row 510
Processing row 520
Proc

Processing row 4160
Processing row 4170
Processing row 4180
Processing row 4190
Processing row 4200
Processing row 4210
Processing row 4220
Processing row 4230
Processing row 4240
Processing row 4250
Processing row 4260
Processing row 4270
Processing row 4280
Processing row 4290
Processing row 4300
Processing row 4310
Processing row 4320
Processing row 4330
Processing row 4340
Processing row 4350
Processing row 4360
Processing row 4370
Processing row 4380
Processing row 4390
Processing row 4400
Processing row 4410
Processing row 4420
Processing row 4430
Processing row 4440
Processing row 4450
Processing row 4460
Processing row 4470
Processing row 4480
Processing row 4490
Processing row 4500
Processing row 4510
Processing row 4520
Processing row 4530
Processing row 4540
Processing row 4550
Processing row 4560
Processing row 4570
Processing row 4580
Processing row 4590
Processing row 4600
Processing row 4610
Processing row 4620
Processing row 4630
Processing row 4640
Processing row 4650


Unnamed: 0,-pron-,passage,favor,prosper,assault,analysis,ease,individual,worthy,prior,...,neutrality,issue,servant,unhindered,executive,operation,saddle,scholarship,speech_sentence,speaking_president
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(A, April, 16, ,, 1945, Mr., Speaker, ,, Mr., ...",Truman
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Only, yesterday, ,, we, laid, to, rest, the, ...",Truman
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(At, a, time, like, this, ,, words, are, inade...",Truman
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(The, most, eloquent, tribute, would, be, a, r...",Truman
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Yet, ,, in, this, decisive, hour, ,, when, wo...",Truman


In [343]:
truman_pos = bag_of_pos(truman_doc)
ike_pos = bag_of_pos(ike_doc)
jfk_pos = bag_of_pos(jfk_doc)
lbj_pos = bag_of_pos(lbj_doc)
nixon_pos = bag_of_pos(nixon_doc)
ford_pos = bag_of_pos(ford_doc)
carter_pos = bag_of_pos(carter_doc)
reagan_pos = bag_of_pos(reagan_doc)
bush_pos = bag_of_pos(bush_doc)
clinton_pos = bag_of_pos(clinton_doc)
gwb_pos = bag_of_pos(gwb_doc)

common_pos = set(
    truman_pos +
    ike_pos +
    jfk_pos +
    lbj_pos +
    nixon_pos +
    ford_pos +
    carter_pos +
    reagan_pos +
    bush_pos +
    clinton_pos +
    gwb_pos
)

print(len(common_pos))

16


In [344]:
pos = pos_features(sentences['sentence'], common_pos)

Processing row 0
Processing row 50
Processing row 100
Processing row 150
Processing row 200
Processing row 250
Processing row 300
Processing row 350
Processing row 400
Processing row 450
Processing row 500
Processing row 550
Processing row 600
Processing row 650
Processing row 700
Processing row 750
Processing row 800
Processing row 850
Processing row 900
Processing row 950
Processing row 1000
Processing row 1050
Processing row 1100
Processing row 1150
Processing row 1200
Processing row 1250
Processing row 1300
Processing row 1350
Processing row 1400
Processing row 1450
Processing row 1500
Processing row 1550
Processing row 1600
Processing row 1650
Processing row 1700
Processing row 1750
Processing row 1800
Processing row 1850
Processing row 1900
Processing row 1950
Processing row 2000
Processing row 2050
Processing row 2100
Processing row 2150
Processing row 2200
Processing row 2250
Processing row 2300
Processing row 2350
Processing row 2400
Processing row 2450
Processing row 2500
Pro

In [345]:
master_df = pd.concat([word_count, pos], axis=1)
master_df.drop('text_sentence', inplace=True, axis=1)
master_df = master_df.reset_index(drop=True)

In [347]:
master_df['punc_count'] = pd.Series(amount_punc(i) for i in master_df['speech_sentence'])

### Modeling

#### Multiclass Prediction of President

In [383]:
X = master_df.drop(['speech_sentence', 'speaking_president', 'pol_party'], axis=1)
Y = master_df.loc[:, 'speaking_president']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=86)

print(len(X_train))
print(len(X_test))

4061
1354


In [360]:
# going with logistic regression again so that we can compare
# unsupervised component scores to normal features

#hard coding multinomial and penalty due to solver restrictions for multinomial models
lr = LogisticRegression(penalty='l2', multi_class='multinomial', tol=1e-3)

params = {
    'C':Real(0.001, 1, 'uniform'),
    'solver':Categorical(['newton-cg', 'sag', 'saga', 'lbfgs'])
}

opt = BayesSearchCV(
    lr,
    params,
    cv=5,
    n_iter=5,
    random_state=645,
    verbose=1
)

opt.fit(X_train, Y_train)
print(opt.best_params_)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  8.7min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed: 17.8min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   19.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   59.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  1.1min finished


{'C': 0.6048282012084276, 'solver': 'newton-cg'}


In [384]:
lr = LogisticRegression(
    penalty='l2',
    multi_class='multinomial',
    C=0.604,
    solver='newton-cg'
)

lr.fit(X_train, Y_train)

Y_train_pred_mcl = lr.predict(X_train)
Y_test_pred_mcl = lr.predict(X_test)

print('__________Training Statistics__________\n')
print(confusion_matrix(Y_train, Y_train_pred_mcl))
print(classification_report(Y_train, Y_train_pred_mcl))

print('\n__________Test Statistics__________\n')
print(confusion_matrix(Y_test, Y_test_pred_mcl))
print(classification_report(Y_test, Y_test_pred_mcl))

__________Training Statistics__________

[[311   4  16   4   5   6   4   3   5   6   4]
 [ 14 245  14   7   7   7   5   3   9   4   7]
 [ 27   4 300   0  10   5   1   2   6   4   5]
 [ 11   7   6 326   5   2  11   0   3   5  12]
 [ 18   3  12   4 306   5   6   2   9   6   6]
 [ 31   3   4   3   5 304   6   0   2   4   5]
 [ 22   3  10   6   7  10 293   5  10   6   6]
 [ 27   3   4   6   2   4   7 316   3   3   4]
 [ 13   6   7   2   8   6   3   5 317   5   9]
 [ 22   6  18   6   5   6   5   4   7 278   3]
 [ 15   6   4  11   5   3   5   4   6   1 317]]
              precision    recall  f1-score   support

        Bush       0.61      0.85      0.71       368
      Carter       0.84      0.76      0.80       322
     Clinton       0.76      0.82      0.79       364
  Eisenhower       0.87      0.84      0.85       388
        Ford       0.84      0.81      0.82       377
      GWBush       0.85      0.83      0.84       367
     Johnson       0.85      0.78      0.81       378
     Ken

Overfit, overfit, overfit. There is a problem with class imbalance and there is also a problem with how that was handled. By diminishing the number of rows to just the first 500, there is a chance that there are words in the training set that are never used in the test set, or vice versa. If the test said includes things not seen in training then it will not be able to predict the speaker very well. I think that is the issue with this dataset. More lines and a more sophisticated way of handling class imbalance are key here.

With that in mind, the raw dataset includes 1.5 million words that were distilled down to the most common ~3000. That means I am only using 0.2% if the potential data to build these models (that 1.5 mil does include stop words), which is just to say there is a tremendous amount of potential here. 

However, when comparing these results to the unsupervised approach earlier, our training scores were much higher. Yes, the overfit does hurt, but I think there is more promise with this dataset when going the more conventional feature generation route. 

Let's see if we can do better predicting party affiliations.

#### Binary Classifier of Political Party

In [362]:
#assigning political party affiliations to each president
party = {
    'Truman':'DEM', 
    'Eisenhower':'REP', 
    'Kennedy':'DEM', 
    'Johnson':'DEM', 
    'Nixon':'REP', 
    'Ford':'REP', 
    'Carter':'DEM', 
    'Reagan':'REP', 
    'Bush':'REP', 
    'Clinton':'DEM', 
    'GWBush':'REP'
}
for pres in party:
    master_df.loc[master_df['speaking_president'] == pres, 'pol_party'] = party[pres]

#checking for class balance    
master_df['pol_party'].value_counts()

REP    3000
DEM    2415
Name: pol_party, dtype: int64

In [364]:
X = master_df.drop(['speech_sentence', 'speaking_president', 'pol_party'], axis=1)
Y = master_df.loc[:, 'pol_party']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=86)

print(len(X_train))
print(len(X_test))

4061
1354


In [365]:
lr = LogisticRegression()

params = {
    'penalty':Categorical(['l1', 'l2']),
    'C':Real(0.01, 1, 'uniform'),
    'solver':Categorical(['saga', 'liblinear']), #only these solvers can handle both penalties
    
}

opt = BayesSearchCV(
    lr,
    params,
    cv=5,
    n_iter=7,
    random_state=645,
    verbose=1
)

opt.fit(X_train, Y_train)
print(opt.best_params_)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    8.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  1.0min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    6.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    6.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    6.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  1.4min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  1.8min finished


{'C': 0.7219406141637149, 'penalty': 'l2', 'solver': 'liblinear'}


In [366]:
lr = LogisticRegression(
    penalty='l2',
    C=0.7219,
    solver='liblinear'
)

lr.fit(X_train, Y_train)

Y_train_pred = lr.predict(X_train)
Y_test_pred = lr.predict(X_test)

print('__________Training Statistics__________\n')
print(confusion_matrix(Y_train, Y_train_pred))
print(classification_report(Y_train, Y_train_pred))

print('\n__________Test Statistics__________\n')
print(confusion_matrix(Y_test, Y_test_pred))
print(classification_report(Y_test, Y_test_pred))

__________Training Statistics__________

[[1377  443]
 [ 218 2023]]
              precision    recall  f1-score   support

         DEM       0.86      0.76      0.81      1820
         REP       0.82      0.90      0.86      2241

   micro avg       0.84      0.84      0.84      4061
   macro avg       0.84      0.83      0.83      4061
weighted avg       0.84      0.84      0.84      4061


__________Test Statistics__________

[[326 269]
 [202 557]]
              precision    recall  f1-score   support

         DEM       0.62      0.55      0.58       595
         REP       0.67      0.73      0.70       759

   micro avg       0.65      0.65      0.65      1354
   macro avg       0.65      0.64      0.64      1354
weighted avg       0.65      0.65      0.65      1354



Overfit isn't as bad for the party predictions and precision and recall are both looking good. There is something to say in regards to party politics over the last 60 years, these presidents really stuck to their talking points! It is apparent that the most common words used in the SOTU addresses can be traced back to the two political parties even when using a subset of the data. Also, considering that the common_word list only retained words that were present in all of the speeches and I limited them to only the top words used, means that party divisions and political affiliations are at the forefront of these addresses.

It's very telling that the party prediction was so effective for each of the sentences. Given the inclusion of the whole dataset and a much larger set of words, the overfit issue shown here will likely be less of a problem. There is a 160 sentence swing between the DEM sentences and the Rep sentences in the test set. When the test set is only 1300 sentences long, that means the imbalance is >10%.

## Conclusion's of Unsupervised vs. Supervised

Even though there is a lot of data from their speeches missing in comparison to the unsupervised models, the performance is still better for the normal supervised approaches. A future iteration of this project should be to build a word2vec model and use that to look for similarity between current day politicians/presidents and compare them back to the past. It would be interesting to see if there are any political similarities outside of the model once a connection is made. 