# Presidential Speeches: Topic Modeling
This notebook is used for creating topics for the text on 991 Presidental speeches that span all US Presidents from George Washington to Donald Trump mid-term 2019.

This notebook is building off EDA, Pre-Prossesing & Sentiment work done in `potus_speech_eda_sentiment.ipynb` found in the same folder as this worksheet.

### Setup
Install modules and import libraries to run this notebook >

In [308]:
# install gensim if needed by un-hashing and running:
# pip install --upgrade gensim

In [579]:
import pickle

import pandas as pd
import numpy as np

from gensim import matutils, models
import scipy.sparse

from sklearn.feature_extraction import text 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import NMF

import nltk
from nltk.stem import WordNetLemmatizer 
from nltk.tokenize import TreebankWordTokenizer


In [580]:
# download out-of-nltk-box stop words -- they'll be used later
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/jupyter/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Get data for analysis

In [581]:
# open pickle of file of with clean Presidential speech transcripts'
#  in the document-term matrix

with open('pickle/transcript_cv_dtm.pickle','rb') as read_file:
    transcripts_dtm = pickle.load(read_file)

-
### Topic Modeling: Attempt #1
**LDA with Count Vectorizer.**</br>
First basic attempt at topic modeling to set baseline of what we'll need to likely improve upon.

In [582]:
# make the document-term matrix into a term-document matrix by transposing it

transcripts_tdm = transcripts_dtm.T

In [583]:
# verify shape correct

transcripts_tdm.shape

(37928, 991)

In [584]:
# Turn document-term matrix into gensim format to work with LDS model

sparse_counts = scipy.sparse.csr_matrix(transcripts_tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [585]:
# Open previously pickled cv to create dictionary for gensim

with open('pickle/cv_stop.pickle','rb') as read_file:
    cv = pickle.load(read_file)

In [586]:
# create dictionary for gensim

id2word = dict((v, k) for k, v in cv.vocabulary_.items())

In [587]:
# LDA for num_topics = 2

lda1_2 = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=20)
lda1_2.print_topics()

[(0,
  '0.012*"states" + 0.011*"government" + 0.008*"united" + 0.006*"congress" + 0.005*"public" + 0.005*"country" + 0.004*"people" + 0.004*"shall" + 0.004*"great" + 0.004*"state"'),
 (1,
  '0.010*"people" + 0.007*"world" + 0.005*"american" + 0.005*"new" + 0.005*"president" + 0.005*"america" + 0.005*"years" + 0.004*"time" + 0.004*"peace" + 0.004*"country"')]

In [588]:
# LDA for num_topics = 3

lda1_3 = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=20)
lda1_3.print_topics()

[(0,
  '0.010*"people" + 0.007*"world" + 0.006*"president" + 0.005*"new" + 0.005*"american" + 0.005*"america" + 0.005*"years" + 0.005*"time" + 0.004*"peace" + 0.004*"know"'),
 (1,
  '0.015*"states" + 0.011*"government" + 0.010*"united" + 0.007*"congress" + 0.005*"state" + 0.005*"public" + 0.005*"shall" + 0.005*"country" + 0.004*"people" + 0.004*"great"'),
 (2,
  '0.009*"government" + 0.005*"congress" + 0.005*"states" + 0.005*"great" + 0.004*"year" + 0.004*"law" + 0.004*"people" + 0.004*"country" + 0.004*"american" + 0.004*"united"')]

In [None]:
# LDA for num_topics = 6
# Uping to 6 and doing 50 passes to give this the best chance possible

lda1_4 = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=6, passes=50)
lda1_4.print_topics()

-
### Topic Modeling: Attempt #2
**LDA with Count Vectorizer with more tuned parameters.**</br>
Lemmatizing the text.  Adding stop words.  Using more parameters for vectorization to tune output.

In [None]:
# Create a function to lemmatize the text

def lem_text(text, tokenizer, stemmer):
    cleaned_text = []
    for word in text:
        cleaned_words = []
        for word in tokenizer.tokenize(word):
            stem_word = stemmer.lemmatize(word)
            cleaned_words.append(stem_word)
        cleaned_text.append(' '.join(cleaned_words))
    return cleaned_text

In [507]:
# Lemmatize the cleaned transcripts

transcripts_clean_rd2_lem = lem_text(transcripts_clean_rd2.Transcript, TreebankWordTokenizer(), WordNetLemmatizer())


In [508]:
# Additional stop words that I want to remove to improve results of my topics
#  note words continuously were added to this list after seeing results from topics

add_stop_words = ['government','know','want','thats','mr',
                  'going','year','make','shall','let', 'subject',
                  'say', 'think','way','president','said'
                 ]

# Add new stop words
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

In [509]:
# Get clean text

with open('pickle/transcripts_clean_rd2.pickle','rb') as read_file:
    transcripts_clean_rd2 = pickle.load(read_file)

For this LDA model attempt, we're still using CountVectorizer, but will add extra parameters to:
- add new stop words
- set max df to get rid of too commonly used words
- set min df to get rid of words not used often enough to impact topics

In [510]:
# Create a new document-term matrix using new parameters

cv2 = CountVectorizer(stop_words=stop_words, min_df=.1, max_df=.8)
transcript_cv2 = cv2.fit_transform(transcripts_clean_rd2_lem)
transcript_cv_dtm_v2 = pd.DataFrame(transcript_cv2.toarray(), columns=cv2.get_feature_names())
transcript_cv_dtm_v2.head()

Unnamed: 0,abandon,ability,able,abroad,absence,absolute,absolutely,abundant,abuse,accept,...,worth,worthy,written,wrong,yes,yesterday,yield,york,young,youre
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,0,0,1,1,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,1,0


In [511]:
# Create the gensim corpus
corpus2 = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(transcript_cv_dtm_v2.T))

# Create the vocabulary dictionary
id2word2 = dict((v, k) for k, v in cv2.vocabulary_.items())

In [512]:
# LDA for num_topics = 2

lda2_2 = models.LdaModel(corpus=corpus2, id2word=id2word2, num_topics=2, passes=20)
lda2_2.print_topics()

[(0,
  '0.014*"american" + 0.012*"world" + 0.009*"america" + 0.008*"new" + 0.007*"peace" + 0.006*"war" + 0.006*"work" + 0.006*"right" + 0.005*"life" + 0.005*"today"'),
 (1,
  '0.010*"congress" + 0.010*"law" + 0.008*"power" + 0.007*"public" + 0.006*"act" + 0.006*"duty" + 0.005*"present" + 0.005*"war" + 0.005*"right" + 0.005*"citizen"')]

In [513]:
# LDA for num_topics = 3

lda2_3 = models.LdaModel(corpus=corpus2, id2word=id2word2, num_topics=3, passes=10)
lda2_3.print_topics()

[(0,
  '0.011*"congress" + 0.010*"law" + 0.009*"power" + 0.008*"public" + 0.007*"act" + 0.007*"duty" + 0.006*"constitution" + 0.006*"citizen" + 0.006*"right" + 0.006*"war"'),
 (1,
  '0.014*"american" + 0.012*"world" + 0.009*"america" + 0.008*"new" + 0.007*"peace" + 0.006*"war" + 0.006*"right" + 0.006*"work" + 0.005*"life" + 0.005*"today"'),
 (2,
  '0.008*"congress" + 0.008*"law" + 0.006*"american" + 0.006*"public" + 0.005*"work" + 0.005*"service" + 0.005*"national" + 0.005*"business" + 0.004*"condition" + 0.004*"department"')]

In [514]:
# LDA for num_topics = 4

lda2_4 = models.LdaModel(corpus=corpus2, id2word=id2word2, num_topics=4, passes=20)
lda2_4.print_topics()

[(0,
  '0.013*"law" + 0.012*"power" + 0.010*"constitution" + 0.010*"congress" + 0.009*"right" + 0.008*"act" + 0.008*"duty" + 0.007*"citizen" + 0.006*"public" + 0.006*"war"'),
 (1,
  '0.015*"american" + 0.010*"america" + 0.009*"new" + 0.009*"work" + 0.008*"job" + 0.007*"tax" + 0.007*"congress" + 0.006*"right" + 0.006*"need" + 0.006*"child"'),
 (2,
  '0.010*"congress" + 0.008*"public" + 0.008*"law" + 0.006*"present" + 0.005*"service" + 0.005*"department" + 0.005*"power" + 0.005*"duty" + 0.004*"condition" + 0.004*"act"'),
 (3,
  '0.018*"world" + 0.013*"peace" + 0.012*"american" + 0.012*"war" + 0.008*"america" + 0.008*"new" + 0.007*"men" + 0.007*"freedom" + 0.007*"force" + 0.006*"life"')]

In [221]:
# LDA for num_topics = 5

lda2_5 = models.LdaModel(corpus=corpus2, id2word=id2word2, num_topics=5, passes=20)
lda2_5.print_topics()

[(0,
  '0.008*"congress" + 0.007*"law" + 0.006*"public" + 0.006*"power" + 0.005*"act" + 0.004*"duty" + 0.004*"present" + 0.004*"citizen" + 0.004*"war" + 0.003*"treaty"'),
 (1,
  '0.009*"law" + 0.006*"men" + 0.005*"right" + 0.005*"power" + 0.004*"public" + 0.004*"business" + 0.004*"work" + 0.004*"american" + 0.004*"constitution" + 0.003*"man"'),
 (2,
  '0.013*"president" + 0.006*"congress" + 0.006*"american" + 0.004*"world" + 0.004*"problem" + 0.004*"program" + 0.004*"policy" + 0.004*"new" + 0.004*"national" + 0.004*"federal"'),
 (3,
  '0.011*"american" + 0.010*"america" + 0.007*"new" + 0.007*"world" + 0.005*"work" + 0.005*"right" + 0.005*"job" + 0.004*"president" + 0.004*"child" + 0.004*"way"'),
 (4,
  '0.012*"world" + 0.012*"peace" + 0.011*"war" + 0.006*"american" + 0.006*"men" + 0.006*"force" + 0.005*"freedom" + 0.005*"free" + 0.005*"new" + 0.004*"right"')]

#### Takeaways from Attempt #2 LDA w/CV parameter models:
- **lda2_4**: Seems the easiest to draw themes from.  With more (5 topics) starts to get a bit more abstract.  This will be considered for final topic selection.  Updated to run with 30 passes to refine results.

-
### Topic Modeling: Attempt #3
**LDA with TFIDF with more tuned parameters.**</br>
Lemmatizing the text.  Adding stop words.  Using more parameters for vectorization to tune output.

In [285]:
# Create a new document-term matrix using TFIDF
#   using lemmatized words as done in Attempt #2

tf1 = TfidfVectorizer(stop_words=stop_words, min_df=.1, max_df=.8)
transcript_tf1 = tf1.fit_transform(transcripts_clean_rd2_lem)
transcript_cv_dtm_tf1 = pd.DataFrame(transcript_tf1.toarray(), columns=tf1.get_feature_names())
transcript_cv_dtm_tf1.head()

Unnamed: 0,abandon,ability,able,abroad,absence,absolute,absolutely,abundant,abuse,accept,...,worth,worthy,written,wrong,yes,yesterday,yield,york,young,youre
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.051571,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.064086,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.030715,0.037082,0.0,0.0,0.0,0.11377,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.044687,0.0,0.0,0.0,0.0,0.039271,0.031724,0.0


In [255]:
# Create the gensim corpus
corpus3 = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(transcript_cv_dtm_tf1.T))

# Create the vocabulary dictionary
id2word3 = dict((v, k) for k, v in tf1.vocabulary_.items())

In [259]:
# LDA for num_topics = 2

tf1_2 = models.LdaModel(corpus=corpus3, id2word=id2word3, num_topics=2, passes=20)
tf1_2.print_topics()

[(0,
  '0.007*"american" + 0.007*"world" + 0.006*"america" + 0.004*"new" + 0.004*"peace" + 0.004*"president" + 0.004*"war" + 0.004*"today" + 0.004*"life" + 0.003*"work"'),
 (1,
  '0.006*"law" + 0.005*"congress" + 0.004*"public" + 0.004*"power" + 0.004*"duty" + 0.004*"constitution" + 0.003*"treaty" + 0.003*"act" + 0.003*"citizen" + 0.003*"present"')]

#### Takeaways:
Looking very similiar to the CountVectorizer options, but we'll try higher number of topics to see if that changes

In [260]:
# LDA for num_topics = 4

tf1_4 = models.LdaModel(corpus=corpus3, id2word=id2word3, num_topics=4, passes=20)
tf1_4.print_topics()

[(0,
  '0.001*"law" + 0.001*"president" + 0.001*"indian" + 0.001*"said" + 0.001*"congress" + 0.001*"territory" + 0.001*"person" + 0.001*"jurisdiction" + 0.001*"american" + 0.001*"power"'),
 (1,
  '0.001*"president" + 0.001*"america" + 0.001*"day" + 0.001*"oath" + 0.001*"peace" + 0.001*"person" + 0.001*"world" + 0.001*"american" + 0.001*"act" + 0.001*"proclamation"'),
 (2,
  '0.007*"american" + 0.007*"world" + 0.006*"america" + 0.004*"new" + 0.004*"peace" + 0.004*"president" + 0.004*"war" + 0.004*"today" + 0.004*"life" + 0.003*"freedom"'),
 (3,
  '0.006*"law" + 0.005*"congress" + 0.005*"public" + 0.004*"power" + 0.004*"duty" + 0.004*"constitution" + 0.003*"act" + 0.003*"treaty" + 0.003*"citizen" + 0.003*"present"')]

#### Takeaways:
Definitely now more distinction from what came out of CountVectorizer options, although numbers all seem very low except for a few.  Not certain this will be best option, but will continue to explore.  Let's try taking it down to 3 topics for a middle-ground.

In [263]:
# LDA for num_topics = 3

tf1_3 = models.LdaModel(corpus=corpus3, id2word=id2word3, num_topics=3, passes=20)
tf1_3.print_topics()

[(0,
  '0.001*"president" + 0.001*"day" + 0.001*"american" + 0.001*"prayer" + 0.001*"peace" + 0.001*"public" + 0.001*"religious" + 0.001*"america" + 0.001*"right" + 0.001*"men"'),
 (1,
  '0.007*"american" + 0.007*"world" + 0.006*"america" + 0.004*"new" + 0.004*"peace" + 0.004*"president" + 0.004*"war" + 0.004*"today" + 0.004*"life" + 0.003*"work"'),
 (2,
  '0.006*"law" + 0.005*"congress" + 0.005*"public" + 0.004*"power" + 0.004*"duty" + 0.004*"constitution" + 0.004*"act" + 0.004*"treaty" + 0.003*"citizen" + 0.003*"present"')]

#### Takeaways:
Still pretty low perentages and also don't see as clear of topics coming out to this.  Will not likely use models that came out of Attempt #3 with TFIDF.

-
### Topic Modeling: Attempt #4
**LDA with Count Vectorizer with tuned parameters + bigrams**</br>
Keeping parameters used in Attempt #2, but now also adding in option for bigrams.

In [490]:
# Create a new document-term matrix, now adding bigram option

cv3 = CountVectorizer(stop_words=stop_words, ngram_range=(1,2), min_df=.1, max_df=.8)
transcript_cv3 = cv3.fit_transform(transcripts_clean_rd2_lem)
transcript_cv_dtm_v3 = pd.DataFrame(transcript_cv3.toarray(), columns=cv3.get_feature_names())
transcript_cv_dtm_v3.head()

Unnamed: 0,abandon,ability,able,abroad,absence,absolute,absolutely,abundant,abuse,accept,...,worthy,written,wrong,yes,yesterday,yield,york,young,young people,youre
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,0,0,1,1,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,1,0,0


In [491]:
# Create the gensim corpus
corpus4 = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(transcript_cv_dtm_v3.T))

# Create the vocabulary dictionary
id2word4 = dict((v, k) for k, v in cv3.vocabulary_.items())

In [501]:
# LDA for num_topics = 2

lda4_2 = models.LdaModel(corpus=corpus4, id2word=id2word4, num_topics=2, passes=20)
lda4_2.print_topics()

[(0,
  '0.013*"american" + 0.011*"world" + 0.009*"america" + 0.008*"new" + 0.007*"peace" + 0.006*"war" + 0.006*"work" + 0.006*"right" + 0.005*"life" + 0.005*"today"'),
 (1,
  '0.010*"congress" + 0.009*"law" + 0.007*"power" + 0.007*"public" + 0.006*"act" + 0.006*"duty" + 0.005*"present" + 0.005*"war" + 0.005*"right" + 0.005*"citizen"')]

#### Takeaways:
- No bigrams came out in the topics and looks similar to results without bigrams tried in Attempt #2.  We'll try increasing size of topics to see if that reveals any.

In [290]:
# LDA for num_topics = 4

lda4_4 = models.LdaModel(corpus=corpus4, id2word=id2word4, num_topics=4, passes=20)
lda4_4.print_topics()

[(0,
  '0.014*"law" + 0.013*"constitution" + 0.012*"power" + 0.011*"president" + 0.010*"right" + 0.009*"congress" + 0.009*"act" + 0.007*"union" + 0.007*"duty" + 0.006*"authority"'),
 (1,
  '0.014*"american" + 0.014*"world" + 0.011*"america" + 0.009*"new" + 0.009*"peace" + 0.008*"war" + 0.007*"president" + 0.006*"life" + 0.006*"right" + 0.006*"today"'),
 (2,
  '0.010*"congress" + 0.008*"public" + 0.008*"law" + 0.006*"power" + 0.006*"present" + 0.005*"duty" + 0.005*"war" + 0.005*"treaty" + 0.005*"citizen" + 0.005*"act"'),
 (3,
  '0.009*"american" + 0.009*"congress" + 0.009*"business" + 0.009*"work" + 0.008*"tax" + 0.008*"president" + 0.007*"federal" + 0.006*"need" + 0.006*"national" + 0.006*"program"')]

#### Takeaways:
- Still no bigrams, but is taking a slightly different approach than Attempt #2 for topics.  Could be interesting to test.  Will continue to up topics to see what happens.

In [291]:
# LDA for num_topics = 6

lda4_6 = models.LdaModel(corpus=corpus4, id2word=id2word4, num_topics=6, passes=20)
lda4_6.print_topics()

[(0,
  '0.018*"law" + 0.012*"right" + 0.010*"constitution" + 0.009*"men" + 0.008*"question" + 0.007*"court" + 0.007*"man" + 0.007*"congress" + 0.007*"act" + 0.006*"person"'),
 (1,
  '0.016*"american" + 0.012*"america" + 0.010*"new" + 0.009*"president" + 0.008*"work" + 0.008*"job" + 0.006*"tax" + 0.006*"world" + 0.006*"right" + 0.006*"congress"'),
 (2,
  '0.018*"world" + 0.014*"peace" + 0.013*"war" + 0.011*"american" + 0.009*"president" + 0.007*"force" + 0.007*"freedom" + 0.007*"new" + 0.007*"america" + 0.006*"men"'),
 (3,
  '0.009*"congress" + 0.008*"law" + 0.007*"department" + 0.007*"american" + 0.007*"report" + 0.006*"work" + 0.006*"service" + 0.005*"secretary" + 0.005*"legislation" + 0.005*"increase"'),
 (4,
  '0.011*"congress" + 0.010*"power" + 0.009*"public" + 0.008*"duty" + 0.007*"law" + 0.007*"act" + 0.007*"citizen" + 0.007*"war" + 0.006*"treaty" + 0.006*"present"'),
 (5,
  '0.009*"business" + 0.009*"congress" + 0.008*"national" + 0.008*"public" + 0.006*"law" + 0.006*"federal" +

#### Takeaways:
- Some interesting topics starting to appear here.  A few are a few fuzzy, so we'll try to play around with topic size more -- but this is a strong consideration for topics.

In [499]:
# LDA for num_topics = 7

lda4_7 = models.LdaModel(corpus=corpus4, id2word=id2word4, num_topics=7, passes=30)
lda4_7.print_topics()

[(0,
  '0.018*"law" + 0.015*"constitution" + 0.015*"right" + 0.013*"question" + 0.010*"congress" + 0.009*"house" + 0.008*"election" + 0.008*"slavery" + 0.008*"act" + 0.007*"union"'),
 (1,
  '0.014*"world" + 0.013*"peace" + 0.009*"vietnam" + 0.009*"american" + 0.009*"soviet" + 0.009*"war" + 0.007*"force" + 0.007*"new" + 0.006*"military" + 0.006*"security"'),
 (2,
  '0.019*"world" + 0.016*"war" + 0.013*"american" + 0.012*"peace" + 0.012*"men" + 0.010*"life" + 0.009*"america" + 0.009*"freedom" + 0.007*"right" + 0.007*"free"'),
 (3,
  '0.009*"law" + 0.008*"congress" + 0.008*"work" + 0.007*"business" + 0.006*"service" + 0.006*"public" + 0.006*"national" + 0.006*"american" + 0.005*"commission" + 0.005*"legislation"'),
 (4,
  '0.017*"american" + 0.013*"america" + 0.010*"new" + 0.009*"job" + 0.009*"work" + 0.007*"tax" + 0.007*"child" + 0.007*"world" + 0.006*"help" + 0.006*"need"'),
 (5,
  '0.016*"public" + 0.016*"power" + 0.009*"duty" + 0.009*"bank" + 0.008*"constitution" + 0.007*"congress" + 

In [500]:
# LDA for num_topics = 8

lda4_8 = models.LdaModel(corpus=corpus4, id2word=id2word4, num_topics=8, passes=30)
lda4_8.print_topics()

[(0,
  '0.016*"right" + 0.011*"men" + 0.010*"constitution" + 0.010*"man" + 0.009*"law" + 0.008*"life" + 0.008*"free" + 0.008*"slavery" + 0.007*"principle" + 0.007*"question"'),
 (1,
  '0.016*"american" + 0.011*"america" + 0.010*"job" + 0.009*"new" + 0.009*"work" + 0.009*"tax" + 0.007*"congress" + 0.007*"child" + 0.006*"family" + 0.006*"right"'),
 (2,
  '0.019*"law" + 0.014*"act" + 0.012*"officer" + 0.011*"person" + 0.010*"power" + 0.009*"duty" + 0.009*"congress" + 0.009*"authority" + 0.007*"department" + 0.007*"constitution"'),
 (3,
  '0.012*"congress" + 0.008*"law" + 0.007*"department" + 0.006*"increase" + 0.006*"report" + 0.006*"public" + 0.005*"american" + 0.005*"service" + 0.005*"legislation" + 0.005*"present"'),
 (4,
  '0.013*"vietnam" + 0.011*"american" + 0.010*"war" + 0.008*"south" + 0.008*"peace" + 0.007*"world" + 0.006*"question" + 0.006*"believe" + 0.005*"policy" + 0.005*"action"'),
 (5,
  '0.009*"business" + 0.009*"law" + 0.008*"work" + 0.008*"men" + 0.007*"national" + 0.007

In [498]:
# LDA for num_topics = 9

lda4_9 = models.LdaModel(corpus=corpus4, id2word=id2word4, num_topics=9, passes=20)
lda4_9.print_topics()

[(0,
  '0.012*"life" + 0.012*"right" + 0.011*"world" + 0.011*"men" + 0.009*"american" + 0.008*"man" + 0.008*"america" + 0.007*"freedom" + 0.007*"day" + 0.007*"citizen"'),
 (1,
  '0.012*"power" + 0.011*"congress" + 0.010*"public" + 0.009*"law" + 0.008*"duty" + 0.008*"act" + 0.007*"war" + 0.007*"citizen" + 0.006*"constitution" + 0.005*"present"'),
 (2,
  '0.010*"congress" + 0.008*"law" + 0.007*"department" + 0.006*"service" + 0.006*"report" + 0.005*"secretary" + 0.005*"american" + 0.005*"public" + 0.005*"treaty" + 0.005*"present"'),
 (3,
  '0.021*"constitution" + 0.018*"slavery" + 0.015*"law" + 0.014*"right" + 0.013*"question" + 0.013*"slave" + 0.012*"union" + 0.012*"territory" + 0.010*"congress" + 0.009*"free"'),
 (4,
  '0.017*"american" + 0.014*"america" + 0.011*"new" + 0.009*"work" + 0.008*"job" + 0.008*"world" + 0.007*"child" + 0.006*"right" + 0.006*"help" + 0.006*"family"'),
 (5,
  '0.010*"question" + 0.009*"dont" + 0.008*"congress" + 0.008*"believe" + 0.008*"governor" + 0.007*"hous

#### Takeaways:
- With 9 topics things start to lose focus, but 8 and especially 7 has an interesting mix.  8 is getting a bit too narrow in certain topics though (e.g. vietnam war) so 7 is going to be what will be focused on.  Updated number of passes on the model to refine results.

-
### Topic Modeling: Attempt #5
**LSA with Count Vectorizer** </br>
While LDA normally works best with large text -- such as presidential speeches -- and was why that model was first used, we'll also test LSA to see if chance it performs better than our LDA.

In [311]:
# Create a new document-term matrix using parameters that worked well for LDA as starting point

cv5 = CountVectorizer(stop_words=stop_words, ngram_range=(1,2), min_df=.1, max_df=.8)
transcript_cv5 = cv5.fit_transform(transcripts_clean_rd2_lem)
transcript_cv_dtm_v5 = pd.DataFrame(transcript_cv5.toarray(), columns=cv5.get_feature_names())
transcript_cv_dtm_v5.head()

Unnamed: 0,abandon,ability,able,abroad,absence,absolute,absolutely,abundant,abuse,accept,...,worthy,written,wrong,yes,yesterday,yield,york,young,young people,youre
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,0,0,1,1,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,1,0,0


In [325]:
# Run in LSA model for 2 topics and then print variance ratio

lsa1_2 = TruncatedSVD(2)
doc_topic_lsa1_2 = lsa1_2.fit_transform(transcript_cv_dtm_v5)
lsa1_2.explained_variance_ratio_

array([0.26345906, 0.09600439])

In [319]:
# create function to capture the words under each topic

def display_topics(model, feature_names, no_top_words, topic_names=None):
    for ix, topic in enumerate(model.components_):
        if not topic_names or not topic_names[ix]:
            print("\nTopic ", ix)
        else:
            print("\nTopic: '",topic_names[ix],"'")
        print(", ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [322]:
# Examine first 10 words that fit under the topics

display_topics(lsa1_2, cv5.get_feature_names(), 10)


Topic  0
congress, law, american, power, public, new, war, right, president, act

Topic  1
american, president, world, america, job, new, tax, dont, help, program


In [326]:
# Run in LSA model for 4 topics and then print variance ratio

lsa1_4 = TruncatedSVD(4)
doc_topic_lsa1_4 = lsa1_4.fit_transform(transcript_cv_dtm_v5)
lsa1_4.explained_variance_ratio_

array([0.26345906, 0.09600439, 0.04510286, 0.03413083])

In [327]:
# Examine first 10 words that fit under the topics

display_topics(lsa1_4, cv5.get_feature_names(), 10)


Topic  0
congress, law, american, power, public, new, war, right, president, act

Topic  1
american, president, world, america, job, new, tax, dont, help, program

Topic  2
slavery, slave, constitution, compromise, territory, right, principle, union, free, man

Topic  3
president, governor, dont, constitution, question, said, secretary, power, senate, bank


#### Takeaways:
- Starting to see some topics in these LSA models, but LDA was providing more ways to distinguish the topics and are easier to interpret for the speeches, so will continue with LDA.

-
### Topic Modeling: Attempt #6
**NMF with Count Vectorizer** </br>
As noted with LSA attempt, while LDA normally works best with large text -- such as presidential speeches -- and was why that model was first used, we'll also test NMF to see if chance it performs better than our LDA.

In [330]:
# Create a new document-term matrix using parameters that worked well for LDA as starting point

cv6 = CountVectorizer(stop_words=stop_words, ngram_range=(1,2), min_df=.1, max_df=.8)
transcript_cv6 = cv6.fit_transform(transcripts_clean_rd2_lem)
transcript_cv_dtm_v6 = pd.DataFrame(transcript_cv6.toarray(), columns=cv6.get_feature_names())
transcript_cv_dtm_v6.head()

Unnamed: 0,abandon,ability,able,abroad,absence,absolute,absolutely,abundant,abuse,accept,...,worthy,written,wrong,yes,yesterday,yield,york,young,young people,youre
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,0,0,1,1,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,1,0,0


In [334]:
nmf1_4 = NMF(8)
doc_topic_nmf1_4 = nmf1_4.fit_transform(transcript_cv6)

display_topics(nmf1_4, cv6.get_feature_names(), 10)


Topic  0
congress, treaty, war, mexico, public, citizen, report, present, secretary, new

Topic  1
american, america, new, job, work, tax, child, congress, family, need

Topic  2
slavery, slave, compromise, right, territory, principle, free, law, man, question

Topic  3
president, question, dont, governor, said, believe, thing, like, tax, problem

Topic  4
world, peace, war, american, force, america, freedom, new, men, life

Topic  5
power, constitution, law, public, congress, duty, act, bank, right, union

Topic  6
law, work, business, men, american, court, national, public, congress, department

Topic  7
examination, service, person, commission, rule, place, officer, appointment, test, general




#### Takeaways:
- Had to increase number of topics to start seeing any sort of distinction and there is some similiarity of what is seen in the past models for topics, but LDA still coming out ahead in terms of allowing for being distinction of topics and understanding weight of words to use to create topics.

-
## Mapping Topics to Each Speech
To see what speech was ranked with which first ranking topic and also ensure a proper distribution of the topics across the speeches, we'll map the topics to the speeches in a master dataframe for analysis.

In [515]:
# open master dataframe with speech information -- we'll be adding the topics to this

potus_speech_master = pd.read_csv('csv/potus_speech_sentiment.csv')

In [516]:
# remove unnecessary first column repeating index

potus_speech_master.drop(columns='Unnamed: 0', inplace= True)

-<br/>
Now we'll create a few functions that will be used to give us the 'top topic' (greatest weight) for each document in our LDA models that we're considering >

In [517]:
# function to order topics for each document in decending weight value

def all_topics_ordered(all_topics_df): 
    '''
    Takes in a dataframe with topic tuples ([topic],[weight of topic])
    that are in individual columns for each row/document in the dataframe.  
    For every row/document the function creates a dataframe
    of all topic tuples for that row/document and sorts them by the 
    descending weight value of the topic, meaning the first
    tuple listed will have the greatest weight of association
    to the row/document.
    '''
    all_speech_topics = []
    for index, row in all_topics_df.iterrows(): 
        speech_topics = []
        for topic in row:
            if topic != None:
                speech_topics.append(topic)
        speech_topics.sort(key=lambda x:x[1], reverse=True)
        all_speech_topics.append(speech_topics)
    return pd.DataFrame(all_speech_topics)

In [518]:
# Create a function to place topics in the master dataframe for speeches

def first_topic(topics, master_df):
    '''
    Function takes in topic speech rankings from model and 
    places the first topic for each speech in the master
    dataframe for the speeches.
    ---
    Inputs: 
    -- topics = all speeches' topic rankings
    -- master_df = dataframe with all data on speeches
    
    Output:
    Master dataframe with two new columns:
    -- Topic: topic category
    -- Topic_Percent: percentage that speech fits in that category
    
    '''
    topic_cat = []
    topic_percent = []
    for x,y in topics[0]:
        topic_cat.append(x)
        topic_percent.append(y)

    master_df['Topic'] = topic_cat
    master_df['Topic_Percent'] = topic_percent
    
    return master_df

-<br>
Now we'll look at the topic breakdowns for the topics we're considering using >

## Topics Narrowing, Review, & Selection

### lda2_4: LDA, CV, 4 Topics

In [519]:
# create a listing of all of the topic rankings for each speech
corpus_transformed_lda2_4 = lda2_4[corpus2]

# make it into a dataframe
df_speech_topics_lda2_4 = pd.DataFrame(corpus_transformed_lda2_4)

In [520]:
# apply the function above to the dataframe created so that 
#  the topic with the greatest 'weight' will be listed first

ordered_topics_lda2_4 = all_topics_ordered(df_speech_topics_lda2_4)

In [521]:
# add first topic listing to master dataframe

potus_speech_master_lda2_4 = first_topic(ordered_topics_lda2_4, potus_speech_master)

In [522]:
# look at distribution of topics across speeches

potus_speech_master_lda2_4.Topic.value_counts()

3    320
0    256
2    217
1    198
Name: Topic, dtype: int64

### Takaways:
- Fairly nice distribution.  This is good, but will see if we get good distribution for additional topics so we can have more disinction between speeches within reason of not getting too narrow or overwhleming.

### lda4_7: LDA, CV, 7 Topics
We'll look at using the 7 topic model that showed nice results with fairly disinct topics

In [523]:
# create a listing of all of the topic rankings for each speech
corpus_transformed_lda4_7 = lda4_7[corpus4]

# make it into a dataframe
df_speech_topics_lda4_7 = pd.DataFrame(corpus_transformed_lda4_7)

In [524]:
# apply the function above to the dataframe created so that 
#  the topic with the greatest 'weight' will be listed first

ordered_topics_lda4_7 = all_topics_ordered(df_speech_topics_lda4_7)

In [525]:
# now pull the first topic listed (one with greatest weight) and 
#   pull it into our master dataframe with all speeches' information

potus_speech_master_lda4_7 = first_topic(ordered_topics_lda4_7, potus_speech_master)

In [526]:
potus_speech_master_lda4_7.Topic.value_counts()

6    235
4    196
2    183
1    139
3     95
5     93
0     50
Name: Topic, dtype: int64

This is a nice distribution of topics.  Let's assign topic titles to each topic and then add those into the dataframe as well.  We can then examine if the topic names properly respresent the speeches >

In [538]:
# create new column replacing numerical topics with topic descriptions
#   these are topic descriptions selected based on words under each topic

topic_categories = potus_speech_master_lda4_7['Topic'].replace({
    0: 'Law, constitution, & rights', 
    1: 'World peace with war & force',
    2: 'War with American freedom',
    3: 'Work and business',
    4: 'American jobs and family help & needs',
    5: 'Public power and duty',
    6: 'Laws, treaties, and action'
})


Now for our analysis, we want to add in major historical periods in American history to the master dataframe to see if there are any relationships or trends with speech topics.  They will be as follows:
- 1789 - 1799: Establishment of new democratic nation
- 1800 - 1860: Settlement and expansion
- 1861 - 1865: American Civil War
- 1865 - 1890: Reconstruction Era & Guilded Age
- 1890 - 1913: Progressive Era
- 1914 - 1918: World War I / Progressive Era
- 1919 - 1928: Roaring Twenties / Progressive Era
- 1929 - 1932: Great Depression
- 1933 - 1938: Great Depression/New Deal
- 1939 - 1945: World War II
- 1946 - 1953: Cold War
- 1954 - 1964: Cold War / Civil Rights Movement
- 1965 - 1968: Cold & Vietnam Wars / Civil Rights Movement
- 1969 - 1972: Cold & Vietnam Wars
- 1973 - 1975: Energy Crisis/Cold & Vietnam Wars
- 1976 - 1980: Energy Crisis/Cold War
- 1981 - 1991: Reagan Era / Cold War
- 1992 - 2000: Neoconservative / Dot-Com Era
- 2001 - 2006: 'War on Terror'
- 2007 - 2009: Great Recession / 'War on Terror'
- 2010 - 2019: 'War on Terror'


In [547]:
# create empty new row to input historical periods

potus_speech_master_lda4_7['Historical_Period'] = np.nan

In [569]:
# now add all historical periods for their appropriate date range

for index in range(0,len(potus_speech_master_lda4_7)):
    if index <= 27:
        potus_speech_master_lda4_7.loc[index,'Historical_Period'] = '1789-1799: New Democratic Nation'
    elif index > 27 and index <= 209:
        potus_speech_master_lda4_7.loc[index,'Historical_Period'] = '1800-1860: Settlement & Expansion'
    elif index > 210 and index <= 237:
        potus_speech_master_lda4_7.loc[index,'Historical_Period'] = '1861-1865: American Civil War'
    elif index > 237 and index <= 337:
        potus_speech_master_lda4_7.loc[index,'Historical_Period'] = '1865-1890: Reconstruction Era & Guilded Age'
    elif index > 337 and index <= 419:
        potus_speech_master_lda4_7.loc[index,'Historical_Period'] = '1890-1913: Progressive Era'
    elif index > 419 and index <= 451:
        potus_speech_master_lda4_7.loc[index,'Historical_Period'] = '1914-1918: World War I / Progressive Era'
    elif index > 451 and index <= 483:
        potus_speech_master_lda4_7.loc[index,'Historical_Period'] = '1919-1928: Roaring Twenties / Progressive Era'
    elif index > 483 and index <= 512:
        potus_speech_master_lda4_7.loc[index,'Historical_Period'] = '1929-1932: Great Depression'
    elif index > 512 and index <= 531:
        potus_speech_master_lda4_7.loc[index,'Historical_Period'] = '1933-1938: Great Depression/New Deal'
    elif index > 531 and index <= 567:
        potus_speech_master_lda4_7.loc[index,'Historical_Period'] = '1939-1945: World War II'
    elif index > 567 and index <= 584:
        potus_speech_master_lda4_7.loc[index,'Historical_Period'] = '1946-1953: Cold War'
    elif index > 584 and index <= 652:
        potus_speech_master_lda4_7.loc[index,'Historical_Period'] = '1954-1964: Cold War / Civil Rights Movement'
    elif index > 652 and index <= 699:
        potus_speech_master_lda4_7.loc[index,'Historical_Period'] = '1965-1968: Cold & Vietnam Wars / Civil Rights Movement'
    elif index > 699 and index <= 716:
        potus_speech_master_lda4_7.loc[index,'Historical_Period'] = '1969-1972: Cold & Vietnam Wars'
    elif index > 716 and index <= 736:
        potus_speech_master_lda4_7.loc[index,'Historical_Period'] = '1973-1975:: Energy Crisis/Cold & Vietnam Wars'
    elif index > 736 and index <= 763:
        potus_speech_master_lda4_7.loc[index,'Historical_Period'] = '1976-1980: Energy Crisis/Cold War'
    elif index > 763 and index <= 837:
        potus_speech_master_lda4_7.loc[index,'Historical_Period'] = '1981-1991: Reagan Era / Cold War'
    elif index > 837 and index <= 881:
        potus_speech_master_lda4_7.loc[index,'Historical_Period'] = '1992-2000: Neoconservative / Dot-Com Era'
    elif index > 881 and index <= 911:
        potus_speech_master_lda4_7.loc[index,'Historical_Period'] = "2001-2006: 'War on Terror'"
    elif index > 911 and index <= 932:
        potus_speech_master_lda4_7.loc[index,'Historical_Period'] = "2007-2009: Great Recession / 'War on Terror'"
    else:
        potus_speech_master_lda4_7.loc[index,'Historical_Period'] = "2010-2019: Ongoing 'War on Terror'"

In [573]:
# Let's now see how many speeches fall in each era

potus_speech_master_lda4_7['Historical_Period'].value_counts()

1800-1860: Settlement & Expansion                         182
1865-1890: Reconstruction Era & Guilded Age               100
1890-1913: Progressive Era                                 82
1981-1991: Reagan Era / Cold War                           74
1954-1964: Cold War / Civil Rights Movement                68
2010-2019: Ongoing 'War on Terror'                         59
1965-1968: Cold & Vietnam Wars / Civil Rights Movement     47
1992-2000: Neoconservative / Dot-Com Era                   44
1939-1945: World War II                                    36
1914-1918: World War I / Progressive Era                   32
1919-1928: Roaring Twenties / Progressive Era              32
2001-2006: 'War on Terror'                                 30
1929-1932: Great Depression                                29
1789-1799: New Democratic Nation                           28
1976-1980: Energy Crisis/Cold War                          27
1861-1865: American Civil War                              27
2007-200

In [578]:
## UPDATE THIS TO INCLUDE THE VERSION OF THE FINAL MODEL SELECTED

# Save the master database for all speeches with sentiment & topics as csv
#  to be pulled into Tableau for visualization

potus_speech_master_lda4_7.to_csv('csv/potus_speech_master_topic_sentiment.csv')