# Topic Modeling of the American Presidential Inaugarations dataset with LDA

## Set up

### Install necessary packages

In [5]:
pip install --upgrade opendatasets numpy pandas scipy scikit-learn matplotlib seaborn gensim

Collecting opendatasets
  Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.22
Note: you may need to restart the kernel to use updated packages.


### Import packages

In [180]:
import os
import numpy as np
import pandas as pd
import gensim as gm
import seaborn as sns
import opendatasets as od
import regex as re
from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.feature_extraction.text import TfidfVectorizer
from matplotlib import pyplot as plt

### Download Data
For this step, we need the `kaggle` api which was installed above. Please follow these instructions:
1. Create a kaggle account
2. Navigate to https://kaggle.com/`username`/account -> `API` -> `Create New API Token` 
3. Move the downloaded `kaggle.json` file to `~/.kaggle/` using a command such as `mv ~/Downloads/kaggle.json ~/.kaggle/` from the CLI or using your machine's gui`
4. Open the kaggle.json you should see something like `{"username":"[USERNAME]","key":"[KEY]"}`. Copy `KEY` to your clipboard
5. Run the command below and follow the prompts
6. Done! You should see `inaug_speeches.csv` in `./data` dir.

In [14]:
os.makedirs('../data',exist_ok=True)
od.download('https://www.kaggle.com/datasets/adhok93/presidentialaddress',data_dir='../data',force=True)
os.rename('../data/presidentialaddress/inaug_speeches.csv','../data/inaug_speeches.csv')
os.rmdir('../data/presidentialaddress')

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username:

  aaprasad


Your Kaggle Key:

  ································


Downloading presidentialaddress.zip to ../data/presidentialaddress


100%|██████████| 275k/275k [00:00<00:00, 10.7MB/s]







### Helper Functions

In [251]:
def word2topic(word_topic_probs,words,n_terms=5):
    top_k_inds = np.argsort(word_topic_probs,axis=-1)[:,-1*n_terms:]
    top_k_inds = np.flip(top_k_inds,axis=-1)
    terms = {i:words[top_k_inds[i,:]].tolist() for i in range(len(word_topic_probs))}
    return terms
def topics2docs(doc_topic_probs,n_topics=3):
    top_k_inds = np.argsort(doc_topic_probs,axis=-1)[:,-1*n_topics:]
    top_k_inds = np.flip(top_k_inds,axis=-1)
    topics = {i:top_k_inds[i,:] for i in range(len(doc_topic_probs))}
    return topics

In [278]:
def preprocess(dataset='../data/inaug_speeches.csv'):
    speeches_df = pd.read_csv(dataset,encoding='latin1')
    speeches_df = speeches_df.drop(columns=list(speeches_df.keys())[0])
    speeches_df['Inaugural Address'] = speeches_df['Inaugural Address'].apply(lambda x: 1 if x == 'Inaugural Address' or 'first' in x else 2)
    speeches_df['text'] = speeches_df['text'].apply(lambda x: re.sub(r'<.*?>', ' ', x))
    speeches_df['text'] = speeches_df['text'].apply(lambda x: x.strip('\t')).apply(lambda x: x.strip('\n'))
    speeches_df['text'] = speeches_df['text'].apply(lambda x: x.strip('\t')).apply(lambda x: x.replace(u'\xa0',''))
    speeches = speeches_df['text'].to_list()
    tfidf = TfidfVectorizer(input='content',stop_words='english')
    X = tfidf.fit_transform(speeches)
    words = tfidf.get_feature_names_out()
    return speeches_df,X,tfidf,words
speeches_df,X,tfidf,words = preprocess()

In [279]:
speeches_df

Unnamed: 0,Name,Inaugural Address,Date,text
0,George Washington,2,"Thursday, April 30, 1789",Fellow-Citizens of the Senate and o...
1,George Washington,2,"Monday, March 4, 1793",Fellow Citizens: I AM again calle...
2,John Adams,1,"Saturday, March 4, 1797","WHEN it was first perceived, in ea..."
3,Thomas Jefferson,2,"Wednesday, March 4, 1801",Friends and Fellow-Citizens: CALL...
4,Thomas Jefferson,2,"Monday, March 4, 1805","PROCEEDING, fellow-citizens, to th..."
5,James Madison,2,"Saturday, March 4, 1809",UNWILLING to depart from examples ...
6,James Madison,2,"Thursday, March 4, 1813",ABOUT to add the solemnity of an o...
7,James Monroe,2,"Tuesday, March 4, 1817",I SHOULD be destitute of feeling i...
8,James Monroe,2,"Monday, March 5, 1821",Fellow-Citizens: I SHALL not atte...
9,John Quincy Adams,1,"Friday, March 4, 1825",IN compliance with an usage coeval...


In [295]:
N_TOPICS = 20
lda = LDA(n_components=N_TOPICS)
logits = lda.fit_transform(X)
cluster_labels = np.argmax(logits,axis=1)
lda.components_.shape

(20, 8864)

In [296]:
N_TERMS = 5
topic_words = word2topic(lda.components_,words,n_terms=N_TERMS)
print(f'Top {N_TERMS} words occuring in each of the {N_TOPICS} topics')
for topic,terms in topic_words.items():
    print(f'Topic {topic}: {terms}')

Top 5 words occuring in each of the 20 topics
Topic 0: ['occasionally', 'designed', 'slightest', 'evidently', 'unequaled']
Topic 1: ['dollar', 'micah', 'paying', 'deal', 'pleasing']
Topic 2: ['amendment', 'interstate', 'familiar', 'stirred', 'studied']
Topic 3: ['decline', 'charter', 'intellectual', 'necessarily', 'strongly']
Topic 4: ['measures', 'sense', 'course', 'look', 'resources']
Topic 5: ['learned', 'regards', 'trend', 'mistakes', 'aright']
Topic 6: ['represents', 'array', 'definite', 'polls', 'experiences']
Topic 7: ['strives', 'rhetoric', 'moon', 'skills', 'angry']
Topic 8: ['decline', 'charter', 'intellectual', 'necessarily', 'strongly']
Topic 9: ['stricken', 'inculcate', 'proportion', 'roman', 'necessarily']
Topic 10: ['don', 'breeze', 'door', 'word', 'blowing']
Topic 11: ['belligerent', 'paternalism', 'degradation', 'entitle', 'discouragement']
Topic 12: ['world', 'america', 'new', 'freedom', 'let']
Topic 13: ['wished', 'currents', 'drawn', 'singular', 'outside']
Topic 14:

In [298]:
N_TOPICS_PER_DOC=5
print(f'Top {N_TOPICS_PER_DOC} topics appearing in each of the {X.shape[0]} documents')
topic_docs = topics2docs(logits,N_TOPICS_PER_DOC)
for doc,topics in topic_docs.items():
    topics = ', '.join([str(topic) for topic in topics])
    name = speeches_df.iloc[doc]['Name']
    ia = speeches_df.iloc[doc]['Inaugural Address']
    print(f'{name}\'s Inaugural Address #{ia}: {topics}')

Top 5 topics appearing in each of the 58 documents
George Washington's Inaugural Address #2: 14, 18, 4, 12, 2
George Washington's Inaugural Address #2: 14, 16, 4, 12, 15
John Adams's Inaugural Address #1: 14, 1, 4, 12, 11
Thomas Jefferson's Inaugural Address #2: 4, 14, 12, 15, 17
Thomas Jefferson's Inaugural Address #2: 14, 15, 4, 12, 2
James Madison's Inaugural Address #2: 14, 11, 4, 12, 2
James Madison's Inaugural Address #2: 4, 14, 12, 18, 15
James Monroe's Inaugural Address #2: 14, 4, 12, 15, 9
James Monroe's Inaugural Address #2: 14, 10, 4, 12, 16
John Quincy Adams's Inaugural Address #1: 14, 4, 12, 16, 10
Andrew Jackson's Inaugural Address #2: 4, 14, 12, 17, 3
Andrew Jackson's Inaugural Address #2: 14, 9, 4, 12, 2
Martin Van Buren's Inaugural Address #1: 14, 0, 4, 12, 5
William Henry Harrison's Inaugural Address #1: 14, 9, 4, 12, 2
James Knox Polk's Inaugural Address #1: 14, 4, 12, 16, 2
Zachary Taylor's Inaugural Address #1: 14, 4, 2, 12, 13
Franklin Pierce's Inaugural Address #