NMF vs LDA Analysis with American Presidents' Inauguration Speeches 

In [59]:
# All Imports
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.feature_extraction import text
from nltk.corpus import stopwords
from nltk import word_tokenize as wn, pos_tag
from nltk.stem import WordNetLemmatizer


STEP 1 : Downloading data and importing it into a Dataframe.

Data Downloaded From - https://www.kaggle.com/datasets/adhok93/presidentialaddress

In [44]:
raw_data_df = pd.read_csv('/content/drive/MyDrive/inaug_speeches.csv', encoding='cp1252')
pd.set_option('display.max_colwidth', 150)
raw_data_df.head()

Unnamed: 0.1,Unnamed: 0,Name,Inaugural Address,Date,text
0,4,George Washington,First Inaugural Address,"Thursday, April 30, 1789",Fellow-Citizens of the Senate and of the House of Representatives: AMONG the vicissitudes incident to life no event could have fille...
1,5,George Washington,Second Inaugural Address,"Monday, March 4, 1793",Fellow Citizens: I AM again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the occas...
2,6,John Adams,Inaugural Address,"Saturday, March 4, 1797","WHEN it was first perceived, in early times, that no middle course for America remained between unlimited submission to a foreign le..."
3,7,Thomas Jefferson,First Inaugural Address,"Wednesday, March 4, 1801","Friends and Fellow-Citizens: CALLED upon to undertake the duties of the first executive office of our country, I avail myself of th..."
4,8,Thomas Jefferson,Second Inaugural Address,"Monday, March 4, 1805","PROCEEDING, fellow-citizens, to that qualification which the Constitution requires before my entrance on the charge again conferred ..."


STEP 2 : Not all presidents had more than 1 terms , hence we are only considering the 1st inaugural address

In [22]:
# np.unique(raw_data_df[['Name','Inaugural Address']],axis=0)
# raw_data_df.groupby(['Name','Inaugural Address']).size()

display(raw_data_df[['Name','Inaugural Address']])

Unnamed: 0,Name,Inaugural Address
0,George Washington,First Inaugural Address
1,George Washington,Second Inaugural Address
2,John Adams,Inaugural Address
3,Thomas Jefferson,First Inaugural Address
4,Thomas Jefferson,Second Inaugural Address
5,James Madison,First Inaugural Address
6,James Madison,Second Inaugural Address
7,James Monroe,First Inaugural Address
8,James Monroe,Second Inaugural Address
9,John Quincy Adams,Inaugural Address


In [23]:
display(raw_data_df.groupby(['Name']).size().reset_index(name='counts'))

Unnamed: 0,Name,counts
0,Abraham Lincoln,2
1,Andrew Jackson,2
2,Barack Obama,2
3,Benjamin Harrison,1
4,Bill Clinton,2
5,Calvin Coolidge,1
6,Donald J. Trump,1
7,Dwight D. Eisenhower,2
8,Franklin D. Roosevelt,4
9,Franklin Pierce,1


Limiting the number of columns to 2 i.e President's Name and their speech text

In [75]:
raw_data_df=raw_data_df[['Name',"text"]]
raw_data_df = raw_data_df.set_index('Name')
raw_data_df.head()

Unnamed: 0_level_0,text
Name,Unnamed: 1_level_1
George Washington,fellow citizens of the senate and of the house of representatives among the vicissitudes incident to life no event could have fille...
George Washington,fellow citizens i am again called upon by the voice of my country to execute the functions of its chief magistrate when the occas...
John Adams,when it was first perceived in early times that no middle course for america remained between unlimited submission to a foreign le...
Thomas Jefferson,friends and fellow citizens called upon to undertake the duties of the first executive office of our country i avail myself of th...
Thomas Jefferson,proceeding fellow citizens to that qualification which the constitution requires before my entrance on the charge again conferred ...


Data Cleaning - 

In [76]:
import re,string


def clean_raw_data(text):
    '''Make text lowercase, remove text in square brackets, 
    remove punctuation, remove read errors,
    and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', ' ', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text)
    text = re.sub('\w*\d\w*', ' ', text)
    
    return text
round1 = lambda x: clean_raw_data(x)
# Clean Speech Text
raw_data_df["text"] = raw_data_df["text"].apply(round1)
# Visually Inspect
raw_data_df.head()

Unnamed: 0_level_0,text
Name,Unnamed: 1_level_1
George Washington,fellow citizens of the senate and of the house of representatives among the vicissitudes incident to life no event could have fille...
George Washington,fellow citizens i am again called upon by the voice of my country to execute the functions of its chief magistrate when the occas...
John Adams,when it was first perceived in early times that no middle course for america remained between unlimited submission to a foreign le...
Thomas Jefferson,friends and fellow citizens called upon to undertake the duties of the first executive office of our country i avail myself of th...
Thomas Jefferson,proceeding fellow citizens to that qualification which the constitution requires before my entrance on the charge again conferred ...


In [56]:
import nltk
nltk.download('stopwords')

stop_words = nltk.corpus.stopwords.words('english')
stop_words[0:10]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [77]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

def tokenize_words(txt):
  tokenized= word_tokenize(txt)
  return tokenized

def remove_stopwords(tokenized_txt):
  cleaned_text_array = [word for word in tokenized_txt if word not in stop_words]
  return cleaned_text_array

def word_lemmatizer(tokenized_text_without_stopwords):
  wordnet_lemmatizer = WordNetLemmatizer()
  lemmatized = [wordnet_lemmatizer.lemmatize(word) for word in tokenized_text_without_stopwords]
  return ' '.join(lemmatized)

raw_data_df['tokenized_text']=raw_data_df['text'].apply(tokenize_words)

raw_data_df['tokenized_text_without_stopwords'] = raw_data_df['tokenized_text'].apply(lambda x: remove_stopwords(x))

raw_data_df['lemmatized_tok_txt_wo_stopwords'] = raw_data_df['tokenized_text_without_stopwords'].apply(lambda x: word_lemmatizer(x))

raw_data_df.head()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Unnamed: 0_level_0,text,tokenized_text,tokenized_text_without_stopwords,lemmatized_tok_txt_wo_stopwords
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
George Washington,fellow citizens of the senate and of the house of representatives among the vicissitudes incident to life no event could have fille...,"[fellow, citizens, of, the, senate, and, of, the, house, of, representatives, among, the, vicissitudes, incident, to, life, no, event, could, have...","[fellow, citizens, senate, house, representatives, among, vicissitudes, incident, life, event, could, filled, greater, anxieties, notification, tr...",fellow citizen senate house representative among vicissitude incident life event could filled greater anxiety notification transmitted order recei...
George Washington,fellow citizens i am again called upon by the voice of my country to execute the functions of its chief magistrate when the occas...,"[fellow, citizens, i, am, again, called, upon, by, the, voice, of, my, country, to, execute, the, functions, of, its, chief, magistrate, when, the...","[fellow, citizens, called, upon, voice, country, execute, functions, chief, magistrate, occasion, proper, shall, arrive, shall, endeavor, express,...",fellow citizen called upon voice country execute function chief magistrate occasion proper shall arrive shall endeavor express high sense entertai...
John Adams,when it was first perceived in early times that no middle course for america remained between unlimited submission to a foreign le...,"[when, it, was, first, perceived, in, early, times, that, no, middle, course, for, america, remained, between, unlimited, submission, to, a, forei...","[first, perceived, early, times, middle, course, america, remained, unlimited, submission, foreign, legislature, total, independence, claims, men,...",first perceived early time middle course america remained unlimited submission foreign legislature total independence claim men reflection le appr...
Thomas Jefferson,friends and fellow citizens called upon to undertake the duties of the first executive office of our country i avail myself of th...,"[friends, and, fellow, citizens, called, upon, to, undertake, the, duties, of, the, first, executive, office, of, our, country, i, avail, myself, ...","[friends, fellow, citizens, called, upon, undertake, duties, first, executive, office, country, avail, presence, portion, fellow, citizens, assemb...",friend fellow citizen called upon undertake duty first executive office country avail presence portion fellow citizen assembled express grateful t...
Thomas Jefferson,proceeding fellow citizens to that qualification which the constitution requires before my entrance on the charge again conferred ...,"[proceeding, fellow, citizens, to, that, qualification, which, the, constitution, requires, before, my, entrance, on, the, charge, again, conferre...","[proceeding, fellow, citizens, qualification, constitution, requires, entrance, charge, conferred, duty, express, deep, sense, entertain, new, pro...",proceeding fellow citizen qualification constitution requires entrance charge conferred duty express deep sense entertain new proof confidence fel...


In [95]:
additional_stop_words = ["fellow","america", 'today', 'thing']
stop_words_aggr = text.ENGLISH_STOP_WORDS.union(additional_stop_words)

tfidf = TfidfVectorizer(stop_words=stop_words_aggr, ngram_range = (1,1), max_df = .8, min_df = .01)
tfidf_transformed = tfidf.fit_transform(raw_data_df.text) #We are fitting and transforming the speech words to a TFIDF Matrix

data_with_words_as_columns = pd.DataFrame(tfidf_transformed.toarray(), columns=tfidf.get_feature_names()) 
data_with_words_as_columns.index = raw_data_df.index #To include president names in the o/p we are setting Presidents name as Index

data_with_words_as_columns.head()





Unnamed: 0_level_0,abandon,abandoned,abandonment,abate,abdicated,abeyance,abhorring,abide,abiding,abilities,...,yorktown,young,younger,youngest,youth,youthful,zeal,zealous,zealously,zone
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
George Washington,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
George Washington,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
John Adams,0.0,0.03245,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.028445,0.0,0.0,0.0
Thomas Jefferson,0.037264,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0358,0.0,0.0,0.0
Thomas Jefferson,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.086495,0.0,0.0,0.0


In [96]:
def display_topics(model, feature_names, num_top_words,topic_names=None):
  for ix, topic in enumerate(model.components_):
    if not topic_names or not topic_names[ix]:
      print("\nTopic ", ix)
    else:
      print("\nTopic: '",topic_names[ix],"'")
    print(", ".join([feature_names[i] \
             for i in topic.argsort()[:-num_top_words - 1:-1]]))

In [101]:
nmf_model = NMF(20)
# Learn an NMF model for given Document Term Matrix 'V' 
# Extract the document-topic matrix 'W'
nmf_topic = nmf_model.fit_transform(data_with_words_as_columns)
# Extract top words from the topic-term matrix 'H' 
display_topics(nmf_model, tfidf.get_feature_names(), 5)




Topic  0
states, union, constitution, state, foreign

Topic  1
americans, work, american, children, freedom

Topic  2
freedom, peoples, peace, democracy, know

Topic  3
public, business, congress, laws, law

Topic  4
let, peace, help, role, sides

Topic  5
public, duties, congress, interests, branches

Topic  6
war, british, massacre, savage, cruel

Topic  7
studied, stirred, familiar, things, women

Topic  8
arrive, upbraidings, willingly, violated, incurring

Topic  9
problems, regards, republic, tasks, aright

Topic  10
learned, test, trend, peace, mistakes

Topic  11
change, covenant, man, union, mastery

Topic  12
union, states, general, powers, preservation

Topic  13
dollar, paying, deal, specie, debt

Topic  14
public, peace, state, happiness, reason

Topic  15
public, providential, immutable, impressions, ought

Topic  16
action, helped, wished, leadership, national

Topic  17
war, woe, offenses, god, offense

Topic  18
proposition, santo, domingo, transit, territory

Topic  

