## Problem Statement

The goal we set is to analyze the transcripts of the presidential state of the union speeches and figure out the topics/themes they used to address the American people. We will be focusing on the [Presidential-Speeches](https://millercenter.org/the-presidency/presidential-speeches) provided by the Miller Center website.

## Scraping the Data

The link file contains all presidential state of the union speeches which we can use BeautifulSoup to webscrape. We will be saving the webscraped transcripts in a different folder as python pickle files.

In [1]:
# load the links data to a dataframe
import pandas as pd
df = pd.read_csv('C:\\Users\\alanl\\Desktop\\nlp_project\\links.csv')
df.head()

Unnamed: 0,President,Link,Year,Party
0,John F. Kennedy,https://millercenter.org/the-presidency/presid...,1961,D
1,John F. Kennedy,https://millercenter.org/the-presidency/presid...,1962,D
2,John F. Kennedy,https://millercenter.org/the-presidency/presid...,1963,D
3,Lyndon B. Johnson,https://millercenter.org/the-presidency/presid...,1964,D
4,Lyndon B. Johnson,https://millercenter.org/the-presidency/presid...,1965,D


In [2]:
# there are 26 state of the union speeches we will extract
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 4 columns):
President    52 non-null object
Link         52 non-null object
Year         52 non-null int64
Party        52 non-null object
dtypes: int64(1), object(3)
memory usage: 1.8+ KB


In [3]:
# we will test out the url for 1991 state of the union address by George H.W. Bush
url = df.Link[1]
print(url)

https://millercenter.org/the-presidency/presidential-speeches/january-11-1962-state-union-address


In [4]:
# scrape one state of the union speech as a sample to see if the process works
import requests
from bs4 import BeautifulSoup
page = requests.get(url).text
soup = BeautifulSoup(page, "html.parser")
text = [p.text for p in soup.find(class_="transcript-inner").find_all('p')]

# print out the beginning and the ending of the speech to make sure everything is right
print(text[0])
print(text[-2])

Mr. Vice President, my old colleague from Massachusetts and your new Speaker, John McCormack, Members of the 87th Congress, ladies and gentlemen:
A year ago, in assuming the tasks of the Presidency, I said that few generations, in all history, had been granted the role of being the great defender of freedom in its hour of maximum danger. This is our good fortune; and I welcome it now as I did a year ago. For it is the fate of this generation-of you in the Congress and of me as President--to live with a struggle we did not start, in a world we did not make. But the pressures of life are not always distributed by choice. And while no nation has ever faced such a challenge, no nation has ever been so ready to seize the burden and the glory of freedom.


In [5]:
# import pickles prepare the store the transcripts
import pickle

# create a new function that turns url into transcripts
def url_to_transcript(url):
    page = requests.get(url).text
    soup = BeautifulSoup(page, "html.parser")
    text = [p.text for p in soup.find(class_="transcript-inner").find_all('p')]
    print(url)
    return text

urls = df.Link.tolist()
print(urls[0])

https://millercenter.org/the-presidency/presidential-speeches/january-30-1961-state-union


In [6]:
speech_list = []

for i in range(len(df)):
    speech = str(df.President[i]) + str('_') + str(df.Year[i])
    speech_list.append(speech)

print(speech_list[0])

John F. Kennedy_1961


In [7]:
transcripts = [url_to_transcript(u) for u in urls]

https://millercenter.org/the-presidency/presidential-speeches/january-30-1961-state-union
https://millercenter.org/the-presidency/presidential-speeches/january-11-1962-state-union-address
https://millercenter.org/the-presidency/presidential-speeches/january-14-1963-state-union-address
https://millercenter.org/the-presidency/presidential-speeches/january-8-1964-state-union
https://millercenter.org/the-presidency/presidential-speeches/january-4-1965-state-union
https://millercenter.org/the-presidency/presidential-speeches/january-12-1966-state-union
https://millercenter.org/the-presidency/presidential-speeches/january-10-1967-state-union-address
https://millercenter.org/the-presidency/presidential-speeches/january-17-1968-state-union-address
https://millercenter.org/the-presidency/presidential-speeches/january-14-1969-state-union-address
https://millercenter.org/the-presidency/presidential-speeches/january-22-1970-state-union-address
https://millercenter.org/the-presidency/presidential-s

In [8]:
# create a folder for the transcripts
!mkdir transcripts

In [9]:
# dump the transcript files as txt 
for i, c in enumerate(speech_list):
    with open("transcripts/" + c + ".txt", "wb","utf-8") as file:
        pickle.dump(transcripts[i], file)

In [10]:
# load all the transcripts again into a self-created dictionary called data 
data = {}
for i, c in enumerate(speech_list):
    with open("transcripts/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)

In [11]:
# look at the keys within data
data.keys()

dict_keys(['John F. Kennedy_1961', 'John F. Kennedy_1962', 'John F. Kennedy_1963', 'Lyndon B. Johnson_1964', 'Lyndon B. Johnson_1965', 'Lyndon B. Johnson_1966', 'Lyndon B. Johnson_1967', 'Lyndon B. Johnson_1968', 'Lyndon B. Johnson_1969', 'Richard M. Nixon_1970', 'Richard M. Nixon_1971', 'Richard M. Nixon_1972', 'Richard M. Nixon_1974', 'Gerald Ford_1975', 'Gerald Ford_1976', 'Gerald Ford_1977', 'Jimmy Carter_1978', 'Jimmy Carter_1979', 'Jimmy Carter_1980', 'Ronald Reagan_1982', 'Ronald Reagan_1983', 'Ronald Reagan_1984', 'Ronald Reagan_1985', 'Ronald Reagan_1986', 'Ronald Reagan_1987', 'Ronald Reagan_1988', 'George H.W. Bush_1990', 'George H.W. Bush_1991', 'George H.W. Bush_1992', 'Bill Clinton_1994', 'Bill Clinton_1995', 'Bill Clinton_1996', 'Bill Clinton_1997', 'Bill Clinton_1998', 'Bill Clinton_1999', 'Bill Clinton_2000', 'George W. Bush_2002', 'George W. Bush_2003', 'George W. Bush_2004', 'George W. Bush_2005', 'George W. Bush_2006', 'George W. Bush_2007', 'George W. Bush_2008', '

In [13]:
# Check the first 3 lines of George H.W. Bush 1990 speech to confirm everything is correct
data['George H.W. Bush_1990'][:3]

['Mr. President, Mr. Speaker, members of the United States Congress:',
 'I return as a former President of the Senate and a former member of this great House. And now, as President, it is my privilege to report to you on the state of the Union.',
 "Tonight I come not to speak about the state of the government, not to detail every new initiative we plan for the coming year nor to describe every line in the budget. I'm here to speak to you and to the American people about the state of the Union, about our world—the changes we've seen, the challenges we face—and what that means for America."]

### Additional questions

Some thoughts regarding storing data this way is whether we need to group the speeches by president. My current opinion is no since presidential state of the union address speeches may have different themes for each year. Our current goal in this nlp topic analysis is to figure out intent and national direction instead of presidential leadership style.

## Convert data from Dictionary to DataFrame

In [14]:
# The current format within data dictionary is key: president_year, value: list format
# we need to change it to key: comedian, value: string format
# create a new function that combines all the list text to string format
def combine_text(list_of_text):
    combined_text = ' '.join(list_of_text)
    return combined_text

# utilize the combine_text function to combine everything
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}

In [15]:
# we can see now the length of the list is one, meaning it converted to string
len(data_combined['George H.W. Bush_1990'])

1

In [16]:
# We can convert this into a pandas dataframe
pd.set_option('max_colwidth',150)
data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['transcript']
data_df = data_df.sort_index()
data_df[:4]

Unnamed: 0,transcript
Barack Obama_2010,"Madam Speaker, Vice President Biden, Members of Congress, distinguished guests, and fellow Americans: Our Constitution declares that from time to ..."
Barack Obama_2011,"Mr. Speaker, Mr. Vice President, members of Congress, distinguished guests, and fellow Americans:\n\r\n Tonight I want to begin by congratula..."
Barack Obama_2012,"Mr. Speaker, Mr. Vice President, members of Congress, distinguished guests, and fellow Americans: Last month, I went to Andrews Air Force Base an..."
Barack Obama_2013,"\r\nMr. Speaker, Mr. Vice President, members of Congress, fellow citizens: \r\n \r\nFifty-one years ago, John F. Kennedy declared to this chamber..."


## Clean the Data

Since we are going to utilize the bag of words methods on our dataset. We will go through these rounds in order to clean the data.

<b>First Round</b>: lowercase; square brackets; remove words containing numbers; remove punctation; remove text in square brackets.

<b>Second Round</b>: additional punctuation; non-sensical text.

<b>Third Round</b>: stemming 'looked' --> 'look'; lemmatization 'geese' --> 'goose'.

In [17]:
# First Round of Cleaning
import re
import string

def clean_text_round1(text):
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [18]:
# Second Round of Cleaning
def clean_text_round2(text):
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\r','',text)
    return text

round2 = lambda x: clean_text_round2(x)

In [22]:
# Third round of cleaning
import en_core_web_sm
nlp = en_core_web_sm.load()

def clean_text_round3(text):
    text = nlp(text) #tokenize the data
    text = [w.lemma_ for w in text if not w.is_stop] #lemmatize words that are not stop words
    return text

round3 = lambda x: clean_text_round3(x)

In [23]:
# Fourth round of cleaning
# load the untokenizer function to convert everything back to string again once stemming and lemmatization is complete
# reference code provided by commonsense, please check out his/her brilliant github
# https://github.com/commonsense/metanl/blob/master/metanl/token_utils.py
def clean_text_round4(words):
    """
    Untokenizing a text undoes the tokenizing operation, restoring
    punctuation and spaces to the places that people expect them to be.
    Ideally, `untokenize(tokenize(text))` should be identical to `text`,
    except for line breaks.
    """
    text = ' '.join(words)
    step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .',  '...')
    step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
    step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2)
    step4 = re.sub(r' ([.,:;?!%]+)$', r"\1", step3)
    step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
         "can not", "cannot")
    step6 = step5.replace(" ` ", " '")
    return step6.strip()

round4 = lambda x: clean_text_round4(x)

In [24]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_df.transcript.apply(round1))
data_clean = pd.DataFrame(data_clean.transcript.apply(round2))
data_clean = pd.DataFrame(data_clean.transcript.apply(round3))
data_clean = pd.DataFrame(data_clean.transcript.apply(round4))
data_clean[:3]

Unnamed: 0,transcript
Barack Obama_2010,madam speaker vice president biden member congress distinguish guest fellow americans constitution declare time time president shall congress info...
Barack Obama_2011,mr speaker mr vice president member congress distinguish guest fellow americans tonight want begin congratulate man woman congress new sp...
Barack Obama_2012,mr speaker mr vice president member congress distinguish guest fellow americans month go andrews air force base welcome home troop serve iraq o...


### Additional questions
1. stemming seems to incorrect stem words, such as president is transformed into presid, and force is transformed into forc, is there any way to fix this?


2. considering ngrams, but based on the sample code below, it groups all data by two, sometimes the combinations don't make sense such is "presid member" "congress distinguish", how can ngrams be deployed in a smarter way?

In [25]:
#example code for bigrams 

#import nltk
#from nltk.tokenize import word_tokenize
#from nltk.util import ngrams 

#sentences = ["To Sherlock Holmes she is always the woman.", "I have seldom heard him mention her under any other name."]

#bigrams = []
#for sentence in sentences:
#    sequence = word_tokenize(sentence) 
#    bigrams.extend(list(ngrams(sequence, 2)))

#freq_dist = nltk.FreqDist(bigrams)
#prob_dist = nltk.MLEProbDist(freq_dist)
#number_of_bigrams = freq_dist.N()


## Storing the Data

The data generated will be in two standard text formats for further analysis:
1. <b>Document-Term Matrix</b> - a matrix containing word counts
2. <b>Corpus</b> - a collection of text

In [26]:
# this dataframe is currently using president names as index, add a column containing their name and speech year
speech_list.sort()
data_df['full_name'] = speech_list
data_df.head()

Unnamed: 0,transcript,full_name
Barack Obama_2010,"Madam Speaker, Vice President Biden, Members of Congress, distinguished guests, and fellow Americans: Our Constitution declares that from time to ...",Barack Obama_2010
Barack Obama_2011,"Mr. Speaker, Mr. Vice President, members of Congress, distinguished guests, and fellow Americans:\n\r\n Tonight I want to begin by congratula...",Barack Obama_2011
Barack Obama_2012,"Mr. Speaker, Mr. Vice President, members of Congress, distinguished guests, and fellow Americans: Last month, I went to Andrews Air Force Base an...",Barack Obama_2012
Barack Obama_2013,"\r\nMr. Speaker, Mr. Vice President, members of Congress, fellow citizens: \r\n \r\nFifty-one years ago, John F. Kennedy declared to this chamber...",Barack Obama_2013
Barack Obama_2014,"Mr. Speaker, Mr. Vice President, Members of Congress, my fellow Americans: Today in America, a teacher spent extra time with a student who needed ...",Barack Obama_2014


In [27]:
# create the pickle for later usage
data_df.to_pickle("corpus.pkl")

In [28]:
# create a document-term matrix using CountVectorizer and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm.head()

Unnamed: 0,aaron,abandon,abandonment,abbas,abdication,abduction,abhorrent,abide,ability,abilityour,...,zarfos,zarqawi,zeitchik,zero,zeroemission,zerooverall,zimbabwe,zion,zone,zoom
Barack Obama_2010,0,0,0,0,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
Barack Obama_2011,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Barack Obama_2012,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Barack Obama_2013,0,0,0,0,0,0,0,0,3,0,...,0,0,0,0,0,0,0,0,0,0
Barack Obama_2014,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [29]:
# create the pickle for later usage
data_dtm.to_pickle("dtm.pkl")

In [30]:
# pickle the clean data as well
data_clean.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))

## Additional questions
How can we adjust CountVectorizer's parameters. What is ngram_range? What is min_df and max_df?