# Analyzing Presidential Debates

## Group Members - Purvi Thakor, Junior Ovince, Akshay Kamath
<br>
With the midterms around the corner & politics dominating the news, we decided to analyze Presidential debates for our NLP project. We found datasets available on
[UC Barbara's The American Presidency Project](https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/presidential-candidates-debates-1960-2016).

We discussed & decided to analyze the presidential debates from 2002 onwards which meant we worked on 3 sets of debates - 

* <font color=blue>Gore<font color=black> v <font color=red>Bush<font color=black> _(2000)_
  -  [Presidential Debate in St Louis, Missouri](https://www.presidency.ucsb.edu/ws/index.php?pid=29420)
  -  [Presidential Debate in Winston-Salem, North Carolina](https://www.presidency.ucsb.edu/ws/index.php?pid=29419)
  -  [Presidential Debate in Boston, Massachusetts](https://www.presidency.ucsb.edu/ws/index.php?pid=29418)
<br>
<br>
* <font color=blue>Kerry<font color=black> v <font color=red>Bush<font color=black> _(2004)_
  -  [Presidential Debate in Tempe, Arizona](https://www.presidency.ucsb.edu/ws/index.php?pid=63163)
  -  [Presidential Debate in St Louis, Missouri](https://www.presidency.ucsb.edu/ws/index.php?pid=72776)
  -  [Presidential Debate in Coral Gables, Florida](https://www.presidency.ucsb.edu/ws/index.php?pid=72770)
<br>
<br>
* <font color=blue>Obama<font color=black> v <font color=red>McCain<font color=black> _(2008)_
  -  [Presidential Debate in Hempstead, New York](https://www.presidency.ucsb.edu/ws/index.php?pid=84526)
  -  [Presidential Debate in Nashville, Tennessee](https://www.presidency.ucsb.edu/ws/index.php?pid=84482)
  -  [Presidential Debate in Oxford, Mississippi](https://www.presidency.ucsb.edu/ws/index.php?pid=78691)
<br>
<br>
* <font color=blue>Obama<font color=black> v <font color=red>Romney<font color=black> _(2012)_
  -  [Presidential Debate at Lynn University in Boca Raton, Florida](https://www.presidency.ucsb.edu/ws/index.php?pid=102344)
  -  [Presidential Debate at Hofstra University in Hempstead, New York](https://www.presidency.ucsb.edu/ws/index.php?pid=102343)
  -  [Presidential Debate at the University of Denver, Colorado](https://www.presidency.ucsb.edu/ws/index.php?pid=102317)
<br>
<br>
* <font color=blue>Clinton<font color=black> v <font color=red>Trump<font color=black> _(2016)_
  -  [Presidential Debate at the University of Nevada, Las Vegas](https://www.presidency.ucsb.edu/ws/index.php?pid=119039)
  -  [Presidential Debate at Washington University in St. Louis, Missouri](https://www.presidency.ucsb.edu/ws/index.php?pid=119038)
  -  [Presidential Debate at Hofstra University in Hempstead, New York](https://www.presidency.ucsb.edu/ws/index.php?pid=119012)
<br>
<br>
    
Furthermore, we decided to analyze both the victory & concession speeches for all the candidates.
    


## Preprocessing

The data was manually scraped from the abovementioned site & stored in text files. Thankfully, the format across each of the documents was more or less the same & we ended up saving a lot of time in the initial few steps. 

In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

In [None]:
import pandas as pd
import os
import re
import tabulate
import string
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import spacy

from textstat.textstat import textstatistics, easy_word_set, legacy_round
from nltk.stem import *
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

In [None]:
print(os.getcwd())

In [None]:
wrkdir = os.chdir("D:/Project/NLP/presidential_debates/")

def read_file(filename):
    input_file_text = open(filename).read()
    return input_file_text

file_list=[]
for file in os.listdir(wrkdir):
    file_list.append(file) 

file_list

In [None]:
for i in file_list:
    file = open(i,'rt')
    file_read = file.read()
    print("\n")
    print(125*"+")
    print(file.name)
    print(125*"+")
    print("\n")
    file.close()
    print(file_read)

As we can see, the format of the files is slightly different. Initially, the conversation begins with the speaker's name in proper format & then it gets converted to upper format. So we realized that each document needs to be cleansed separately.

### #2000 Presidential Debate Data Cleaning

In [None]:
all_files = []

for i in file_list:
    if '2000' in i:
        file = open(i,'rt')
        file_read = file.read()
        file_read = file_read.replace('\n', '')
#        file_read = file_read.replace(".",". ")
        all_files.append(file_read)
        print("\n")
        print(125*"+")
        print(file.name)
        print(125*"+")
        print("\n")
        file.close()
        print(file_read)
        
string_files = ' '.join(all_files) #converting list to string
cleaned_files = string_files.replace('\n\n','') #replacing new line tags
file_read1 = cleaned_files.replace('Senator Kerry.','KERRY:') #replacing one instance of proper string with upper
file_read2 = file_read1.replace('President Bush.','BUSH:') #replacing one instance of proper string with upper
GB_2000 = re.findall(r'BUSH:(.*?)(?:GORE:|LEHRER:|SCHIEFFER:|MODERATOR:|MEMBER OF AUDIENCE:)', file_read2) #extracting comments made by McCain only
AG_2000 = re.findall(r'GORE:(.*?)(?:BUSH:|LEHRER:|SCHIEFFER:|MODERATOR:|MEMBER OF AUDIENCE:)', file_read2) #extracting comments made by Obama only

GB_2000_wc = ''.join(GB_2000)
AG_2000_wc = ''.join(AG_2000)

In [None]:
print("George Bush's word count in the 2000 presidential debates is " + str("{:,}".format(len(GB_2000_wc))) + " words.")
print("Al Gore's word count in the 2000 presidential debates is " + str("{:,}".format(len(AG_2000_wc))) + " words.")

### # 2004 Presidential Debate Data Cleaning

In [None]:
all_files = []

for i in file_list:
    if '2004' in i:
        file = open(i,'rt')
        file_read = file.read()
        file_read = file_read.replace('\n', '')
        all_files.append(file_read)
        print("\n")
        print(125*"+")
        print(file.name)
        print(125*"+")
        print("\n")
        file.close()
        print(file_read)
        
string_files = ' '.join(all_files) #converting list to string
cleaned_files = string_files.replace('\n\n','') #replacing new line tags
file_read1 = cleaned_files.replace('Kerry:','KERRY: ') #replacing one instance of proper string with upper
file_read2 = file_read1.replace('Bush: ','BUSH: ') #replacing one instance of proper string with upper
GB_2004 = re.findall(r'BUSH:(.*?)(?:KERRY:|LEHRER:|SCHIEFFER:|MODERATOR:|MEMBER OF AUDIENCE:)', file_read2) #extracting comments made by McCain only
JK_2004 = re.findall(r'KERRY:(.*?)(?:BUSH:|LEHRER:|SCHIEFFER:|MODERATOR:|MEMBER OF AUDIENCE:)', file_read2) #extracting comments made by Obama only

GB_2004_wc = ''.join(GB_2004)
JK_2004_wc = ''.join(JK_2004)

In [None]:
print("George Bush's word count in the 2004 presidential debates is " + str("{:,}".format(len(GB_2004_wc))) + " words.")
print("John Kerry's word count in the 2004 presidential debates is " + str("{:,}".format(len(JK_2004_wc))) + " words.")

### # 2008 Presidential Debate Data Cleaning

In [None]:
all_files = []

for i in file_list:
    if '2008' in i:
        file = open(i,'rt')
        file_read = file.read()
        file_read = file_read.replace('\n', '')
        all_files.append(file_read)
        print("\n")
        print(125*"+")
        print(file.name)
        print(125*"+")
        print("\n")
        file.close()
        print(file_read)
        
string_files = ' '.join(all_files) #converting list to string
cleaned_files = string_files.replace('\n\n','') #replacing new line tags
file_read1 = cleaned_files.replace('McCain:','MCCAIN:') #replacing one instance of proper string with upper
file_read2 = file_read1.replace('Obama:','OBAMA:') #replacing one instance of proper string with upper
JM_2008 = re.findall(r'MCCAIN:(.*?)(?:OBAMA:|LEHRER:|SCHIEFFER:|BROKAW:)', file_read2) #extracting comments made by McCain only
BO_2008 = re.findall(r'OBAMA:(.*?)(?:MCCAIN:|LEHRER:|SCHIEFFER:|BROKAW:)', file_read2) #extracting comments made by Obama only

JM_2008_wc = ''.join(JM_2008)
BO_2008_wc = ''.join(BO_2008)

In [None]:
print("Barack Obama's word count in the 2008 presidential debates is " + str("{:,}".format(len(BO_2008_wc))) + " words.")
print("John McCain's word count in the 2008 presidential debates is " + str("{:,}".format(len(JM_2008_wc))) + " words.")

### # 2012 Presidential Debate Data Cleaning

In [None]:
all_files = []

for i in file_list:
    if '2012' in i:
        file = open(i,'rt')
        file_read = file.read()
        file_read = file_read.replace('\n', '')
        all_files.append(file_read)
        print("\n")
        print(125*"+")
        print(file.name)
        print(125*"+")
        print("\n")
        file.close()
        print(file_read)

string_files = ' '.join(all_files) #converting list to string
cleaned_files = string_files.replace('\n\n','') #replacing new line tags
file_read1 = cleaned_files.replace('Gov. Romney.',' ROMNEY:') #replacing all instances of proper string with upper
file_read2 = file_read1.replace('The President.',' OBAMA:') #replacing all instances of proper string with upper
file_read3 = file_read2.replace('Mr. Lehrer.',' LEHRER:') #replacing all instances of proper string with upper
file_read4 = file_read3.replace('Mr. Schieffer.',' SCHIEFFER:') #replacing all instances of proper string with upper
file_read5 = file_read4.replace('Ms. Crowley.',' CROWLEY:') #replacing all instances of proper string with upper

MR_2012 = re.findall(r' ROMNEY:(.*?)(?:OBAMA:| LEHRER:| SCHIEFFER:| CROWLEY:)', file_read5) #extracting comments made by Romney only
BO_2012 = re.findall(r' OBAMA:(.*?)(?:ROMNEY:| LEHRER:| SCHIEFFER:| CROWLEY:)', file_read5) #extracting comments made by Obama only

MR_2012_wc = ''.join(MR_2012)
BO_2012_wc = ''.join(BO_2012)

In [None]:
print("Barack Obama's word count in the 2012 presidential debates is " + str("{:,}".format(len(BO_2012_wc))) + " words.")
print("Mitt Romney's word count in the 2012 presidential debates is " + str("{:,}".format(len(MR_2012_wc))) + " words.")

### # 2016 Presidential Debate Data Cleaning

In [None]:
all_files = []

for i in file_list:
    if '2016' in i:
        file = open(i,'rt')
        file_read = file.read()
        file_read = file_read.replace('\n', '')
        all_files.append(file_read)
        print("\n")
        print(125*"+")
        print(file.name)
        print(125*"+")
        print("\n")
        file.close()
        print(file_read)

string_files = ' '.join(all_files) #converting list to string
cleaned_files = string_files.replace('\n\n','') #replacing new line tags
file_read1 = cleaned_files.replace('Trump:',' TRUMP:') #replacing all instances of proper string with upper
file_read2 = file_read1.replace('Clinton:',' CLINTON:') #replacing all instances of proper string with upper
file_read3 = file_read2.replace('Holt:',' HOLT:') #replacing all instances of proper string with upper
file_read4 = file_read3.replace('Wallace:',' WALLACE:') #replacing all instances of proper string with upper
file_read5 = file_read4.replace('Raddatz:',' RADDATZ:') #replacing all instances of proper string with upper
file_read6 = file_read5.replace('Cooper:',' COOPER:') #replacing all instances of proper string with upper

DT_2016 = re.findall(r'TRUMP:(.*?)(?:CLINTON:|HOLT:|WALLACE:|RADDATZ:|COOPER:)', file_read6) #extracting comments made by Trump only
HC_2016 = re.findall(r'CLINTON:(.*?)(?:TRUMP:|HOLT:|WALLACE:|RADDATZ:|COOPER:)', file_read6) #extracting comments made by Clinton only

DT_2016_wc = ''.join(DT_2016)
HC_2016_wc = ''.join(HC_2016)

In [None]:
print("Donald Trump's word count in the 2016 presidential debates is " + str("{:,}".format(len(DT_2016_wc))) + " words.")
print("Hillary Clinton's word count in the 2016 presidential debates is " + str("{:,}".format(len(HC_2016_wc))) + " words.")

### # Word Count (Without Cleaning)

<br>
Without discarding any information, on a very superficial level, we can see that Obama has had the highest word count in a presidential debate followed by Romney in 2012.
<br>

George Bush's word count in the 2004 presidential debates is 105,301 words.
John Kerry's word count in the 2004 presidential debates is 79,328 words.

| Year | Nominee | Word Count |
| --- | --- | --- |
| 2000 | <font color=blue>Gore<font color=black> | 108,651 |
| **2000** | <font color=red>**Bush**<font color=black> | **116,400** |
| 2004 | <font color=blue>Kerry<font color=black> | 79,328 |
| **2004** | <font color=red>**Bush**<font color=black> | **105,301** |    
| **2008** | <font color=blue>**Obama**<font color=black> | **121,759** |
| 2008 | <font color=red>McCain<font color=black> | 112,023 |    
| **2012** | <font color=blue>**Obama**<font color=black> | **124,529** |
| 2012 | <font color=red>Romney<font color=black> | 124,418 |    
| 2016 | <font color=blue>Clinton<font color=black> | 106,863 |
| **2016** | <font color=red>**Trump**<font color=black> | **121,193** |
 

### # Word Count (With Cleaning)

In [None]:
#wordnet_lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

In [None]:
def cleaned_data(data):
    data = "".join(data)
    data = word_tokenize(data)
    data = [w for w in data if not w in stop_words]
    wc_lst = len(data)
    return wc_lst

In [None]:
wc_files = [AG_2000,GB_2000,JK_2004,GB_2004,BO_2008,JM_2008,BO_2012,MR_2012,HC_2016,DT_2016]
lst = ['Gore (2000)','Bush (2000)','Kerry (2004)','Bush (2004)','Obama (2008)','McCain (2008)','Obama (2012)','Romney (2012)','Clinton (2016)','Trump (2016)']

wc = []
for i in wc_files:
    wc_vals = cleaned_data(i)
    wc.append(wc_vals)

word_count = pd.DataFrame({'Nominee':lst,'Words':wc})
word_count

In [None]:
ax = word_count.plot(kind='bar',x='Nominee',y='Words',figsize=(15,5),legend = True, fontsize = 12,color=['blue', 'red', 'blue', 'red', 'blue', 'red','blue', 'red', 'blue','red'])
ax.set_xlabel("Nominee", fontsize=12)
ax.set_ylabel("Words", fontsize=12)
ax.get_legend().remove()
ax.set_title('Count of words used by Nominees in Debates (after removal of stopwords)')
plt.show()

### # Debate Similarity

#### 1) Jacardian Similarity

In [None]:
punc = str.maketrans('', '', string.punctuation)
regex = r"(?<!\d)[....'--'``](?!\d)"

def cleaner(data):
    data = "".join(data)
    data = data.replace("..."," ")
    data = data.replace("."," ")
    data = re.sub(regex, "", data)
    data = data.translate(punc)
    data = word_tokenize(data)
    data = [w for w in data if not w in stop_words]    
    
    return data

def jacardian_distance(file1, file2):
    file1 = cleaner(file1)
    file2 = cleaner(file2)
    intersection = len(list(set(file1).intersection(set(file2))))
    union = (len(set(file1)) + len(set(file2))) - intersection
    jacardian = float(intersection/ union)
    
    return jacardian


In [None]:
file_list = [AG_2000,GB_2000,JK_2004,GB_2004,BO_2008,JM_2008,BO_2012,MR_2012,HC_2016,DT_2016]

jac_mat = np.zeros(shape = (len(file_list), len(file_list)))
jac_mat = pd.DataFrame(jac_mat)#, index=6, columns=6)

for a in range(len(file_list)):
    for b in range(len(file_list)):
        #print(jacardian_distance(file_list[a], file_list[b]))
        jac_mat[a][b] = jacardian_distance(file_list[a], file_list[b])

jac_mat = jac_mat.rename(columns={0 : "Gore'00",
                                  1 : "Bush'00",
                                  2 : "Kerry'04",
                                  3 : "Bush'04",
                                  4 : "Obama'08",
                                  5 : "McCain'08",
                                  6 : "Obama'12",
                                  7 : "Romney'12",
                                  8 : "Clinton'16",
                                  9 : "Trump'16"
                                 },
                        index =  {0 : "Gore'00",
                                  1 : "Bush'00",
                                  2 : "Kerry'04",
                                  3 : "Bush'04",
                                  4 : "Obama'08",
                                  5 : "McCain'08",
                                  6 : "Obama'12",
                                  7 : "Romney'12",
                                  8 : "Clinton'16",
                                  9 : "Trump'16"
                                 })


In [None]:
plt.figure(figsize=(8,8))
mask =  np.tri(jac_mat.shape[1],k=-1)
hm_j = sns.heatmap(jac_mat,
                cbar=True,
                square=True,
#                fmt='d',
                annot = True,
                annot_kws={'size': 15},
                cmap=sns.cubehelix_palette(rot=-.57),
                mask = mask.T,
                vmin = 0.2, vmax = .5
                )

# Show heat map
plt.tight_layout()
plt.show()

#### 2) Cosine Similarity

In [None]:
def cosine_similarity(file1, file2):
    
    file1 = cleaner(file1)
    file2 = cleaner(file2)

    combined = sorted(list(set(file1 + file2)), key=str.lower)
    
    counts_1 = {}
    counts_2 = {}
    
    for i in combined:
        counts_1[i]=file1.count(i)
        counts_2[i]=file2.count(i)
    
    document_1_vector = np.array( list(counts_1.values() ) ) # len is same as combined
    document_2_vector = np.array( list(counts_2.values() ) )  # len is same as combined
    
    dot_product_of_two_document_vectors = np.dot(document_1_vector, document_2_vector)
    
    norm_1 = np.linalg.norm(document_1_vector)
    norm_2 = np.linalg.norm(document_2_vector)
    
    cosine_similarity = dot_product_of_two_document_vectors / (norm_1 * norm_2)
    return cosine_similarity

In [None]:
#file_list = [BO_2008,JM_2008,BO_2012,MR_2012,HC_2016,DT_2016]

cos_mat = np.zeros(shape = (len(file_list), len(file_list)))
cos_mat = pd.DataFrame(cos_mat)#, index=6, columns=6)

for a in range(len(file_list)):
    for b in range(len(file_list)):
        cos_mat[a][b] = cosine_similarity(file_list[a], file_list[b])

cos_mat = cos_mat.rename(columns={0 : "Gore'00",
                                  1 : "Bush'00",
                                  2 : "Kerry'04",
                                  3 : "Bush'04",
                                  4 : "Obama'08",
                                  5 : "McCain'08",
                                  6 : "Obama'12",
                                  7 : "Romney'12",
                                  8 : "Clinton'16",
                                  9 : "Trump'16"
                                 },
                        index =  {0 : "Gore'00",
                                  1 : "Bush'00",
                                  2 : "Kerry'04",
                                  3 : "Bush'04",
                                  4 : "Obama'08",
                                  5 : "McCain'08",
                                  6 : "Obama'12",
                                  7 : "Romney'12",
                                  8 : "Clinton'16",
                                  9 : "Trump'16"
                                 })

In [None]:
plt.figure(figsize=(8,8))
# mask = np.zeros_like(corr)
# mask[np.triu_indices_from(mask)] = True

mask =  np.tri(cos_mat.shape[1],k=-1)

hm_c = sns.heatmap(cos_mat,
                cbar=True,
                square=True,
                annot = True,
                cmap=sns.cubehelix_palette(rot=-.7),
                annot_kws={'size': 15},
                mask=mask.T,
                vmin=.7,vmax=1)

# Show heat map
plt.tight_layout()
plt.show()

In [None]:
# ADJACENT SIMILARITY PLOTS

# fig, ax = plt.subplots(1,2)#,figsize=(10,10)
# fig.set_size_inches(10,10)
# hm_j = sns.heatmap(jac_mat,
#                 cbar=True,
#                 square=True,
#                 annot = True,
#                 cmap=sns.cubehelix_palette(rot=-.57),
#                 annot_kws={'size': 15},
#                 mask=mask.T,
#                 ax=ax[0])

# hm_c = sns.heatmap(cos_mat,
#                 cbar=True,
#                 square=True,
#                 annot = True,
#                 cmap=sns.cubehelix_palette(rot=-.57),
#                 annot_kws={'size': 15},
#                 mask=mask.T,
#                 ax=ax[1])

# fig.show()


### Readability Index

#### The Flesch Reading Ease Readability Formula 

In [None]:
nlp = spacy.load('en_core_web_sm')
def break_sentences(text):
    document = nlp(text)
    return document
    
def word_count(text): 
    sentences = break_sentences(text) 
    words = 0
    for sentence in sentences: 
        words += len([token for token in sentence]) 
    return words

def sentence_count(text): 
    sentences = break_sentences(text) 
    return len(sentences)

In [None]:
print(str(word_count(BO_2008_wc)) + str(" word count"))
print(str(sentence_count(BO_2008_wc)) + str(" sentence count"))

#nlp(BO_2008_wc)

In [None]:
sentences = break_sentences(BO_2008_wc)

In [None]:
sentences