<h1>Objective</h1>
<ul>
    <li>The purpose of the analysis was to determine if any debate factors (use of certain words, referencing certain issues, personal attacks, etc.) correlated with the outcome of the following election.</li>
    <li> Additionally, we would like to examine these debates over time to find any interesting patterns.</li>
    <li><strong>The following are questions that we would like to answer.</strong></li>
</ul>
<ol>
    <li>How did the winning candidates’ word choice differ from the losing candidates?</li>
    <li>Did candidates who won the election talk about particular issues more than their opponents?</li>
    <li>Do Ad Hominem/Personal Attacks increase a candidate’s chances of winning?</li>
    <li>Under what circumstances are politicians more likely to use positive, neutral,and negative language? </li>
    <li>Do candidates that use longer or shorter words on average in debates perform better or worse?</li>
</ol>

<h1>Data Sources</h1>
<ul>
    <li>We retrieved all of our transcripts for the presidential debates from the site https://debates.org/voter-education/debate-transcripts/.</li>
    <li>Election results from https://www.archives.gov/electoral-college/1960</li>
    <li>We also used  https://en.wikipedia.org/wiki/List_of_United_States_presidential_election_results_by_state to find the results of each election and data on the electoral college outcome</li>
</ul>

<h1>Models and Algorithms</h1>
<h2><ul>
    <li>We used an expanded version of the code from Mini project 1 to extract the text from the web pages. </li>
</ul></h2>

In [None]:
import requests, re, nltk
from bs4 import BeautifulSoup
from nltk import clean_html
from collections import Counter
import operator

# we may not care about the usage of stop words
stop_words = nltk.corpus.stopwords.words('english') + [
 'ut', '\'re','.', ',', '--', '\'s', '?', ')', ':', '(', '\'',
 '\"', '-', '}', '{', '&', '|', u'\u2014', '', '–', 'still', 'good', 'well',
'said', 'â\x80\x9ci', 'gutenberg-tm', 'mr', 'project', 'one', 'uh', 'don’t',
 'would', 'made']


# We most likely would like to remove html markup
def cleanHtml (html):
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'html.parser')
    return soup .get_text()

# We also want to remove special characters, quotes, etc. from each word
def cleanWord (w):
    # r in r'[.,"\']' tells to treat \ as a regular character 
    # but we need to escape ' with \'
    # any character between the brackets [] is to be removed 
    wn = re.sub('[,"\.\'&\|@>*;/=]', "", w)
    # get rid of numbers
    return re.sub('^[0-9\.]*$', "", wn)
       
# define a function to get text/clean/calculate frequency
def debate_word_dictionary_generator (URL, name1, name2, modList):
    # first get the web page
    r = requests .get(URL)
    
    # Now clean
    # remove html markup
    t = cleanHtml (r .text) .lower()
    
    # split string into an array of words using any sequence of spaces "\s+" 
    wds = re .split('\s+',t)
    
    
    
    # remove periods, commas, etc stuck to the edges of words
    for i in range(len(wds)):
        wds[i] = cleanWord (wds [i])
        
    name1Arr = []
    name2Arr = []
    switcher = 3
            
    for i in range(len(wds)):
        if wds[i] == name1:
            switcher = 1
            
        elif wds[i] == name2:
            switcher = 2
            
        elif wds[i] in modList:
            switcher = 3
            
        else:
            if switcher == 1:
                name1Arr.append(wds[i])
                
            elif switcher == 2:
                name2Arr.append(wds[i])
    
    # If satisfied with results, lets go to the next step: calculate frequencies
    # We can write a loop to create a dictionary, but 
    # there is a special function for everything in python
    # in particular for counting frequencies (like function table() in R)
    wf1 = Counter (name1Arr)
    wf2 = Counter (name2Arr)
    
    # Remove stop words from the dictionary wf
    for k in stop_words:
        wf1. pop(k, None)
        wf2. pop(k, None)
           
        
    #how many regular words in the document?
    tw1 = 0
    for w in wf1:
       tw1 += wf1[w]
        
    tw2 = 0
    for w in wf2:
       tw2 += wf2[w] 
    # Get ordered list
    wfs1 = sorted (wf1 .items(), key = operator.itemgetter(1), reverse=True)
    ml1 = min(len(wfs1),30)
    
    wfs2 = sorted (wf2 .items(), key = operator.itemgetter(1), reverse=True)
    ml2 = min(len(wfs2),30)
    

    #Reverse the list because barh plots items from the bottom
    return [(wfs1 [ 0:ml1 ] [::-1], tw1), (wfs2 [ 0:ml2 ] [::-1], tw2)]
        

<ul>We then used the following code to generate the bar charts.</ul>

In [None]:
import numpy as np
import pylab
import matplotlib.pyplot as plt

%matplotlib inline
def plotTwoLists (wf_ee, wf_bu, title):
    f = plt.figure (figsize=(10, 6))
    # this is painfully tedious....
    f .suptitle (title, fontsize=20)
    ax = f.add_subplot(111)
    ax .spines ['top'] .set_color ('none')
    ax .spines ['bottom'] .set_color ('none')
    ax .spines ['left'] .set_color ('none')
    ax .spines ['right'] .set_color ('none')
    ax .tick_params (labelcolor='w', top='off', bottom='off', left='off', right='off', labelsize=20)

    # Create two subplots, this is the first one
    ax1 = f .add_subplot (121)
    plt .subplots_adjust (wspace=.5)

    pos = np .arange (len(wf_ee)) 
    ax1 .tick_params (axis='both', which='major', labelsize=14)
    pylab .yticks (pos, [ x [0] for x in wf_ee ])
    ax1 .barh (range(len(wf_ee)), [ x [1] for x in wf_ee ], align='center')

    ax2 = f .add_subplot (122)
    ax2 .tick_params (axis='both', which='major', labelsize=14)
    pos = np .arange (len(wf_bu)) 
    pylab .yticks (pos, [ x [0] for x in wf_bu ])
    ax2 .barh (range (len(wf_bu)), [ x [1] for x in wf_bu ], align='center')

<h2><ul><li>We used the NLTKs vader classification system in order to classify how much of a speech was positive or negative.</li></ul></h2>

In [1]:
import nltk 
nltk.download('vader_lexicon') # one time only
from nltk.sentiment.vader import SentimentIntensityAnalyzer
vader = SentimentIntensityAnalyzer() # or whatever you want to call it

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/mwermert/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


<ul>We then used an edited version of the the code above to get the text from each candidate.</ul>

In [None]:
import requests, re, nltk
from bs4 import BeautifulSoup
from nltk import clean_html
from collections import Counter
import operator

# We most likely would like to remove html markup
def cleanHtml (html):
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'html.parser')
    return soup .get_text()

# We also want to remove special characters, quotes, etc. from each word
def cleanWord (w):
    # r in r'[.,"\']' tells to treat \ as a regular character 
    # but we need to escape ' with \'
    # any character between the brackets [] is to be removed 
    wn = re.sub('[,"\.\'&\|@>*;/=]', "", w)
    # get rid of numbers
    return re.sub('^[0-9\.]*$', "", wn)

def debate_word_list (URL, name1, name2, modList):
    # first get the web page
    r = requests .get(URL)
    
    # Now clean
    # remove html markup
    t = cleanHtml (r .text) .lower()
    
    # split string into an array of words using any sequence of spaces "\s+" 
    wds = re .split('\s+',t)
    
    
    
    # remove periods, commas, etc stuck to the edges of words
    for i in range(len(wds)):
        wds[i] = cleanWord (wds [i])
        
    name1Arr = []
    name2Arr = []
    switcher = 3
            
    for i in range(len(wds)):
        if wds[i] == name1:
            switcher = 1
            
        elif wds[i] == name2:
            switcher = 2
            
        elif wds[i] in modList:
            switcher = 3
            
        else:
            if switcher == 1:
                name1Arr.append(wds[i])
                
            elif switcher == 2:
                name2Arr.append(wds[i])
                
    return [name1Arr, name2Arr]

<h2><ul><li>The following code was used to calculate the average word length.</li></ul></h2>

In [None]:
count = 0
total = 0
for key, value in total_arr1:
    for j in range(0, value):
        total += len(key)
        count += 1
    
print('Cand1: ' + str(float(total/count)))

count = 0
total = 0
for key, value in total_arr2:
    for j in range(0, value):
        total += len(key)
        count += 1
    
print('Cand2: ' + str(float(total/count)))

<h1>Results</h1>

<br/>
<h2>Word Count</h2>
<img src="./445pic1.PNG" />
<ul><li>The results of our word count analysis showed that winning and losing candidates used very similar words throughout the debates, and word choice seems to have very little correlation to who wins the election.</li> </ul>
<img src="./401pic2.PNG" />
<ul><li>This figure breaks down the top words that the losing candidates had in common with winning candidates and those that were distinct for each side. This makes it more abundantly clear that the difference of word choice between the winners and losers is miniscule and inconclusive. The distinct words fail to prove any sort of recognizable pattern.
</li> </ul>
<h3>Results by Party</h3>
<img src = "./445pic3.PNG"/>
<ul><li>Overall, the two parties have used very similar words over time as well.</li></ul>
<strong>Words Exclusive to Democrats</strong>
<ul>
    <li>american</li>
    <li>i've</li>
    <li>need</li>
    <li>right</li>
    <li>also</li>
    <li>plan</li>
</ul>
<ul><li>Words like 'right', 'need', 'americans', and 'plan' could emphasize the Democrat's higher focus on working class citizens.</li></ul>
<strong>Words Exclusive To Republicans</strong>
<ul>
    <li>say</li>
    <li>way</li>
    <li>like</li>
    <li>look</li>
    <li>government</li>
    <li>senator</li>
</ul>
<h2>Sentiment Analysis</h2>
<ul>
    <li><strong>Winning Candidates
{'neg': 0.086, 'neu': 0.773, 'pos': 0.142, 'compound': 1.0}
 Losing Candidates
        {'neg': 0.083, 'neu': 0.773, 'pos': 0.144, 'compound': 1.0}</strong></li>
    <li>The average positive, neutral, and negative percentages for both winners and losers is roughly the same. </li>
    <li>Due to the conversations being mainly about issues such as trade and taxation, a majority of the talk is considered neutral.</li>
    <li>Generally, if a candidate is running for a second term, they will speak a little bit more positively. If the candidate is a challenger, they may be more likely to use more negative language.</li>
    <br/>
    <li><strong>Democrats
{'neg': 0.086, 'neu': 0.774, 'pos': 0.14, 'compound': 1.0}
 Republicans
        {'neg': 0.082, 'neu': 0.772, 'pos': 0.147, 'compound': 1.0} </strong></li>
    <li>Generally speaking, the two parties use a similar sentiment in their language.</li>
</ul>
<h2>Word Length</h2>
<ul>
    <li><strong>Winning Candidates: 4.347268518729053 Losing Candidates 4.354639316239316</strong></li>
    <li>The average word length for a candidate is about 4.35.</li>
    <li><strong>Democrats: 4.3845054900200005 Republicans 4.317131587006303</strong></li>
    <li>The average word length of candidates from the two parties is very similar. However, the Democrats tend to use slightly longer words than the Republicans.</li>
    <li>However, an interesting pattern is that as time goes on, the average word length in presidential debates has decreased.</li>
    <img src="./avg_word_length.png" />
</ul>

<h1>Issues Encountered</h1>
<ul><li>Given the nature of our project, it became difficult to find what we will train our models with. Our debates occurred from 1960 to 2020. Naturally, commonly used language has greatly changed over this period of time, so it was difficult to find a model that would be able to respond to all of the different words that have been used throughout all of the debates.</li></ul>
<ul><li>The lack of opinion polling data for the older elections. With the newer elections, we were able to find a decent number of opinion polls from both before and after the debates. However, for the older elections, this data was much less comprehensive. 
</li></ul>


<h1>Future Areas of Research</h1>
<ul>
    <li>In the future, it would likely be helpful to perform this analysis on other sources of speech that a candidate gives, such as speeches at campaign rallies, media appearances, and other long form speeches. </li>
    <li>Another interesting analysis of presidential debates could be an analysis of posture, facial expressions, and voice tones during the debate. Libraires exist to provide analysis on these factors.</li>
    <li>While we only performed this analysis on presidential debates, we could also repeat this process for vice presidential debates and presidential primary debates.</li>
</ul>


<h1>In Conclusion</h1>
<ul><li><strong>While we did find some interesting patterns in the text from the debates, we ultimately did not find any strong predictors that would coorelate with election success. </strong></li>
 </ul>
 <ol>
    <li><strong>How did the winning candidates’ word choice differ from the losing candidates?</strong><br/><br/> Generally, there were few differences between the most commonly used words for candidates that won and lost the election. <br/><br/> </li>
    <li><strong>Did candidates who won the election talk about particular issues more than their opponents?</strong><br/><br/> Typically, candidates would respond to questions, so they would talk about the same issues around the same amount.<br/><br/></li>
    <li><strong>Do Ad Hominem/Personal Attacks increase a candidate’s chances of winning?</strong> <br/><br/>The winning candidates tend to use slightly more negative language than losing candidates, but the difference is pretty minimal. Thus, it is likely that personal attacks do not have a huge impact on the outcome. <br/><br/></li>
    <li><strong>Under what circumstances are politicians more likely to use positive, neutral,and negative language? </strong><br/><br/> Generally, if a candidate is running for a second term, they will speak a little bit more positively. If the candidate is a challenger, they may be more likely to use more negative language.<br/><br/> </li>
    <li><strong>Do candidates that use longer or shorter words on average in debates perform average better or worse in debates?</strong><br/><br/> The average word length used in debates seems to have little to no effect on election results.</li>
</ol>
 