# Predicting an Election from Tweets     

[Michaël Juillard](michael.juillard@epfl.ch), 
[Mikhail Vorobiev](mikhail.vorobiev@epfl.ch),
[Chiara Ercolani](chiara.ercolani@epfl.ch)

# 1 Problem Definition

Nowadays social medias like Twitter and Facebook are means by which people continuously express their opinion on any matter. Thus, data mining combined with data analysis could be a great way to undersand the general feeling of the users on a certian matter.
With this in mind, we decided to try to predict the result of an election using data from Twitter.

We focused on the US Senate Election of 2016, since this election would provide enough meaningful data for our analysis. In fact there are many candidates and a lot of people are tweeting about them, allowing us to have a big dataset to train our algorithm with. We started by mining data from Twitter, collecting every Tweet regarding every Rupublican and Democrate Senate candidate during the two months preceeding the elections. We collected tweets aimed at the candidate (@candidateprofile) or that contained an hashtag referred to the candidate (#candidateprofile). 

The second step was to perform a sentiment analysis on these tweets to understand if the user was expressing a positive or negative feeling towards the candidate. Sentiment analysis combines natural language processing, text analysis and computational linguistics to assess the attitude of a writer towards a topic. There are various tools for it, we picked one called Pattern developed by the University of Antwerp, in Belgium. 

Afterwards we trained a machine learning algorithm with the sentiment analysis data. TODO describe this shit

Finally we used other sentiment analysis data from the same election and analyzed whether our algorithm was able to predict the election result or not. 

# 2 Resources

For the project we used a variety of tools, here are the links to their websites.

Links for used for Twitter data mining
* [US Senate Elections 2016 Wikipedia page](https://en.wikipedia.org/wiki/United_States_Senate_elections,_2016)
* [Twitter REST API](https://dev.twitter.com/rest/public)
* [Tool to get older Tweets](https://github.com/Jefferson-Henrique/GetOldTweets-python)


Tool used for the sentiment analysis
* [Sentiment Analysis Tool](http://www.clips.ua.ac.be/pattern)

Papers about election prediction with tweets
* [Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment](https://www.google.ch/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwjqp9O79qPRAhWCvhQKHb-SAq4QFggcMAA&url=http%3A%2F%2Fwww.aaai.org%2Focs%2Findex.php%2FICWSM%2FICWSM10%2Fpaper%2Fdownload%2F1441%2F1852&usg=AFQjCNFnJCLUH96xhdmhtDbDnJs4Dtx8jg)
* [Limits of Electoral Predictions Using Twitter](http://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/viewFile/2862/3254)
* [On Using Twitter to Monitor Political Sentimentand Predict Election Results](https://www.aclweb.org/anthology/W/W11/W11-3702.pdf)
* [Predicting US Primary Elections with Twitter](http://snap.stanford.edu/social2012/papers/shi.pdf)

# 3 Web scraping

We started by finding the list of candidates for the US Senate Elections in 2016 and we looked for their twitter accounts. We decided to only use Democratic and Republican candidates, as the parties are not very relevant in the US. 

Twitter APIs only allow data mining in the past week, thus we used a tool to bypass this limitation and collect data from the 8th of September 2016 to the 8th of November 2016, the day of the elections.

The outputs of this analysis are kept in the *data* folder of our project

# 4 Data Analysis

## 4.1 Sentiment Analysis

Sentiment analysis is used to measure the polarity of a text. By polarity it is intended wheter the text leans to a negative or positive attitute towards the topic it contains. 

Patter, used to perform sentiment analysis for this project, is a natural language processing toolkit. In particular, we used pattern.en, which was made for the English language. 

The **sentiment( )** function returns a **(polarity, subjectivity)**-tuple for the given sentence, based on the adjectives it contains, where polarity is a value between -1.0 and +1.0 and subjectivity between 0.0 and 1.0. 

**Important note** : this sentiment analysis tool requires the usage of Python 2 

In [53]:
from pattern.en import sentiment
import numpy as np 
import os

for file in os.listdir("./data"):
    if file.endswith(".csv"):
        path = r'data/'+file
        print(file)
        var = os.path.basename(path)
        var= str.split(var,'_')
        var=str.split(var[1],'.')
        sentiment_path = 'sentiment_analysis_v1/sentiment_'+var[0]+'.csv'

        text=np.loadtxt(path,dtype=str, comments='++++',delimiter=';',skiprows=1,usecols=(4,))
        date=np.loadtxt(path,dtype=str, comments='++++',delimiter=';',skiprows=1,usecols=(1,))
        retweet=np.loadtxt(path,dtype=str, comments='++++',delimiter=';',skiprows=1,usecols=(2,))
        favourites=np.loadtxt(path,dtype=str, comments='++++',delimiter=';',skiprows=1,usecols=(3,))

        ## remove hashtags and @ from the tweets 
        text=np.core.defchararray.replace(text,'#',' ')
        text=np.core.defchararray.replace(text,'@',' ')

        array_sentiment=np.zeros((len(text),2))


        # perform sentiment analysis 
        for i in range(len(text)):
            array_sentiment[i] = sentiment(text[i])
            if (array_sentiment[i][0]<0.0000000000000001 and array_sentiment[i][0]> -0.0000000000000001 and array_sentiment[i][0]!=0.0):
                array_sentiment[i]=0
                

        # save sentiment analysis to output file
        np.savetxt(sentiment_path,np.transpose([date,array_sentiment[:,0],array_sentiment[:,1],retweet,favourites]),fmt="%s;%s;%s;%s;%s",delimiter=';',header="date;sentiment;objectivity;retweets;favourites")


output_got#barackobama @barackobama.csv
output_gotCampbellforLa.csv
[ 0.  0.]
26
output_gotCatherineForNV.csv
[ 0.  0.]
4107
[ 0.  0.]
5155
[ 0.  0.]
5295
[ 0.  0.]
6353
output_gotChrisvance123.csv
output_gotChrisVanHollen.csv
output_gotChuckGrassley.csv
[ 0.  0.]
64
[ 0.  0.]
2787
[ 0.  0.]
3762
[ 0.  0.]
5977
output_gotDanCarterCT.csv
output_gotGovernorHassan.csv
output_gotJasonKander.csv
[ 0.  0.]
1055
output_gotJerryMoran.csv
output_gotjimbarksdale.csv
[ 0.  0.]
103
output_gotJimGrayLexKY.csv
[ 0.  0.]
253
output_gotJohnKennedyLA.csv
[ 0.  0.]
333
[ 0.  0.]
334
output_gotKamalaHarris.csv
[ 0.  0.]
4977
output_gotKathyforMD.csv
[ 0.  0.]
848
output_gotKellyAyotte.csv
[ 0.  0.]
1009
[ 0.  0.]
1718
[ 0.  0.]
1967
[ 0.  0.]
3091
[ 0.  0.]
3593
[ 0.  0.]
5837
[ 0.  0.]
8457
[ 0.  0.]
8476
[ 0.  0.]
8626
[ 0.  0.]
12289
[ 0.  0.]
17046
[ 0.  0.]
17685
[ 0.  0.]
18549
[ 0.  0.]
19804
[ 0.  0.]
20081
[ 0.  0.]
20086
[ 0.  0.]
22323
[ 0.  0.]
23604
[ 0.  0.]
23817
[ 0.  0.]
24799
output_got

## 4.2 Unique User Identification

We decided to identify the number of unique users who tweeted about a certain candidate and use this number as a feature for our machine learning algorithm

In [None]:
import numpy as np
import codecs
import os
for file in os.listdir("./data"):
    if file.endswith(".csv"):
        cand = r'data/'+file
        filecp = codecs.open(cand, encoding = 'latin-1')
        mydata = np.loadtxt(filecp, dtype=str, comments='++++',delimiter=';',skiprows=1,usecols=(0,))
        print(file + " " + str(len(mydata)) + " " + str(len(np.unique(mydata))))

## 4.3 Mean 

We decided to do a daily mean of the polarity value given by the sentiment analysis of the tweets. The mean is weighted on the number of likes and retweets that every tweet received. In fact we assumed that likes and retweets meant that people agreed with the content of the tweet and thus such tweets deserved to have a higher weight.


In [54]:
import numpy as np 
import os

for file in os.listdir("./sentiment_analysis_v1"):
    if file.endswith(".csv"):
        path = r'sentiment_analysis_v1/'+file
        print(file)
        var = os.path.basename(path)
        var= str.split(var,'_')
        var=str.split(var[1],'.')
        mean_path = 'mean_v1/mean_'+var[0]+'.csv'
        sentim=np.loadtxt(path, comments='++++',delimiter=';',skiprows=1,usecols=(1,))
        date=np.loadtxt(path,dtype=str, comments='++++',delimiter=';',skiprows=1,usecols=(0,))
        retweet=np.loadtxt(path, comments='++++',delimiter=';',skiprows=1,usecols=(3,))
        favourites=np.loadtxt(path, comments='++++',delimiter=';',skiprows=1,usecols=(4,))

        array_length=62
        mean=[0]*array_length


        #create day and month arrays
        array_day=np.append(np.arange(8,0,-1),(np.append(np.arange(31,0,-1),np.arange(30,7,-1))))
        array_month=np.append(np.ones(8)*11,(np.append(np.ones(31)*10,np.ones(23)*9)))

        # parse the date and the month and create arrays for them
        day =np.zeros(len(date))
        month=np.zeros(len(date))
        for i in range(len(date)):
            day[i]=np.datetime64(date[i]).astype(object).day
            month[i]=np.datetime64(date[i]).astype(object).month


        #create array of the weights (based on likes and favourites)
        array_weight = np.zeros(len(sentim))
        for i in range(len(retweet)):
            if(retweet[i]!=0.0 or favourites[i]!=0.0):
                array_weight[i]=retweet[i]+favourites[i]
            else:
                array_weight[i]= 1

        #compute the mean
        cnt_date=0
        cnt_mean=[0]*array_length

        for i in range(len(sentim)):
            if (sentim[i]!=0.0):
                if (array_day[cnt_date]==day[i]and array_month[cnt_date]==month[i]): 
                        mean[cnt_date] = mean[cnt_date]+array_weight[i]*sentim[i]
                        cnt_mean[cnt_date]=cnt_mean[cnt_date]+array_weight[i]

                else :
                    
                    while (array_day[cnt_date]!=day[i] or array_month[cnt_date]!= month[i]): 
                        if (cnt_date <len(array_day)-1):
                            cnt_date=cnt_date + 1
                        else:
                            break
                        
                    mean[cnt_date]=mean[cnt_date]+array_weight[i]*sentim[i]
                    cnt_mean[cnt_date]=cnt_mean[cnt_date]+array_weight[i]



        weigthed_mean=[None]*array_length

        for i in range(len(mean)):
            if (mean[i]!=0.0):
                weigthed_mean[i]=mean[i]/cnt_mean[i]
            else:
                weigthed_mean[i]=0
                
        # save output on file
        np.savetxt(mean_path,np.transpose([array_day,array_month,mean,cnt_mean,weigthed_mean]),fmt="%d;%d;%s;%s;%s",delimiter=';',header="day;month;sum;weight;mean")



sentiment_got#barackobama @barackobama.csv
sentiment_gotCampbellforLa.csv
sentiment_gotCatherineForNV.csv
sentiment_gotChrisvance123.csv
sentiment_gotChrisVanHollen.csv
sentiment_gotChuckGrassley.csv
sentiment_gotDanCarterCT.csv
sentiment_gotGovernorHassan.csv
sentiment_gotJasonKander.csv
sentiment_gotJerryMoran.csv
sentiment_gotjimbarksdale.csv
sentiment_gotJimGrayLexKY.csv
sentiment_gotJohnKennedyLA.csv
sentiment_gotKamalaHarris.csv
sentiment_gotKathyforMD.csv
sentiment_gotKellyAyotte.csv
sentiment_gotLorettaSanchez.csv
sentiment_gotmarcorubio.csv
sentiment_gotMikeCrapo.csv
sentiment_gotPatrickMurphyFL.csv
sentiment_gotpattyforiowa.csv
sentiment_gotPattyMurray.csv
sentiment_gotRandPaul.csv
sentiment_gotRepJoeHeck.csv
sentiment_gotRepKirkpatrick.csv
sentiment_gotRickSantorum.csv
sentiment_gotRonWyden.csv
sentiment_gotRoyBlunt.csv
sentiment_gotSenatorIsakson.csv
sentiment_gotSenatorKirk.csv
sentiment_gotSenBlumenthal.csv
sentiment_gotSenEvanBayh.csv
sentiment_gotSenJohnMcCain.csv
senti

# 5 Machine Learning Algorithm

We used a linear regression algorithm with multiple outputs

## 5.1 Training Phase

## 5.2 Prediction Phase

# 6 Conclusions

Sentiment Analysis, like any other natural language processing tool, is hard to perform and does not give extremely accurate results. Performing it on Tweets is tricky because many slang words are used and often the analysis does not output anything. We were aware of this but saw that with a big dataset we were able to have a reasonable number of tweets that could be processed by the sentiment analysis and give accurate results.

Another issue that we encountered was that the number of tweets differed a lot depending on the popularity of the candidate. However, since we looked at big elections in a big country, we were able to have a quite big dataset anyways.

used one election to keep same condition


In [51]:
from pattern.en import sentiment
import numpy as np 
import os

path = r'data/output_gotCatherineForNV.csv'
print(file)
var = os.path.basename(path)
var= str.split(var,'_')
var=str.split(var[1],'.')
sentiment_path = 'sentiment_analysis_v1/sentiment_'+var[0]+'.csv'

text=np.loadtxt(path,dtype=str, comments='++++',delimiter=';',skiprows=1,usecols=(4,))
date=np.loadtxt(path,dtype=str, comments='++++',delimiter=';',skiprows=1,usecols=(1,))
retweet=np.loadtxt(path,dtype=str, comments='++++',delimiter=';',skiprows=1,usecols=(2,))
favourites=np.loadtxt(path,dtype=str, comments='++++',delimiter=';',skiprows=1,usecols=(3,))

## remove hashtags and @ from the tweets 
text=np.core.defchararray.replace(text,'#',' ')
text=np.core.defchararray.replace(text,'@',' ')

array_sentiment=np.zeros((len(text),2))


# perform sentiment analysis 
for i in range(len(text)):
    array_sentiment[i] = sentiment(text[i])
    if (array_sentiment[i][0]<0.0000000000000001 and array_sentiment[i][0]> -0.0000000000000001 and array_sentiment[i][0]!=0.0):
        array_sentiment[i]=0
        print(array_sentiment[i])
        print(i)

        
        
# save sentiment analysis to output file
np.savetxt(sentiment_path,np.transpose([date,array_sentiment[:,0],array_sentiment[:,1],retweet,favourites]),fmt="%s;%s;%s;%s;%s",delimiter=';',header="date;sentiment;objectivity;retweets;favourites")


sentiment_gotCatherineForNV.csv
[ 0.  0.]
4107
[ 0.  0.]
5155
[ 0.  0.]
5295
[ 0.  0.]
6353
