# Predicting Feelings: Tracking Twitter's Response to Donald Trump

Donald Trump was the president elect of the United States. Some have argued that his victory came as a surprise because of the existence of political echo chambers on social media platforms like Facebook and Twitter. As defined on Wikipedia, an echo chamber is “a metaphorical description of a situation in which information, ideas, or beliefs are amplified or reinforced by communication and repetition inside a defined system.” Because social media algorithms recommend personalized information to the user, a user may receive information that excludes perspectives that do not align with their point of view. 

The goal for now is to tackle an intermediate step in order to investigate echo chambers on Twitter. This intermediate step is to analyze and explore sentiment of Twitter users with regards to the recent political election. We explored sentiment in individual tweets over time and developed generalized methods to identify tweets belonging to a particular sentiment in order to move closer to identifying echo chambers.

## Current Research

Data is readily available at no cost through Twitter’s Streaming and Rest APIs. The company also provides a convenient method for sharing information through the retweet (RT). This is analogous to an echo in an echo chamber. Furthermore, tweets are limited in the amount of characters that can be sent. This makes large-scale analysis much easier.

One study used machine learning to predict political orientation and measure political homophily on Twitter. The results provide evidence of certain attributes being linked to increased political homophily (see Colleoni Rozza and Arvidsson, 2014). This provides evidence favoring the existence of echo chambers on Twitter, assuming echo chambers are defined by heightened levels of homophily.

In light of the recent election, there has been little if any quantitative analysis of the role that echo chambers played in the election of Donald Trump. Quantifying the impact of echo chambers on the election results is beyond the scope of this project. However, identifying tweets belonging to a political echo chamber will further our understanding of the way information spreads around the internet. Future research can reveal the virtues and follies of echo chambers.

Data collection commenced using Twitter’s Streaming API on 19 October 2016. Bad JSON data, computer processing errors, or internet deficiencies made continuous data collection unreliable. Instead, data was collected semi-frequently until a few days before election day. From 5 November 2016, I collected every day until election day. On election day, I collected data continuously begining at 6:00 pm until almost 2:00 am on 9 November 2016. More data was periodically collected during smaller time intervals. Identifying echo chambers using Twitter data is a challenge because echo chambers are time dependent. Sentiment on the other hand can be tracked as it develops over time. The ability to classify a positive or negative sentiment at a specific time will allow us to follow the change of how disgruntled or gruntled twitter is with the progress of the election. The data that this project investigates necessarily involves past, present, and future data associated with keywords used in the API.

Twitter's Streaming API delivers a randomized sample of incoming tweets that contain a keyword chosen by the developer. The tweet data is delivered in JSON format and contains several variables related to the tweet as well as the person who sent the tweet. For convenience, I'll refer to the person who sent the tweet as the user. Two sets of tweets were collected, one with the keyword "Trump" and the other with the keyword "Clinton". The tweet data was appended to a file called 'trump.txt' or 'clinton.txt'. Since Donald Trump won the election, the Trump keyword will be the primary keyword used for analysis.

The variables collected were the time the tweet was sent (Unix timestamp), the number of user followers, the number of user friends, the number of user tweets to date, and the text of the tweet. In order to work more efficiently with the data, we have a module to store the data in an object called TwitterCorpus. The module is called `utils.py` and contains other auxiliary functions that creates a data processing pipeline. The data can be loaded as demonstrated below.

## Data Cleaning

The TwitterCorpus object has a variety of methods for data cleaning and feature extraction. The principle method is the `clean_text` method. It uses regular expressions to go through each tweet and identify and store certain attributes of the tweet as an attribute of the TwitterCorpus object. The method accepts a keyword `remove_vars_from_tweet` which is a boolean variable that determines whether or not to remove the extracted variables from the tweet after they have been stored as a TwitterCorpus attribute. For the purposes of this exposition, the variables will remain in the tweets.

Each observation corresponds to a tweet. Below are the variable names and their corresponding descriptions:

+ **time**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;the date and time the tweet was collected. Time is in standard 24 hour format.

+ **usr_fol**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;the number of people following the user

+ **usr_n_stat**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;the number of statuses (tweets) to date of the user

+ **usr_fri**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;the number of people that the user is following

+ **n_weblinks**&nbsp;&nbsp;&nbsp;&nbsp;the number of URLs in the tweet

+ **n_mentions**&nbsp;&nbsp;&nbsp;&nbsp;the number of people mentioned in the tweet

+ **n_hashtags**&nbsp;&nbsp;&nbsp;&nbsp;the number of hashtags in the tweet

+ **RT**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;whether or not the tweet was a retweet

Another method is the `convert_time` method. This takes the Unix timestamp of the tweet and converts it into a `datetime` object which is part of the `python` standard library. This is helpful for time-series analyses performed on the data. These methods are demonstrated below.

The TwitterCorpus object can also generate pandas `DataFrame` objects of the tweet variable statistics. There were 2,666,819 tweets collected for the Trump data set and 2,124,664 tweets collected for the Clinton data set on or before 9 November 2016. As mentioned before we will only be utilizing the keyword "Trump" due to the outcome of the election. We will onlly analyze the sentiment of tweets regarding Trump.

The following is showing how the data is read and becomes the TwitterCorpus object and the calls to appropriate member functions in order to change the data to a dataframe that we will use.

Along with the previously mentioned variables in our dataframe we also add quantitive values for the sentiment of a tweet. These scores are given as follows:

+ **neg**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Negative score.

+ **neu**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Neutral score.

+ **pos**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Positive score.

+ **comp**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Composite score calculated as a normalized aggregate of the negative, neutral, and positive scores.

These are the features we will be using in order to train and test various models in order to classify sentiment of Trump tweets. This will also allow for us to look later at how the sentiment changes with time and whether or not we would have the capacity to predict the sentiment of future tweets according to Trumps actions.

In [8]:
import utils as ut
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegressionCV,LogisticRegression
filename = "/home/byu.local/fishekoa/myacmeshare/lovetrumpshate/capstone/data/before_inaug/trump.txt" #Local Filename
Trump = ut.TwitterCorpus(filename,n=None,m=None)
Trump.clean_text()
Trump.convert_time()

Loading file...

Errors: 0
Time: 9.88429689407
Cleaning text...
Time: 24.7653501034
Converting time to datetime object...
Time: 0.191492795944


In [3]:
df = Trump.make_df()
df.head()

Creating DataFrame...
Time: 607.17642808


Unnamed: 0,ts,usr_fol,usr_n_stat,usr_fri,n_weblinks,n_mentions,n_hashtags,RT,neg,neu,pos,comp,text
2016-10-19 07:00:13.849,1476885613849,684.0,4048.0,1221.0,1,2,0,0,0.0,0.811,0.189,0.4215,I liked a @YouTube video from @thefader https:...
2016-10-19 07:00:14.680,1476885614680,90.0,772.0,107.0,1,0,0,0,0.169,0.608,0.223,0.2023,"Trump Calls US Elections Rigged, Blockchain Co..."
2016-10-19 07:00:14.523,1476885614523,242.0,11124.0,273.0,0,0,1,0,0.101,0.776,0.123,0.144,Because most men are good guys and realize and...
2016-10-19 07:00:14.734,1476885614734,10.0,271.0,10.0,0,3,0,0,0.164,0.643,0.193,0.1027,@hmrstrm45 @mitchellvii @nypost I'm smart enou...
2016-10-19 07:00:14.711,1476885614711,1730.0,27397.0,2370.0,1,1,0,1,0.0,1.0,0.0,0.0,RT @kurteichenwald: After decades of wrecking ...


Along with the dataframe, data on the actual tweets is also available. The TwitterCorpus object extracts the hashtags and mentions from the text of the tweets. A mention is when the user includes in their tweet the Twitter handle of a different user.

If echo chambers exist in the Trump data set, there may be a difference between tweets and retweets. Below are the tables comparing a selection of summary statistics of tweets and retweets from the Trump data set. Notice that the mean and standard deviation of the usr_fol variable is much smaller for the retweet data set. The retweet data set also has more mentions on average than the tweet data set.

## Data Viz

Derek

## Machine Learning

Many of the algorithms we learned from class were either not well suited for our problem or were more complicated to implement and not worth implementing at this stage. Here listed are algorithms that were not appropriate for our use.

+ Nearest Neighbors
+ Linear Regression
+ Ridge Regression
+ Mixture Models with Latent Variables
+ Decision Trees
+ Random Forests
+ Kalman Filter
+ ARMA
+ Neural Networks

In order, the vanilla (out-of-the-box) implementations of our algorithms gave us these results for their rank according to test score:

+ Gradient Boosting (.6183)
+ Polynomial SVM (.6170)
+ XGBoost (.6157)
+ Naive Bayes (.6118)
+ Quad Discriminant Analysis (.5828)
+ Linear Discriminant Analysis (.5577)
+ Logistic Regression (.3830)
+ Linear SVM (.3830)

Many of these did not improve with modification or regularization. In particular, Linear SVM and Logistic Regression were very poor performers and did not improve when the parameters were changed. Interestingly, Quad Discriminant Analysis performed poorly when reg_param > 1, getting the same score as logistic regression and linear SVM. Most of the other algorithms improved when parameters were changed or tweaked. After some experimentation, the following algorithms performed best based on the highest score achieved.

+ XGBoost (.6186)
+ Gradient Boosting (.6183)
+ Polynomial SVM (.6170)
+ Quad Discriminant Analysis (.6170)
+ Naive Bayes (.6118)
+ Linear Discriminant Analysis (.5879)
+ Logistic Regression (.3830)
+ Linear SVM (.3830)

It's interesting that Gradient Boosting performed better out of the box than XGBoost but XGBoost improved a fair amount with tuned regularization. The largest gain in improvement came from the SVM family, where polynomial SVM was much better than Linear SVM. There does appear to be something funky going on since the scores between these two models sum to one. Quad Discriminant Analysis also gained quite a bit---from .5828 to .6170 making it go from 5th tied with Polynomial SVM for 3rd best.

Of course, these scores don't tell us we have a good model. It remains to be seen how the models interpret new tweets and to see if those tweets are truly positive or negative toward Trump. However, it does appear that tree based models are clearly advantageous for our problem (at least, without using other features from the text data). Polynomial SVM and Quad Discriminant Analysis also warrant more investigation and experimentation. Since Linear SVM and Logistic Regression were such poor performers, they are likely not the best algorithms to use. As noted before, there is something suspicious about their scores. It would be unwise to reject these algorithms outright. Manipulation of the data or using some other parameters might improve these models. This seems unlikely though.

The key to using tree-based methods for our problem is knowing how to adjust the regularization, since trees are well-known to overfit the data. Next steps include cross validation with tree-based models and Poly SVM and QDA. Feature engineering on the text data may be useful as well. However, textual data may best be modeled by something more complex, like a Neural Network. It is unfortunate that the individual models don't get much better than about 62% accurate. An ensemble model might improve the score drastically as each model may capture different information about Trump sentiment. What this tells us is that despite trying to make the outcome variable either very pro or very anti-Trump, the problem of sentiment classification is very hard---especially when relying on models and data built into the NLTK library.

## Preface

All the algorithms we run eventually reach a maximum score of .61704, which is very strange given the variety of the algorithms and tuning parameters. We are still not sure what is going on in the data. We will continue to investigate. One possibility is that our predictions break because the training set is very different from the testing set. The former is all the data before election day and the latter is election day and into the day after the election. People were very surprised on election day and so the data may be fundamentally different and only able to predict so much. But even this is unlikely because we ran the algorithms using only two variables, the timestamp for the independent (X) variable and the sentiment for the dependent (Y) variable. Most algorithms didn't budge and some, like Naive Bayes, only budged a little. Most still gave us a score of .61704.

We ran three new algorithms. The first is a nearest neighbors centroid classifier which gave us the same score of .61704. We also built an voting ensemble classifier which takes the best algorithms we had up to now and uses them to vote on the outcome of the tweets. It got very close to the .61704 score. We also ran an ARMA model to try to predict the movement of tweet sentiment over time. We are still working on understanding how the `statsmodels` implementation works but a naive implementation seemed to fit the data pretty well for the most part.

In order, the vanilla (out-of-the-box) implementations of our algorithms gave us these results for their rank according to test score:

+ Gradient Boosting (.6183)
+ Polynomial SVM (.6170)
+ XGBoost (.6157)
+ Naive Bayes (.6118)
+ Quad Discriminant Analysis (.5828)

Improved

+ XGBoost (.6186)
+ Gradient Boosting (.6183)
+ Polynomial SVM (.6170)
+ Quad Discriminant Analysis (.6170)
+ Naive Bayes (.6118)