# Predicting Feelings: Tracking Twitter's Response to Donald Trump

Donald Trump is the president elect of the United States. Some have argued that his victory came as a surprise because of the existence of political echo chambers on social media platforms like Facebook and Twitter. As defined on Wikipedia, an echo chamber is “a metaphorical description of a situation in which information, ideas, or beliefs are amplified or reinforced by communication and repetition inside a defined system.” Because social media algorithms recommend personalized information to the user, a user may receive information that excludes perspectives that do not align with their point of view. 

The goal of this data project is to investigate echo chambers on Twitter, with regards to the recent political election. As the project develops, I will formalize the definition of a Twitter echo chamber, explore its attributes, and develop a generalized method to identify tweets belonging to a political echo chamber.

## Current Research

Most of the research about echo chambers is qualitative. The studies that attempt a quantitative analysis use Twitter data. The attributes of the Twitter platform are convenient for a quantitative analysis of echo chambers. Data is readily available at no cost through Twitter’s Streaming and Rest APIs. The company also provides a convenient method for sharing information through the retweet (RT). This is analogous to an echo in an echo chamber. Furthermore, tweets are limited in the amount of characters that can be sent. This makes large-scale analysis much easier.

One study used machine learning to predict political orientation and measure political homophily on Twitter. The results provide evidence of certain attributes being linked to increased political homophily (see Colleoni Rozza and Arvidsson, 2014). This provides evidence favoring the existence of echo chambers on Twitter, assuming echo chambers are defined by heightened levels of homophily.

In light of the recent election, there has been little if any quantitative analysis of the role that echo chambers played in the election of Donald Trump. Quantifying the impact of echo chambers on the election results is beyond the scope of this project. However, identifying tweets belonging to a political echo chamber will further our understanding of the way information spreads around the internet. Future research can reveal the virtues and follies of echo chambers.

I began collecting data using Twitter’s Streaming API on 19 October 2016. Bad JSON data, computer processing errors, or internet deficiencies made continuous data collection unreliable. Instead, data was collected semi-frequently until a few days before election day. From 5 November 2016, I collected every day until election day. On election day, I collected data continuously begining at 6:00 pm until almost 2:00 am on 9 November 2016. Presently, I periodically collect more data during smaller time intervals. Identifying echo chambers using Twitter data is a challenge because echo chambers are time dependent. An echo chamber that exists in a past data set might not exist in a similar data set in the future. The data that this project investigates necessarily involves past, present, and future data associated with keywords used in the API.

Twitter's Streaming API delivers a randomized sample of incoming tweets that contain a keyword chosen by the developer. The tweet data is delivered in JSON format and contains several variables related to the tweet as well as the person who sent the tweet. For convenience, I'll refer to the person who sent the tweet as the user. I chose to collect two sets of tweets, one with the keyword "Trump" and the other with the keyword "Clinton". The tweet data were appended to a file called 'trump.txt' or 'clinton.txt'. Since Donald Trump won the election, the Trump keyword will be the primary keyword used for analysis.

Initially, memory storage was an issue, so I collected a subset of the JSON data. The variables I collected were the time the tweet was sent (Unix timestamp), the number of user followers, the number of user friends, the number of user tweets to date, and the text of the tweet. In order to work more efficiently with the data, I wrote a module to store the data in an object called TwitterCorpus. The module is called `utils.py` and contains other auxiliary functions that create a data processing pipeline. The data can be loaded as demonstrated below.

## Data Cleaning

The TwitterCorpus object has a variety of methods for data cleaning and feature extraction. The principle method is the `clean_text` method. It uses regular expressions to go through each tweet and identify and store certain attributes of the tweet as an attribute of the TwitterCorpus object. The method accepts a keyword `remove_vars_from_tweet` which is a boolean variable that determines whether or not to remove the extracted variables from the tweet after they have been stored as a TwitterCorpus attribute. For the purposes of this exposition, the variables will remain in the tweets.

Each observation corresponds to a tweet. Below are the variable names and their corresponding descriptions:

+ **time**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;the date and time the tweet was collected. Time is in standard 24 hour format.

+ **usr_fol**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;the number of people following the user

+ **usr_n_stat**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;the number of statuses (tweets) to date of the user

+ **usr_fri**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;the number of people that the user is following

+ **n_weblinks**&nbsp;&nbsp;&nbsp;&nbsp;the number of URLs in the tweet

+ **n_mentions**&nbsp;&nbsp;&nbsp;&nbsp;the number of people mentioned in the tweet

+ **n_hashtags**&nbsp;&nbsp;&nbsp;&nbsp;the number of hashtags in the tweet

+ **RT**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;whether or not the tweet was a retweet

Another method is the `convert_time` method. This takes the Unix timestamp of the tweet and converts it into a `datetime` object which is part of the `python` standard library. This is helpful for time-series analyses performed on the data. These methods are demonstrated below.

The TwitterCorpus object can also generate pandas `DataFrame` objects of the tweet variable statistics. There were 2,666,819 tweets collected for the Trump data set and 2,124,664 tweets collected for the Clinton data set on or before 9 November 2016. This is demonstrated below along with the count, mean, and standard deviation for each variable.

It is worth noting that the two data sets above have similar summary statistics. Along with these dataframes, data on the actual tweets is also available. The TwitterCorpus object extracts the hashtags and mentions from the text of the tweets. A mention is when the user includes in their tweet the Twitter handle of a different user.

If echo chambers exist in the Trump and Clinton data sets, there may be a difference between tweets and retweets. Below are the tables comparing a selection of summary statistics of tweets and retweets from the Trump data set. Notice that the mean and standard deviation of the usr_fol variable is much smaller for the retweet data set. The retweet data set also has more mentions on average than the tweet data set.

## Data Viz

Most of the Twitter data follows a power law. Histograms of the variables show a very severe curve as demonstrated below.

Due to the size of the data, rendering visualizations can be computationally expensive. Visualizing a subset of the data will still provide an accurate visualization. In general, the distribution of the variables does not change over time. This means that we can reliably plot a slice of our data set and still get away with understanding what the data look like.

Below are scatterplots representing the number of friends and the number of followers of the Twitter users. A parity line is drawn to help the viewer know where the number of followers equals the number of friends.

This first scatterplot gives good intuition but leaves out some data points. I manually adjusted the axes and excluded some data points. The second scatterplot below includes all of the data points. Again, the grey parity line shows where the number of followers equals the number of friends.

The second scatterplot shows that users either have about as many friends as followers, or almost no friends and many followers. Note the scale on the x-axis. Possible reasons for this appearance could be the presence of companies, Twitter bots, or information sources which are likely to have many followers but few friends. The scatterplot below plots the same data on a log-log scale with the same parity line. There is a clear ceiling on the number of friends just under the 1000 mark on the y-axis for users who have fewer than 1000 followers. As seen above, there is more variance in the number of followers than in the number of friends.

Visualizing text is a challenge. Fortunately, the analysis is not limited to the content of the tweet. Connections between Twitter users forms a network. I created two network representations of Twitter data with keywords Trump and #MAGA. The hashtag is an acronym for Trump’s slogan “Make America Great Again”. The visualizations were created in D3.js which is a JavaScript library widely considered to be the industry standard for data visualizations. The interactive networks are online at [www.derekmiller.info](http://www.derekmiller.info). The visualization simulates a force-directed graph structure. In this case, the nodes represent Twitter handles such as @realDonaldTrump. If @realDonaldTrump mentioned @dgmllr in a tweet, then they would share a connection represented by a grey line connecting the nodes. The graph uses an algorithm to place similar nodes near each other and unrelated nodes farther apart.

![jpg](trumpmaga.jpg)

Notice the differences between the two visualizations. For the visualization with the keyword Trump, the network seems mostly disconnected. Some clusters exist but they are small and spread out. Contrast this with the visualization of the network with keyword #MAGA. A tight, relatively dense cluster makes up the majority of the entire network. One possible explanation for this difference is that the keyword Trump is a neutral term on average. There are many people who dislike Donald Trump and many others who like him. Both are equally likely (or unlikely) to mention Trump in a tweet. This leads to a disconnected network. The keyword #MAGA is much less likely to be as neutral on average as the Trump keyword. A Trump opponent is much less likely to use the Trump-branded hashtag.

All of the data described above comes from Twitter. This means that the data is possibly biased. Other social media sites such as Facebook or LinkedIn may describe different users. The results of this project cannot be extrapolated beyond the Twitter platform. However, for the purpose of this project, Twitter data will provide a solution to identifying political echo chambers. I have addressed only a few of many noteworthy observations that still need exploring. For example, what are the most common topics associated with various hashtags or users? How do keyword networks evolve over time? Is there a difference between users who retweet a lot versus those who don't retweet very often?

I hope to answer these questions and more as I learn about machine learning algorithms. Feature engineering and unsupervised algorithms will continue to bring structure to the data set. Finding consistencies over time will help identify how to define an echo chamber and generalize this problem to other domains.

Twitter data using Twitter’s Streaming API on 19 October 2016. Data was collected semi-frequently until a few days before election day. Tweets and metadata were also collected on Election Day until 2:00 am on 9 November 2016 and Inauguration Day 2017.

Twitter's Streaming API delivered a randomized sample of incoming tweets that contain keywords chosen by the developer.
From last semester, Derek wrote a module called `utils.py` to store the data in an object called `TwitterCorpus`. This file was modified to accomodate new objectives for this project. The `TwitterCorpus` object has a variety of methods for data cleaning and feature extraction. The `clean_text` method uses regular expressions to go through each tweet and identify and store certain attributes of the tweet. Another method is the `convert_time` method. This takes the Unix timestamp of the tweet and converts it into a datetime object which is part of the Python standard library. This will be helpful for time-series analyses performed on the data. The `TwitterCorpus` object generates a `pandas DataFrame` using `make_df`. There were 2,666,819 tweets collected for the Trump keyword data set on election day. For this project, the data set will be trained and tested on the election day data set.

Each observation corresponds to a tweet. Below are the variable names and their corresponding descriptions:
+ time: the date and time the tweet was collected. Time is in standard 24 hour format.
+ usr_fol: the number of people following the user
+ usr_n_stat: the number of statuses (tweets) to date of the user
+ usr_fri: the number of people that the user is following
+ n_weblinks: the number of URLs in the tweet
+ n_mentions: the number of people mentioned in the tweet
+ n_hashtags: the number of hashtags in the tweet
+ RT: whether or not the tweet was a retweet
+ neg: the negative valence (sentiment) score of the tweet text
+ neu: the neutral valence score
+ pos: the positive valence score
+ comp: the compound valence score, a weighted average of neg, neu, and pos


Direct answers to questions:

1. All these conditions are handled by `utils.py`
2. The Twitter data are as reliable as the Streaming API's sample. The Valence scores are mostly reliable. The library used (`nltk.sentiment.vader`) was designed to use on Twitter data. However, there are unsolved problems in natural language processing and this data will not account for that. Despite this, we should have enough data to produce a good result given the large sample size. There is no missing data.
3. Our problem is to predict whether a tweet is pro-Trump or anti-Trump. This data set has a strong correlation to Donald Trump since they were collected during the election cycle. The data are sufficient to solve this problem and directly relate to our question: how mad is Twitter at Trump? Some aspects may need further engineering, however, since the valence scores are not perfect correlates with tweet sentiment. We will not know what to modify until we start exploring algorithms.
4. See proposal

## Machine Learning

# Machine Learnin'

This notebook shows the score of several algorithms we have learned about in class. First, you will see the import statements and loading the data into a dataframe. Neutral tweets are removed from the outcome classification. The feature matrix X is defined and the data are ready to train the models. The models we chose to use are Naive Bayes, Gradient Boosting, XGBoost, Logistic Regression, Support Vector Machines, and Discriminant Analysis. After the code and test results are displayed, a discussion about the methods and algorithms will follow. In that discussion, we will review the model assumptions, strengths, and pitfalls. We will also address which algorithms we did not use and why, followed by a brief summary of how these results inform our research question.

Many of the algorithms we learned from class were either not well suited for our problem or were more complicated to implement and not worth implementing at this stage. Here are those algorithms and a brief explanation.

+ Nearest Neighbors: the concept of distance does not suit our problem well. It is still possible to use this as a classifier, but the model is not intuitive for a time-sensitive sentiment classification.
+ Linear Regression: while we could try to predict the sentiment score calculated from the nltk package, we decided not to do this and so linear regression is not a good binary classifier.
+ Ridge Regression: this is still regression and not well suited to our problem for the same reasons as above.
+ Mixture Models with Latent Variables: this is good for topic extraction and may be a good way to improve our results through feature engineering. However, modeling the time-sensitive tweet data as a network is also not intuitive, though some aspects of Twitter are certainly networks. Ultimately, this is not ideal for sentiment classification.
+ Decision Trees: these work well but only when multiple trees are trained.
+ Random Forests: better than a decision tree but inferior to gradient boosting.
+ Kalman Filter: sentiment classification does not have a clear state space model that it relies on, though this may be useful in the future.
+ ARMA: similar to the Kalman Filter above. Need more information to set up the model.
+ Neural Networks: this will be very good for classification based on the text. However, the model is very complicated and simpler algorithms are likely to perform well enough---at least for a benchmark.

In order, the vanilla (out-of-the-box) implementations of our algorithms gave us these results for their rank according to test score:

+ Gradient Boosting (.6183)
+ Polynomial SVM (.6170)
+ XGBoost (.6157)
+ Naive Bayes (.6118)
+ Quad Discriminant Analysis (.5828)
+ Linear Discriminant Analysis (.5577)
+ Logistic Regression (.3830)
+ Linear SVM (.3830)

Many of these did not improve with modification or regularization. In particular, Linear SVM and Logistic Regression were very poor performers and did not improve when the parameters were changed. Interestingly, Quad Discriminant Analysis performed poorly when reg_param > 1, getting the same score as logistic regression and linear SVM. Most of the other algorithms improved when parameters were changed or tweaked. After some experimentation, the following algorithms performed best based on the highest score achieved.

+ XGBoost (.6186)
+ Gradient Boosting (.6183)
+ Polynomial SVM (.6170)
+ Quad Discriminant Analysis (.6170)
+ Naive Bayes (.6118)
+ Linear Discriminant Analysis (.5879)
+ Logistic Regression (.3830)
+ Linear SVM (.3830)

It's interesting that Gradient Boosting performed better out of the box than XGBoost but XGBoost improved a fair amount with tuned regularization. The largest gain in improvement came from the SVM family, where polynomial SVM was much better than Linear SVM. There does appear to be something funky going on since the scores between these two models sum to one. Quad Discriminant Analysis also gained quite a bit---from .5828 to .6170 making it go from 5th tied with Polynomial SVM for 3rd best.

Of course, these scores don't tell us we have a good model. It remains to be seen how the models interpret new tweets and to see if those tweets are truly positive or negative toward Trump. However, it does appear that tree based models are clearly advantageous for our problem (at least, without using other features from the text data). Polynomial SVM and Quad Discriminant Analysis also warrant more investigation and experimentation. Since Linear SVM and Logistic Regression were such poor performers, they are likely not the best algorithms to use. As noted before, there is something suspicious about their scores. It would be unwise to reject these algorithms outright. Manipuplation of the data or using some other parameters might improve these models. This seems unlikely though.

The key to using tree-based methods for our problem is knowing how to adjust the regularization, since trees are well-known to overfit the data. Next steps include cross validation with tree-based models and Poly SVM and QDA. Feature engineering on the text data may be useful as well. However, textual data may best be modeled by something more complex, like a Neural Network. It is unfortunate that the individual models don't get much better than about 62% accurate. An ensemble model might improve the score drastically as each model may capture different information about Trump sentiment. What this tells us is that despite trying to make the outcome variable either very pro or very anti-Trump, the problem of sentiment classification is very hard---especially when relying on models and data built into the NLTK library.

## Preface

All the algorithms we run eventually reach a maximum score of .61704, which is very strange given the variety of the algorithms and tuning parameters. We are still not sure what is going on in the data. We will continue to investigate. One possibility is that our predictions break because the training set is very different from the testing set. The former is all the data before election day and the latter is election day and into the day after the election. People were very surprised on election day and so the data may be fundamentally different and only able to predict so much. But even this is unlikely because we ran the algorithms using only two variables, the timestamp for the independent (X) variable and the sentiment for the dependent (Y) variable. Most algorithms didn't budge and some, like Naive Bayes, only budged a little. Most still gave us a score of .61704.

We ran three new algorithms. The first is a nearest neighbors centroid classifier which gave us the same score of .61704. We also built an voting ensemble classifier which takes the best algorithms we had up to now and uses them to vote on the outcome of the tweets. It got very close to the .61704 score. We also ran an ARMA model to try to predict the movement of tweet sentiment over time. We are still working on understanding how the `statsmodels` implementation works but a naive implementation seemed to fit the data pretty well for the most part.

In order, the vanilla (out-of-the-box) implementations of our algorithms gave us these results for their rank according to test score:

+ Gradient Boosting (.6183)
+ Polynomial SVM (.6170)
+ XGBoost (.6157)
+ Naive Bayes (.6118)
+ Quad Discriminant Analysis (.5828)

Improved

+ XGBoost (.6186)
+ Gradient Boosting (.6183)
+ Polynomial SVM (.6170)
+ Quad Discriminant Analysis (.6170)
+ Naive Bayes (.6118)

In [1]:
# import statements
import utils as ut
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
reload(ut)
plt.style.use('acme')


Bad key "axes.titlepad" on line 24 in
/home/derekgm@byu.local/.config/matplotlib/stylelib/acme.mplstyle.
You probably need to get an updated matplotlibrc file from
http://github.com/matplotlib/matplotlib/blob/master/matplotlibrc.template
or from the matplotlib source distribution


In [4]:
#c,df,T = ut.make_train_test()
fname = ut.get_file()
T = pd.read_csv(fname)
T.index = pd.to_datetime(T['ts'],unit='ms') - pd.DateOffset(hours=7)
T.tail()
df = pd.read_csv("../../../trumpdf.csv")
df.index = pd.to_datetime(df['ts'],unit='ms') - pd.DateOffset(hours=7)


	Options

            1: trump from lab computer

            2: trump from linux mint

            3: clean trump from lab computer

            4: clean trump from linux mint


Enter number >> 3


In [None]:
(df['pos']-df['neg']).plot(kind='hist',bins=150,linewidth=0)
plt.title("Tweet Valence (sentiment)")
plt.show()

In [None]:
T['pos-neg'].plot(kind='hist',bins=150,linewidth=0)
plt.title("Tweet Valence (sentiment)")
plt.show()