# Donaldometer
### measuring presidential sentiment on Twitter

Twitter data using Twitter’s Streaming API on 19 October 2016. Data was collected semi-frequently until a few days before election day. Tweets and metadata were also collected on Election Day until 2:00 am on 9 November 2016 and Inauguration Day 2017.

Twitter's Streaming API delivered a randomized sample of incoming tweets that contain keywords chosen by the developer.
From last semester, Derek wrote a module called `utils.py` to store the data in an object called `TwitterCorpus`. This file was modified to accomodate new objectives for this project. The `TwitterCorpus` object has a variety of methods for data cleaning and feature extraction. The `clean_text` method uses regular expressions to go through each tweet and identify and store certain attributes of the tweet. Another method is the `convert_time` method. This takes the Unix timestamp of the tweet and converts it into a datetime object which is part of the Python standard library. This will be helpful for time-series analyses performed on the data. The `TwitterCorpus` object generates a `pandas DataFrame` using `make_df`. There were 2,666,819 tweets collected for the Trump keyword data set on election day. For this project, the data set will be trained and tested on the election day data set.

Each observation corresponds to a tweet. Below are the variable names and their corresponding descriptions:
+ time: the date and time the tweet was collected. Time is in standard 24 hour format.
+ usr_fol: the number of people following the user
+ usr_n_stat: the number of statuses (tweets) to date of the user
+ usr_fri: the number of people that the user is following
+ n_weblinks: the number of URLs in the tweet
+ n_mentions: the number of people mentioned in the tweet
+ n_hashtags: the number of hashtags in the tweet
+ RT: whether or not the tweet was a retweet
+ neg: the negative valence (sentiment) score of the tweet text
+ neu: the neutral valence score
+ pos: the positive valence score
+ comp: the compound valence score, a weighted average of neg, neu, and pos


Direct answers to questions:

1. All these conditions are handled by `utils.py`
2. The Twitter data are as reliable as the Streaming API's sample. The Valence scores are mostly reliable. The library used (`nltk.sentiment.vader`) was designed to use on Twitter data. However, there are unsolved problems in natural language processing and this data will not account for that. Despite this, we should have enough data to produce a good result given the large sample size. There is no missing data.
3. Our problem is to predict whether a tweet is pro-Trump or anti-Trump. This data set has a strong correlation to Donald Trump since they were collected during the election cycle. The data are sufficient to solve this problem and directly relate to our question: how mad is Twitter at Trump? Some aspects may need further engineering, however, since the valence scores are not perfect correlates with tweet sentiment. We will not know what to modify until we start exploring algorithms.
4. See proposal

In [9]:
import utils as ut
import numpy as np
from sklearn.linear_model import LogisticRegressionCV,LogisticRegression
reload(ut)

<module 'utils' from 'utils.pyc'>

In [10]:
filename = ut.get_file()


	Options

            1: trump from lab computer

            2: trump from linux mint

            3: clinton from lab computer

            4: clinton from linux mint


Enter number >> 1


In [11]:
reload(ut)
Trump = ut.TwitterCorpus(filename,n=None,m=None)
Trump.clean_text()
Trump.convert_time()

Loading file...

Errors: 0
Time: 12.1520459652
Cleaning text...
Time: 25.2041931152
Converting time to datetime object...
Time: 7.17876005173


In [12]:
print len(Trump.tweets),sum(Trump.retweets)

2666820 1887599


In [13]:
df = Trump.make_df()

Time: 602.116089106


In [14]:
df.tail()

Unnamed: 0,time,usr_fol,usr_n_stat,usr_fri,n_weblinks,n_mentions,n_hashtags,RT,neg,neu,pos,comp,text
2666815,2016-11-09 01:46:30,450.0,34016.0,375.0,1,1,0,1,0.0,1.0,0.0,0.0,RT @dave_izeidi: Mdr qui ne tente rien n'a rie...
2666816,2016-11-09 01:46:30,3707.0,1030.0,265.0,0,1,0,1,0.0,0.786,0.214,0.5994,RT @SJTEMI: Yoooo. Well if Donald Trump can be...
2666817,2016-11-09 01:46:30,187.0,3072.0,328.0,0,1,1,1,0.651,0.247,0.102,-0.9499,RT @OTRADaily: Trump didn't win. Racism won. S...
2666818,2016-11-09 01:46:30,71.0,7743.0,6.0,1,0,1,0,0.0,1.0,0.0,0.0,#SIGUEMEYTESIGO Presidencia de Trump inquieta ...
2666819,2016-11-09 01:46:31,537.0,5730.0,752.0,0,1,0,1,0.256,0.744,0.0,-0.7318,RT @starboysivan: I CANT BELIEVE FLORIDA OU5 O...


In [15]:
df = df[df['RT'] == 0]
df.tail()

Unnamed: 0,time,usr_fol,usr_n_stat,usr_fri,n_weblinks,n_mentions,n_hashtags,RT,neg,neu,pos,comp,text
2666799,2016-11-09 01:46:30,106.0,3257.0,89.0,0,0,0,0,0.179,0.695,0.127,-0.2607,"trump voters, y'all asks for this. if y'all ar..."
2666801,2016-11-09 01:46:30,1520.0,56622.0,2047.0,0,1,0,0,0.307,0.394,0.299,-0.0258,@M1Jarvis trump win. The world gonna die\n
2666806,2016-11-09 01:46:30,323.0,1646.0,510.0,1,0,0,0,0.0,1.0,0.0,0.0,Found the silver lining https://t.co/XW0WRhjlyI\n
2666809,2016-11-09 01:46:30,5139.0,18214.0,479.0,1,0,0,0,0.0,1.0,0.0,0.0,If we look at all the blue. Hilary has 21% of ...
2666818,2016-11-09 01:46:30,71.0,7743.0,6.0,1,0,1,0,0.0,1.0,0.0,0.0,#SIGUEMEYTESIGO Presidencia de Trump inquieta ...


In [16]:
N = len(df)
df.n_weblinks.sum()/float(N)

0.5919283489536344

In [17]:
df.n_mentions.sum()/float(N)

0.40259053593268146

In [18]:
df.n_hashtags.sum()/float(N)

0.39475707148549644

In [19]:
print N

779221
