# Machine Learnin'

This notebook shows the score of several algorithms we have learned about in class. First, you will see the import statements and loading the data into a dataframe. Neutral tweets are removed from the outcome classification. The feature matrix X is defined and the data are ready to train the models. The models we chose to use are Naive Bayes, Gradient Boosting, XGBoost, Logistic Regression, Support Vector Machines, and Discriminant Analysis. After the code and test results are displayed, a discussion about the methods and algorithms will follow. In that discussion, we will review the model assumptions, strengths, and pitfalls. We will also address which algorithms we did not use and why, followed by a brief summary of how these results inform our research question.

In [1]:
# import statements
import utils as ut
import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
import xgboost as xgb
import matplotlib.pyplot as plt
%matplotlib inline
reload(ut)



<module 'utils' from 'utils.pyc'>

In [2]:
#c,df,T = ut.make_train_test()
fname = ut.get_file()
T = pd.read_csv(fname)
T.index = pd.to_datetime(T['ts'],unit='ms') - pd.DateOffset(hours=7)
T['outcome'] = np.around(T['comp'].as_matrix())
T.tail(3)


	Options

            1: trump from lab computer

            2: trump from linux mint

            3: clean trump from lab computer

            4: clean trump from linux mint


Enter number >> 4


Unnamed: 0_level_0,ts,usr_fol,usr_n_stat,usr_fri,n_weblinks,n_mentions,n_hashtags,neu,comp,pos-neg,outcome
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2016-11-09 01:46:29.622,1478681189622,1587.0,17840.0,253.0,0,1,0,0.795,-0.5574,-0.205,-1.0
2016-11-09 01:46:29.638,1478681189638,686.0,1897.0,1805.0,0,0,0,0.663,-0.8235,-0.337,-1.0
2016-11-09 01:46:30.684,1478681190684,232.0,51.0,188.0,0,0,1,0.652,0.8402,0.348,1.0


In [3]:
print "Proportion of tweets with neutral sentiment:",sum(T.outcome == 0)*1./len(T)
print "Proportion of tweets with negative sentiment:",sum(T.outcome == -1)*1./len(T)
print "Proportion of tweets with positive sentiment:",sum(T.outcome == 1)*1./len(T)

Proportion of tweets with neutral sentiment: 0.451176830223
Proportion of tweets with negative sentiment: 0.247292360129
Proportion of tweets with positive sentiment: 0.301530809649


In [4]:
print len(T)
T = T[T['outcome'] != 0]
print len(T)
X = T.drop(['outcome','comp','pos-neg'],axis=1)
print X.columns

338579
185820
Index([u'ts', u'usr_fol', u'usr_n_stat', u'usr_fri', u'n_weblinks',
       u'n_mentions', u'n_hashtags', u'neu'],
      dtype='object')


# Model Fitting!

### Naive Bayes
Ha! Typical. So naive. Noob!

In [5]:
nb = GaussianNB()
nb.fit(X[:"2016-11-07"],T[:"2016-11-07"]['outcome'])
nb.score(X["2016-11-08":],T["2016-11-08":]['outcome'])

0.61184274284729778

### Gradient Boosted Regression Trees
GBRT = Great Britain ReTweets

In [6]:
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier

In [7]:
gbc = GradientBoostingClassifier()
gbc.fit(X[:"2016-11-07"],T[:"2016-11-07"]['outcome'])
gbc.score(X["2016-11-08":],T["2016-11-08":]['outcome'])

0.61834666781526271

### XGBoost
Great Britain's more attractive cousin.

In [8]:
xgbc = xgb.XGBClassifier()
xgbc.fit(X[:"2016-11-07"],T[:"2016-11-07"]['outcome'])
xgbc.score(X["2016-11-08":],T["2016-11-08":]['outcome'])

0.61574079059299858

In [32]:
xgbc = xgb.XGBClassifier(gamma=2,reg_lambda=.5)
xgbc.fit(X[:"2016-11-07"],T[:"2016-11-07"]['outcome'])
xgbc.score(X["2016-11-08":],T["2016-11-08":]['outcome'])

0.61863740618303598

### Logistic Regression
This one goes to eleven.

In [10]:
#c,df,T = ut.make_train_test()

## Vanilla implementation
lrc = LogisticRegression()
lrc.fit(X[:"2016-11-07"],T[:"2016-11-07"]['outcome'])
lrc.score(X["2016-11-08":],T["2016-11-08":]['outcome'])

0.38295627079586936

In [11]:
## Chocolate Regression
lrc = LogisticRegression(C=.1,solver='lbfgs',multi_class='multinomial')
lrc.fit(X[:"2016-11-07"],T[:"2016-11-07"]['outcome'])
lrc.score(X["2016-11-08":],T["2016-11-08":]['outcome'])

0.38295627079586936

### Support Vector Machines
Rise of the Machines.

In [12]:
from sklearn.svm import LinearSVC,SVC
lsvc = LinearSVC()
lsvc.fit(X[:"2016-11-07"],T[:"2016-11-07"]['outcome'])
lsvc.score(X["2016-11-08":],T["2016-11-08":]['outcome'])

0.38295627079586936

In [13]:
svc = SVC(kernel='poly')
svc.fit(X[:"2016-11-07"],T[:"2016-11-07"]['outcome'])
svc.score(X["2016-11-08":],T["2016-11-08":]['outcome'])

0.61704372920413064

### Discriminant Analysis
A nice way of saying 'racial profiling'.

In [14]:
qda = QuadraticDiscriminantAnalysis()
qda.fit(X[:"2016-11-07"],T[:"2016-11-07"]['outcome'])
qda.score(X["2016-11-08":],T["2016-11-08":]['outcome'])

0.5827904422453617

In [15]:
qda = QuadraticDiscriminantAnalysis(reg_param=1)
qda.fit(X[:"2016-11-07"],T[:"2016-11-07"]['outcome'])
qda.score(X["2016-11-08":],T["2016-11-08":]['outcome'])

0.61704372920413064

In [21]:
qda = QuadraticDiscriminantAnalysis(reg_param=.9)
qda.fit(X[:"2016-11-07"],T[:"2016-11-07"]['outcome'])
qda.score(X["2016-11-08":],T["2016-11-08":]['outcome'])

0.61148739595335266

In [17]:
qda = QuadraticDiscriminantAnalysis(reg_param=.7)
qda.fit(X[:"2016-11-07"],T[:"2016-11-07"]['outcome'])
qda.score(X["2016-11-08":],T["2016-11-08":]['outcome'])

0.61035674674534546

In [19]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
lda.fit(X[:"2016-11-07"],T[:"2016-11-07"]['outcome'])
lda.score(X["2016-11-08":],T["2016-11-08":]['outcome'])

0.55770079791529825

In [20]:
lda = LinearDiscriminantAnalysis(solver='lsqr',shrinkage='auto')
lda.fit(X[:"2016-11-07"],T[:"2016-11-07"]['outcome'])
lda.score(X["2016-11-08":],T["2016-11-08":]['outcome'])

0.58794835625141328

# Discussion

Many of the algorithms we learned from class were either not well suited for our problem or were more complicated to implement and not worth implementing at this stage. Here are those algorithms and a brief explanation.

+ Nearest Neighbors: the concept of distance does not suit our problem well. It is still possible to use this as a classifier, but the model is not intuitive for a time-sensitive sentiment classification.
+ Linear Regression: while we could try to predict the sentiment score calculated from the nltk package, we decided not to do this and so linear regression is not a good binary classifier.
+ Ridge Regression: this is still regression and not well suited to our problem for the same reasons as above.
+ Mixture Models with Latent Variables: this is good for topic extraction and may be a good way to improve our results through feature engineering. However, modeling the time-sensitive tweet data as a network is also not intuitive, though some aspects of Twitter are certainly networks. Ultimately, this is not ideal for sentiment classification.
+ Decision Trees: these work well but only when multiple trees are trained.
+ Random Forests: better than a decision tree but inferior to gradient boosting.
+ Kalman Filter: sentiment classification does not have a clear state space model that it relies on, though this may be useful in the future.
+ ARMA: similar to the Kalman Filter above. Need more information to set up the model.
+ Neural Networks: this will be very good for classification based on the text. However, the model is very complicated and simpler algorithms are likely to perform well enough---at least for a benchmark.

In order, the vanilla (out-of-the-box) implementations of our algorithms gave us these results for their rank according to test score:

+ Gradient Boosting (.6183)
+ Polynomial SVM (.6170)
+ XGBoost (.6157)
+ Naive Bayes (.6118)
+ Quad Discriminant Analysis (.5828)
+ Linear Discriminant Analysis (.5577)
+ Logistic Regression (.3830)
+ Linear SVM (.3830)

Many of these did not improve with modification or regularization. In particular, Linear SVM and Logistic Regression were very poor performers and did not improve when the parameters were changed. Interestingly, Quad Discriminant Analysis performed poorly when reg_param > 1, getting the same score as logistic regression and linear SVM. Most of the other algorithms improved when parameters were changed or tweaked. After some experimentation, the following algorithms performed best based on the highest score achieved.

+ XGBoost (.6186)
+ Gradient Boosting (.6183)
+ Polynomial SVM (.6170)
+ Quad Discriminant Analysis (.6170)
+ Naive Bayes (.6118)
+ Linear Discriminant Analysis (.5879)
+ Logistic Regression (.3830)
+ Linear SVM (.3830)

It's interesting that Gradient Boosting performed better out of the box than XGBoost but XGBoost improved a fair amount with tuned regularization. The largest gain in improvement came from the SVM family, where polynomial SVM was much better than Linear SVM. There does appear to be something funky going on since the scores between these two models sum to one. Quad Discriminant Analysis also gained quite a bit---from .5828 to .6170 making it go from 5th tied with Polynomial SVM for 3rd best.

Of course, these scores don't tell us we have a good model. It remains to be seen how the models interpret new tweets and to see if those tweets are truly positive or negative toward Trump. However, it does appear that tree based models are clearly advantageous for our problem (at least, without using other features from the text data). Polynomial SVM and Quad Discriminant Analysis also warrant more investigation and experimentation. Since Linear SVM and Logistic Regression were such poor performers, they are likely not the best algorithms to use. As noted before, there is something suspicious about their scores. It would be unwise to reject these algorithms outright. Manipuplation of the data or using some other parameters might improve these models. This seems unlikely though.

The key to using tree-based methods for our problem is knowing how to adjust the regularization, since trees are well-known to overfit the data. Next steps include cross validation with tree-based models and Poly SVM and QDA. Feature engineering on the text data may be useful as well. However, textual data may best be modeled by something more complex, like a Neural Network. It is unfortunate that the individual models don't get much better than about 62% accurate. An ensemble model might improve the score drastically as each model may capture different information about Trump sentiment. What this tells us is that despite trying to make the outcome variable either very pro or very anti-Trump, the problem of sentiment classification is very hard---especially when relying on models and data built into the NLTK library.