# Performance comparison

In this notebook, we will train logistic regression, naive bayes, and decision tree on the same data. We will compare their performance in terms of the final score but also the time it takes to train the models.

In [1]:
import sys
sys.path.append('../scripts')

import numpy as np
import pandas as pd
import helpers_models as hm
from transforms import *
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

Let's prepare the data sample we will train and test on. We want to use the same data for all models.

In [2]:
data = pd.read_hdf('../data/pivot_numbers_only.h5', 'data')

In [3]:
target_week = data['tweets'].columns.max()
target_week

36

In [4]:
first_week = data['tweets'].columns.min()
first_week

23

In [5]:
data = hm.make_target(data, target_week)

Before training, let's balance the dataset and use a small sample of it.

In [6]:
data = hm.balance_data(data)

In [7]:
train, test = train_test_split(data)

In [8]:
train.shape

(1434270, 53)

Here is a pipeline to process the data. All models will use the data transformed by this pipeline.

In [9]:
data_pipe = Pipeline([
    ('limiter', WeeksLimiter(first_week, target_week)),
    ('normal', Normalizer()),
    ('decay', TimeDecayApplier(target_week))
])

Now transform train and test.

In [10]:
train_target = train.target
train = data_pipe.fit_transform(train.drop('target', axis=1))

In [11]:
test_target = test.target
test = data_pipe.fit_transform(test.drop('target', axis=1))

We will use the normalization and the time decay and we will use 13 weeks of data for training. We ran grid search algorithms for each model before so now we know what to choose.

In [12]:
logistic = LogisticRegression()

In [13]:
bayes = GaussianNB()

In [14]:
tree = DecisionTreeClassifier()

Now train all models.

In [15]:
%%time
logistic.fit(train, train_target)

CPU times: user 3.91 s, sys: 429 ms, total: 4.34 s
Wall time: 4.34 s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [16]:
%%time
bayes.fit(train, train_target)

CPU times: user 1.19 s, sys: 602 ms, total: 1.79 s
Wall time: 1.79 s


GaussianNB(priors=None)

In [17]:
%%time
tree.fit(train, train_target)

CPU times: user 31.7 s, sys: 157 ms, total: 31.9 s
Wall time: 31.9 s


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [18]:
predicted = logistic.predict(test)
print(classification_report(test_target, predicted, digits=5))

             precision    recall  f1-score   support

      False    0.50234   0.99972   0.66868    238966
       True    0.97335   0.01023   0.02025    239124

avg / total    0.73792   0.50481   0.34436    478090



In [19]:
predicted = bayes.predict(test)
print(classification_report(test_target, predicted, digits=5))

             precision    recall  f1-score   support

      False    0.59833   0.94372   0.73234    238966
       True    0.86707   0.36687   0.51559    239124

avg / total    0.73274   0.65520   0.62393    478090



In [20]:
predicted = tree.predict(test)
print(classification_report(test_target, predicted, digits=5))

             precision    recall  f1-score   support

      False    0.69523   0.76875   0.73014    238966
       True    0.74159   0.66322   0.70022    239124

avg / total    0.71842   0.71597   0.71517    478090



What attributes were chosen as the main predictors? For the logistic regression, we select the features with coeficients larger than the mean.

In [21]:
coefs = np.abs(logistic.coef_[0])
coef_mean = coefs.mean()
coef_mean

0.17718191220339319

In [22]:
columns = test[np.where(coefs > coef_mean)[0]].columns.values
list(sorted(map(lambda x: (x[1], x[0]), columns)))

[(31, 'mentions'),
 (32, 'hashtags'),
 (32, 'mentions'),
 (32, 'tweets'),
 (32, 'urls'),
 (33, 'hashtags'),
 (33, 'mentions'),
 (33, 'tweets'),
 (33, 'urls'),
 (34, 'hashtags'),
 (34, 'mentions'),
 (34, 'tweets'),
 (34, 'urls'),
 (35, 'hashtags'),
 (35, 'mentions'),
 (35, 'tweets'),
 (35, 'urls')]

Since the naive bayes classifier is only using the probabilities of classes and features, it does not prefer any specific feature.

In [23]:
bayes.class_prior_

array([ 0.50005508,  0.49994492])

Decision tree selected features.

In [24]:
imp = tree.feature_importances_
imp_mean = imp.mean()
imp_mean

0.019230769230769232

In [25]:
columns = test[np.where(imp > imp_mean)[0]].columns.values
list(sorted(map(lambda x: (x[1], x[0]), columns)))

[(30, 'tweets'),
 (31, 'tweets'),
 (31, 'urls'),
 (32, 'urls'),
 (33, 'mentions'),
 (33, 'tweets'),
 (33, 'urls'),
 (34, 'mentions'),
 (34, 'tweets'),
 (34, 'urls'),
 (35, 'mentions'),
 (35, 'tweets'),
 (35, 'urls')]

## Conclusion

Logistic regression performs poorly and overfits the data easily. Although it has the best precision, the recall and the F1 score are the worst. The model fits the data quickly.

Naive bayes has slightly worse precision, better recall, and almost twice as high F1 score. It takes half the time to fit, performs pretty good.

Decision tree has the best scores. It takes the longest to fit, ten times as long as logistic regression. It has worse precision than both models, the best recall and F1 score by far.

Both logistic regression and decision tree prefer features from weeks 30, 31, and afterwards.