# Performance comparison

In this notebook, we will train logistic regression, naive bayes, and decision tree on the same data. We will compare their performance in terms of the final score but also the time it takes to train the models. We vary the amount of weeks and the preprocessing steps and observe the scores.

In [1]:
import sys
from timeit import default_timer as timer
sys.path.append('../scripts')

import numpy as np
import pandas as pd
import helpers_models as hm
from transforms import *
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

## Data

Let's prepare the data sample we will train and test on. We want to use the same data for all models.

In [2]:
data = pd.read_hdf('../data/pivot_numbers_only.h5', 'data')

In [3]:
target_week = data['tweets'].columns.max()
target_week

36

In [4]:
data = hm.make_target(data, target_week)

Before training, let's balance the dataset.

In [5]:
data = hm.balance_data(data)

In [6]:
data.shape

(1912360, 53)

## Evaluation function

Now we want to define a function that allows us to easily configure the parameters and train all 3 classifiers. The function should accept a dataset, a set of parameters, and return the resulting scores (F1 scores) of all 3 classifiers. It should also allow the time it takes to train the classifiers.

In [7]:
def evaluate(data, first_week, normalize=False, decay=False):
    # split to train and test
    train, test = train_test_split(data)
    
    # pipeline
    pipe = Pipeline([
        ('weeks', WeeksLimiter(first_week, target_week)),
        ('normal', Normalizer(skip=not normalize)),
        ('decay', TimeDecayApplier(target_week, skip=not decay))
    ])
    
    # split data
    train_target = train.target
    test_target = test.target
    train = train.drop('target', axis=1)
    test = test.drop('target', axis=1)
    
    print('Training on', train.shape[0], 'samples')
    
    # apply pipeline
    pipe.fit(train, train_target)
    train = pipe.transform(train)
    test = pipe.transform(test)
    
    # classifiers
    logistic = LogisticRegression()
    bayes = GaussianNB()
    tree = DecisionTreeClassifier()
    
    # train
    print('Training logistic regression')
    start = timer()
    logistic.fit(train, train_target)
    end = timer()
    logistic_time = end - start
    
    print('Training naive bayes')
    start = timer()
    bayes.fit(train, train_target)
    end = timer()
    bayes_time = end - start
    
    print('Training decision tree')
    start = timer()
    tree.fit(train, train_target)
    end = timer()
    tree_time = end - start
    
    # predict
    logistic_score = f1_score(test_target, logistic.predict(test))
    bayes_score = f1_score(test_target, bayes.predict(test))
    tree_score = f1_score(test_target, tree.predict(test))
    
    # return
    return logistic_time, logistic_score, bayes_time, bayes_score, tree_time, tree_score

## Previous conclusion

This was the conclusion before writing the `evaluate` function.

With 8 weeks worth of data that is untransformed, the logistic regression performs the best. Naive bayes takes is much quicker to fit but performs poorly. Decision tree has slightly worse scores that logistic regression and it takes quite longer to fit.

Decision tree performs the best on the transformed data. It takes very long to train but gives the best scores. Naive bayes performs better than logistic regression and fits the data very quickly. Logistic regression suffers when predicting the active users. It takes very short time to fit compared to the decision tree but performs very poorly.

Transforming the data helped for the naive bayes and decision tree but it screwed up the logistic regression.

Logistic regression performs the best on untransformed data. Decision tree performs the best on transformed data, even better than the previous logistic regression. However, it takes very long to train.

## Experiments

Now we can easily evaluate multiple configurations. Let's play!

In [8]:
def evaluate_print(weeks, normalize, decay):
    logistic_time, logistic_score, bayes_time, bayes_score, tree_time, tree_score = evaluate(data, target_week - weeks, normalize, decay)
    
    results = pd.DataFrame()
    results['time to train in seconds'] = [logistic_time, bayes_time, tree_time]
    results['f1 score'] = [logistic_score, bayes_score, tree_score]
    results.index = ['logistic regression', 'naive bayes', 'decision tree']
    
    return results

### 3 weeks

In [9]:
evaluate_print(3, normalize=False, decay=False)

Training on 1434270 samples
Training logistic regression
Training naive bayes
Training decision tree


Unnamed: 0,time to train in seconds,f1 score
logistic regression,5.507904,0.716468
naive bayes,0.58438,0.315991
decision tree,8.720838,0.719166


In [10]:
evaluate_print(3, normalize=True, decay=False)

Training on 1434270 samples
Training logistic regression
Training naive bayes
Training decision tree


Unnamed: 0,time to train in seconds,f1 score
logistic regression,2.272803,0.66647
naive bayes,0.521504,0.585638
decision tree,7.365648,0.713384


In [11]:
evaluate_print(3, normalize=True, decay=True)

Training on 1434270 samples
Training logistic regression
Training naive bayes
Training decision tree


Unnamed: 0,time to train in seconds,f1 score
logistic regression,2.416988,0.666429
naive bayes,0.5183,0.587867
decision tree,7.481723,0.713577


### 8 weeks

In [12]:
evaluate_print(8, normalize=False, decay=False)

Training on 1434270 samples
Training logistic regression
Training naive bayes
Training decision tree


Unnamed: 0,time to train in seconds,f1 score
logistic regression,12.910996,0.717402
naive bayes,1.268664,0.281066
decision tree,26.483935,0.713895


In [13]:
evaluate_print(8, normalize=True, decay=False)

Training on 1434270 samples
Training logistic regression
Training naive bayes
Training decision tree


Unnamed: 0,time to train in seconds,f1 score
logistic regression,3.62247,0.005953
naive bayes,1.215836,0.559784
decision tree,22.27663,0.70832


In [14]:
evaluate_print(8, normalize=True, decay=True)

Training on 1434270 samples
Training logistic regression
Training naive bayes
Training decision tree


Unnamed: 0,time to train in seconds,f1 score
logistic regression,3.905712,0.666553
naive bayes,1.17369,0.552587
decision tree,24.014808,0.70327


### 12 weeks

In [15]:
evaluate_print(12, normalize=False, decay=False)

Training on 1434270 samples
Training logistic regression
Training naive bayes
Training decision tree


Unnamed: 0,time to train in seconds,f1 score
logistic regression,14.289392,0.71882
naive bayes,1.797675,0.273473
decision tree,37.939056,0.764517


In [16]:
evaluate_print(12, normalize=False, decay=False)

Training on 1434270 samples
Training logistic regression
Training naive bayes
Training decision tree


Unnamed: 0,time to train in seconds,f1 score
logistic regression,14.583505,0.718946
naive bayes,1.768697,0.273817
decision tree,42.113433,0.76542


In [17]:
evaluate_print(12, normalize=True, decay=False)

Training on 1434270 samples
Training logistic regression
Training naive bayes
Training decision tree


Unnamed: 0,time to train in seconds,f1 score
logistic regression,4.47429,0.044247
naive bayes,1.742707,0.55022
decision tree,33.730065,0.705697
