# Naive bayes classifier

In this notebook, we will train a Naive bayes classifier on a subset of the data. We will use grid-search to optimize the parameters.

In [1]:
import sys
sys.path.append('../scripts')

import numpy as np
import pandas as pd
import helpers_models as hm
from transforms import *
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV

First, load a sample of the data.

In [2]:
data = pd.read_hdf('../data/pivot_numbers_only.h5', 'data').sample(2000000)

In [3]:
first_week, last_week = data['tweets'].columns.min(), data['tweets'].columns.max()

In [4]:
data = hm.make_target(data, last_week)

In [5]:
data.shape

(2000000, 53)

In [6]:
first_week

23

In [7]:
last_week

36

Split the data into train and test.

In [8]:
train, test = train_test_split(data)

Now make a pipeline for transforming and predicting.

In [9]:
pipeline = Pipeline([
    ('limiter', WeeksLimiter(first_week=25, target_week=last_week)),
    ('normal', Normalizer()),
    ('decay', TimeDecayApplier(target_week=last_week)),
    ('bayes', GaussianNB())
])

Now let's check the params of the pipeline for grid-search to optimize.

In [10]:
pipeline.get_params()

{'bayes': GaussianNB(priors=None),
 'bayes__priors': None,
 'decay': TimeDecayApplier(skip=False, target_week=36),
 'decay__skip': False,
 'decay__target_week': 36,
 'limiter': WeeksLimiter(first_week=25, target_week=36),
 'limiter__first_week': 25,
 'limiter__target_week': 36,
 'normal': Normalizer(skip=False),
 'normal__skip': False,
 'steps': [('limiter', WeeksLimiter(first_week=25, target_week=36)),
  ('normal', Normalizer(skip=False)),
  ('decay', TimeDecayApplier(skip=False, target_week=36)),
  ('bayes', GaussianNB(priors=None))]}

In [11]:
params = {
    'limiter__first_week': [23, 25, 30, 32],
    'normal__skip': [False, True],
    'decay__skip': [False, True]
}

In [12]:
model = GridSearchCV(pipeline, params, n_jobs=-1, pre_dispatch='2*n_jobs', verbose=2, error_score=0)

The next step is to train the model and optimize the params.

In [13]:
%%time
model.fit(train.drop('target', axis=1, level=0), train['target'])

Fitting 3 folds for each of 16 candidates, totalling 48 fits
[CV] decay__skip=False, limiter__first_week=23, normal__skip=False ...
[CV] decay__skip=False, limiter__first_week=23, normal__skip=False ...
[CV] decay__skip=False, limiter__first_week=23, normal__skip=False ...
[CV] decay__skip=False, limiter__first_week=23, normal__skip=True ....
[CV] decay__skip=False, limiter__first_week=23, normal__skip=True ....
[CV] decay__skip=False, limiter__first_week=23, normal__skip=True ....
[CV]  decay__skip=False, limiter__first_week=23, normal__skip=False, total=  10.8s
[CV] decay__skip=False, limiter__first_week=25, normal__skip=False ...
[CV]  decay__skip=False, limiter__first_week=23, normal__skip=False, total=  14.5s
[CV] decay__skip=False, limiter__first_week=25, normal__skip=False ...
[CV]  decay__skip=False, limiter__first_week=23, normal__skip=False, total=  19.7s
[CV] decay__skip=False, limiter__first_week=25, normal__skip=False ...
[CV] decay__skip=False, limiter__first_week=25, nor

[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  5.0min


[CV] decay__skip=True, limiter__first_week=25, normal__skip=False ....
[CV]  decay__skip=True, limiter__first_week=23, normal__skip=True, total=  21.7s
[CV]  decay__skip=True, limiter__first_week=23, normal__skip=False, total=  26.8s
[CV]  decay__skip=True, limiter__first_week=23, normal__skip=True, total=  18.6s
[CV] decay__skip=True, limiter__first_week=25, normal__skip=False ....
[CV]  decay__skip=True, limiter__first_week=23, normal__skip=True, total=  14.6s
[CV] decay__skip=True, limiter__first_week=25, normal__skip=True .....
[CV] decay__skip=True, limiter__first_week=25, normal__skip=True .....
[CV]  decay__skip=True, limiter__first_week=25, normal__skip=False, total=  14.3s
[CV] decay__skip=True, limiter__first_week=25, normal__skip=True .....
[CV]  decay__skip=True, limiter__first_week=25, normal__skip=False, total=  12.7s
[CV] decay__skip=True, limiter__first_week=30, normal__skip=False ....
[CV]  decay__skip=True, limiter__first_week=25, normal__skip=False, total=  12.5s
[CV

[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:  6.4min finished


CPU times: user 2min 42s, sys: 31.1 s, total: 3min 13s
Wall time: 6min 28s


GridSearchCV(cv=None, error_score=0,
       estimator=Pipeline(steps=[('limiter', WeeksLimiter(first_week=25, target_week=36)), ('normal', Normalizer(skip=False)), ('decay', TimeDecayApplier(skip=False, target_week=36)), ('bayes', GaussianNB(priors=None))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'limiter__first_week': [23, 25, 30, 32], 'normal__skip': [False, True], 'decay__skip': [False, True]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=2)

Finally, use the model to predict test values and scores. We will use the test data for predictions.

In [14]:
%%time
predicted = model.predict(test.drop('target', axis=1, level=0))
report = classification_report(test['target'], predicted, digits=5)

CPU times: user 1.03 s, sys: 742 ms, total: 1.77 s
Wall time: 1.7 s


In [15]:
print(report)

             precision    recall  f1-score   support

      False    0.92574   0.95429   0.93980    441936
       True    0.54539   0.41737   0.47287     58064

avg / total    0.88157   0.89194   0.88558    500000



Also check the best parameters grid-search found.

In [16]:
model.best_params_

{'decay__skip': False, 'limiter__first_week': 32, 'normal__skip': False}