# Model using binarized data

We will finally train a logistic regression model using the binarized data.

In [1]:
import sys
sys.path.append('../scripts')

import numpy as np
import pandas as pd
import helpers_models as hm
from binarized_transforms import *
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegressionCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

First, load the data. This data has different columns than the data we worked with before. It will required different transforms.

In [2]:
data = pd.read_pickle('../data/binarized_data.pkl').astype(int)

In [3]:
data.head()

Unnamed: 0_level_0,23,23,23,23,24,24,24,24,24,24,...,36,36,36,36,36,36,36,36,36,36
Unnamed: 0_level_1,other_hashtags,other_mentions,other_urls,tweets,hashtag_1:,hashtag_Aerotek,hashtag_BSB,"mention_""""",other_hashtags,other_mentions,...,hashtag_jobs,hashtag_shjobs,"mention_""""",mention_justinbieber,other_hashtags,other_mentions,other_urls,tweets,url_http://eepurl.com/dgVR,url_http://www.accuweather.com/twtr
user,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
bdogg,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
00000000,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
000000000000111,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
000000000101010,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now set the starting and the target week. Considering the results on the chart we made of weeks taken the accuracy, we will take 11 weeks of the dataset.

In [4]:
target_week = data.columns.levels[0].max()

In [5]:
start_week = target_week - 11

In [6]:
print('We will be using weeks', start_week, 'to', target_week - 1, 'to train the model for the target week', target_week)

We will be using weeks 25 to 35 to train the model for the target week 36


Before we make a pipeline, let's transform and pre-process the data. These transforms also need to modify the target column so they can't be used in the pipeline.

In [7]:
data = TargetMaker(target_week=target_week).transform(data)

Split the data into train and test.

In [8]:
train, test = train_test_split(data)

And balance the train data.

In [9]:
train = ClassBalancer().fit_transform(train, train[['target']].values.ravel())

In [10]:
pipeline = Pipeline([
    ('limiter', WeeksLimiter(start_week, target_week)),
    ('normal', Normalizer()),
    ('decay', TimeDecayApplier(target_week)),
    ('logreg', LogisticRegressionCV(max_iter=300, n_jobs=-1, verbose=2))
])

In [11]:
pipeline.get_params()

{'decay': TimeDecayApplier(target_week=36),
 'decay__target_week': 36,
 'limiter': WeeksLimiter(start_week=25, target_week=36),
 'limiter__start_week': 25,
 'limiter__target_week': 36,
 'logreg': LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
            fit_intercept=True, intercept_scaling=1.0, max_iter=200,
            multi_class='ovr', n_jobs=-1, penalty='l2', random_state=None,
            refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=2),
 'logreg__Cs': 10,
 'logreg__class_weight': None,
 'logreg__cv': None,
 'logreg__dual': False,
 'logreg__fit_intercept': True,
 'logreg__intercept_scaling': 1.0,
 'logreg__max_iter': 200,
 'logreg__multi_class': 'ovr',
 'logreg__n_jobs': -1,
 'logreg__penalty': 'l2',
 'logreg__random_state': None,
 'logreg__refit': True,
 'logreg__scoring': None,
 'logreg__solver': 'lbfgs',
 'logreg__tol': 0.0001,
 'logreg__verbose': 2,
 'normal': Normalizer(),
 'steps': [('limiter', WeeksLimiter(start_week=25, target_week=36

In [12]:
%%time
pipeline.fit(train.drop('target', axis=1), train[['target']].values.ravel())

[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:  5.2min finished


CPU times: user 2min 4s, sys: 17.3 s, total: 2min 22s
Wall time: 6min 33s


Pipeline(steps=[('limiter', WeeksLimiter(start_week=25, target_week=36)), ('normal', Normalizer()), ('decay', TimeDecayApplier(target_week=36)), ('logreg', LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=200,
           multi_class='ovr', n_jobs=-1, penalty='l2', random_state=None,
           refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=2))])

Now test the accuracy of the trained model.

In [13]:
%%time
predicted = pipeline.predict(test.drop('target', axis=1))
report = classification_report(test[['target']].values.ravel(), predicted)
print(report)

             precision    recall  f1-score   support

      False       0.91      0.95      0.92    249523
       True       0.36      0.24      0.29     32477

avg / total       0.84      0.86      0.85    282000

CPU times: user 26.3 s, sys: 750 ms, total: 27 s
Wall time: 26.8 s
