# Grid search

In this notebook we will do a grid search to optimize the best parameters and preprocessing of the data and the logistic regression classifier.

In [1]:
import sys
sys.path.append('../scripts')

import numpy as np
import pandas as pd
import helpers_models as hm
from binarized_transforms import *
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report

In [2]:
data = pd.read_pickle('../data/binarized_data.pkl')

The grid search will take a long time to train so we will only take a small subset of the dataset.

In [3]:
data = data.sample(1000)

Set the target and the starting weeks.

In [4]:
target_week = data.columns.levels[0].max()
target_week

36

In [5]:
start_week = 25

Make the target and split the data into train and test.

In [6]:
data = TargetMaker(target_week=target_week).transform(data)

In [7]:
train, test = train_test_split(data, test_size=0.2)

Balance the train data.

In [8]:
train = ClassBalancer().fit_transform(train, train[['target']].values.ravel())

Now make the pipeline. The parameters of this pipeline will be optimized.

In [9]:
pipeline = Pipeline([
    ('limiter', WeeksLimiter(start_week, target_week)),
    ('normal', Normalizer()),
    ('decay', TimeDecayApplier(target_week)),
    ('pca', PCA(0.95)),
    ('logreg', LogisticRegression(verbose=2, solver='sag'))
])

In [10]:
pipeline.get_params()

{'decay': TimeDecayApplier(ignore_binarized_columns=True, skip=False, target_week=36,
          verbose=False),
 'decay__ignore_binarized_columns': True,
 'decay__skip': False,
 'decay__target_week': 36,
 'decay__verbose': False,
 'limiter': WeeksLimiter(start_week=25, target_week=36),
 'limiter__start_week': 25,
 'limiter__target_week': 36,
 'logreg': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
           penalty='l2', random_state=None, solver='sag', tol=0.0001,
           verbose=2, warm_start=False),
 'logreg__C': 1.0,
 'logreg__class_weight': None,
 'logreg__dual': False,
 'logreg__fit_intercept': True,
 'logreg__intercept_scaling': 1,
 'logreg__max_iter': 100,
 'logreg__multi_class': 'ovr',
 'logreg__n_jobs': 1,
 'logreg__penalty': 'l2',
 'logreg__random_state': None,
 'logreg__solver': 'sag',
 'logreg__tol': 0.0001,
 'logreg__verbose': 2,
 'logreg__warm_start': False,
 'no

As we can see, there are a lot of params that can be optimized using grid search. Let's define the options for grid search.

In [11]:
params = {
    # weeks limiter params
    'limiter__start_week': [23, 25, 30, 32],
    
    # normalizer params
    'normal__skip': [False, True],
    'normal__ignore_binarized_columns': [False, True],
    
    # time decay applier params
    'decay__skip': [False, True],
    'decay__ignore_binarized_columns': [False, True],
    
    # PCA params
    'pca__n_components': [0.95, 1.0],
    
    # logistic regression params
    'logreg__C': [0.2, 1.0]
}

Now use the grid search to optimize the params and train the best model.

In [None]:
model = GridSearchCV(pipeline, params, n_jobs=-1, pre_dispatch='2*n_jobs', verbose=2, error_score=0)

In [None]:
model.fit(train.drop('target', axis=1), train[['target']].values.ravel())

Fitting 3 folds for each of 256 candidates, totalling 768 fits
[CV] decay__ignore_binarized_columns=False, decay__skip=False, limiter__start_week=23, logreg__C=0.2, normal__ignore_binarized_columns=False, normal__skip=False, pca__n_components=0.95 
[CV] decay__ignore_binarized_columns=False, decay__skip=False, limiter__start_week=23, logreg__C=0.2, normal__ignore_binarized_columns=False, normal__skip=False, pca__n_components=0.95 
[CV] decay__ignore_binarized_columns=False, decay__skip=False, limiter__start_week=23, logreg__C=0.2, normal__ignore_binarized_columns=False, normal__skip=False, pca__n_components=0.95 
[CV] decay__ignore_binarized_columns=False, decay__skip=False, limiter__start_week=23, logreg__C=0.2, normal__ignore_binarized_columns=False, normal__skip=False, pca__n_components=1.0 
[CV] decay__ignore_binarized_columns=False, decay__skip=False, limiter__start_week=23, logreg__C=0.2, normal__ignore_binarized_columns=False, normal__skip=False, pca__n_components=1.0 
[CV] deca

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  data[(week, column)] = data[(week, column)] / time_decay


[CV] decay__ignore_binarized_columns=False, decay__skip=False, limiter__start_week=23, logreg__C=0.2, normal__ignore_binarized_columns=False, normal__skip=True, pca__n_components=0.95 


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  data[(week, column)] = data[(week, column)] / time_decay
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  data[(week, column)] = data[(week, column)].div(self.column_sums[(week, column)]).fillna(0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  data[(week, column)] = data[(week, column)].div(self.column_sums[(week, column)]).fillna(0)
A value is tryi

Okay, now the parameters are learned, let's check them out.

In [None]:
model.best_params_

The attribute `cv_results_` also contains the tried combinations and the results.

In [None]:
cv_results = pd.DataFrame(model.cv_results_)

In [None]:
cv_results

Finally, test the model on the test sample.

In [None]:
%%time
predicted = model.predict(test.drop('target', axis=1))
report = classification_report(test[['target']].values.ravel(), predicted)

In [None]:
print(report)