# Ensemble learning

In the chapter, we will look into ensemble learning. We will train multiple models and hopefully improve out scoring. To keep things quick and simple, we'll be using only minimal preprocessing. This will reduce the overall performance of the models but also decrease the time it takes to train the models.

In [1]:
import sys
sys.path.append('scripts')

%matplotlib inline

import random as rnd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import process
import transforms
from multiprocessing import Pool, Queue
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import BaggingClassifier, VotingClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import f1_score
from sklearn.externals import joblib

sb.set_style('dark')

## Data

This should come as no suprise by now. We'll load the training and test dataset.

In [2]:
%%time
data_train, target_train = process.load_dataset_target('data/train.csv', 'data/train_target.csv')
data_val, target_val = process.load_dataset_target('data/valid.csv', 'data/valid_target.csv')

CPU times: user 7.98 s, sys: 810 ms, total: 8.79 s
Wall time: 8.82 s


In [3]:
target_train = target_train[0]
target_val = target_val[0]

## Training a model

It's time to make a pipeline and train a proper model. Onward, brothers!

In [4]:
pipeline_bayes = make_pipeline(
    VarianceThreshold(),
    transforms.BoxcoxTransform(),
    StandardScaler(),
    BaggingClassifier(GaussianNB(), n_jobs=-1, verbose=2)
)
pipeline_tree = make_pipeline(
    VarianceThreshold(),
    transforms.BoxcoxTransform(),
    StandardScaler(),
    BaggingClassifier(DecisionTreeClassifier(max_depth=1, class_weight='balanced'), n_jobs=-1, verbose=2)
)

In [5]:
%%time
model = VotingClassifier([
        ('bayes', pipeline_bayes),
        ('tree', pipeline_tree),
    ], n_jobs=-1).fit(data_train, target_train)

  llf -= N / 2.0 * np.log(np.sum((y - y_mean)**2. / N, axis=0))
  tmp2 = (x - v) * (fx - fw)
  llf -= N / 2.0 * np.log(np.sum((y - y_mean)**2. / N, axis=0))
  tmp2 = (x - v) * (fx - fw)
  **self._backend_args)


Building estimator 1 of 2 for this parallel run (total 10)...


  **self._backend_args)


Building estimator 1 of 2 for this parallel run (total 10)...
Building estimator 2 of 2 for this parallel run (total 10)...
Building estimator 2 of 2 for this parallel run (total 10)...


[Parallel(n_jobs=8)]: Done   1 out of   1 | elapsed:    3.7s remaining:    0.0s


Building estimator 1 of 2 for this parallel run (total 10)...


[Parallel(n_jobs=8)]: Done   1 out of   1 | elapsed:    3.5s remaining:    0.0s


Building estimator 1 of 2 for this parallel run (total 10)...
Building estimator 2 of 2 for this parallel run (total 10)...
Building estimator 2 of 2 for this parallel run (total 10)...
Building estimator 1 of 1 for this parallel run (total 10)...
Building estimator 1 of 1 for this parallel run (total 10)...
Building estimator 1 of 1 for this parallel run (total 10)...
Building estimator 1 of 1 for this parallel run (total 10)...
Building estimator 1 of 1 for this parallel run (total 10)...
Building estimator 1 of 1 for this parallel run (total 10)...
Building estimator 1 of 1 for this parallel run (total 10)...
Building estimator 1 of 1 for this parallel run (total 10)...
Building estimator 1 of 1 for this parallel run (total 10)...
Building estimator 1 of 1 for this parallel run (total 10)...
Building estimator 1 of 1 for this parallel run (total 10)...
Building estimator 1 of 1 for this parallel run (total 10)...


[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:   17.9s finished
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:   18.7s finished


CPU times: user 1.1 s, sys: 244 ms, total: 1.35 s
Wall time: 3min 24s


## Scoring on the validation dataset

Now let's see how well does the model perform.

In [6]:
%%time
score = model.score(data_val, target_val)

  llf -= N / 2.0 * np.log(np.sum((y - y_mean)**2. / N, axis=0))
  tmp2 = (x - v) * (fx - fw)
[Parallel(n_jobs=8)]: Done   3 out of   8 | elapsed:  1.0min remaining:  1.7min
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:  2.2min remaining:    0.0s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:  2.2min finished
[Parallel(n_jobs=8)]: Done   3 out of   8 | elapsed:   30.7s remaining:   51.1s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:  1.7min remaining:    0.0s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:  1.7min finished


CPU times: user 4min 37s, sys: 3min 41s, total: 8min 19s
Wall time: 8min 43s


In [7]:
print('Accuracy: %f' % score)

Accuracy: 0.060242


And that's the score of the model. Finally, let's save this model.

In [8]:
joblib.dump(model, 'models/ensemble.pkl')

['models/ensemble.pkl']