# Balance the dataset

I forgot that the dataset is so inbalanced. That's the reason I've got so bad models. I could not help myself but try to balance the dataset and train one more model. Let's go.

In [1]:
import sys
sys.path.append('scripts')

%matplotlib inline

import random as rnd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import process
from transforms import LowVarianceRemover, BoxcoxTransform
from multiprocessing import Pool, Queue
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import ExtraTreesClassifier, BaggingClassifier, VotingClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score
from sklearn.externals import joblib

sb.set_style('dark')

In [2]:
data = process.load_dataset('data/train.csv', 'data/train_target.csv')

In [3]:
non_zero = data[data.target != 0]
non_zero.shape

(184788, 50)

In [4]:
avg_per_class = non_zero.groupby(by='target').count()[[0]].mean()[0]
avg_per_class

9725.6842105263149

Now we know the following things:

- after removing 0 targets, we are left with about 185000 data points
- there are 9725 data points for each class on average

In [5]:
zero = data[data.target == 0].sample(int(avg_per_class / 3 * 2))

In [6]:
balanced_data = pd.concat([non_zero, zero])
balanced_data.head()

Unnamed: 0,cli_pl_header,cli_pl_body,cli_cont_len,srv_pl_header,srv_pl_body,srv_cont_len,aggregated_sessions,bytes,net_samples,tcp_frag,...,cli_tx_time,load_time,server_latency,proxy,sp_healthscore,sp_req_duration,sp_is_lat,sp_error,throughput,target
0,667,0,0,336,143151,143151,1,155494,1,0,...,0.0,766.522,1.097,0,0,0,0,0,202.856539,8
1,667,0,0,336,143151,143151,1,310988,1,0,...,0.0,766.496,1.494,0,0,0,0,0,405.72684,8
2,667,0,0,336,143151,143151,1,310988,1,0,...,0.0,766.544,1.104,0,0,0,0,0,405.701434,8
3,592,0,0,238,1269,1269,1,2401,1,0,...,0.0,293.212,1.639,0,0,0,0,0,8.188614,5
4,592,0,0,238,1269,1269,1,4802,1,0,...,0.0,293.205,1.646,0,0,0,0,0,16.37762,5


In [7]:
balanced_data.shape

(191271, 50)

Now we have a balanced dataset. Let's train a model.

## Extra trees

In [8]:
pipeline = make_pipeline(
    LowVarianceRemover(),
    PCA(0.95),
    BoxcoxTransform(),
    StandardScaler(),
    ExtraTreesClassifier(n_jobs=-1, verbose=2)
)

In [9]:
%%time
model = GridSearchCV(pipeline, dict(extratreesclassifier__max_depth=[1, 2, 3]), n_jobs=-1, verbose=2, error_score=0)
model.fit(balanced_data.drop('target', axis=1), balanced_data.target)

Fitting 3 folds for each of 3 candidates, totalling 9 fits
[CV] extratreesclassifier__max_depth=1 ...............................
[CV] extratreesclassifier__max_depth=1 ...............................
[CV] extratreesclassifier__max_depth=1 ...............................
[CV] extratreesclassifier__max_depth=2 ...............................
[CV] extratreesclassifier__max_depth=2 ...............................
[CV] extratreesclassifier__max_depth=2 ...............................
[CV] extratreesclassifier__max_depth=3 ...............................
[CV] extratreesclassifier__max_depth=3 ...............................
building tree 5 of 10
building tree 1 of 10
building tree 4 of 10
building tree 2 of 10
building tree 7 of 10
building tree 8 of 10
building tree 3 of 10
building tree 6 of 10
building tree 9 of 10
building tree 10 of 10


[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:    0.1s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.2s finished


building tree 3 of 10
building tree 8 of 10
building tree 1 of 10
building tree 7 of 10
building tree 4 of 10
building tree 2 of 10
building tree 5 of 10
building tree 6 of 10
building tree 10 of 10
building tree 9 of 10


[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.1s finished


building tree 1 of 10
building tree 6 of 10
building tree 3 of 10
building tree 5 of 10
building tree 8 of 10
building tree 4 of 10
building tree 2 of 10
building tree 7 of 10
building tree 10 of 10


[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:    0.1s remaining:    0.0s


building tree 9 of 10


[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.1s finished


building tree 8 of 10
building tree 5 of 10
building tree 1 of 10
building tree 3 of 10
building tree 2 of 10
building tree 6 of 10
building tree 4 of 10
building tree 7 of 10
building tree 10 of 10
building tree 9 of 10


[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:    0.1s remaining:    0.0s


building tree 6 of 10
building tree 1 of 10
building tree 4 of 10
building tree 3 of 10
building tree 7 of 10


[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.1s finished


building tree 8 of 10
building tree 2 of 10
building tree 5 of 10
building tree 10 of 10
building tree 9 of 10


[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:    0.1s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.1s finished


building tree 4 of 10
building tree 3 of 10
building tree 2 of 10
building tree 1 of 10
building tree 8 of 10
building tree 6 of 10
building tree 7 of 10
building tree 5 of 10
building tree 9 of 10
building tree 10 of 10


[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:    0.1s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.1s finished


building tree 6 of 10
building tree 3 of 10
building tree 4 of 10
building tree 2 of 10
building tree 5 of 10
building tree 7 of 10
building tree 8 of 10
building tree 1 of 10
building tree 10 of 10
building tree 9 of 10


[Parallel(n_jobs=8)]: Done   7 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.1s finished


building tree 6 of 10
building tree 4 of 10
building tree 8 of 10
building tree 7 of 10
building tree 3 of 10
building tree 1 of 10
building tree 2 of 10
building tree 5 of 10
building tree 9 of 10
building tree 10 of 10


[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.1s finished
[Parallel(n_jobs=8)]: Done   7 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.1s finished
[Parallel(n_jobs=8)]: Done   7 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.0s finished
[Parallel(n_jobs=8)]: Done   7 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.0s finished
[Parallel(n_jobs=8)]: Done   7 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.0s finished
[Parallel(n_jobs=8)]: Done   7 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.0s finished
[Parallel(n_jobs=8)]: Done   7 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=8

[CV] ................ extratreesclassifier__max_depth=1, total=   7.3s
[CV] extratreesclassifier__max_depth=3 ...............................


[Parallel(n_jobs=8)]: Done   7 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.1s finished


[CV] ................ extratreesclassifier__max_depth=2, total=   7.7s


[Parallel(n_jobs=8)]: Done   7 out of  10 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.1s finished
[Parallel(n_jobs=8)]: Done   7 out of  10 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.1s finished


[CV] ................ extratreesclassifier__max_depth=1, total=   8.0s
[CV] ................ extratreesclassifier__max_depth=3, total=   7.7s


[Parallel(n_jobs=-1)]: Done   4 out of   9 | elapsed:   12.9s remaining:   16.1s
[Parallel(n_jobs=8)]: Done   7 out of  10 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.1s finished


[CV] ................ extratreesclassifier__max_depth=2, total=   8.3s


[Parallel(n_jobs=8)]: Done   7 out of  10 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.1s finished


[CV] ................ extratreesclassifier__max_depth=1, total=   9.4s


[Parallel(n_jobs=8)]: Done   7 out of  10 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.1s finished


[CV] ................ extratreesclassifier__max_depth=2, total=   9.8s


[Parallel(n_jobs=8)]: Done   7 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.1s finished


[CV] ................ extratreesclassifier__max_depth=3, total=  10.0s
building tree 2 of 10
building tree 3 of 10
building tree 8 of 10
building tree 5 of 10
building tree 7 of 10
building tree 1 of 10
building tree 6 of 10
building tree 4 of 10
building tree 9 of 10
building tree 10 of 10


[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.1s finished
[Parallel(n_jobs=8)]: Done   7 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.0s finished
[Parallel(n_jobs=8)]: Done   7 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.1s finished


[CV] ................ extratreesclassifier__max_depth=3, total=   5.1s


[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:   18.9s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:   18.9s finished


building tree 1 of 10building tree 4 of 10building tree 2 of 10building tree 5 of 10building tree 3 of 10building tree 6 of 10building tree 7 of 10building tree 8 of 10







building tree 9 of 10
building tree 10 of 10
CPU times: user 5.64 s, sys: 504 ms, total: 6.15 s
Wall time: 23.2 s


[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.1s finished


In [10]:
model.best_params_

{'extratreesclassifier__max_depth': 1}

Now let's evaluate the model.

In [11]:
val_data, _ = process.load_dataset_target('data/valid.csv', 'data/valid_target.csv')

In [12]:
predicted = pd.DataFrame(model.predict(val_data))
predicted.to_csv('predictions/extratrees.csv', header=False, index=False)

[Parallel(n_jobs=8)]: Done   7 out of  10 | elapsed:    0.3s remaining:    0.1s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.4s finished


In [13]:
%%bash
python3 scripts/eval.py data/valid_target.csv predictions/extratrees.csv


Evaluating predictions/extratrees.csv using as gold labels data/valid_target.csv

Detailed Report:
  category  ||   precision  ||     recall   ||       F1     ||    examples  ||
1           ||        0.0000||        0.0000||        0.0000||           644||
4           ||        0.0000||        0.0000||        0.0000||          6736||
5           ||        0.0000||        0.0000||        0.0000||          8104||
6           ||        0.0000||        0.0000||        0.0000||         13764||
8           ||        0.0000||        0.0000||        0.0000||         40961||
12          ||        0.0420||        1.0000||        0.0807||         31985||
13          ||        0.0000||        0.0000||        0.0000||          2807||
14          ||        0.0000||        0.0000||        0.0000||         22590||
15          ||        0.0000||        0.0000||        0.0000||           781||
19          ||        0.0000||        0.0000||        0.0000||          5482||
20          ||        0.0000|| 

## Bagging

In [14]:
pipeline_ens_bagging = make_pipeline(
    VarianceThreshold(),
    BoxcoxTransform(),
    StandardScaler(),
    BaggingClassifier(n_jobs=-1, verbose=2)
)
pipeline_ens_bayes = make_pipeline(
    LowVarianceRemover(),
    PCA(0.95),
    BoxcoxTransform(),
    StandardScaler(),
    GaussianNB()
)

In [15]:
%%time
model_ens = VotingClassifier([
    ('bagging', pipeline_ens_bagging),
    ('bayes', pipeline_ens_bayes)
])
model_ens.fit(balanced_data.drop('target', axis=1), balanced_data.target)

  llf -= N / 2.0 * np.log(np.sum((y - y_mean)**2. / N, axis=0))
  tmp2 = (x - v) * (fx - fw)


Building estimator 1 of 2 for this parallel run (total 10)...
Building estimator 1 of 2 for this parallel run (total 10)...
Building estimator 1 of 1 for this parallel run (total 10)...
Building estimator 1 of 1 for this parallel run (total 10)...
Building estimator 1 of 1 for this parallel run (total 10)...
Building estimator 1 of 1 for this parallel run (total 10)...
Building estimator 1 of 1 for this parallel run (total 10)...
Building estimator 1 of 1 for this parallel run (total 10)...
Building estimator 2 of 2 for this parallel run (total 10)...
Building estimator 2 of 2 for this parallel run (total 10)...


[Parallel(n_jobs=8)]: Done   3 out of   8 | elapsed:    6.1s remaining:   10.2s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:    9.5s remaining:    0.0s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:    9.5s finished


CPU times: user 44 s, sys: 6.47 s, total: 50.4 s
Wall time: 58.9 s


Now let's evaluate this model, too.

In [16]:
predicted_ens = pd.DataFrame(model.predict(val_data))
predicted_ens.to_csv('predictions/ensemble.csv', header=False, index=False)

[Parallel(n_jobs=8)]: Done   7 out of  10 | elapsed:    0.3s remaining:    0.1s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.4s finished


In [17]:
%%bash
python3 scripts/eval.py data/valid_target.csv predictions/ensemble.csv


Evaluating predictions/ensemble.csv using as gold labels data/valid_target.csv

Detailed Report:
  category  ||   precision  ||     recall   ||       F1     ||    examples  ||
1           ||        0.0000||        0.0000||        0.0000||           644||
4           ||        0.0000||        0.0000||        0.0000||          6736||
5           ||        0.0000||        0.0000||        0.0000||          8104||
6           ||        0.0000||        0.0000||        0.0000||         13764||
8           ||        0.0000||        0.0000||        0.0000||         40961||
12          ||        0.0420||        1.0000||        0.0807||         31985||
13          ||        0.0000||        0.0000||        0.0000||          2807||
14          ||        0.0000||        0.0000||        0.0000||         22590||
15          ||        0.0000||        0.0000||        0.0000||           781||
19          ||        0.0000||        0.0000||        0.0000||          5482||
20          ||        0.0000||   

To be honest, I expected a better improvement. But I don't know what else to improve, so that's it.