# Conclusion

In this awesome data science series, we have looked at my playground where we loaded and analyzed the [Netcla dataset](http://www.neteye-blog.com/netcla-the-ecml-pkdd-network-classification-challenge/). We trained a couple of basic prediction models, then we also trained a few ensemble models.

In order to save our precious time, we always trained only on a sample of the whole dataset. That's probably why the scores were pretty low for most models. We also lacked the domain knowledge to fully understand the data. Therefore, preprocessing was only basic and it further lowered the scores of the models.

The last thing we shall do is to load our saved models, predict labels for the validation data and run the `eval.py` script to see our performance.

In [1]:
import sys
sys.path.append('scripts')

import pandas as pd
import process
from sklearn.externals import joblib

In [2]:
data, _ = process.load_dataset_target('data/valid.csv', 'data/valid_target.csv')

## Notes

These are my personal notes for this task. I must say, it was a lot of fun! I enjoyed writing Python code, transforming the data, and trying various predictors and seeing how they perform. I lacked domain knowledge and therefore didn't bother much with feature engineering and looking into data properly. For instance, I didn't care about outliers.

When I started, I wrote 2 transforms, one of which is the `BoxcoxTransform` (in file `scripts/transforms.py`) which applies boxcox tranform to the dataset. Unfortunately, I wrote it bad, it performed boxcox transforms for each row, instead for each column. These also involved a huge performance penalty.

Of course, I didn't know, so I always used only a small subset of the whole dataset for training models. After I fixed `BoxcoxTransform`, suddenly things moved very quickly. For example, the extra trees model originally trained for 25 minutes using a small subset of the dataset. After the fix, it only trained for 16 seconds!

Amazed by this new found POWER, I trained on the whole dataset using more parameters for grid search and more complex models. Even more fun!

Finally, thank you for this class. It was AMAZING!!!

## Extra trees with grid search

First, predict and store the predictions.

In [3]:
trees_model = joblib.load('models/extratrees.pkl')
trees_result = pd.DataFrame(trees_model.predict(data))

[Parallel(n_jobs=8)]: Done   7 out of  10 | elapsed:    0.3s remaining:    0.1s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.4s finished


In [4]:
trees_result.to_csv('predictions/extratrees.csv', header=False, index=False)

Now evaluate the model.

In [5]:
%%bash
python3 scripts/eval.py data/valid_target.csv predictions/extratrees.csv


Evaluating predictions/extratrees.csv using as gold labels data/valid_target.csv

Detailed Report:
  category  ||   precision  ||     recall   ||       F1     ||    examples  ||
1           ||        0.0000||        0.0000||        0.0000||           644||
4           ||        0.0000||        0.0000||        0.0000||          6736||
5           ||        0.0000||        0.0000||        0.0000||          8104||
6           ||        0.0000||        0.0000||        0.0000||         13764||
8           ||        0.0000||        0.0000||        0.0000||         40961||
12          ||        0.0000||        0.0000||        0.0000||         31985||
13          ||        0.0000||        0.0000||        0.0000||          2807||
14          ||        0.0000||        0.0000||        0.0000||         22590||
15          ||        0.0000||        0.0000||        0.0000||           781||
19          ||        0.0000||        0.0000||        0.0000||          5482||
20          ||        0.0000|| 

## Ensemble

First, predict and store the predictions.

In [6]:
ensemble_model = joblib.load('models/ensemble.pkl')
ensemble_result = pd.DataFrame(ensemble_model.predict(data))

  llf -= N / 2.0 * np.log(np.sum((y - y_mean)**2. / N, axis=0))
  tmp2 = (x - v) * (fx - fw)
[Parallel(n_jobs=8)]: Done   3 out of   8 | elapsed:  1.1min remaining:  1.8min
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:  2.3min remaining:    0.0s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:  2.3min finished
[Parallel(n_jobs=8)]: Done   3 out of   8 | elapsed:   33.7s remaining:   56.2s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:  1.8min remaining:    0.0s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:  1.8min finished


In [7]:
ensemble_result.to_csv('predictions/ensemble.csv', header=False, index=False)

Now evaluate the model.

In [8]:
%%bash
python3 scripts/eval.py data/valid_target.csv predictions/ensemble.csv


Evaluating predictions/ensemble.csv using as gold labels data/valid_target.csv

Detailed Report:
  category  ||   precision  ||     recall   ||       F1     ||    examples  ||
1           ||        0.0023||        0.6925||        0.0045||           644||
4           ||        0.0263||        0.8481||        0.0510||          6736||
5           ||        0.0000||        0.0000||        0.0000||          8104||
6           ||        0.3676||        0.4804||        0.4165||         13764||
8           ||        0.4015||        0.0039||        0.0077||         40961||
12          ||        0.9990||        0.2232||        0.3649||         31985||
13          ||        0.0000||        0.0000||        0.0000||          2807||
14          ||        0.0000||        0.0000||        0.0000||         22590||
15          ||        0.0022||        0.5595||        0.0045||           781||
19          ||        0.0289||        0.5252||        0.0548||          5482||
20          ||        0.0000||   

## Evaluation

I don't really now why both models perform so poorly. The extra trees model even does not work, it always predicts 0.

One possible reason for this is that the input data is not being transformed by the model before predicting.

Anyway, I'm signing off. Have a good day!