In [12]:
%load_ext autoreload
%autoreload

import sys
import time

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from joblib import dump, load
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import Normalizer

sys.path.append('../../utils')
from utils import time_comparison

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Time fiability

Let's see how the classifier performs over time when learning for a few days and then testing on completely new values. Let's first try by learning on 6000 malwares starting on the 15th of June. This should cover approximatively a week of analysis.

In [9]:
gt = pd.read_csv("../../../dumps/time_analysis/threshold_3/3_20190615_6000.csv")
cols = [col for col in gt.columns if col not in ['label']]
data_train = gt[cols]
target_train = gt['label']

tree = DecisionTreeClassifier(max_depth=4,min_samples_split=0.1,min_samples_leaf=10,random_state=0)
tree.fit(data_train, target_train)
dump(tree,"snapshots/tree_3_20190615_6000.joblib")

['snapshots/tree_3_20190615_6000.joblib']

In order to provide a relevant analysis, it makes sense not to compare performances on malwares that were checked in a period close to the ones used to build the training set. Therefore, we'll make the comparison on a dataset that has been generated at least two or three months after the older result from our reference set.

In [10]:
tree = load("snapshots/tree_3_20190615_6000.joblib")

gt = pd.read_csv("../../../dumps/time_analysis/threshold_3/3_20190808_1000.csv")
cols = [col for col in gt.columns if col not in ['label']]
data_test = gt[cols]
target_test = gt['label']

print("Accuracy on training set: {:.3f}".format(tree.score(data_train, target_train))) 
print("Accuracy on test set: {:.3f}".format(tree.score(data_test, target_test)))

Accuracy on training set: 0.946
Accuracy on test set: 0.640


Conclusion : really good performances of decision trees when learning and testing period are close but seems to overfit over time because bad performances on new data.

#### What if we decide to normalize too ?

In [5]:
gt = pd.read_csv("../../../dumps/time_analysis/threshold_3/3_20190615_6000.csv")
cols = [col for col in gt.columns if col not in ['label']]
data_train = gt[cols]
target_train = gt['label']

scaler = Normalizer()
scaler.fit(data_train)
data_train = scaler.transform(data_train)

tree = DecisionTreeClassifier(max_depth=4,min_samples_split=0.1,min_samples_leaf=10,random_state=0)
tree.fit(data_train, target_train)
dump(tree,"snapshots/tree_3_20190615_6000n.joblib")

['../snapshots/tree_default_20190615_6000n.joblib']

In [13]:
tree = load("snapshots/tree_3_20190615_6000n.joblib")

gt = pd.read_csv("../../../dumps/time_analysis/threshold_3/3_20190808_1000.csv")
cols = [col for col in gt.columns if col not in ['label']]
data_test = gt[cols]
target_test = gt['label']

scaler = Normalizer()
scaler.fit(data_test)
data_test = scaler.transform(data_test)

print("Accuracy on training set: {:.3f}".format(tree.score(data_train, target_train))) 
print("Accuracy on test set: {:.3f}".format(tree.score(data_test, target_test)))

Accuracy on training set: 0.100
Accuracy on test set: 0.923


Interesting to see that in this case, normalization offers such a great improvement !

#### Long run

Let's iterate the process and see how performances are impacted when increasing the size of the training set (the test set used in the same as in the previous experience).

In [14]:
time_comparison('tree','../../')

Acceptation threshold : 1/5 

  # malwares in training set    Approx. period in weeks    Training acc    Test acc
----------------------------  -------------------------  --------------  ----------
                        6000                   0.857143        0.9405         0.868
                       14000                   2               0.936857       0.803
                       21000                   3               0.923333       0.669
                       31000                   4.42857         0.868774       0.609
Acceptation threshold : 2/5 

  # malwares in training set    Approx. period in weeks    Training acc    Test acc
----------------------------  -------------------------  --------------  ----------
                        6000                   0.857143        0.924          0.716
                       14000                   2               0.954714       0.716
                       21000                   3               0.96619        0.776
                

Make sense, the more we increase the size of the training test the more the test precision increases. I guess this is also due to the fact that the last malwares are "closer" to the ones in the test set