# 6 - Model and hyper parameters choice

The dataset used in this notebook is the same used in the last two practice sessions: you can copy-paste into the current directory or download it again from [here](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html).
Remember:
- you need the kdd.data.gz file
- it is a compressed archive, you have to extract it

---

This session will focus on:
- model choice
- hyper parameters selection

---

In [1]:
import pandas as pd
import numpy as np
import time

## Load the dataset

In [2]:
FILENAME = 'datasets/kddcup.data.corrected'

In [3]:
# feature names obtained from: http://kdd.ics.uci.edu/databases/kddcup99/kddcup.names
header_names = [
    'duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 
    'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 
    'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 
    'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 
    'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 
    'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 
    'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 
    'dst_host_srv_rerror_rate', 'attack_type'
]

In [4]:
df = pd.read_csv(FILENAME, header=None, names=header_names, sep=',')

<div class="alert alert-block alert-info">
<b>
IMPORTANT:
    
The cell below reduces the size of the dataframe by sampling from it. 
This is done to work with a smaller amount of data, in case your machine cannot deal with the whole dataset. 
Try to run the notebook without running this cell but, if you have any problems, come back here, uncomment this line and rerun the whole notebook with less data.
</b>

If you still have troubles, there is a smaller version available on the same website.
The file name is *kddcup.data_10_percent.gz*.
</div>

In [5]:
df = df.sample(frac=0.60)

## Info about the dataset (pt. 1)

This is the same dataset we used in the past sessions, but it is useful to recap here some information about it.

<div class="alert alert-block alert-danger">
<b>Q: Complete the cells below.</b>
</div>

In [6]:
print("Number of rows =", df.shape[0])

Number of rows = 2939059


In [8]:
print("Number of features =", df.drop('attack_type', axis=1).shape[1])

Number of features = 41


In [10]:
print("Type of the features:")
df.info()

Type of the features:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2939059 entries, 578448 to 4164086
Data columns (total 42 columns):
duration                       int64
protocol_type                  object
service                        object
flag                           object
src_bytes                      int64
dst_bytes                      int64
land                           int64
wrong_fragment                 int64
urgent                         int64
hot                            int64
num_failed_logins              int64
logged_in                      int64
num_compromised                int64
root_shell                     int64
su_attempted                   int64
num_root                       int64
num_file_creations             int64
num_shells                     int64
num_access_files               int64
num_outbound_cmds              int64
is_host_login                  int64
is_guest_login                 int64
count                          int64
srv_co

In [11]:
set_cat_features = (
    set(df.drop('attack_type', axis=1).columns) - set(df.drop('attack_type', axis=1).describe().columns)
)
print("There are %d categorical features, and they are:" % len(set_cat_features))
print(set_cat_features)

There are 3 categorical features, and they are:
{'protocol_type', 'service', 'flag'}


## Creating variables containing indeces of each feature type

In [12]:
col_names = np.array(header_names)

nominal_idx = [1, 2, 3]
binary_idx = [6, 11, 13, 14, 20, 21]
numeric_idx = list(set(range(41)).difference(nominal_idx).difference(binary_idx))

nominal_cols = col_names[nominal_idx].tolist()
binary_cols = col_names[binary_idx].tolist()
numeric_cols = col_names[numeric_idx].tolist()

## Info about the dataset (pt. 2)

In [13]:
print("Number of different 'protocol_type's =", len(df['protocol_type'].unique()))
print("Number of occurrences of each protocol_type")
df.groupby('protocol_type').size().reset_index().sort_values(0, ascending=False).rename(columns={0:'cnt'})

Number of different 'protocol_type's = 3
Number of occurrences of each protocol_type


Unnamed: 0,protocol_type,cnt
0,icmp,1699093
1,tcp,1123311
2,udp,116655


<div class="alert alert-block alert-danger">
<b>Q: What is the effect of <code>.rename(columns={0:'cnt'})</code>, in the cell above?</b>
</div>

<div class="alert alert-block alert-success">
Renames the column: from 0 to 'cnt'
</div>

In [14]:
print("Number of different 'flag's =", len(df['flag'].unique()))
print("Number of occurrences of each flag")
df.groupby('flag').size().reset_index().sort_values(0, ascending=False).rename(columns={0:'cnt'})

Number of different 'flag's = 11
Number of occurrences of each flag


Unnamed: 0,flag,cnt
9,SF,2246181
5,S0,522487
1,REJ,161257
4,RSTR,4802
2,RSTO,3132
10,SH,632
6,S1,336
7,S2,95
3,RSTOS0,77
0,OTH,32


In [15]:
print("Number of different 'service's =", len(df['service'].unique()))
print("Number of occurrences of each service")
df.groupby('service').size().reset_index().sort_values(0, ascending=False).rename(columns={0:'cnt'})

Number of different 'service's = 69
Number of occurrences of each service


Unnamed: 0,service,cnt
14,ecr_i,1685908
48,private,660934
23,http,374306
53,smtp,57940
43,other,43555
11,domain_u,34662
19,ftp_data,24537
13,eco_i,9801
17,finger,4149
64,urp_i,3277


## Mapping each attack type to one category

In [16]:
# this cell modifies the data in the dataframe, if you run it more than once, you might encounter some errors
df['attack_type'] = df.apply(lambda r: r['attack_type'][:-1], axis=1)

The file *training_attack_types.txt* maps each of the attacks in the original dataset to 1 category.
The file can be found [here](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html), or you can use the same one you used last time.

#### IMPORTANT
Check that there are no empty lines at the end of the file, otherwise the code below might not work correctly.

In [17]:
from collections import defaultdict

In [18]:
category = defaultdict(list)
category['benign'].append('normal')

In [19]:
TRAINING_ATTACK_TYPES_FILENAME = 'datasets/training_attack_types.txt'

In [20]:
with open(TRAINING_ATTACK_TYPES_FILENAME, 'r') as f:
    for line in f.readlines():
        attack, cat = line.strip().split(' ')
        category[cat].append(attack)

attack_mapping = {v: k for k in category for v in category[k]}

`attack_mapping`, as created above, is a dictionary that maps each attack_type to one category.

In [21]:
print("There are %d categories" % len(set(attack_mapping.values())))
print("Their names are:", set(attack_mapping.values()))

There are 5 categories
Their names are: {'dos', 'benign', 'u2r', 'probe', 'r2l'}


In [22]:
# this is the cell that performs the actual mapping
df['attack_category'] = df.apply(lambda r: attack_mapping[r['attack_type']], axis=1)

<div class="alert alert-block alert-info">
This is similar to what was done above. The difference is that here a new column is created (named "attack_category"), which contains, for each row, the category of the attack type (<code>r['attack_type']</code> is used as a key to access the dictionary <code>attack_mapping</code>) 
</div>

In [23]:
print("Number of occurrences of each category:")
df.groupby('attack_category').size().reset_index().sort_values(0, ascending=False)

Number of occurrences of each category:


Unnamed: 0,attack_category,0
1,dos,2329516
0,benign,584332
2,probe,24501
3,r2l,679
4,u2r,31


<div class="alert alert-block alert-danger">
<b>Q: Looking at the results of the previous cell you should realize that you must be very careful with training a 5-class classifier... Why?</b>
</div>

<div class="alert alert-block alert-success">
The classes are extremely unbalanced. Thus the model is likely to learn pretty well how to model the most common categories but *not* the least common classes.

In case of unbalanced binary classification, a model may learn nothing: in some cases, it always predicts the most common class. It is up to you to detect whether that's the case by looking at the evaluation metrics. 
</div>

## Data preparation: dummy variables

We have some categorical variables. Thus, we have to converte them to one-hot encoded variables.

<div class="alert alert-block alert-danger">
<b>Q: Create a new DataFrame encoding the categorical attributes with one hot encoding.</b>
</div>

In [24]:
# Convert categorical feature into dummy variables with one-hot encoding
df_one_hot = pd.get_dummies(df, columns=nominal_cols)

## Data preparation: Train-test split

In [26]:
from sklearn.model_selection import train_test_split

# Split dataset up into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    df_one_hot.drop(['attack_category', 'attack_type'], axis=1), 
    df_one_hot['attack_category'], 
    test_size=0.2
)

## Data preparation: scaling

This cell might take a while to run

In [27]:
from sklearn.preprocessing import StandardScaler

In [28]:
# if it crashes, you might want to try rerunning the notebook on less data (see df.sample at the beginning)
standard_scaler = StandardScaler().fit(X_train[numeric_cols])

X_train[numeric_cols] = standard_scaler.transform(X_train[numeric_cols])
X_test[numeric_cols] = standard_scaler.transform(X_test[numeric_cols])

  return self.partial_fit(X, y)
  after removing the cwd from sys.path.
  """


<div class="alert alert-block alert-danger">
<b>Q: Why is the scaling performed using the X_train variable only?</b>
</div>

<div class="alert alert-block alert-success">
Because otherwise you'd be using information about the test set during the training phase: it is a methodological error.
</div>

## Data preparation: converting label to integers

In [29]:
y_train_bin = y_train.apply(lambda x: 0 if x is 'benign' else 1)
y_test_bin = y_test.apply(lambda x: 0 if x is 'benign' else 1)

REMEMBER: `.apply` applies the `lambda` function within the `()` to each element of the sequence (in this case a Series).

---

# CLASSIFICATION

In [30]:
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

from sklearn.metrics import (
    accuracy_score, 
    recall_score,
    precision_score,
    confusion_matrix
)

---

# 6.1 - BINARY CLASSIFICATION

Now you have to train a model for performing binary classification (i.e. detecting whether one entry is malicious or not, without caring about the attack_type).

You have to perform cross validation and hyperparameters selections in order to find the best performing model.

## First example of Cross Validation with GridSearch

The following cell performs hyper parameter selection with 5-fold cross validation.
The comments inside this cell should provide enough explanation to let you write the code for performing cross validation on different models and different hyper parameters.

<div class="alert alert-block alert-info">
<b>IMPORTANT</b>: you'll probably notice that some of the combination of parameters take a very long time to run. This is normal and it is due to the fact that, for each possible combination of parameters, the model is trained several times (value set in GridSearchCV). Thus, the elapsed time that you had seen in the previous prac session will get much longer.
    
You might also have some problems due to memory limitations.
As mentioned at the beginning of the notebook, try to sample the original DF or, alternatively, use the 10percent dataset (you can download it from the website).
    
For this session in the classroom, use small lists of parameters and 2-fold or 3-fold cross validation in order to run the code faster.
However, while working at home you have to explore larger choices of hyperparameters and use 5-fold cross validation in order to find better performing models.
</div>

- This line defines the model, in this case a RandomForestClassifier but it could be any classifier (DecisionTreeClassifier, SVC, etc.)

In [31]:
classifier = RandomForestClassifier()

- this line defines a Pipeline, in this case it is made of only one element and this is what you will need to use for now

In [32]:
pipe = Pipeline(steps=[('clf', classifier)])

- parameters of the model in the pipeline can be set using `'__'` separated parameter names. In this case, `'clf__n_estimator'` means that the following list is a list of values for the n_estimators attribute of the clf model. Here, it is the number of trees of the RandomForest

In [33]:
param_grid = {
    'clf__n_estimators': [10, 25, 50],
}

- performs cross validation for each of the parameters set above. `cv=3` means that I perform 3-fold cross validation


In [34]:
searchRF = GridSearchCV(pipe, param_grid, iid=False, cv=3, verbose=True)
searchRF.fit(X_train, y_train_bin)

Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  6.3min finished


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('clf', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=False, n_jobs=None,
       param_grid={'clf__n_estimators': [10, 25, 50]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=True)

- once the training is done, you can show the parameters of the best performing model

In [35]:
print(searchRF.best_params_)

{'clf__n_estimators': 25}


- and you can also retrieve it in order to perform the final test on the test set

In [36]:
trainedRF = searchRF.best_estimator_.get_params()['clf']
trainedRF

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=25, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [37]:
y_pred_RF = trainedRF.predict(X_test)

<div class="alert alert-block alert-danger">
<b>Q: Complete the following cell in order to perform the final evaluation of the model on the test set.</b>
</div>

In [38]:
print("accuracy:", accuracy_score(y_test_bin, y_pred_RF)) 
print("recall:", recall_score(y_test_bin, y_pred_RF))
print("precision:", precision_score(y_test_bin, y_pred_RF))

accuracy: 0.999948963273972
recall: 0.9999447844682957
precision: 0.9999915049059168


## Another example, this time with a DecisionTreeClassifier

It is more convenient to write everything in a unique cell, in order to have everything under control.

Here I perform 3-fold cross validation on a DecisionTreeClassifier and explore 3 possible values of `max_depth` and 2 possible values of `max_leaf_nodes`.

In [40]:
classifier = DecisionTreeClassifier()

pipe = Pipeline(steps=[('clf', classifier)])

# as you can see, you can set lists of value for all the attributes of the model
param_grid = {
    'clf__max_leaf_nodes': [10, None],
    'clf__max_depth': [2, 5, 10],
}

searchDT = GridSearchCV(pipe, param_grid, iid=False, cv=3, verbose=True)
searchDT.fit(X_train, y_train_bin)

print(searchDT.best_params_)

trainedDT = searchDT.best_estimator_.get_params()['clf']
trainedDT

Fitting 3 folds for each of 6 candidates, totalling 18 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed:  2.7min finished


{'clf__max_depth': 10, 'clf__max_leaf_nodes': None}


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [41]:
y_pred_DT = trainedDT.predict(X_test)

<div class="alert alert-block alert-danger">
<b>Q: Complete the following cell in order to perform the final evaluation of the model on the test set.</b>
</div>

In [42]:
print("accuracy:", accuracy_score(y_test_bin, y_pred_DT)) 
print("recall:", recall_score(y_test_bin, y_pred_DT))
print("precision:", precision_score(y_test_bin, y_pred_DT))

accuracy: 0.999215735643369
recall: 0.9990698306582116
precision: 0.9999511124053334


In [43]:
# print("accuracy:", accuracy_score()) 
# print("recall:", recall_score())
# print("precision:", precision_score())

## Now your turn

<div class="alert alert-block alert-danger">
<b>Q: Use the scikit-learn Pipeline and GridSearchCV in order to find the best model and hyperparameter configuration. Fill the template below.</b>
</div>
As mentioned above, keep things simple for this practice session otherwise you won't have time to see the results of the training.

Possible models to explore:
- GaussianNB
- DecisionTreeClassifier
- RandomForestClassifier
- LogisticRegression
- ... but **do** feel free to experiment with other ones as well! 

In [57]:
classifier = RandomForestClassifier()

pipe = Pipeline(steps=[('clf', classifier)])

param_grid = {
    'clf__n_estimators': [25, 50, 75, 100, 125, 150],
    'clf__max_depth': [2, 5, 7, 10, 15, 20],
    'clf_max_features': ['log2', 'sqrt', None]
}

search = GridSearchCV(pipe, param_grid, iid=False, cv=5, verbose=True)
search.fit(X_train, y_train_bin)

print(search.best_params_)

trained_clf = search.best_estimator_.get_params()['clf']
trained_clf

In [58]:
y_pred = trained_clf.predict(X_test)

print("accuracy:", accuracy_score(y_test_bin, y_pred)) 
print("recall:", recall_score(y_test_bin, y_pred))
print("precision:", precision_score(y_test_bin, y_pred))

---

# 6.2 - 5-CLASS CLASSIFICATION

This last section focuses on performing multiclass classification.
Instead of trying to classify each entry as *attack* or *not attack*, we now try to classify it as belonging to one of the following classes:
- dos
- benign
- probe
- r2l
- u2r

Here is an example of model and hyperparameters selection:

In [47]:
pipe = Pipeline(steps=[('clf', RandomForestClassifier())])

param_grid = {
    'clf__n_estimators': [50, 100],
    'clf__max_depth': [2, 5, 10],
}

search = GridSearchCV(pipe, param_grid, iid=False, cv=3, verbose=True)
search.fit(X_train, y_train)

print(search.best_params_)

trained_clf = search.best_estimator_.get_params()['clf']
trained_clf

Fitting 3 folds for each of 6 candidates, totalling 18 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed: 19.8min finished


{'clf__max_depth': 10, 'clf__n_estimators': 100}


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [48]:
y_pred = trained_clf.predict(X_test)

In [49]:
print("accuracy:", accuracy_score(y_test, y_pred)) 

accuracy: 0.9996665600566167


In [50]:
from sklearn.metrics import classification_report

In [51]:
clf_report = classification_report(y_test, y_pred, output_dict=True)

  'precision', 'predicted', average, warn_for)


In [52]:
for attack_type in set(attack_mapping.values()):
    print(attack_type)
    print(clf_report[attack_type])
    print()

dos
{'precision': 0.9999978532618613, 'recall': 0.9999892664014752, 'f1-score': 0.9999935598132346, 'support': 465827}

benign
{'precision': 0.9983351120597652, 'recall': 1.0, 'f1-score': 0.9991668624895857, 'support': 116930}

u2r
{'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 10}

probe
{'precision': 1.0, 'recall': 0.9842985318107668, 'f1-score': 0.9920871441783989, 'support': 4904}

r2l
{'precision': 1.0, 'recall': 0.2624113475177305, 'f1-score': 0.41573033707865165, 'support': 141}



<div class="alert alert-block alert-danger">
<b>Q: Now it's your turn. Experiment with different models, hyperparameters and try to find the best performing model using cross validation with Pipeline and GridSearchCV. Do not forget to tinker with different features as well: as introduced in the last session, try to remove some features, focus only on some features, etc. in order to improve the classification accuracy.
    
Remember: *one perfect model* to solve this problem does not exist, but by approaching the problem in the correct way, you can get better and better models.
</b>
</div>

Remember, when performing the final evaluation, to look at all the 5 classes and not only at the overall accuracy!

<div class="alert alert-block alert-success">
Several things you might want to try:
    
- work on the features: are all of them important? How are they distributed? Does the model gets better if you remove some of the features? etc...

- work on the data: is the StandardScaler the best scaler to use? Are the chosen parameters the ones leading to the best result? Should you remove some of the entries of the most common classes in order to focus on the minority classes?

- try different models and different hyper parameters for each model (you cannot try "every" possible combination, try to pick wisely). For instance, you first do a large grained search and. after that, a fine grained search "around" the best parameters. The GridSearchCV might even take hours to run if you set too many parameters; for the sake of this prac session you might want to play with the smaller dataset (10percent)
</div>

In [53]:
# classifier = ... # feel free to use any models you want

# pipe = Pipeline(steps=[('clf', classifier)])

# param_grid = {
#     ...
# }

# search = GridSearchCV(pipe, param_grid, iid=False, cv=..., verbose=True)
# search.fit(X_train, y_train)

# print(searchDT.best_params_)

# trained_clf = searchDT...
# trained_clf

In [54]:
# y_pred = trained_clf...

In [55]:
# clf_report = 

---

<div class="alert alert-block alert-info">
After you finish the notebook, take a look at this link: https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html

In the second Section ("Face recognition with eigenfaces"), it presents an example of how to use the techniques that you have seen so far in a more challenging problem (face recognition), which is also very relevant from a security perspective, since face recognition is often used for authentication on hand-held devices.
</div>