# Hyperparameter tuning for a Random Forest classifier
In this notebook, we will use grid-search and random-search to tune the hyperparameters of a Random Forest network traffic classifier.

We will use a dataset of benign and various DDoS attacks from the CIC-DDoS2019 dataset (https://www.unb.ca/cic/datasets/ddos-2019.html).
The network traffic has been previously pre-processed in a way that packets are grouped in bi-directional traffic flows using the 5-tuple (source IP, destination IP, source Port, destination Port, protocol). Each flow is represented with 21 packet-header features computed from max 10 packets:

| Feature nr.         | Feature Name |
|---------------------|---------------------|
| 00 | timestamp (mean IAT) | 
| 01 | packet_length (mean)| 
| 02 | IP_flags_df (sum) |
| 03 | IP_flags_mf (sum) |
| 04 | IP_flags_rb (sum) | 
| 05 | IP_frag_off (sum) |
| 06 | protocols (mean) |
| 07 | TCP_length (mean) |
| 08 | TCP_flags_ack (sum) |
| 09 | TCP_flags_cwr (sum) |
| 10 | TCP_flags_ece (sum) |
| 11 | TCP_flags_fin (sum) |
| 12 | TCP_flags_push (sum) |
| 13 | TCP_flags_res (sum) |
| 14 | TCP_flags_reset (sum) |
| 15 | TCP_flags_syn (sum) |
| 16 | TCP_flags_urg (sum) |
| 17 | TCP_window_size (mean) |
| 18 | UDP_length (mean) |
| 19 | ICMP_type (mean) |
| 20 | Packets (counter)|

In [None]:
# Author: Roberto Doriguzzi-Corin
# Project: Course on Network Intrusion and Anomaly Detection with Machine Learning
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os
import numpy as np
from sklearn.metrics import classification_report, f1_score, accuracy_score, get_scorer_names
from util_functions import *
from IPython.display import Image, display
from sklearn.tree import export_graphviz
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

OUTPUT_FILE = "./rf_tree"
DATASET_FOLDER = "./DOS2019"

from matplotlib import pyplot as plt
plt.rcParams.update({'figure.figsize': (12.0, 8.0)})
plt.rcParams.update({'font.size': 14})

SEED=1
feature_names = get_feature_names()
target_names = ['benign', 'dns',  'syn', 'udplag', 'webddos'] #IMPORTANT: when adding new classes, maintain the alphabetical order
X_train, Y_train = load_dataset(DATASET_FOLDER + "/*" + '-train.hdf5')

In [None]:
def show_tree(tree_clf, feature_names):
    export_graphviz(
        tree_clf,
        out_file=OUTPUT_FILE + ".dot",
        feature_names=feature_names,
        class_names=target_names,
        rounded=True,
        filled=True
    )

    # comvert the "dot" file into a png image
    os.system("dot -Tpng " + OUTPUT_FILE + ".dot -o " + OUTPUT_FILE + ".png")
    display(Image(filename=OUTPUT_FILE + ".png"))

# Grid search
Train a Random Forest classifier across a grid of hyperparameters and ```k```-fold cross-validation. The RF is trained ```k``` times on each combination of hyperparameters, using each time a different fold for validation.

In [None]:
# Define the parameter distribution 
param_dist = {
"n_estimators": [10,20,50],
"max_depth": [2,5,10,20],
"min_samples_split": [0.1,0.2,0.5],
"min_samples_leaf": [0.1,0.2,0.5] 
}
print ("Total combinations: ", len(param_dist["n_estimators"])*len(param_dist["max_depth"])*len(param_dist["min_samples_split"])*len(param_dist["min_samples_leaf"]))

# Create the random forest classifier
rf = RandomForestClassifier()

# Create the randomized search object 
gs = GridSearchCV(rf, param_grid=param_dist, cv=3, scoring='f1_weighted',verbose=2)

# Fit the grid search object to the data
gs.fit(X_train, Y_train)

# Inference using the RF model
Use the trained RF to make prediction on the test set. 

In [None]:
X_test, y_test = load_dataset(DATASET_FOLDER + "/*" + '-test.hdf5')
# Print the best hyperparameters 
best_model = gs.best_estimator_ 
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=target_names))

print("\nBest parameters: ", gs.best_params_) 

# Random search
Train a Random Forest classifier using random combinations of hyperparameters that are sampled from a specified distribution. With randomized search, one can set the maximum number of combinations to test, to control the total training time, and the number of ```k```-folds.

In [None]:
# Define the parameter distribution 
from scipy.stats import uniform, randint

param_dist = {
"n_estimators": randint(10, 50),
"max_depth": randint(2, 20),
"min_samples_split": uniform(0.1, 0.5),
"min_samples_leaf": uniform(0.1, 0.5)
}

# Create the random forest classifier
rf = RandomForestClassifier()

# Create the randomized search object 
gs = RandomizedSearchCV(rf,param_distributions=param_dist,n_iter=100,cv=3,verbose=2,random_state=SEED)

# Fit the grid search object to the data
gs.fit(X_train, Y_train)

# Inference using the RF model
Use the trained RF to make prediction on the test set. 

In [None]:
X_test, y_test = load_dataset(DATASET_FOLDER + "/*" + '-test.hdf5')
# Print the best hyperparameters 
best_model = gs.best_estimator_ 
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=target_names))

print("\nBest parameters: ", gs.best_params_) 