# Feature importance analysis
In this laboratory, you will use a Random Forest to evaluate the relative importance of the features of the training set. This technique is often used to get rid of irrelevant features before training. 

You will use a dataset of benign and various DDoS attacks from the CIC-DDoS2019 dataset (https://www.unb.ca/cic/datasets/ddos-2019.html).
The network traffic has been previously pre-processed in a way that packets are grouped in bi-directional traffic flows using the 5-tuple (source IP, destination IP, source Port, destination Port, protocol). Each flow is represented with 21 packet-header features computed from max 1000 packets:

| Feature nr.         | Feature Name |
|---------------------|---------------------|
| 00 | timestamp (mean IAT) | 
| 01 | packet_length (mean)| 
| 02 | IP_flags_df (sum) |
| 03 | IP_flags_mf (sum) |
| 04 | IP_flags_rb (sum) | 
| 05 | IP_frag_off (sum) |
| 06 | protocols (mean) |
| 07 | TCP_length (mean) |
| 08 | TCP_flags_ack (sum) |
| 09 | TCP_flags_cwr (sum) |
| 10 | TCP_flags_ece (sum) |
| 11 | TCP_flags_fin (sum) |
| 12 | TCP_flags_push (sum) |
| 13 | TCP_flags_res (sum) |
| 14 | TCP_flags_reset (sum) |
| 15 | TCP_flags_syn (sum) |
| 16 | TCP_flags_urg (sum) |
| 17 | TCP_window_size (mean) |
| 18 | UDP_length (mean) |
| 19 | ICMP_type (mean) |
| 20 | Packets (counter)|

In [None]:
# Author: Roberto Doriguzzi-Corin
# Project: Course on Network Intrusion and Anomaly Detection with Machine Learning
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os
import numpy as np
import glob
import h5py
import sys
import copy
import argparse
from sklearn.metrics import classification_report, accuracy_score
import logging
from util_functions import *
from IPython.display import Image, display
from sklearn.tree import export_graphviz
from sklearn.ensemble import RandomForestClassifier

OUTPUT_FILE = "./rf_tree"
DATASET_FOLDER = "./DOS2019"
X_train, y_train = load_dataset(DATASET_FOLDER + "/*" + '-train.hdf5')

from matplotlib import pyplot as plt
plt.rcParams.update({'figure.figsize': (12.0, 8.0)})
plt.rcParams.update({'font.size': 14})

SEED=1
feature_names = get_feature_names()
target_names = ['benign', 'dns',  'syn', 'udplag', 'webddos'] #IMPORTANT: when adding new classes, maintain the alphabetical order
target_names_full = ['benign', 'dns', 'ldap', 'mssql', 'netbios', 'ntp', 'portmap', 'snmp', 'ssdp', 'syn', 'tftp', 'udp', 'udplag', 'webddos'] # we use this to match class names with the class numbers returned by the RF
X_train, y_train = load_dataset(DATASET_FOLDER + "/*" + '-train.hdf5')
y_train = np.where(y_train==1)[1] 

In [None]:
def show_tree(tree_clf, feature_names):
    export_graphviz(
        tree_clf,
        out_file=OUTPUT_FILE + ".dot",
        feature_names=feature_names,
        class_names=target_names,
        rounded=True,
        filled=True
    )

    # comvert the "dot" file into a png image
    os.system("dot -Tpng " + OUTPUT_FILE + ".dot -o " + OUTPUT_FILE + ".png")
    display(Image(filename=OUTPUT_FILE + ".png"))

# Classification with Random Forests
Implement a Random Forest classifier with 100 trees (estimators) and play with the regularisation hyper-parameters, such as max_depth, min_samples_split, min_samples_leaf.
Replace the RF Classifier with an ExtraTreesClassifier and test the regularisation hyper-parameters

In [None]:
### ADD YOUR CODE HERE ### 
### Define the RF model and train it using the dataset loaded above
### Set the number of estimators (n_estimators), the stopping strategy (e.g., max_depth) and enable the oob_score=True
rf = 

##########################

# Validation using the OOB score
The "OOB score" stands for "Out-of-Bag score," and it is a metric used in the context of random forests for estimating the model's performance on unseen data **without the need for a separate validation set**. It's a valuable tool for assessing the generalization capability of a random forest classifier or regressor.

In [None]:
### ADD YOUR CODE HERE ### 
### Play the RF hyperparameters of the RF model to see what configuration works best for this problem
oob_score = 
##########################

print(rf.get_params())
print ("Accuracy score: ", oob_score)

In [None]:
# Let's visualise some decision tree of the Random Forest
tree_clf = rf.estimators_[0]
show_tree(tree_clf, feature_names)

# Feature importance
Let's now plot the most important features, as computed using the average decrease of the Gini impurity.

In [None]:
### ADD YOUR CODE HERE ### 
### Assign the feature importances to "fi"
fi = 
##########################

plt.barh(feature_names, fi)
plt.show()

# Inference using the RF model
Use the trained RF to make prediction on the test set. 

In [None]:
X_test, y_test = load_dataset(DATASET_FOLDER + "/*" + '-test.hdf5')
y_test = np.where(y_test==1)[1] #from one-shot-encoding to numbers

### ADD YOUR CODE HERE ### 
### Replace the three dots with your code
y_pred = ...
##########################

print(classification_report(y_test, y_pred, target_names=target_names))