# Feature importance analysis
In this laboratory you will use two different techniques to analyse the relative importance of the dataset features. In the first part, you will use a Random Forest to evaluate the relative importance of the features of the training set. This technique is often used to get rid of irrelevant features before training. In the second step, you will use the feature elimination technique to understand which features contribute most to the classification task.

In [None]:
# Author: Roberto Doriguzzi-Corin
# Project: Lecture on Intrusion Detection with Deep Learning
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os
import numpy as np
import glob
import h5py
import sys
import copy
import argparse
from sklearn.metrics import f1_score
from tensorflow.keras.models import load_model
import tensorflow as tf
import logging

from sklearn.ensemble import RandomForestClassifier

config = tf.compat.v1.ConfigProto(inter_op_parallelism_threads=1)
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
tf.get_logger().setLevel(logging.ERROR)

from matplotlib import pyplot as plt
plt.rcParams.update({'figure.figsize': (12.0, 8.0)})
plt.rcParams.update({'font.size': 14})

SEED=1
feature_names = ['time','packet_len','highest_proto','IP flags','protocols','TCP len','TCP ack','TCP flags','TCP win_size',
                         'UDP len','ICMP type','flow_len']

In [None]:
def load_dataset(path):
    filename = glob.glob(path)[0]
    dataset = h5py.File(filename, "r")
    set_x_orig = np.array(dataset["set_x"][:])  # features
    set_y_orig = np.array(dataset["set_y"][:])  # labels

    X = np.reshape(set_x_orig, (set_x_orig.shape[0], set_x_orig.shape[1], set_x_orig.shape[2], 1))
    Y = set_y_orig

    return X, Y

In [None]:
# trivial flatten method that takes only the first row
def flatten_samples(X_train):
    X_new = []
    for sample in X_train:
        sample = np.squeeze(sample)
        new_sample = sample[0] #replace with something smarter
        new_sample = np.append(new_sample, [0], axis=0) #replace [0] with the number of packets/flow
        X_new.append(new_sample)
    return np.array(X_new)

In [None]:
X_train, Y_train = load_dataset("../Datasets/IDS2017/*" + '-train.hdf5')
X_train = flatten_samples(X_train)

## Feature analysis with Random Forest
Replace code in the cell below with a random forest that estimates the relative importance of each feature in the training set. In this part of the laboratory, you will use a 1D representation of the flows to understand which features are more important for a pre-trained model.

In [None]:
feature_importances_ = np.random.rand(X_train.shape[1]) # replace with a RandomForestClassifier

In [None]:
plt.barh(feature_names, feature_importances_)
plt.show()

In [None]:
X_test, Y_test = load_dataset("../Datasets/IDS2017/*" + '-test.hdf5')
X_test = flatten_samples(X_test)
model = load_model("../Models/10t-1n-mlp-IDS2017.h5")

## Analisys of feature importance with feature elimination
In the cell below, add the code necessary to remove a different feature at each iteration to estimate its importance in the classification task. 

In [None]:
results = []
for feature in feature_names:
    feature_index = feature_names.index(feature)
    # here remove one feature and then classify the traffic without it
    Y_pred = np.squeeze(model.predict(X_test, batch_size=2048) > 0.5) 
    f1 = f1_score(Y_test, Y_pred)
    results.append(1-f1)

In [None]:
plt.barh(np.array(feature_names), np.array(results))
plt.xlabel("Feature Importance")
plt.show()