# NIDS on CIDDS-001 OpenStack data

For the first two parts `Use KNN and RFC on CIDDS-001 OpenStack week 1` and `Use KNN and RFC on CIDDS-001 OpenStack week 2` the steps listed below are executed. When we have a k-Nearest Neigbors (KNN) and a Random Forest Classification (RFC) model for both week 1 and week 2, we mix the test data and the models in the final two sections `Use knn_week1 and rfc_week1 on test data of week 2` and `Use knn_week2 and rfc_week2 on test data of week 1`. The KNN and RFC model trained with data of week 1 is scored with test data of week 2. But, in order to be able to do this, we need to normalize the test data of week 2 with the `mean` and `std` parameters that were used on the training data of the models of week 1. Analogously, the same mix is executed for the models of week 2 with the test data of week 1.

## 1. Preprocessing
First, the data is preprocessed into a pandas DataFrame. The CIDDS-001 contains 14 columns: 
* Src IP
* Src Port
* Dest IP
* Dest Port
* Proto
* Data first seen
* Duration
* Bytes
* Packets
* Flags
* Class
* AttackType
* AttackID
* AttackDescription

But a few columns are not used for classification because we do not want our model to be dependent them. Following columns are dropped in the preprocessing step:
* Src IP
* Src Port
* Dest IP
* Data first seen
* AttackType
* AttackID
* AttackDescription

To be noted: in the dataset read from the files there was an extra column 'Flows' which always has the value '1' and is removed too.

## 2. Split the preprocessed data into a training set and a test set
After the preprocessing step, the preprocessed data is split into 80% training data and 20% test data.

## 3. Normalize the data
Each column (except for the `Flags`) is z-score normalized. The `mean` and `std` for each column of the training set are determined. The z-score calculation (i.e. unchanged `mean` and `std` for each column) is used on the corresponding column of the test set.

Note that I deliberatly split the data into a training and test set _before_ normalizing. This is done because if the model would see new data, this new data must be normalized with the same `mean` and `std` as the training set was normalized with. To get a better representation of the score of the model, the same philosophy is adopted to the normalization process of the test data

## 4. Train and score the models
Finally, the normalized data can be used to train and test the model. This is done using the scikit-learn implementations of k-Nearest Neigbors and Random Forest Classification.

Questions: 
* What is the `flows` column in the cleaned data set?
* If you normalize the dataset, how is new data handled? You must perform some kind of operation on the new data to map it into the dimensions used to train the model with. Should we use the same `Z-score` calculation but with `mean` and `std` used for training?
* Cfr. question 2: we split the dataset into a training set and in a test set. Do we calculate `mean` and `std` only on the training set and use the same `mean` and `std` on the test set data to normalize it?

## Imports

In [1]:
import pandas as pd
from timeit import default_timer as timer

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

from utils import cidds_001 as utils

## Global timing

In [2]:
global_start = timer()

# Use KNN and RFC on CIDDS-001 OpenStack week 1

## Preprocessing

In [3]:
dataset_week1 = pd.read_feather('saved_dfs/cidds-001/traffic/OpenStack/CIDDS-001-internal-week1-cleaned.feather')
dataset_week1 = dataset_week1.sample(frac=1, random_state=13) # randomize dataset
dataset_week1 = utils.get_balanced_cidds(dataset_week1, classification_target='class')


# For this first version, the classification target is 'class' (i.e. normal, victim, attack) instead of the better target 'attack_type' (cfr. v2)
columns_to_drop = utils.columns_to_drop + ['attack_type']
columns_to_drop.remove('class')
dataset_week1.drop(columns=columns_to_drop, inplace=True)

dataset_week1.head()

Unnamed: 0,duration,icmp,igmp,tcp,udp,src_port,dst_port,packets,bytes,flows,tcp_urg,tcp_ack,tcp_psh,tcp_rst,tcp_syn,tcp_fin,tos,class
0,0.005,0.0,0.0,1.0,0.0,60543.0,80.0,5.0,479.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,attacker
1,0.007,0.0,0.0,1.0,0.0,54182.0,80.0,5.0,479.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,attacker
2,0.006,0.0,0.0,1.0,0.0,44446.0,80.0,5.0,479.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,attacker
3,0.008,0.0,0.0,1.0,0.0,58773.0,80.0,5.0,479.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,attacker
4,0.003,0.0,0.0,1.0,0.0,42573.0,80.0,6.0,545.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,attacker


## Split the data in a training and test set

In [4]:
y_week1 = dataset_week1['class'].values
x_week1 = dataset_week1.drop(['class'], axis=1)
x_train_week1, x_test_week1, y_train_week1, y_test_week1 = train_test_split(x_week1, y_week1, test_size=0.6, random_state=0)

print(len(x_train_week1))
print(len(x_test_week1))
print(len(y_train_week1))
print(len(y_test_week1))

833271
1249908
833271
1249908


## Normalize the training data and use same `mean` and `std` for test data

In [5]:
norm_params_week1 = utils.z_score_normalization(x_train_week1, utils.columns_to_normalize, cidds_df_test=x_test_week1)

## Fit KNN

In [6]:
knn_week1 = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree')

start = timer()
knn_week1.fit(x_train_week1, y_train_week1)
end = timer()
print('Time to fit KNN on week1 of OpenStack: {} seconds'.format(end - start))

## Score KNN

In [None]:
start = timer()
score = knn_week1.score(x_test_week1, y_test_week1)
end = timer()
print('Scoring KNN took {0} seconds, with a score of {1}'.format(end - start, score))

Scoring KNN took 121.79406480000034 seconds, with a score of 0.9976718323719308


## Confusion matrix of KNN

In [None]:
# Predict
predicted_y = knn_week1.predict(x_test_week1)

# calculate confucion matrix
confusion_matrix(y_test_week1, predicted_y)

## Fit RFC

In [None]:
rfc_week1 = RandomForestClassifier(n_estimators=200)

start = timer()
rfc_week1.fit(x_train_week1, y_train_week1)
end = timer()
print('Time to fit RFC on week1 of OpenStack: {} seconds'.format(end - start))

Time to fit RFC on week1 of OpenStack: 61.51200289999997 seconds


## Score RFC

In [None]:
start = timer()
score = rfc_week1.score(x_test_week1, y_test_week1)
end = timer()
print('Scoring RFC took {0} seconds, with a score of {1}'.format(end - start, score))

Scoring RFC took 2.5530915999997887 seconds, with a score of 0.9984398876719124


## Confusion matrix of RFC

In [None]:
# Predict
predicted_y = rfc_week1.predict(x_test_week1)

# calculate confucion matrix
confusion_matrix(y_test_week1, predicted_y)

# Use KNN and RFC on CIDDS-001 OpenStack week 2

## Preprocessing

In [None]:
dataset_week2 = pd.read_feather('saved_dfs/cidds-001/traffic/OpenStack/CIDDS-001-internal-week2-cleaned.feather')
dataset_week2 = dataset_week2.sample(frac=1, random_state=13) # randomize dataset
dataset_week2 = utils.get_balanced_cidds(dataset_week2, classification_target='class')

dataset_week2.drop(columns=columns_to_drop, inplace=True)

dataset_week2.head()

Unnamed: 0,duration,icmp,igmp,tcp,udp,dst_port,packets,bytes,flag1,flag2,flag3,flag4,flag5,flag6,tos,class
0,0.176,0.0,0.0,1.0,0.0,443.0,9.0,950.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,normal
1,0.0,0.0,0.0,1.0,0.0,80.0,1.0,55.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,normal
2,0.0,0.0,0.0,1.0,0.0,80.0,1.0,66.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,normal
3,0.999,0.0,0.0,1.0,0.0,58848.0,4.0,216.0,0.0,1.0,0.0,0.0,0.0,1.0,32.0,normal
4,0.0,0.0,0.0,1.0,0.0,80.0,1.0,66.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,normal


## Split the data in a training and test set

In [None]:
y_week2 = dataset_week2['class'].values
x_week2 = dataset_week2.drop(['class'], axis=1)
x_train_week2, x_test_week2, y_train_week2, y_test_week2 = train_test_split(x_week2, y_week2, test_size=0.2, random_state=0)

print(len(x_train_week2))
print(len(x_test_week2))
print(len(y_train_week2))
print(len(y_test_week2))

637219
159305
637219
159305


## Normalize the training data and use same `mean` and `std` for test data

In [None]:
norm_params_week2 = utils.z_score_normalization(x_train_week2, utils.columns_to_normalize, cidds_df_test=x_test_week2)

## Fit KNN

In [None]:
knn_week2 = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree')

start = timer()
knn_week2.fit(x_train_week2, y_train_week2)
end = timer()
print('Time to fit KNN on week2 of OpenStack: {} seconds'.format(end - start))

Time to fit KNN on week2 of OpenStack: 678.3605799999996 seconds


## Score KNN

In [None]:
start = timer()
score = knn_week2.score(x_test_week2, y_test_week2)
end = timer()
print('Scoring KNN took {0} seconds, with a score of {1}'.format(end - start, score))

Scoring KNN took 193.8710618 seconds, with a score of 0.9976711339882615


## Confusion matrix of KNN

In [None]:
# Predict
predicted_y = knn_week2.predict(x_test_week2)

# calculate confucion matrix
confusion_matrix(y_test_week2, predicted_y)

## Fit RFC

In [None]:
rfc_week2 = RandomForestClassifier(n_estimators=200)

start = timer()
rfc_week2.fit(x_train_week2, y_train_week2)
end = timer()
print('Time to fit RFC on week2 of OpenStack: {} seconds'.format(end - start))

Time to fit RFC on week2 of OpenStack: 85.87605560000065 seconds


## Score RFC

In [None]:
start = timer()
score = rfc_week2.score(x_test_week2, y_test_week2)
end = timer()
print('Scoring RFC took {0} seconds, with a score of {1}'.format(end - start, score))

Scoring RFC took 3.253232200000639 seconds, with a score of 0.9984181287467436


## Confusion matrix of RFC

In [None]:
# Predict
predicted_y = rfc_week2.predict(x_test_week2)

# calculate confucion matrix
confusion_matrix(y_test_week2, predicted_y)

# Use `knn_week1` and `rfc_week1` on test data of week 2

## Normalize data of week 2 with normalization parameters of week 1

In [None]:
# test data of week 2, normalized with parameters of week 1
x_test_week2_all = pd.DataFrame(data=x_week2, copy=True)
utils.z_score_normalizations_with_given_params(x_test_week2_all, utils.columns_to_normalize, norm_params_week1)

## K-Nearest Neighbors

In [None]:
start = timer()
score = knn_week1.score(x_test_week2_all, y_week2)
end = timer()
print('Scoring KNN took {0} seconds, with a score of {1}'.format(end - start, score))

Scoring KNN took 164.78324620000058 seconds, with a score of 0.9918709393929883


## Confusion matrix

In [None]:
# Predict
predicted_y = knn_week1.predict(x_test_week2_all)

# calculate confucion matrix
confusion_matrix(y_week2, predicted_y)

## Random Forest Classification

In [None]:
start = timer()
score = rfc_week1.score(x_test_week2_all, y_week2)
end = timer()
print('Scoring RFC took {0} seconds, with a score of {1}'.format(end - start, score))

Scoring RFC took 3.347298800000317 seconds, with a score of 0.9915131351809422


## Confusion matrix

In [None]:
# Predict
predicted_y = rfc_week1.predict(x_test_week2_all)

# calculate confucion matrix
confusion_matrix(y_week2, predicted_y)

# Use `knn_week2` and `rfc_week2` on test data of week 1

## Normalize data of week 1 with normalization parameters of week 2

In [None]:
# test data of week 1, normalized with parameters of week 2
x_test_week1_all = pd.DataFrame(data=x_week1, copy=True)
utils.z_score_normalizations_with_given_params(x_test_week1_all, utils.columns_to_normalize, norm_params_week2)

## K-Nearest Neighbors

In [None]:
start = timer()
score = knn_week2.score(x_test_week1_all, y_week1)
end = timer()
print('Scoring KNN took {0} seconds, with a score of {1}'.format(end - start, score))

Scoring KNN took 127.10119400000076 seconds, with a score of 0.97488619180581


## Confusion matrix

In [None]:
# Predict
predicted_y = knn_week2.predict(x_test_week1_all)

# calculate confucion matrix
confusion_matrix(y_week1, predicted_y)

## Random Forest Classification

In [None]:
start = timer()
score = rfc_week2.score(x_test_week1_all, y_week1)
end = timer()
print('Scoring RFC took {0} seconds, with a score of {1}'.format(end - start, score))

Scoring RFC took 2.5423682999999073 seconds, with a score of 0.9733100783256394


## Confusion matrix

In [None]:
# Predict
predicted_y = rfc_week2.predict(x_test_week1_all)

# calculate confucion matrix
confusion_matrix(y_week1, predicted_y)

## Global timing

In [None]:
global_end = timer()
print('Running the complete notebook took {0} min, {1} sec.'.format(
    int((global_end - global_start) / 60), 
    int((((global_end - global_start) / 60) - int((global_end - global_start) / 60)) * 60)
))

Running the complete notebook took 28 min, 55 sec.


# Feature importance of Random Forest Classification models

In [None]:
rfc_week1.feature_importances_

array([6.44247733e-02, 1.94160945e-01, 1.43496944e-01, 1.57513757e-01,
       4.81383819e-02, 3.22431936e-03, 3.55890653e-06, 4.99891961e-03,
       9.05759644e-03, 0.00000000e+00, 2.68939116e-02, 8.29083331e-02,
       3.31036075e-02, 1.35006348e-01, 9.70686035e-02])

In [None]:
rfc_week2.feature_importances_

array([6.99647245e-02, 1.35643158e-01, 2.03805422e-01, 1.74121280e-01,
       2.42297710e-02, 1.41759086e-03, 2.84182600e-07, 4.55438387e-03,
       5.30377735e-03, 0.00000000e+00, 1.48364935e-02, 1.01041198e-01,
       1.55752075e-02, 1.47275489e-01, 1.02231221e-01])