# This notebook is intended to be used as a work-in-progress rapid prototyping development environment for the recreation of the proposed method of anomaly detection for IDS purposes from [this paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8751972)

# Preprocessing and Feature Engineering

## Importing the data

First off, we need to import the data:

In [73]:
# import necessary packages
import numpy as np
import pandas as pd
import sklearn

In [17]:
# import data from CICIDS-2017 MachineLearningCVE
# going to use Tuesday-WorkingHours.pcap_ISCX.csv for testing purposes
input_file = "datasets/CICIDS-2017/MachineLearningCVE/Tuesday-WorkingHours.pcap_ISCX.csv"

# use pandas to read from .csv
df = pd.read_csv(input_file, header='infer')

# convert the DataFrame to a dict
#df.to_dict()
df

Unnamed: 0,Destination Port,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,...,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,88,640,7,4,440,358,220,0,62.857143,107.349008,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
1,88,900,9,4,600,2944,300,0,66.666667,132.287566,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
2,88,1205,7,4,2776,2830,1388,0,396.571429,677.274651,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
3,88,511,7,4,452,370,226,0,64.571429,110.276708,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
4,88,773,9,4,612,2944,306,0,68.000000,134.933317,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
445904,53,155,2,2,88,120,44,44,44.000000,0.000000,...,32,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
445905,59317,110,1,1,0,0,0,0,0.000000,0.000000,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
445906,53,166,2,2,88,188,44,44,44.000000,0.000000,...,32,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
445907,54726,81,1,1,0,0,0,0,0.000000,0.000000,...,32,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN


### Now it's time for some feature extraction (e.g. converting strings to mathematically viable numerical values)

In [6]:
# importing 'Counter' for class summarization
from collections import Counter

In [88]:
# define X and y
num_cols = len(df.columns)
X = df.iloc[:, 0:num_cols-2]  # -2 at the end to account for y, which will be added later
y = df.loc[:, ' Label']

Considering that y is the only column in the DataFrame with categorical features, we will be using the `LabelEncoder` [class](https://scikit-learn.org/stable/modules/preprocessing_targets.html#label-encoding) from `sklearn` in order to encode each categorical feature in that column to a number, so as to allow mathematical operations on said data.

In [80]:
# import LabelEncoder from scikit-learn
from sklearn.preprocessing import LabelEncoder

In [89]:
# now, vectorize the previously generated DataFrame object
enc = LabelEncoder()
y = enc.fit_transform(y)

# count the 
counter = Counter(y)
print(counter)

Counter({0: 432074, 1: 7938, 2: 5897})


In [93]:
# and finally, add y to X (now that the column is label encoded)
X[' Label'] = y

# get summarization of y
counter = Counter(X[' Label'])
print(f'{counter}\n')

Counter({0: 432074, 1: 7938, 2: 5897})



Sweet! Now we have our train X and y, and they are all composed of numerical data as well.

## Use the SMOTE technique to balance the dataset (which essentially gives the classifier more data from outliers, thus enhancing its ability to classify such data):

In [98]:
# import necessary packages for use in SMOTE implementation
import imblearn
import matplotlib.pyplot as pyplot
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

Now, we can use `Counter` to summarize the class distribution of y:

In [112]:
# summarize y
counter = Counter(y)
print(counter)

Counter({0: 432074, 1: 7938, 2: 5897})


According to the above summary, there are 432074 '0's, 7938 '1's, and 5897 '2's in y. Thus, in order to properly balance the training set, we should probably balance out the data so that, instead of a ratio of 432074:7938\:5897 (or approximately 216:4\:3), we have something more balanced, so as to provide more needed training data (for classification accuracy).

Thus, with a beginning ratio of 216:4\:3 (or 100:0.02:0.014), we can utilize oversampling and undersampling from the `imblearn` package in order to balance out the data a bit more.

This will be carried out in anticipation of utilization in a `Pipeline` further on.

In [163]:
# over sampling via SMOTE
over = SMOTE(sampling_strategy=0.05)

We will be oversampling to only 5% of the majority set due to the minority sets already being at such a slim 0.02 and 0.014 percent; thus extrapolation of features should be a bit more effective (I hope, haha)

In [164]:
# under sampling via RandomUnderSampler
under = RandomUnderSampler(sampling_strategy=0.5)

For undersampling, we will halve the majority dataset, to further balance out the dataset from a 100:0.02\:0.014 ratio to a more modest 10:1\:1. Hopefully this won't negatively affect the data interpretation and classification too much.

In [165]:
# testing pipelining 'over' and 'under'
steps = [('o', over), ('u', under)]
pipeline = Pipeline(steps=steps)

# transform dataset
X_test, y_test = pipeline.fit_resample(X, y)

# summarize new class distribution
counter = Counter(y)
print(counter)

ValueError: Input X contains NaN.
SMOTE does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

Note to self: either implement an imputer in the pipeline, or implement it via an encoder to fill in NaN values in the dataset properly (i.e., accounting for what they are replaced with properly, making sure it doesn't skew the dataset weirdly, etc.)