# Classifier comparison

A comparison of a several classifiers on CICIDS2017 webattacks dataset.

Sources:

* CICIDS2017: https://www.unb.ca/cic/datasets/ids-2017.html
* Scikit-learn demo: https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
* Overview of classification metrics: http://www.machinelearning.ru/wiki/images/d/de/Voron-ML-Quality-slides.pdf

## Reading and preparing data

Read undersampled (balanced) and preprocessed data.

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('web_attacks_balanced.csv')

The "Label" column is encoded as follows: "BENIGN" = 0, attack = 1.

In [2]:
df['Label'] = df['Label'].apply(lambda x: 0 if x == 'BENIGN' else 1)
y = df['Label'].values

Select the features.

In [3]:
webattack_features = ['Average Packet Size', 'Flow Bytes/s',
                       'Max Packet Length', 'Fwd Packet Length Mean',
                       'Fwd IAT Min', 'Total Length of Fwd Packets',
                       'Fwd IAT Std', 'Fwd Packet Length Max',
                       'Flow IAT Mean', 'Fwd Header Length']

In [4]:
X = df[webattack_features]
print(X.shape, y.shape)

(7267, 10) (7267,)


In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0, random_state=42)

unique, counts = np.unique(y_train, return_counts=True)
dict(zip(unique, counts))

{0: 5087, 1: 2180}

In [6]:
X.head()

Unnamed: 0,Average Packet Size,Flow Bytes/s,Max Packet Length,Fwd Packet Length Mean,Fwd IAT Min,Total Length of Fwd Packets,Fwd IAT Std,Fwd Packet Length Max,Flow IAT Mean,Fwd Header Length
0,80.75,3635.433,103.0,39.0,3.0,78.0,0.0,39.0,26040.0,64.0
1,50.666667,10.03516,48.0,48.0,1999848.0,432.0,20000000.0,48.0,5064547.0,204.0
2,48.0,909090.9,48.0,32.0,4.0,64.0,0.0,32.0,58.66667,64.0
3,94.25,2000000.0,112.0,51.0,3.0,102.0,0.0,51.0,54.33333,64.0
4,80.0,1792208.0,94.0,44.0,3.0,88.0,0.0,44.0,51.33333,64.0


## Classifier comparison

The operation may take a long time, 3-5 minutes depending on the computer performance.

In [7]:
import time
import warnings
warnings.filterwarnings("ignore")

from sklearn import model_selection
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier

models = []
models.append(('KNN', KNeighborsClassifier()))
models.append(('SVM', SVC(gamma='auto')))
models.append(('CART', DecisionTreeClassifier(max_depth=5)))
models.append(('RF', RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1)))    
models.append(('ABoost', AdaBoostClassifier()))
models.append(('LR', LogisticRegression(solver='lbfgs', max_iter=200)))
models.append(('NB', GaussianNB()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('QDA', QuadraticDiscriminantAnalysis()))
models.append(('MLP', MLPClassifier()))

for name, model in models:
    start_time = time.time()
    kfold = model_selection.KFold(n_splits=5, random_state=24)    

    recall = cross_val_score(model, X_train, y_train, cv=kfold, scoring='recall').mean()
    precision = cross_val_score(model, X_train, y_train, cv=kfold, scoring='precision').mean()
    accuracy = cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy').mean()
    f1_score = cross_val_score(model, X, y, cv=kfold, scoring='f1_weighted').mean()
    
    delta = time.time() - start_time
    print('{}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.2f} secs'.format(name, precision, accuracy, recall, f1_score, delta))

KNN	0.947	0.971	0.956	0.969	3.78 secs
SVM	0.612	0.705	0.044	0.607	181.34 secs
CART	0.979	0.976	0.941	0.970	0.76 secs
RF	0.995	0.969	0.906	0.966	1.27 secs
ABoost	0.960	0.976	0.961	0.972	15.20 secs
LR	0.883	0.927	0.873	0.933	10.24 secs
NB	0.502	0.702	0.956	0.737	0.36 secs
LDA	0.879	0.924	0.868	0.746	0.87 secs
QDA	0.811	0.926	0.984	0.937	0.69 secs
MLP	0.913	0.902	0.908	0.931	111.04 secs
