In [9]:
import numpy as np
import pandas as pd
pd.set_option('max_colwidth', 200)
pd.set_option("display.colheader_justify", "center")

# EDA Packages
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 
sns.set(color_codes=True)

# ML Packages
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import Perceptron
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.naive_bayes import GaussianNB

In this notebook I will implement and evaluate 4 classification algorithms for the UNSW dataset.  The label is binary 1: attack 0: normal.  I will also try some hyperparameter tuning, though I am limited by the computational complexity needed for cross validation on my laptop.  The 4 algorithms are:

* Random Forest
* Naive Bayes
* SVM with a linear kernel
* Perceptron

I found a quick intro on some common ML algorithms, including a few we use below: https://towardsdatascience.com/an-introduction-to-nine-essential-machine-learning-algorithms-ee0efbb61e0

The link above does not include the perceptron algorithm, but here is a quick intro: https://towardsdatascience.com/what-the-hell-is-perceptron-626217814f53

There are many algorithms we can try, and as we better understand the data we will gain understanding of the pros/cons of different classification methods.  I also plan on trying a shallow neural network approach soon.

Loading UNSW training set below

In [6]:
unsw_train = pd.read_csv('https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-NB15-Datasets/a%20part%20of%20training%20and%20testing%20set/UNSW_NB15_training-set.csv')

unsw_train[unsw_train.select_dtypes('object').columns] = unsw_train.select_dtypes('object').apply(lambda x: x.astype('category'))
unsw_train['label'] = unsw_train['label'].astype('category')

In [7]:
unsw_train.head(5)

Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,...,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label
0,1,0.121478,tcp,-,FIN,6,4,258,172,74.08749,...,1,1,0,0,0,1,1,0,Normal,0
1,2,0.649902,tcp,-,FIN,14,38,734,42014,78.473372,...,1,2,0,0,0,1,6,0,Normal,0
2,3,1.623129,tcp,-,FIN,8,16,364,13186,14.170161,...,1,3,0,0,0,2,6,0,Normal,0
3,4,1.681642,tcp,ftp,FIN,12,12,628,770,13.677108,...,1,3,1,1,0,2,1,0,Normal,0
4,5,0.449454,tcp,-,FIN,10,6,534,268,33.373826,...,1,40,0,0,0,2,39,0,Normal,0


Loading the test data for model evaluation

In [8]:
unsw_test = pd.read_csv('https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-NB15-Datasets/a%20part%20of%20training%20and%20testing%20set/UNSW_NB15_testing-set.csv')

unsw_test['label'] = unsw_test['label'].astype('category')
unsw_test[unsw_test.select_dtypes('object').columns] = unsw_test.select_dtypes('object').apply(lambda x: x.astype('category'))

## Training a binary classifier

First getting idea of the frequency of the labels

In [16]:
unsw_train['label'].value_counts() / len(unsw_train)

1    0.680622
0    0.319378
Name: label, dtype: float64

In [7]:
round((sum(np.random.randint(0,2,len(unsw_train)) == unsw_train['label']) / len(unsw_train)), 3)

0.5

By random guess, we would identify about 50% of attacks.  Lets compare this to random forest

Most scikit-learn classifers need categorical variables to be one-hot endcoded.  This means a feature with three possible categories (e.g. small, medium large) would be transformed into 3 columns, with a 1 in the row of the corresponding value and 0's in the other cells.  For example, small would be row 100, medium 010, etc.

This is easily implemented in pandas:

In [5]:
pd.get_dummies(unsw_train).head()

Unnamed: 0,id,dur,spkts,dpkts,sbytes,dbytes,rate,sttl,dttl,sload,...,attack_cat_DoS,attack_cat_Exploits,attack_cat_Fuzzers,attack_cat_Generic,attack_cat_Normal,attack_cat_Reconnaissance,attack_cat_Shellcode,attack_cat_Worms,label_0,label_1
0,1,0.121478,6,4,258,172,74.08749,252,254,14158.94238,...,0,0,0,0,1,0,0,0,1,0
1,2,0.649902,14,38,734,42014,78.473372,62,252,8395.112305,...,0,0,0,0,1,0,0,0,1,0
2,3,1.623129,8,16,364,13186,14.170161,62,252,1572.271851,...,0,0,0,0,1,0,0,0,1,0
3,4,1.681642,12,12,628,770,13.677108,62,252,2740.178955,...,0,0,0,0,1,0,0,0,1,0
4,5,0.449454,10,6,534,268,33.373826,254,252,8561.499023,...,0,0,0,0,1,0,0,0,1,0


Given this extends our feature dimensionality from less than 50 to over 200, we will stick with float/into features only.

### Random Forest

We will use 5-fold cross validation to tune the following hyperparameters: number of estimators, tree depth, number of features considered at each split.  Instead of trying all combinations of these hyperparameters, we will conduct a random search over 20 different combinations.  This is not best practice, and its likely we will miss the optimal hyperparameters.  However, we are limited by the speed of our laptops (ASECC will be a huge help!).

In [28]:
random_forest = RandomForestClassifier()

In [29]:
param_grid = {
    'max_depth': [3, 5, 10],
    'max_features': [2, 4, 8, 10, 12],
    'n_estimators': [50, 100, 300, 500]
}

In [30]:
random_search = RandomizedSearchCV(estimator = random_forest, param_distributions = param_grid, 
                          cv = 5, verbose = 1, n_iter = 10)

We will now use the best performing parameters from the Random Search to train a RF classifier and evaluate the model performance

In [None]:
random_search.fit(unsw_train.select_dtypes(['float', 'int']).drop('id', axis=1), unsw_train['label'])

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


In [None]:
random_search.best_params_

In [21]:
random_forest_1 = RandomForestClassifier(max_depth=11, max_features=4, n_estimators=1000)

In [22]:
random_forest_1.fit(unsw_train.select_dtypes(['float', 'int']).drop('id', axis=1), unsw_train['label'])

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=11, max_features=4,
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

Evaluating on test set

In [23]:
rf_predict = random_forest_1.predict(unsw_test.select_dtypes(['float', 'int']).drop('id', axis=1))

In [24]:
print(classification_report(unsw_test['label'], rf_predict))

              precision    recall  f1-score   support

           0       0.99      0.66      0.79     37000
           1       0.78      0.99      0.88     45332

    accuracy                           0.85     82332
   macro avg       0.89      0.83      0.84     82332
weighted avg       0.88      0.85      0.84     82332



### SVM Classifier

Because the features of this dataset are on different scales, it is important that we do the follwing to each feature: 1) center (subtract the mean) and 2) scale (divide by the standard deviation).  SVM classifiers are optimized using some distance metric, meaning we can only use the continuous features (no categorical features) and different scales will hinder model performance. 

In [13]:
sc= StandardScaler()
scaled_train = sc.fit_transform(unsw_train.select_dtypes(['float', 'int']).drop('id', axis=1))
scaled_test = sc.transform(unsw_test.select_dtypes(['float', 'int']).drop('id', axis=1))

In [None]:
svm = SVC(kernel='linear', C=1, gamma=1)
svm.fit(scaled_train, unsw_train['label'])

In [None]:
svm_predict = svm.predict(scaled_test)
print(classification_report(unsw_test['label'], svm_predict))

### Perceptron (single layer neural setwork)

In [23]:
ppn = Perceptron(penalty='l2', alpha=0.01)
ppn.fit(scaled_train, unsw_train['label'])

Perceptron(alpha=0.01, class_weight=None, early_stopping=False, eta0=1.0,
           fit_intercept=True, max_iter=1000, n_iter_no_change=5, n_jobs=None,
           penalty='l2', random_state=0, shuffle=True, tol=0.001,
           validation_fraction=0.1, verbose=0, warm_start=False)

In [24]:
ppn_predict = ppn.predict(scaled_test)
print(classification_report(unsw_test['label'], ppn_predict))

              precision    recall  f1-score   support

           0       0.90      0.49      0.64     37000
           1       0.70      0.96      0.81     45332

    accuracy                           0.75     82332
   macro avg       0.80      0.72      0.72     82332
weighted avg       0.79      0.75      0.73     82332



### Naive Bayes

Naive Bayes is easy and quick to implement, but makes two (usually) unrealistic assumptions about features: they are independent within classes, and each feature has the same impact on the label assignment.  It is often used to benchmark other more complex algorithms

In [10]:
nb =  GaussianNB()
nb.fit(unsw_train.select_dtypes float', 'int']).drop('id', axis=1), unsw_train['label'])

GaussianNB(priors=None, var_smoothing=1e-09)

In [12]:
nb_predict = nb.predict(unsw_test.select_dtypes(['float', 'int']).drop('id', axis=1))
print(classification_report(unsw_test['label'], nb_predict))

              precision    recall  f1-score   support

           0       0.75      0.53      0.62     37000
           1       0.69      0.85      0.76     45332

    accuracy                           0.71     82332
   macro avg       0.72      0.69      0.69     82332
weighted avg       0.71      0.71      0.70     82332



### Unsupervised learning: K-means 

Often we work with data that does not have a label (e.g. attack/normal).  Unsupervised learning can be used to reveal underlying structure in the dataset that informs how the observations seperate in the feature space.  

We will use k-means clustering here, which is a common unsupervised learning algorithm: * https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68
* https://towardsdatascience.com/machine-learning-algorithms-part-9-k-means-example-in-python-f2ad05ed5203

The most important hyperparameter in k-means is k, the number of clusters into which you wish to cluster your data.  Because we know our data has a binary label, we will use 2 clusters.  However, in future work we can try k=10, which is the number of attack categories plus a category for normal traffic.

In [None]:
kmeans = KMeans(n_clusters=2, init='k-means++', max_iter=300, n_init=10)
pred_y = kmeans.fit_predict(X)

### Principle Components Analysis

Based on the correlation heat map, there appears to be substantial correlation.  To improve classifier performance, we are going to use PCA to reduce both the feature dimensionality and correlation. PCA transforms the input data matrx to a matrix of linear independent columns, where the first component maximizes the variability in the original data, the second component is orthogonal to the first component (thus, 0 correlation) while maximizing the variance in the data, and so on.