# Anomaly Detection for UNSW NB-15 cyberattack dataset
Identifying cyberattacks can be considered both a classification problem and an anomaly detection problem. In this notebook, I treat it as an anomaly detection problem. Given that [the training and test csv's provided by the researchers](https://cloudstor.aarnet.edu.au/plus/index.php/s/2DhnLGDdEECo4ys?path=%2FUNSW-NB15%20-%20CSV%20Files%2Fa%20part%20of%20training%20and%20testing%20set) are balanced between the normal and anomaly classes, I'm going to down-sample to reduce the available number of anomalies.

In [45]:
# Custom modules
from data_prep import load_csv_data, y_anomaly_format 
import model_abstraction as moda

# Data Structures
import pandas as pd
import numpy as np


# Preprocessing or data manipulation methods
from sklearn.decomposition import PCA
from sklearn.preprocessing import OneHotEncoder

# Modeling methods and selection
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split

# Anomaly detection
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.covariance import EllipticEnvelope
from sklearn.neighbors import LocalOutlierFactor

# Model assessment
from sklearn.metrics import confusion_matrix, roc_auc_score

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Modeling Using only Numeric Features
The columns that remain after excluding string objects are all numerical, but contain a mix of ordinal, categorical, integer, and float values. Because the provided dataset is balanced between the normal data and attack classes, I've elected to try the classification approach to anomaly detection in it's own notebook. This will serve as an initial proof of concept for this type of approach to the problem. 

In [40]:
X_train, y_train = load_csv_data('./data/UNSW_NB15_train_set.csv', strategy='anomaly')
X_train, X_hold, y_train, y_hold  = train_test_split(X_train,y_train, test_size = 0.25,
                                                     random_state = 42, stratify=y_train)
X_test, y_test = load_csv_data('./data/UNSW_NB15_test_set.csv', strategy='anomaly')

In [4]:
# Number of numeric features remaining
len(X_train.columns)

39

In [5]:
y_train.unique()

array([ 1, -1])

The type of anomaly detection methods used below start by training on normal data and building a profile. From there, they can be used to predict on unseen data.

In [6]:
## Create 'masks' to filter the dataframe by whether or not the observation belongs to normal or attack class.
train_normal = y_train==1

In [11]:
ilf = IsolationForest()
ilf.fit(X_train[train_normal])




In [13]:
y_pred_nm = ilf.predict(X_train[train_normal])
y_pred_out = ilf.predict(X_train[~train_normal])
y_pred_train = ilf.predict(X_train)
y_pred_test = ilf.predict(X_test)




In [17]:
confusion_matrix(y_train[train_normal], y_pred_nm)

array([[    0,     0],
       [ 5600, 50400]])

In [16]:
confusion_matrix(y_train[~train_normal], y_pred_out)

array([[54217, 65124],
       [    0,     0]])

In [18]:
confusion_matrix(y_train, y_pred_train)

array([[54217, 65124],
       [ 5600, 50400]])

In [38]:
print(roc_auc_score(y_test, y_pred_test))
confusion_matrix(y_test, y_pred_test)

0.6193118243541345


array([[17214, 28118],
       [ 5221, 31779]])

In [22]:
y_test.value_counts()

-1    45332
 1    37000
Name: label, dtype: int64

## Basic One Class SVM

In [27]:
ocsvm = OneClassSVM(kernel='rbf')
ocsvm.fit(X_train[train_normal])
y_pred_normal = ocsvm.fit_predict(X_train[train_normal])
y_pred_outlier = ocsvm.predict(X_train[~train_normal])
y_pred = ocsvm.predict(X_train)



In [28]:
confusion_matrix(y_train[train_normal], y_pred_normal)

array([[    0,     0],
       [26719, 29281]])

In [29]:
confusion_matrix(y_train[~train_normal], y_pred_outlier)

array([[117839,   1502],
       [     0,      0]])

In [30]:
confusion_matrix(y_train, y_pred)

array([[117839,   1502],
       [ 26719,  29281]])

In [35]:
roc_auc_score(y_test, ocsvm.predict(X_test))

0.551775488229781

## Eliptical Envelope

In [25]:
eenv = EllipticEnvelope(contamination=0.2, random_state=0)
eenv.fit(X_train[train_normal])



























EllipticEnvelope(assume_centered=False, contamination=0.2, random_state=0,
         store_precision=True, support_fraction=None)

In [36]:
roc_auc_score(y_test,eenv.predict(X_test))

0.7075123032235447

# Local Outlier Factor

In [59]:
lof = LocalOutlierFactor()
y_pred = lof.fit_predict(X_train)



In [43]:
neg_scores = lof.negative_outlier_factor_
out_scores = (neg_scores.max()-neg_scores)/(neg_scores.max()-neg_scores.min())

In [53]:
pd.Series(neg_scores).describe()

count    1.315050e+05
mean    -7.703878e+12
std      2.633116e+14
min     -2.320057e+16
25%     -1.078605e+00
50%     -1.004495e+00
75%     -1.000000e+00
max     -8.769621e-01
dtype: float64

In [54]:
for lim in np.linspace(1e-5,1e1,10):
    print(roc_auc_score(y_train, list(map(y_anomaly_format, out_scores > lim))))

0.5005816494689044
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5


In [61]:
roc_auc_score(y_train, y_pred)

0.5393576376419513