# Initial Classification Modeling for UNSW NB-15 cyberattack dataset
Identifying cyberattacks can be considered both a classification problem and an anomaly detection problem. In this notebook, I treat it as a classification problem. The training and test csv's provided by the researchers actually have more attacks than normal observations, which makes them better for classification training. A companion notebook, `02b_improved classification`, runs classification examples with nearly 10x the observations, more thoughtful feature selection, and data pre-processing.

This notebook was run using the numeric columns from the data with no type of preprocessing. I intended this as a baseline proof of concept using the provided training and test data. **The point of this notebook was to ensure that it was possible to run any sort of model on the dataset.** I suspected that the models below were performing too well, and brought in the test set to confirm whether or not things were over-fitting. Cross-validated scores on the provided training set were similar to those from a holdout of the training set. When I then got >.1 decreases in scores for the test set, I reasoned that the test set contained types of attacks not present in the training set.

The (relatively) poor results seen here were the catalyst for going back and spending more time understanding the features' real-world meaning and the distribution of their values. Having a "dumb" first pass to compare against later attempts lets me gauge the extent to which later efforts improve or reduce model performance.

In [1]:
# Custom modules
from data_prep import load_csv_data
import model_abstraction as moda

# Data Structures
import pandas as pd
import numpy as np

# Preprocessing or data manipulation methods
from sklearn.decomposition import PCA
from sklearn.preprocessing import OneHotEncoder

# Modeling methods and selection
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import OneClassSVM, LinearSVC
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split

# Model assessment
from sklearn.metrics import confusion_matrix, roc_auc_score

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Modeling Using only Numeric Features
The columns that remain after excluding string objects are all numerical, but contain a mix of ordinal, categorical, integer, and float values. Because the provided dataset is balanced between the normal data and attack classes, I've elected to try the classification approach to anomaly detection in its own notebook, `03_anomaly_detection`. The following tests will serve as an initial proof of concept for a classification approach to detecting attack instances in the training set. 

In [3]:
X_train, y_train = load_csv_data('./data/UNSW_NB15_train_set.csv')
X_train, X_hold, y_train, y_hold  = train_test_split(X_train,y_train, test_size = 0.25,
                                                     random_state = 42, stratify=y_train)
X_test, y_test = load_csv_data('./data/UNSW_NB15_test_set.csv')

In [4]:
# Number of numeric features remaining
len(X_train.columns)

39

## First-Pass Assessment
The blocks below go through the process of applying several classifiers to the data with minimal tuning. This is done as a means to guide the decision-making process on which models to focus on for extensive tuning since parameter searching with cross-validation over this dataset will be non-trivial.  

The function below takes a dictionary of models, a dictionary of model parameters, a dataset of features and targets, and performs a cross-validation assessment using (by default) ROC AUC scores. I created it for a previous classification project and like using it as a means to quickly test a variety of models before narrowing my focus.

In [5]:
# Provide classifiers to test for a "first pass" assessment using only vanilla models
classifiers = {
    'knn': KNeighborsClassifier,
    'lgr': LogisticRegression,
    'gnb': GaussianNB,
    'mnb': MultinomialNB,
    'dtc': DecisionTreeClassifier,
    'rfc': RandomForestClassifier,
    'gbc': GradientBoostingClassifier,
    'lsvc': LinearSVC
}

default_parameters = {
    'knn': {'n_neighbors':9},
    'lgr': {'solver':'lbfgs'},
    'gnb': {},
    'mnb': {},
    'dtc': {},
    'rfc': {'n_estimators':100},
    'gbc': {},
    'lsvc': {}
}

In [6]:
results = moda.cross_val_models(classifiers, X_train, y_train, params=default_parameters, verbose = True)

Model: knn Metric: roc_auc 0.942791683625017
Model: lgr Metric: roc_auc 0.8681552586846705
Model: gnb Metric: roc_auc 0.8680183569154156
Model: mnb Metric: roc_auc 0.7924676707340106
Model: dtc Metric: roc_auc 0.9403653413882174
Model: rfc Metric: roc_auc 0.9930983324953914
Model: gbc Metric: roc_auc 0.9896156619342893




Model: lsvc Metric: roc_auc 0.8146730589671766




## Slightly more in-depth performance assessments: looking for over-fitting
From the above model testing, I'm selecting a few models to focus on:
* Gradient Boosting: doesn't offer much in the way of interpretability, but it's a lower variance option that is producing good results in this case.
* Random Forest: trains much quicker than gradient boosting while having slightly better cross-validated results.
* Logistic Regression: performance seems lower than average but the coefficients may be useful in guiding later feature selection.
* KNN: performs well and the literature I've read on anomaly detection suggests density-based approaches. I realize I'm currently treating this as a supervised classification problem, however I think it's worth pulling that thread.

I brought in the test set because I suspected the models were performing too well, given that this is a first-pass assessment and the dataset was created with the intent to make cyberattacks harder to detect. To detect overfitting, I trained the models on part of the training set and then tested on a holdout from the training set. I then compared both of those results with scores from the provided test set. The model parameters aren't changed too much here given that I'm more concerned about the difference in performance across training and test sets rather than model performance for its own sake.

In [7]:
## Determine how much the classifier is over-fitting by comparing test auc with
## training cross-validation from above
gbc = GradientBoostingClassifier(random_state=42)
gbc.fit(X_train, y_train)
print('Train: ', roc_auc_score(y_train, gbc.predict(X_train)))
print('Holdout: ', roc_auc_score(y_hold, gbc.predict(X_hold)))
print('Test:', roc_auc_score(y_test, gbc.predict(X_test)))

Train:  0.9268779836721013
Holdout:  0.9258939344607664
Test: 0.8429430746373303


In [9]:
## Determine how much the classifier is over-fitting by comparing test auc with
## training cross-validation from above
rfc = RandomForestClassifierForestClassifier(n_estimators=100,random_state=42)
rfc.fit(X_train, y_train)
print('Train: ', roc_auc_score(y_train, rfc.predict(X_train)))
print('Holdout: ', roc_auc_score(y_hold, rfc.predict(X_hold)))
print('Test:', roc_auc_score(y_test, rfc.predict(X_test)))

Train:  0.997792385900229
Holdout:  0.9494054689445158
Test: 0.8611083930926426


I'm seeing slighter dropoffs in AUC score from the trainig set (which the models were trained on) to the holdout set, followed by even steeper reductions into the mid .80's from the low/mid .90's when looking at the test set.

Below, I take a look at seeing if I can select out a few features from the logistic regression coefficient values. The results indicate that the cutoff I'm using for eliminating features is too high or that the tradeoff in accuracy isn't worth it.

In [8]:
## Determine how much the classifier is over-fitting by comparing test auc with
## training cross-validation from above
lgr = LogisticRegression(solver = 'liblinear', penalty = 'l1', random_state=42)
lgr.fit(X_train, y_train)
print('Train: ', roc_auc_score(y_train, lgr.predict(X_train)))
print('Holdout: ', roc_auc_score(y_hold, lgr.predict(X_hold)))
print('Test:    ', roc_auc_score(y_test, lgr.predict(X_test)))

Train:  0.8929929872233793
Holdout:  0.8955387211997012
Test:     0.7533345944992023




In the example below, I eliminate the values

In [65]:
lgr_coefs = pd.Series(lgr.coef_.ravel())[
(pd.Series(lgr.coef_.ravel())>1e-2)\
| (pd.Series(lgr.coef_.ravel())<-1e-2)]
x_red = lgr_coefs

In [66]:
lgr = LogisticRegression(solver = 'liblinear', penalty = 'l1', random_state=42)
lgr.fit(X_train.iloc[:,x_red], y_train)
print('Train: ', roc_auc_score(y_train, lgr.predict(X_train.iloc[:,x_red])))
print('Holdout: ', roc_auc_score(y_hold, lgr.predict(X_hold.iloc[:,x_red])))
print('Test:    ', roc_auc_score(y_test, lgr.predict(X_test.iloc[:,x_red])))

Train:  0.5267994924465512
Holdout:  0.5269865359201731
Test:     0.5130821315889259


I also wanted to take a quick look at using KNN. Some of the literature I've read on cyberattack datasets (mainly the KD Cup 99 dataset) show promising results using knn ([for example](https://pdfs.semanticscholar.org/918a/e12c0cff311038147b2183af9830417361d8.pdf)). The function below cross-validates a training set for a range of `n_neighbor` values. I saw incremental increases in AUC score, with diminishing returns past 10 neighbors. I've tried an extrememly large value for `n_neighbors` just to confirm that the performance gains are negligible beyond that point.

In [15]:
best_k, best_score = moda.iterate_k_for_KNN(X_train, y_train, 80,81)

n_neighbors: 80 roc_auc 0.9454374383181573


## Grid-Searching Logistic Regression:
From the above, it was clearly necessary to go back to the drawing board with my dataset. The grid searches below are an exploration into how well I could tune a model given the limitations of my dataset. I still wanted some interpretability from the model coefficients, so I went with logistic regression.
These took a couple of hours to run locally.  

Despite improvements in the cross-validation score by roughly .1 AUC from above, these didn't translate into any better performance on the test set. This further reinforced my desire to build a more representative test set.

In [4]:
# Parameter search for a logistic regression using an l1 penalty
param_grid = {'solver':['liblinear','saga'], 
             'C':np.linspace(1e-3,1e3,12)
             }
grid = GridSearchCV(LogisticRegression(penalty='l1', random_state=42),param_grid,
                    scoring='roc_auc', cv=5).fit(X_train,y_train)







In [8]:
# Cross-validation AUC score improves from the initial multi-model exercise
grid.best_score_

0.9692200422429181

In [9]:
grid.best_params_

{'C': 545.455, 'solver': 'liblinear'}

In [11]:
lgr_optimized = grid.best_estimator_

In [13]:
# Despite improvements in the tuning score, the performance gains on the test set are marginal
roc_auc_score(y_test, lgr_optimized.predict(X_test))

0.7535152377295676

In [10]:
grid.best_estimator_.coef_

array([[ 3.44108227e-02, -1.94397201e-02, -1.35032449e-02,
         2.04231191e-05, -4.66065391e-05,  1.18592348e-06,
         1.10675768e-02,  5.40989102e-02, -5.64964484e-10,
        -9.83319671e-06,  4.08277437e-02,  1.61653126e-01,
        -1.37131493e-04, -4.28209509e-05,  1.28308153e-06,
        -2.39781690e-06, -5.23263140e-02, -1.19771758e-11,
         4.27559195e-12, -2.73774349e-03, -1.56992644e+00,
        -9.65802305e+00, -3.12065222e-01, -1.00788712e-03,
         1.41033932e-02,  1.14097754e+00, -1.73934138e-06,
        -7.84067021e-02,  1.27175491e+00,  6.96032417e-03,
         1.60333948e-02,  2.55658056e-01,  2.35714547e-01,
         1.34310684e+00,  3.89252015e-01, -2.89861001e-02,
         4.20569386e-03, -2.59115619e-01, -1.21370379e+01]])

In [5]:
# Parameter grid search using an l2 penalty. This version has more solvers available.
param_grid_2 = {'solver':['liblinear','saga', 'sag', 'lbfgs', 'newton-cg'], 
             'C':np.linspace(1e-3,1e3,12)
             }
grid_2 = GridSearchCV(LogisticRegression(penalty='l2', random_state=42),param_grid_2,
                    scoring='roc_auc', cv=5).fit(X_train,y_train)

































In [7]:
grid_2.best_score_

0.8978438674615145