# Classification Modeling for UNSW NB-15 cyberattack dataset
Identifying cyberattacks can be considered both a classification problem and an anomaly detection problem. In this notebook, I treat it as a classification problem. This is appropriate for the data as is, given that the training and test csv's provided by the researchers are balanced between the normal and anomaly classes.

In [2]:
# Custom modules
from data_prep import load_csv_data
import model_abstraction as moda

# Data Structures
import pandas as pd
import numpy as np


# Preprocessing or data manipulation methods
from sklearn.decomposition import PCA
from sklearn.preprocessing import OneHotEncoder

# Modeling methods and selection
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import OneClassSVM, LinearSVC
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split

# Model assessment
from sklearn.metrics import confusion_matrix, roc_auc_score

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Modeling Using only Numeric Features
The columns that remain after excluding string objects are all numerical, but contain a mix of ordinal, categorical, integer, and float values. Because the provided dataset is balanced between the normal data and attack classes, I've elected to try the classification approach to anomaly detection in it's own notebook. This will serve as an initial proof of concept for this type of approach to the problem. 

In [3]:
X_train, y_train = load_csv_data('./data/UNSW_NB15_train_set.csv')
X_train, X_hold, y_train, y_hold  = train_test_split(X_train,y_train, test_size = 0.25,
                                                     random_state = 42, stratify=y_train)
X_test, y_test = load_csv_data('./data/UNSW_NB15_test_set.csv')

In [15]:
# Number of numeric features remaining
len(X_train.columns)

39

## First-Pass Assessment
The blocks below go through the process of applying several classifiers to the data with minimal tuning. This is done as a means to guide the decision-making process on which models to focus on for extensive tuning since parameter searching with cross-validation over this large of a dataset will be non-trivial.

In [28]:
# Provide classifiers to test for a "first pass" assessment using only vanilla models
classifiers = {
    'knn': KNeighborsClassifier,
    'lgr': LogisticRegression,
    'gnb': GaussianNB,
    'mnb': MultinomialNB,
    'dtc': DecisionTreeClassifier,
    'rfc': RandomForestClassifier,
    'gbc': GradientBoostingClassifier,
    'lsvc': LinearSVC
}

default_parameters = {
    'knn': {},
    'lgr': {'solver':'lbfgs'},
    'gnb': {},
    'mnb': {},
    'dtc': {},
    'rfc': {'n_estimators':100},
    'gbc': {},
    'lsvc': {}
}

In [35]:
results = moda.cross_val_models(classifiers, X_train, y_train, params=default_parameters, verbose = True)

Model: knn Metric: roc_auc 0.9307693784066334
Model: lgr Metric: roc_auc 0.8681552586846705
Model: gnb Metric: roc_auc 0.8680183569154156
Model: mnb Metric: roc_auc 0.7924676707340106
Model: dtc Metric: roc_auc 0.939692209932406
Model: rfc Metric: roc_auc 0.9930360035752193
Model: gbc Metric: roc_auc 0.9896177142271914




Model: lsvc Metric: roc_auc 0.7495850404739294




In [38]:
## Determine how much the classifier is over-fitting by comparing test auc with
## training cross-validation from above
gbc = GradientBoostingClassifier(random_state=42)
gbc.fit(X_train, y_train)
print('Train: ', roc_auc_score(y_train, gbc.predict(X_train)))
print('Holdout: ', roc_auc_score(y_hold, gbc.predict(X_hold)))
print('Test:', roc_auc_score(y_test, gbc.predict(X_test)))

Train:  0.9268779836721013
Holdout:  0.9258939344607664
Test: 0.8429430746373303


In [64]:
## Determine how much the classifier is over-fitting by comparing test auc with
## training cross-validation from above
lgr = LogisticRegression(solver = 'liblinear', penalty = 'l1', random_state=42)
lgr.fit(X_train, y_train)
print('Train: ', roc_auc_score(y_train, lgr.predict(X_train)))
print('Holdout: ', roc_auc_score(y_hold, lgr.predict(X_hold)))
print('Test:    ', roc_auc_score(y_test, lgr.predict(X_test)))



Train:  0.8929929872233793
Holdout:  0.8955387211997012
Test:     0.7533345944992023


In the example below, I eliminate the values

In [65]:
lgr_coefs = pd.Series(lgr.coef_.ravel())[
(pd.Series(lgr.coef_.ravel())>1e-2)\
| (pd.Series(lgr.coef_.ravel())<-1e-2)]
x_red = lgr_coefs

In [66]:
lgr = LogisticRegression(solver = 'liblinear', penalty = 'l1', random_state=42)
lgr.fit(X_train.iloc[:,x_red], y_train)
print('Train: ', roc_auc_score(y_train, lgr.predict(X_train.iloc[:,x_red])))
print('Holdout: ', roc_auc_score(y_hold, lgr.predict(X_hold.iloc[:,x_red])))
print('Test:    ', roc_auc_score(y_test, lgr.predict(X_test.iloc[:,x_red])))

Train:  0.5267994924465512
Holdout:  0.5269865359201731
Test:     0.5130821315889259


Tuning KNN:

In [15]:
best_k, best_score = moda.iterate_k_for_KNN(X_train, y_train, 80,81)

n_neighbors: 80 roc_auc 0.9454374383181573


## Grid-Searching Classifiers:
In order to provide some interpretation on coefficients, I'm going to try running some grid search model tuning while figuring out the rest of my dataset

In [4]:
param_grid = {'solver':['liblinear','saga'], 
             'C':np.linspace(1e-3,1e3,12)
             }
grid = GridSearchCV(LogisticRegression(penalty='l1', random_state=42),param_grid,
                    scoring='roc_auc', cv=5).fit(X_train,y_train)







In [5]:
param_grid_2 = {'solver':['liblinear','saga', 'sag', 'lbfgs', 'newton-cg'], 
             'C':np.linspace(1e-3,1e3,12)
             }
grid_2 = GridSearchCV(LogisticRegression(penalty='l2', random_state=42),param_grid_2,
                    scoring='roc_auc', cv=5).fit(X_train,y_train)

































In [7]:
grid_2.best_score_

0.8978438674615145

In [8]:
grid.best_score_

0.9692200422429181

In [9]:
grid.best_params_

{'C': 545.455, 'solver': 'liblinear'}

In [10]:
grid.best_estimator_.coef_

array([[ 3.44108227e-02, -1.94397201e-02, -1.35032449e-02,
         2.04231191e-05, -4.66065391e-05,  1.18592348e-06,
         1.10675768e-02,  5.40989102e-02, -5.64964484e-10,
        -9.83319671e-06,  4.08277437e-02,  1.61653126e-01,
        -1.37131493e-04, -4.28209509e-05,  1.28308153e-06,
        -2.39781690e-06, -5.23263140e-02, -1.19771758e-11,
         4.27559195e-12, -2.73774349e-03, -1.56992644e+00,
        -9.65802305e+00, -3.12065222e-01, -1.00788712e-03,
         1.41033932e-02,  1.14097754e+00, -1.73934138e-06,
        -7.84067021e-02,  1.27175491e+00,  6.96032417e-03,
         1.60333948e-02,  2.55658056e-01,  2.35714547e-01,
         1.34310684e+00,  3.89252015e-01, -2.89861001e-02,
         4.20569386e-03, -2.59115619e-01, -1.21370379e+01]])

In [11]:
lgr_optimized = grid.best_estimator_

In [12]:
type(lgr_optimized)

sklearn.linear_model.logistic.LogisticRegression

In [13]:
roc_auc_score(y_test, lgr_optimized.predict(X_test))

0.7535152377295676