# Intrusion Detection on the KDD Cup 99 Data Set
---

## Notebook Setup

This notebook has been setup to allow both Python and R to work together. For it to work, make sure you have the correct Python and R environments setup (see [./README.md](./README.md)). 

By default all code cells are in Python. To use R, start a code cell with the line `%%R` to tell the notebook to interpret the whole cell as R code. This is setup in such a way that it is effectively two notebooks intertwined; one in R and one in Python. Special code cells can then be used to ferry data between the two notebooks:
  - `%Rpush <variable_name>` will copy a piece of data from the Python notebook to the R notebook
  - `%Rpull <variable_name>` will copy a piece of data from the R notebook to the Python notebook
  
More `<variable_name>`s can be appended to the above lines (with space separators), e.g: `%Rpull foo bar` will copy the `foo` and `bar` variables from R to Python. This should work ok in general provided we stick to using data frames when communicating between the two notebooks (other types tend to require some massaging on the either end).

Python imports:

In [None]:
import matplotlib  # graphs and plotting for python
import numpy  # fast arrays for python (used by pandas)
import pandas  # provides dataframe's and similar structures for python
import seaborn  # provides pre-configured, pretty graphs for matplotlib

from sklearn import metrics
from sklearn import model_selection

import warnings
warnings.filterwarnings('ignore')

Setup Jupyter with `rpy2` to allow embedding R, and `matplotlib` to allow inline plots.

In [None]:
# For inline plots within the notebook
%matplotlib notebook
# Allows code cells to be intrepreted as R (put %%R on the first line) [^1]
%load_ext rpy2.ipython
# Render R output as HTML
from rpy2.ipython import html
html.init_printing()

R libraries:

In [None]:
%%R
library(caret)
library(ggplot2)

In [None]:
random_state = numpy.random.RandomState(0)

## Data Source

In [None]:
columns=['duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'connection_label']
connection_events = pandas.read_csv('http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz', names=columns)  # [^3]

In [None]:
connection_events.head(10).transpose()

The `connection_label` column describes the source of each connection; either the name of the red-team which caused the event, or the string `normal.` which indicates normal network behaviour. 

The task is to create a model which can separate red-team behavour from normal network behaviour, so group the data into two labels: `normal` and `bad`.

In [None]:
def generate_label(connection_label):
    return 'normal' if connection_label == 'normal.' else 'bad'

connection_events['connection_label'] = connection_events['connection_label'].apply(func=generate_label)

Next, separate out the labels from the data set. 

In [None]:
connection_labels = connection_events.filter(['connection_label'], axis='columns')
connection_events = connection_events.drop(['connection_label'], axis='columns')

## Training and Testing Data

It is now necessary to split the data into training and testing sets. The code below performs 10-fold cross-validation, providing the array `train_test_splits`. It does this using the [model selection](http://scikit-learn.org/stable/modules/cross_validation.html#stratified-k-fold) part of the scikit-learn library [^5].

In [None]:
k_fold_splitter = model_selection.StratifiedKFold(n_splits=10,  random_state=random_state)
train_test_splits = k_fold_splitter.split(
    connection_events,  # data to be split
    connection_labels,  # target/class to split by
)

# Force evaluation of the train_test_splits generator into a list. This needs to be
# done before it's sent to R.
train_test_splits = [
    [training_indexes, testing_indexes]
    for training_indexes, testing_indexes in train_test_splits
]

# Here train_test_splits is a list, where each item represents a single 90/10 split of 
# training and testing data respectively:
#
#   train_test_splits = [
#      (indexes_of_training_samples, indexes_of_test_samples),  # first split
#      (indexes_of_training_samples, indexes_of_test_samples),  # second split
#      ...
#      (indexes_of_training_samples, indexes_of_test_samples),  # kth split
#  ]
#
# These indexes can then be used to fetch the data samples and their labels,
# ready for training and testing.

The cell below imports the connection data, it's labels and the train/test splits into the R session, ready for analysis and modelling.

In [None]:
%Rpush train_test_splits connection_events connection_labels

Below are helper functions which will extract the training (or test) data from a data frame, given an item of the `train_test_splits` array:

In [None]:
%%R

get_training_rows <- function(dataframe, train_test_split) {
    # Training indexes are the first item in a train_test_split
    indexes <- train_test_split[[1]]
    
    # Python indices start at 0, whilst R indices start at 1.
    # Correct for this by incrementing each index by 1:
    indexes <- indexes + 1
    
    dataframe[indexes,]
}

get_testing_rows <- function(dataframe, train_test_split) {
    # Testing indexes are the second item in a train_test_split
    indexes <- train_test_split[[2]]
    
    # Python indices start at 0, whilst R indices start at 1.
    # Correct for this by incrementing each index by 1:
    indexes <- indexes + 1  
    
    dataframe[indexes,]
}

Here is an example showing how to extract the first set of training and testing data:

In [None]:
%%R

# Use the indexes to get out the training data
training_data <- get_training_rows(connection_events, train_test_splits[[1]])
training_labels <- get_training_rows(connection_labels, train_test_splits[[1]])
# These data frames can then be used to train your model.

# Use the indexes to get out the testing data
testing_data <- get_testing_rows(connection_events, train_test_splits[[1]])
testing_labels <- get_testing_rows(connection_labels, train_test_splits[[1]])
# Run the model on testing data, and consider how well the classification compares to their true values.

## Data Model

Here's a quick demo showing how to use the data within the R notebook.

In [None]:
%%R
summary(training_data)

In [None]:
%%R
qplot(x=src_bytes, y=dst_bytes, data=testing_data, geom='point')

In practice, we would want to do the folllowing for each train/test split $i$ :
  1. Train the model based on the $i$-th set of training data and labels.
  2. Apply this trained model on the $i$-th set of testing data.
  3. Save the set of predicted labels.
  
Then return a vector of each of these sets of predicted labels.

For this example, come up with a silly set of predicted labels for each of our 10 training/testing splits. This is the format of predicted labels expected by the performance analsysi in the next section.

In [None]:
%%R
predicted_label_sets <- list(
    sample(c('normal', 'bad'), nrow(train_test_splits[[1]][[2]]), replace=TRUE, prob=c(0.5, 0.5)),
    sample(c('normal', 'bad'), nrow(train_test_splits[[2]][[2]]), replace=TRUE, prob=c(0.5, 0.5)),
    sample(c('normal', 'bad'), nrow(train_test_splits[[3]][[2]]), replace=TRUE, prob=c(0.5, 0.5)),
    sample(c('normal', 'bad'), nrow(train_test_splits[[4]][[2]]), replace=TRUE, prob=c(0.5, 0.5)),
    sample(c('normal', 'bad'), nrow(train_test_splits[[5]][[2]]), replace=TRUE, prob=c(0.5, 0.5)),
    sample(c('normal', 'bad'), nrow(train_test_splits[[6]][[2]]), replace=TRUE, prob=c(0.5, 0.5)),
    sample(c('normal', 'bad'), nrow(train_test_splits[[7]][[2]]), replace=TRUE, prob=c(0.5, 0.5)),
    sample(c('normal', 'bad'), nrow(train_test_splits[[8]][[2]]), replace=TRUE, prob=c(0.5, 0.5)),
    sample(c('normal', 'bad'), nrow(train_test_splits[[9]][[2]]), replace=TRUE, prob=c(0.5, 0.5)),
    sample(c('normal', 'bad'), nrow(train_test_splits[[10]][[2]]), replace=TRUE, prob=c(0.5, 0.5))
)

## Model Performance and Analysis


Import the predictions from the R session into Python:

In [None]:
%Rpull predicted_label_sets

In [None]:
# rpy2 returns an r-type list of character vectors. Convert each set of predictions into a numpy 
# array for processing with numpy/pandas/sklearn etc.
predicted_label_sets = [numpy.array(predicted_labels) for predicted_labels in predicted_label_sets]

The performance analysis could be done in R or Python. Here is an example in Python [^4]:

In [None]:
true_label_sets = [connection_labels.iloc[testing_indexes] for training_indexes, testing_indexes in train_test_splits]

In [None]:
confusion_matrixes = [
    metrics.confusion_matrix(true_labels, predicted_labels)
    for true_labels, predicted_labels
    in zip(true_label_sets, predicted_label_sets)
]

In [None]:
summary_confusion_matrix = sum(confusion_matrixes)
summary_confusion_matrix = pandas.DataFrame(
    data=summary_confusion_matrix, 
    index=['True Normal', 'True Bad'], 
    columns=['Predicted Normal', 'Predicted Bad'],
)

summary_confusion_figure, summary_confusion_axes = matplotlib.pyplot.subplots()
summary_confusion_axes.set_title(
    'Confusion matrix showing the predicted vs. true \n'
    'class of "normal" and "bad" network connections.'
)
seaborn.heatmap(
    summary_confusion_matrix,
    annot=True,
    fmt="d",
    cmap=seaborn.color_palette("Blues"),
    vmin=0,
    ax=summary_confusion_axes,
)

In [None]:
def sensitivity(confusion_matrix):
    true_positives = confusion_matrix['Predicted Normal']['True Normal']
    false_negatives = confusion_matrix['Predicted Bad']['True Normal']
    return true_positives/(true_positives+false_negatives)

print('Sensitivity: {:.2f}%'.format(
    sensitivity(summary_confusion_matrix)*100
))

In [None]:
def specificity(confusion_matrix):
    false_positives = confusion_matrix['Predicted Normal']['True Bad']
    true_negatives = confusion_matrix['Predicted Bad']['True Bad']
    return true_negatives/(true_negatives+false_positives)

print('Specificity: {:.2f}%'.format(
    specificity(summary_confusion_matrix)*100
))

Unsuprisingly, modelling the traffic as being uniformly distributed with a 50/50 split did not work particularly well. In fact, it gave sensitivity and specificity measures of around $50\%$.

## References

[^1]: rpy2, https://rpy2.bitbucket.io/.

[^2]: KDD-CUP-99 Task Description, http://kdd.ics.uci.edu/databases/kddcup99/task.html.

[^3]: Hettich, S. and Bay, S. D. (1999). The UCI KDD Archive [http://kdd.ics.uci.edu]. Irvine, CA: University of California, Department of Information and Computer Science.

[^4]: Data Science Toolbox: Assignment 1, https://github.com/dj311/data-science-toolbox-1.

[^5]: Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.