# Intrusion Detection on the KDD Cup 99 Data Set: Documentation
*— Daniel Jones*

--- 



## Notebook Setup

Libraries for Python session (see `readme.md` for installation instructions):

In [None]:
import matplotlib
import numpy
import pandas
import seaborn

from sklearn import metrics
from sklearn import model_selection

import warnings
warnings.filterwarnings('ignore')

Setup Jupyter with `rpy2` to allow embedding R, and `matplotlib` to allow inline plots.

In [None]:
# For inline plots within the notebook
%matplotlib notebook
# Allows code cells to be intrepreted as R (put %%R on the first line) [^1]
%load_ext rpy2.ipython

Libraries for R session (see `readme.md` for installation instructions):

In [None]:
%%R
library(ggplot2)

In [None]:
random_state = numpy.random.RandomState(0)

## Data Source

Each row in the data represents a single TCP connection, as described in the original task description [^2]:
> A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol.  Each connection is labeled as either normal, or as an attack, with exactly one specific attack type.



In [None]:
columns=['duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'traffic_type']
kdd_connections = pandas.read_csv('http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz', names=columns)

In [None]:
kdd_connections.head(10).transpose()

The "Traffic Type" column describes the source of each connection; either the name of the red-team which caused the event, or the string `normal.` which indicates normal network behaviour. 

The task is to create a model which can separate red-team behavour from normal network behaviour. Group the data into two labels, `normal` and `bad`:

In [None]:
def generate_label(traffic_type):
    return 'normal' if traffic_type == 'normal.' else 'bad'

kdd_connections['traffic_type'] = kdd_connections['traffic_type'].apply(func=generate_label)

Next, separate out the labels from the data set. 

In [None]:
traffic_types = kdd_connections['traffic_type']
del kdd_connections['traffic_type']

## Training and Testing Data

It is now necessary to split the data into training and testing sets. When doing this we should ask the following questions:

  1. Does the ratio of normal and bad connections need to be similar in the training and testing data? If so, we should use stratified testing.
    - **TODO** What do we think? We should write down our reasoning when we make a decision.
  2. Should we consider k-fold validation?
    - **TODO** What do we think? Write down reasoning. 
    - This is quick to implement, but would require each of us to write our models in such a way that they are repeatable.
    
For no particular reason, split the data into 90% training and 10% testing randomly and without stratification:

In [None]:
training_data, testing_data, training_labels, testing_labels = model_selection.train_test_split( 
    kdd_connections,
    traffic_types,
    test_size=1/10,
    random_state=random_state,
)
len(training_data), len(testing_data)

The cell below imports the training and testing sets into the R session, ready for analysis and modelling.

In [None]:
%R -i training_data -i testing_data -i training_labels -i testing_labels

## Data Model

Quick demo of using the data within the R session:

In [None]:
%%R
head(training_labels)

In [None]:
%%R
qplot(x=src_bytes, y=dst_bytes, data=testing_data, geom='point')

From our initial analysis of the data, it is clear that the `normal`/`bad` labels are uniformly distributed to the samples at random. The following model reflects this:

In [None]:
%%R
predicted_labels <- sample(c('normal', 'bad'), nrow(testing_data), replace=TRUE, prob=c(0.5, 0.5) )

## Model Performance and Analysis


Import the predictions from the R session into Python:

In [None]:
%R -o predicted_labels

In [None]:
# In this case, rpy2 returns a r-type vector. Convert it into a numpy array for further processing:
predicted_labels = numpy.array(predicted_labels)

In [None]:
confusion_matrix = metrics.confusion_matrix(testing_labels, predicted_labels)
confusion_matrix = pandas.DataFrame(
    data=confusion_matrix, 
    index=['True Normal', 'True Bad'], 
    columns=['Predicted Normal', 'Predicted Bad'],
)
perfect_model_confusion_figure, perfect_model_confusion_axes = matplotlib.pyplot.subplots()
perfect_model_confusion_axes.set_title(
    'Confusion matrix showing the predicted vs. true \n'
    'class of "normal" and "bad" network connections.'
)
seaborn.heatmap(
    confusion_matrix,
    annot=True,
    fmt="d",
    cmap=seaborn.color_palette("Blues"),
    vmin=0,
    ax=perfect_model_confusion_axes,
)

In [None]:
def sensitivity(confusion_matrix):
    true_positives = confusion_matrix['Predicted Normal']['True Normal']
    false_negatives = confusion_matrix['Predicted Bad']['True Bad']
    return true_positives/(true_positives+false_negatives)

print('Sensitivity: {:.2f}%'.format(
    sensitivity(confusion_matrix)*100
))

In [None]:
def specificity(confusion_matrix):
    false_positives = confusion_matrix['Predicted Normal']['True Bad']
    true_negatives = confusion_matrix['Predicted Bad']['True Bad']
    return true_negatives/(true_negatives+false_positives)

print('Specificity: {:.2f}%'.format(
    specificity(confusion_matrix)*100
))

Not only has this unique model proven truly groundbreaking, the visualisation method doesn't reflect the bias in class sizes at all.

## References

[^1]: rpy2, https://rpy2.bitbucket.io/.

[^2]: KDD-CUP-99 Task Description, http://kdd.ics.uci.edu/databases/kddcup99/task.html.
