# Intrusion Detection on the KDD Cup 99 Data Set
*An investigation into models and performance metrics for the classification of network data.*

--- 

**TODO** Standardize on what we consider a positive and negative result e.g. is normal positive, or is bad positive? This will ensure our write-ups and code are consistent and make sense together
  - Positive :: detection of intrusion (`bad.` label)
  - Negative :: detected as normal traffic (`normal.` label)

**TODO** Use the same labels for normal and bad data. We've decided on `normal.` and `bad.`.

**TODO** Introduction. Should include:

  * Introduce the KDD Cup:
    - what was it's aims?
    - what does the data set look like? features etc.
  * Prior work on the kdd 99 data set
    - extensively studied
    - papers, code we found, etc.
  * Introduce our intended approach, talk about:
    - choose a number of approaches to compare
    - lots of thought has gone into our performance metric <-- make this clear
    
We decided to consider three approaches to this problem - logistic regression, logistic regression with penalisation, and support vector machines. We have allocated the work as follows:
  1. Shanglin will work on the logistic regression model.
  2. Daniel will work on an extension of logistic regression using penalisation and feature selection.
  3. Kishalay will work on a model using support vector machines.



## Preliminaries

The notebook needs to be setup with the required libraries and dependencies loaded.

In [None]:
import matplotlib 
import numpy
import pandas
import seaborn

from sklearn import decomposition
from sklearn import linear_model
from sklearn import metrics
from sklearn import model_selection
from sklearn import preprocessing
from statsmodels import api as sm

import warnings
warnings.filterwarnings('ignore')

In [None]:
# For inline plots within the notebook
%matplotlib inline
# Allows code cells to be intrepreted as R (put %%R on the first line) [1]
%load_ext rpy2.ipython
# Render R output as HTML
from functools import partial
from rpy2.ipython import html
html.html_rdataframe=partial(html.html_rdataframe, table_class="docutils")
html.init_printing()

In [None]:
%%R
library(caret)
library(ggplot2)
library(data.table)
library(lattice)

Seed the R and Python random number generators to ensure we get consistent results:

In [None]:
random_state = numpy.random.RandomState(0)

In [None]:
%%R
set.seed(1)

## Data Source (TODO: Dan)

The dataset we used for this project was originally used in the 1999 KDD Intrusion Detection Contest [2] [3]. This data

Each row in the data represents a single TCP connection, as described in the original task description [2]:
> A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol.  Each connection is labeled as either normal, or as an attack, with exactly one specific attack type.



The `connection_label` column describes the source of each connection; either the name of the red-team which caused the event, or the string `normal.` which indicates normal network behaviour. 

The task is to create a model which can separate red-team behavour from normal network behaviour, so we need to group the data into two labels: `normal` and `bad`.

In [None]:
columns=['duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'connection_label']
connection_events = pandas.read_csv('http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz', names=columns)  # [3]

In [None]:
connection_events.head(10).transpose()

The `connection_label` column describes the source of each connection; either the name of the red-team which caused the event, or the string `normal.` which indicates normal network behaviour. 

The task is to create a model which can separate red-team behavour from normal network behaviour, so group the data into two labels: `normal` and `bad`.

In [None]:
def generate_label(connection_label):
    return 'normal' if connection_label == 'normal.' else 'bad'

connection_events['connection_label'] = connection_events['connection_label'].apply(func=generate_label)

Next, separate out the labels from the data set. 

In [None]:
connection_labels = connection_events.filter(['connection_label'], axis='columns')
connection_events = connection_events.drop(['connection_label'], axis='columns')

## Performance Metrics for Classification
From spec:
> - Together agree and test a performance metric.
>   - Half of the effort should be devoted to exploring appropriate performance measures.
>   - You should create a test and validation dataset, but you may choose how to do this.


From spec:
> * Think about the circumstances by which your chosen performance metric
will lead to real-world generalisability, and how it might compromise this for
the purpose of standardization.
> * Demonstrate this with data and/or simulation;
for example, if you believe that you can predict new types of data, you could
demonstrate this by leaving out some types of data and observing your perfor-
mance. 
> * Examine in what sense your group’s best method is truly best.

### Cross-Validation (TODO: Kish)
  - Decided on k-fold, why? Brief comparison to other types.
  

The code below performs 10-fold cross-validation, providing the array `train_test_splits`. It does this using the [model selection](http://scikit-learn.org/stable/modules/cross_validation.html#stratified-k-fold) part of the scikit-learn library [5].  

In [None]:
k_fold_splitter = model_selection.StratifiedKFold(n_splits=10,  random_state=random_state)
train_test_splits = k_fold_splitter.split(
    connection_events,  # data to be split
    connection_labels,  # target/class to split by
)

# Force evaluation of the train_test_splits generator into a list. This needs to be
# done before it's sent to R.
train_test_splits = [
    [training_indices, testing_indices]
    for training_indices, testing_indices in train_test_splits
]

# Here train_test_splits is a list, where each item represents a single 90/10 split of 
# training and testing data respectively. The values contained are 0-based indices
# of the rows in each part of the split, i.e:
#
#   train_test_splits = [
#      (indices_of_training_samples, indices_of_test_samples),  # first split
#      (indices_of_training_samples, indices_of_test_samples),  # second split
#      ...
#      (indices_of_training_samples, indices_of_test_samples),  # kth split
#  ]
#
# These indices can then be used to fetch the data samples and their labels,
# ready for training and testing.

The cell below imports the training and testing sets into the R session, ready for analysis and modelling.

In [None]:
%Rpush train_test_splits connection_events connection_labels

The helper functions below extract training and test rows from a data frame:

In [None]:
%%R

get_training_rows <- function(dataframe, train_test_split) {
    indices <- train_test_split[[1]]  # Training indices are the first item in a train_test_split
    
    # Python indices start at 0, whilst R indices start at 1. Correct for this by incrementing each index by 1:
    indices <- indices + 1
    
    dataframe[indices,]
}

get_testing_rows <- function(dataframe, train_test_split) {
    indices <- train_test_split[[2]]  # Testing indexes are the second item in a train_test_split
    
    # Python indices start at 0, whilst R indices start at 1. Correct for this by incrementing each index by 1:
    indices <- indices + 1  
    
    dataframe[indices,]
}

### Confusion Matrix (TODO: Shanglin)
- We chose to use one, why?
- What other options were there


### Summary Statistics (TODO: Shanglin, Dan, Kish)

Look here for table with lots of them on: https://en.wikipedia.org/wiki/Sensitivity_and_specificity and here is a comparison of two new ones: http://standardwisdom.com/softwarejournal/2011/12/matthews-correlation-coefficient-how-well-does-it-do/

  - Accuracy

In [None]:
def accuracy(confusion_matrix):
    true_positives = confusion_matrix[0][0]
    true_negative = confusion_matrix[1][1]
    false_positives = confusion_matrix[0][1]
    false_negative = confusion_matrix[1][0]
    return (true_positives +true_negative)/(true_positives+false_positive+true_negatives+false_negatives)

  - Sensitivity
  

In [None]:
def sensitivity(confusion_matrix):
    true_positives = confusion_matrix[0][0]
    false_negatives = confusion_matrix[1][0]
    return true_positives/(true_positives+false_negatives)

  - Specificity
  

In [None]:
def specificity(confusion_matrix):
    false_positives = confusion_matrix[0][1]
    true_negatives = confusion_matrix[1][1]
    return true_negatives/(true_negatives+false_positives)

  - Kappa  (https://stats.stackexchange.com/questions/82162/cohens-kappa-in-plain-english). Here is an article on the kappa statistic: https://standardwisdom.com/softwarejournal/2011/12/confusion-matrix-another-single-value-metric-kappa-statistic/

In [None]:
def kappa(confusion_matrix):
    #                 Cats Dogs  (truths)
    #     Cats    | 22 | 9  |
    #     Dogs   | 7  | 13 |
    #     (predictions)

    #     Ground truth: Cats (29), Dogs (22)
    #     Machine Learning Classifier: Cats (31), Dogs (20)
    #     Total: (51)
    #     Observed Accuracy: ((22 + 13) / 51) = 0.69
    #     Expected Accuracy: ((29 * 31 / 51) + (22 * 20 / 51)) / 51 = 0.51
    #     Kappa: (0.69 - 0.51) / (1 - 0.51) = 0.37
    #
    # TODO: Dan
    return 1.0

## Data Models

**TODO** Dan

From spec:
> For example, you could look to predict the next event on each
edge based on past events on this edge, or you could model the network at a
more global level, and many other approaches are possible.

**TODO**: We decided to model it on a global scale, why?
   - Real-llfe it is hard to keep track of state between events (especially true for high-traffic websites), it's easier to classify each incoming event individually as it goes with a previously trained model.

### Logistic Regression (TODO: Shanglin)
**TODO** k-fold cross validation.

In [None]:
%%R

kddata<-read.csv("../data/kddcup.data_10_percent.gz")

kddnames=read.table(
    "../data/kddcup.names",
    sep=":",
    skip=1,
    as.is=T
)

colnames(kddata)=c(kddnames[,1], "normal")

In [None]:
%%R

#set normal col to two levels:normal=0,non-normal=1
levels(kddata$normal)[which(levels(kddata$normal) !='normal.')] <- "non-normal."
kddata$normal<-as.numeric(factor(kddata$normal,levels = c("normal.","non-normal."))) -1 

kddata_normal<-kddata[which(kddata$normal==0),]
kddata_nonnormal<-kddata[which(kddata$normal==1),]

In [None]:
%%R

#set service col to three levels:private,http,others
levels(kddata$service)[which(levels(kddata$service) != "http" & levels(kddata$service) != "private")] <- "others"
kddata$service<-as.numeric(factor(kddata$service,levels = c("private","http","others")))

In [None]:
%%R

#train and test
kddata_normal<-kddata[which(kddata$normal==0),]
kddata_nonnormal<-kddata[which(kddata$normal==1),]

sample_normal<-sample(seq_len(nrow(kddata_normal)),size=floor(.80*nrow(kddata_normal)))
sample_nonnormal<-sample(seq_len(nrow(kddata_nonnormal)),size=floor(.80*nrow(kddata_nonnormal)))

train_normal<-kddata_normal[sample_normal,]
test_normal<-kddata_normal[-sample_normal,]

train_nonnormal<-kddata_nonnormal[sample_nonnormal,]
test_nonnormal<-kddata_nonnormal[-sample_nonnormal,]

train<-rbind(train_normal,train_nonnormal)
test<-rbind(test_normal,test_nonnormal)

head(train)

In [None]:
%%R

#fit in logit model
model1<-glm(normal~service+logged_in+srv_count, family=binomial(link='logit'), data=train)

summary(model1)

In [None]:
%%R

fitted.results <- predict(model1,newdata=test,type = 'response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
confusion<-table(fitted.results,test$normal)
accuracy<-(confusion[1,1]+confusion[2,2])/length(fitted.results)

In [None]:
%%R

accuracy

In [None]:
%%R

confusion

### Logistic Regression with Penalisation (TODO: Dan)

### SUPPORT VECTOR MACHINE (TODO: Kish) 


In [None]:
%%R

library(data.table)
kddata = read.csv("../data/kddcup.data_10_percent.gz")
kddnames = read.table("../data/kddcup.names",sep = ":", skip = 1, as.is = T)
colnames(kddata)=c(kddnames[,1],"normal")
kddata
length(kddata$normal)

In [None]:
%%R

colcounts = lapply(unique_temp,length)
constants = colcounts==1
constants
not_constants = !constants
not_constants
pruned_kddata = kddata[,not_constants]
pruned_kddata

In [None]:
%%R

levels(pruned_kddata$normal) <- c(levels(pruned_kddata$normal), 'bad.')
pruned_kddata[which(pruned_kddata$normal!='normal.'), 'normal'] <- 'bad.'
pruned_kddata[which(pruned_kddata$normal!='normal.'),]
pruned_kddata$normal = factor(pruned_kddata$normal, levels = c('normal.','bad.'))
pruned_kddata


In [None]:
%%R

dataset1 = pruned_kddata[pruned_kddata$normal=="normal.",]
dataset1
dataset2 = pruned_kddata[pruned_kddata$normal=="bad.",]
dataset2
length(dataset2$normal)  

In [None]:
%%R

sample1_indexes = sample(nrow(dataset1), size = floor(0.8*nrow(dataset1)), prob=NULL)
sample1 = dataset1[sample1_indexes,]
sample1
length((sample1$normal))

In [None]:
%%R

sample2_indexes = sample(nrow(dataset2), size = floor(0.2*nrow(dataset2)),prob=NULL)
sample2 = dataset2[sample2_indexes,]
sample2
length((sample2$normal))

In [None]:
%%R

main_sample1 = rbind(sample1,sample2)
main_sample1
test_data1 = dataset1[-sample1_indexes,]
test_data1
length(test_data1$normal)
test_data2 = dataset2[-sample2_indexes,]
length(test_data2$normal)
main_test_data = rbind(test_data1,test_data2)
main_test_data

In [None]:
%%R

install.packages('e1071')
library(e1071)
classifier = svm(formula = normal ~ ., data = main_sample1, type = "C-classification", kernel = "linear")

y_pred = predict(classifier, newdata = main_test_data)
table(main_test_data$normal, y_pred)

In [None]:
%%R

install.packages('caret')
library(caret)
library(e1071)
folds = createFolds(main_sample1$normal, k = 10)
cv = lapply(folds, function(x) {
  training_fold = main_sample1[-x, ]
  test_fold = main_sample1[x, ]
  classifier = svm(formula = normal ~ ., data = training_fold, type = "C-classification", kernel = "linear")
  y_pred = predict(classifier, newdata = test_fold)
  cm = table(test_fold$normal, y_pred)
  accuracy = (cm[1,1] + cm[2,2])/(cm[1,1] + cm[2,2] + cm[1,2] + cm[2,1])
  return(accuracy)
  })
cv
Macc = mean(as.numeric(cv))
Macc

## Model Evaluation
Here we apply performance metric to models

## Evaluation of Performance Metric
Here we evaluate how good our performance metric was.

## References

[1]: rpy2, https://rpy2.bitbucket.io/.

[2]: KDD-CUP-99 Task Description, http://kdd.ics.uci.edu/databases/kddcup99/task.html.

[3]: Hettich, S. and Bay, S. D. (1999). The UCI KDD Archive [http://kdd.ics.uci.edu]. Irvine, CA: University of California, Department of Information and Computer Science.

[4]: Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
