# Intrusion Detection on the KDD Cup 99 Data Set
*An investigation into models and performance metrics for the classification of network data.*

--- 

**TODO** Standardize on what we consider a positive and negative result e.g. is normal positive, or is bad positive? This will ensure our write-ups and code are consistent and make sense together
  - Positive :: detection of intrusion (`bad.` label)
  - Negative :: detected as normal traffic (`normal.` label)

**TODO** Use the same labels for normal and bad data. We've decided on `normal.` and `bad.`.

**TODO** Introduction. Should include:

  * Introduce the KDD Cup:
    - what was it's aims?
    - what does the data set look like? features etc.
  * Prior work on the kdd 99 data set
    - extensively studied
    - papers, code we found, etc.
  * Introduce our intended approach, talk about:
    - choose a number of approaches to compare
    - lots of thought has gone into our performance metric <-- make this clear
    
We decided to consider three approaches to this problem - logistic regression, logistic regression with penalisation, and support vector machines. We have allocated the work as follows:
  1. Shanglin will work on the logistic regression model.
  2. Daniel will work on an extension of logistic regression using penalisation and feature selection.
  3. Kishalay will work on a model using support vector machines.



## Preliminaries

The notebook needs to be setup with the required libraries and dependencies loaded.

In [1]:
import matplotlib 
import numpy
import pandas
import seaborn

from sklearn import decomposition
from sklearn import linear_model
from sklearn import metrics
from sklearn import model_selection
from sklearn import preprocessing
from statsmodels import api as sm

import warnings
warnings.filterwarnings('ignore')

In [2]:
# For inline plots within the notebook
%matplotlib inline
# Allows code cells to be intrepreted as R (put %%R on the first line) [1]
%load_ext rpy2.ipython
# Render R output as HTML
from functools import partial
from rpy2.ipython import html
html.html_rdataframe=partial(html.html_rdataframe, table_class="docutils")
html.init_printing()

In [7]:
%%R

library(caret)
library(ggplot2)
library(data.table)
library(lattice)
library(e1071)

Seed the R and Python random number generators to ensure we get consistent results:

In [8]:
random_state = numpy.random.RandomState(0)

In [9]:
%%R
set.seed(1)

## Data Source

This project uses the same dataset as was originally used in the 1999 KDD Intrusion Detection Contest [2] [3]. The contest aim was to survey and evaluate research in intrusion detection. To do this, an environment was created to simulate the conditions of a typical U.S. Air Force intranet, with the addition malicious traffic. 

The dataset is structured such that each row represents a single Transmission Control Protocol (TCP) connection, summarising a sequence of packets between a source and destination computer within the network. The code below loads in this data set:

In [None]:
columns = ['duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'connection_label']
connection_events = pandas.read_csv('../data/kddcup.data_10_percent.gz', names=columns)  # sourced from http://kdd.ics.uci.edu/ [3]

Each row has a `connection_label` feature, describing the source of each connection; either an identifier of the attack-type it is associated with, or the string `normal.` in the case of normal network traffic.

Since the task is to separate malicious behavour from normal network behaviour, the code below separates these labels from the rest of the dataset, then groups all malicious traffic under the label `bad.`:

In [None]:
connection_labels = connection_events.filter(['connection_label'], axis='columns')
connection_events = connection_events.drop(['connection_label'], axis='columns')

In [13]:
def generate_label(label):
    return 'normal.' if label == 'normal.' else 'bad.'

connection_labels = connection_labels.apply(generate_label)

## Performance Metrics for Classification
From spec:
> - Together agree and test a performance metric.
>   - Half of the effort should be devoted to exploring appropriate performance measures.
>   - You should create a test and validation dataset, but you may choose how to do this.
> 
> * Think about the circumstances by which your chosen performance metric
will lead to real-world generalisability, and how it might compromise this for
the purpose of standardization.
> * Demonstrate this with data and/or simulation;
for example, if you believe that you can predict new types of data, you could
demonstrate this by leaving out some types of data and observing your perfor-
mance. 
> * Examine in what sense your group’s best method is truly best.
It is important to carefully consider different performance metrics in relation to the application of our models, before choosing what we consider to be the most appropriate metric. In this section we compare a number of metrics.

### Cross-Validation 
  Cross validation is a model validation technique for assessing how the results of a statistical analysis generalize to an independent dataset. It is used in prediction, and for checking the practical accuracy of the predicted model. In a typical problem, the dataset on hand is divided into a ‘Training Set’, and a ‘Test Set’. The model is then ‘trained’ on the Training Set. The effectiveness of the model is then tested using the Test Set, which is typically that part of the data which had not been fed into the model. This allows us to test the accuracy of the model in predicting observations which hadn’t been used to estimate it.
  
However, results obtained from a single round of cross validation may contain variability issues. Thus, to smooth out the effect of variability on the sampling (for the Test and Training sets), cross validation is usually performed for a large number of times. This results in a large number of accuracy values, which are then averaged. This mean value is then taken as the indication of the model’s predictive power, and is a better estimate than that obtained after a single iteration. There are a variety of methods by which this technique is implemented. In our project, we have used K Fold Cross Validation. 

In K Fold Cross Validation, the dataset is randomly divided into K equal sized samples. The model is trained on (K-1) samples, and the remaining sample is then used for testing it. This process is repeated K times, which ensures that every sample is used as the Test Data exactly once, and that all observations are used for training and testing. The mean of the values of accuracy for all the iterations is then reported. This method does a good job of reducing variability bias. 

Since K Fold Cross Validation allows us to obtain ‘better’ (less biased or variable) estimates of the accuracy of our model, we consider it to be an useful Performance Metric for the comparison of our models. 


The code below performs 10-fold cross-validation, providing the array `train_test_splits`. It does this using the [model selection](http://scikit-learn.org/stable/modules/cross_validation.html#stratified-k-fold) part of the scikit-learn library [5]. Stratified k-fold validation was chosen to preserve the normal/bad class ratios between the training and testing sets.


In [None]:
k_fold_splitter = model_selection.StratifiedKFold(n_splits=10,  random_state=random_state)
train_test_splits = k_fold_splitter.split(
    connection_events,  # data to be split
    connection_labels,  # target/class to split by
)

# Force evaluation of the train_test_splits generator into a list. This needs to be
# done before it's sent to R.
train_test_splits = [
    [training_indices, testing_indices]
    for training_indices, testing_indices in train_test_splits
]

# Here train_test_splits is a list, where each item represents a single 90/10 split of 
# training and testing data respectively. The values contained are 0-based indices
# of the rows in each part of the split, i.e:
#
#   train_test_splits = [
#      (indices_of_training_samples, indices_of_test_samples),  # first split
#      (indices_of_training_samples, indices_of_test_samples),  # second split
#      ...
#      (indices_of_training_samples, indices_of_test_samples),  # kth split
#  ]
#
# These indices can then be used to fetch the data samples and their labels,
# ready for training and testing.

The cell below imports the training and testing sets into the R session, ready for analysis and modelling.

In [None]:
%Rpush train_test_splits connection_events connection_labels

The helper functions below extract training and test rows from a data frame:

In [None]:
%%R

get_training_rows <- function(dataframe, train_test_split) {
    indices <- train_test_split[[1]]  # Training indices are the first item in a train_test_split
    
    # Python indices start at 0, whilst R indices start at 1. Correct for this by incrementing each index by 1:
    indices <- indices + 1
    
    dataframe[indices,]
}

get_testing_rows <- function(dataframe, train_test_split) {
    indices <- train_test_split[[2]]  # Testing indexes are the second item in a train_test_split
    
    # Python indices start at 0, whilst R indices start at 1. Correct for this by incrementing each index by 1:
    indices <- indices + 1  
    
    dataframe[indices,]
}

### Confusion Matrix (TODO: Shanglin)
- confusion matrix is a table that contains information about actual and predicted classification. The Table below shows a table of confusion with 2 class classifer.
- It is 2X2 table and we have four cells. Each cell represents different numbers:

 1. True Positive(TP): Actual positive condition predicts as positve.
 2. True Negative(TN): Actual negative condition predicts as negative.
 3. Flase Postive(FP)(Type I Error): Actual negative condition predicts as positive.
 4. False Negative(FN)(Type II Error): Actual Positive condition predicts as negative.

- Confusion matrix is an obvious and easy performance metric to compare our models. Using the information in the confusion matrix to calculate some Statistics. 

In [None]:
confusionmatrix = pandas.DataFrame({" ":["Predicted Positive","Predicted Negative"], "Actual Postive":["True Positive","False Negative(Type II Error)"], "Actual Negative":["Flase Postive(Type I Error)","True Negative"]})
confusionmatrix.set_index([" "])

In [None]:
#For Our topic, our confusion matrix should be like this 
confusionmatrix = pandas.DataFrame({" ":["Predicted Bad","Predicted Normal"], "Actual bad":["",""], "Actual normal":["",""]})
confusionmatrix.set_index([" "])

### Summary Statistics (TODO: Shanglin, Dan, Kish)

Look here for table with lots of them on: https://en.wikipedia.org/wiki/Sensitivity_and_specificity and here is a comparison of two new ones: http://standardwisdom.com/softwarejournal/2011/12/matthews-correlation-coefficient-how-well-does-it-do/

Based on confusion matrix, we can have many statistics to do:

  - Accuracy: (TP+TN)/(TP+TN+FP+FN) is the proportion of the total number of predictions that were true.
  - Sentitivity: TP/(TP+FN) is the proportion of positive condition that were correctly predicted.
  - Specificity: TN/(TN+FP) is the proportion of negative condition that were correctly predicted.

For our topic, accuracy is not a good performance metric since the dataset is unblanced. i.e "bad" has large proportion(80%). For exmaple, if prediction of models are all bad, we still have 80% accuracy, but have 0 sentitivity, which is not good. So sentitivity and specificity is somehow good for our models. Also, we have other metrics:
  - ROC curve: is a plot with the false positive rate on the X axis and the true positive rate on the Y axis.
    - The ROC curve allows us to consider the trade-off between the sensitivity and specificity of a model. Unfortunately, it is hard for us to choose in this setting because it does not produce a single value with which models can be compared.
  - Kappa Statistics: 
  
  
  - Accuracy:
     - Using a simple summary measure such as accuracy can cause us to optimise away the prediction of rare classes [6].

In [None]:
def accuracy(confusion_matrix):
    true_positives = confusion_matrix[0][0]
    true_negative = confusion_matrix[1][1]
    false_positives = confusion_matrix[0][1]
    false_negative = confusion_matrix[1][0]
    return (true_positives +true_negative)/(true_positives+false_positive+true_negatives+false_negatives)

  - Sensitivity
  

In [None]:
def sensitivity(confusion_matrix):
    true_positives = confusion_matrix[0][0]
    false_negatives = confusion_matrix[1][0]
    return true_positives/(true_positives+false_negatives)

  - Specificity
  

In [None]:
def specificity(confusion_matrix):
    false_positives = confusion_matrix[0][1]
    true_negatives = confusion_matrix[1][1]
    return true_negatives/(true_negatives+false_positives)

  - Kappa  (https://stats.stackexchange.com/questions/82162/cohens-kappa-in-plain-english). Here is an article on the kappa statistic: https://standardwisdom.com/softwarejournal/2011/12/confusion-matrix-another-single-value-metric-kappa-statistic/
    - The Kappa statistic is useful since it is a function of all elements of the confusion matrix. This has the potential to give a more balanced view of performance than using a metric like sensitivity on it's own.
    - We've included an implementation of the Cohens' Kappa statistic in the function below:

In [None]:
def kappa(confusion_matrix):
    """
    Calculates the Kappa summary statistic [5] on the confusion matrix given. 
    
    The confusion matrix should have the following structure:
        [
            [true_positives, false_positives],
            [false_negatives, true_negatives],
        ]
    """
    true_positives = confusion_matrix[0][0]
    true_negatives= confusion_matrix[1][1]
    false_positives = confusion_matrix[0][1]
    false_negatives = confusion_matrix[1][0]
    
    total = true_positives+false_positives+true_negatives+false_negatives
    
    observed_accuracy = (true_positives + true_negatives) / total
    random_accuracy = (
        (true_negatives+false_positives)*(true_negatives+false_negatives) 
        + (false_negatives+true_positives)*(false_positives+true_positives)
    ) / (
        total * total
    )
    
    return (observed_accuracy - random_accuracy) / (1 - random_accuracy)

### Considering Compliance Budget

Within the study of security usable and economics, a compliance budget considers the impact of a security system on end-users [7]. Each end-user develops a mental model of their perceived benefits of the policy, against the effort required to follow it [8]. A compliance budget is the amount of effort a user will put into complying with security policy before they attempt to bypass it.

In the context of intrusion detection systems (IDS), we can think of the compliance budget as the number of false alarms a member of the security team of an organisation is able to deal with, before considering all alarms as false. Within this report, we consider the total number of false alarms a security team would be able to handle each day, and use this to calculate a worst-case false-positive rate that each model must satisfy. This cut-off is then combined with a summary statistic to rank each mode.

This number will be unique to each organisation, and dependant on a number of factors which we are unable to predict. For example, a large organisation may have a far larger security team than that of a small start-up. The KDD 99 dataset is said to simulate a "typical U.S. Air Force LAN" [2], and so we have assumed that they will have a well resourced security team who are able to adaquately deal with 500 false alarms per day. It has been assumed that true intrusions are extremely rare, so do not have an effect on the security teams compliance budget.

In the code below, we calculate the maximum false-positive rate for each of our models. It assumes that the number connections per day is independant of the day of the week. This is a potential cause of error.

In [17]:
number_of_weeks = 9  # sourced from the KDD 99 Task Description [2]
number_of_days = 9*7

total_number_of_connections = len(connection_events)

# We are using the 10 percent data set, so multiply the number of connection events by 10:
total_number_of_connections = total_number_of_connections * 10

average_connections_per_day = total_number_of_connections / number_of_days
maximum_false_positives_per_day = 500
maximum_false_positive_rate = maximum_false_positives_per_day/average_connections_per_day

print("The Maximum False Positive Rate is {:.2f}%".format(maximum_false_positive_rate*100))

The Maximum False Positive Rate is 0.64%


## Data Models

**TODO** Dan

From spec:
> For example, you could look to predict the next event on each
edge based on past events on this edge, or you could model the network at a
more global level, and many other approaches are possible.

**TODO**: We decided to model it on a global scale, why?
   - Real-llfe it is hard to keep track of state between events (especially true for high-traffic websites), it's easier to classify each incoming event individually as it goes with a previously trained model.
   
   
There are a number of approaches we could have taken when developing our models. The need to use a single performance metric to compare three models limited the potential variation in approaches. 

### Logistic Regression 
Logistic Regression is an appropriate regression to model when dependent variable is binaray value.(i.e. in our topic, normal vs bad). In logistic regression analysis, the regression is used to describe the relationship between dependent binaray variable and other independent variables.[8] 

The function of model can be written as:
$logit(\pi) = \beta_0+\beta_1x_1+\beta_2x_2+...+\beta_kx_k$

Where $\pi$ is the probability of binaray value equal to 1.
     
$      logit(\pi) = log(\pi/(1-\pi))$
     
$      \beta_0...\beta_k$ are parameters of models.
     
$      x_1...x_k$ are independent variables.

In [None]:
%%R

kddata<-read.csv("../data/kddcup.data_10_percent.gz")

kddnames=read.table(
    "../data/kddcup.names",
    sep=":",
    skip=1,
    as.is=T
)

colnames(kddata)=c(kddnames[,1], "normal")

In [None]:
%%R

#set normal col to two levels:normal=0,bad=1
levels(kddata$normal)[which(levels(kddata$normal) !='normal.')] <- "bad."
kddata$normal<-as.numeric(factor(kddata$normal,levels = c("normal.","bad."))) -1 

In [None]:
%%R

#set service col to three levels:private,http,others
levels(kddata$service)[which(levels(kddata$service) != "http" & levels(kddata$service) != "private")] <- "others"
kddata$service<-as.numeric(factor(kddata$service,levels = c("private","http","others")))

In [None]:
%%R

#split data into train and test
#first we split data into "normal" and "bad"
kddata_normal<-kddata[which(kddata$normal==0),]
kddata_bad<-kddata[which(kddata$normal==1),]
# define an 80%/20% train/test split of the dataset
# we sample "normal" and "bad" dataset
sample_normal<-sample(seq_len(nrow(kddata_normal)),size=floor(.80*nrow(kddata_normal)))
sample_bad<-sample(seq_len(nrow(kddata_bad)),size=floor(.80*nrow(kddata_bad)))
# we get train and test for "normal" dataset
train_normal<-kddata_normal[sample_normal,]
test_normal<-kddata_normal[-sample_normal,]
# we get train and test for "bad" dataset
train_bad<-kddata_bad[sample_bad,]
test_bad<-kddata_bad[-sample_bad,]
#combine "normal" and "bad" train dataset to get train dataset, same for "bad"
train<-rbind(train_normal,train_bad)
test<-rbind(test_normal,test_bad)

#### Model Selction

In [None]:
%%R
 
#model 1: variable:service(network service on the destination), logged_in(1 if successfully logged in; 0 otherwise), 
#                  srv_count(	number of connections to the same service as the current connection in the past two seconds),
#                  count(number of connections to the same host as the current connection in the past two seconds )
#                   
model1<-glm(normal~service+logged_in+srv_count+count, family=binomial(link=logit), data=train)
summary(model1)
fitted_results <- predict(model1,newdata=test,type = 'response')
fitted_results <- ifelse(fitted_results > 0.5,1,0)

confusionMatrix(factor(fitted_results),factor(test$normal),positive="1",dnn = c("prediction","actual"))

In [None]:
%%R

#model 2: variable:service(network service on the destination), logged_in(1 if successfully logged in; 0 otherwise), 
#                  srv_count(	number of connections to the same service as the current connection in the past two seconds)
#                   
model2<-glm(normal~service+logged_in+srv_count, family=binomial(link=logit), data=train)
summary(model2)
fitted_results2 <- predict(model2,newdata=test,type = 'response')
fitted_results2 <- ifelse(fitted_results2 > 0.5,1,0)

confusionMatrix(factor(fitted_results2),factor(test$normal),positive="1",dnn = c("prediction","actual"))

Based on confusion matrix of Model 1 and Model 2, we can clearly see that the sensitivity and specificity of model 1 is larger than model 2, so we choose model 1.

In [None]:
%%R

#model 3: variable:service(network service on the destination), logged_in(1 if successfully logged in; 0 otherwise), 
#                  count(number of connections to the same host as the current connection in the past two seconds )
# 
model3<-glm(normal~service+logged_in+count, family=binomial(link=logit), data=train)
summary(model3)
fitted_results3 <- predict(model3,newdata=test,type = 'response')
fitted_results3 <- ifelse(fitted_results3 > 0.5,1,0)

confusionMatrix(factor(fitted_results3),factor(test$normal),positive="1",dnn = c("prediction","actual"))

anova(model3, model1,test="LRT")

Based on confusion matrix of model 1 and model 3, we cannot easily pick which model is better, since sensitivity of model 1 is smaller than model 3, but specificity of model 1 is larger. Therefore, here we choose to use likelihood ratio test to determine which model is better. we assume that the parameter of srv_count = 0. 

In [None]:
%%R

#likelihood ratio test
anova(model3, model1,test="LRT")

we have p-value is very small. So, we reject the hypothesis that parameter of srv_count = 0. We choose model 1 as our model. 

Futhermore, we use k-fold cross validation to fit our model 1. we set 10 folds.

In [None]:
%%R

#k-fold cv
#Create 10 equally size folds
folds = createFolds(kddata$normal, k = 10)
cv = lapply(folds, function(x) {
  training_fold = kddata[-x, ]
  test_fold = kddata[x, ]
  model = glm(normal~service+logged_in+srv_count_count, family=binomial(link=logit), data=training_fold)
  y_pred = predict(model, newdata = test_fold)
  y_pred <- ifelse(y_pred > 0.5,1,0)
  cm<-confusionMatrix(factor(y_pred),factor(test_fold$normal),positive="1",dnn = c("prediction","actual"))
  return(cm$table)
})
cv

### Logistic Regression with Penalisation (TODO: Dan)

### SUPPORT VECTOR MACHINE (TODO: Kish) 


In [15]:
%%R

kddata = read.csv("../data/kddcup.data_10_percent.gz")
kddnames = read.table("../data/kddcup.names",sep = ":", skip = 1, as.is = T)
colnames(kddata)=c(kddnames[,1],"normal")

In [7]:
%%R

unique_values <- apply(kddata, 2, unique)
colcounts = lapply(unique_values,length)
constants = colcounts==1
not_constants = !constants
pruned_kddata = kddata[,not_constants]

In [None]:
%%R

levels(pruned_kddata$normal) <- c(levels(pruned_kddata$normal), 'bad.')
pruned_kddata[which(pruned_kddata$normal!='normal.'), 'normal'] <- 'bad.'
pruned_kddata[which(pruned_kddata$normal!='normal.'),]
pruned_kddata$normal = factor(pruned_kddata$normal, levels = c('normal.','bad.'))

summary(pruned_kddata)

In [None]:
%%R

dataset1 = pruned_kddata[pruned_kddata$normal=="normal.",]
dataset2 = pruned_kddata[pruned_kddata$normal=="bad.",]

c(nrow(dataset1) , nrow(dataset2))

In [None]:
%%R

sample1_indexes = sample(nrow(dataset1), size = floor(0.9*nrow(dataset1)), prob=NULL)
sample1 = dataset1[sample1_indexes,]

length((sample1$normal))

In [None]:
%%R

sample2_indexes = sample(nrow(dataset2), size = floor(0.9*nrow(dataset2)),prob=NULL)
sample2 = dataset2[sample2_indexes,]

length((sample2$normal))

In [None]:
%%R

main_sample1 = rbind(sample1,sample2)
main_sample1
test_data1 = dataset1[-sample1_indexes,]
test_data1
length(test_data1$normal)
test_data2 = dataset2[-sample2_indexes,]
length(test_data2$normal)
main_test_data = rbind(test_data1,test_data2)

c(length(main_sample1$normal), length(main_test_data$normal))

In [None]:
%%R

classifier = svm(formula = normal ~ ., data = main_sample1, type = "C-classification", kernel = "linear")

y_pred = predict(classifier, newdata = main_test_data)
table(main_test_data$normal, y_pred)

In [None]:
%%R

folds = createFolds(pruned_kddata$normal, k = 10)
cv = lapply(folds, function(x) {
  training_fold = pruned_kddata[-x, ]
  test_fold = pruned_kddata[x, ]
  classifier = svm(formula = normal ~ ., data = training_fold, type = "C-classification", kernel = "linear")
  y_pred = predict(classifier, newdata = test_fold)
  cm = table(test_fold$normal, y_pred)
  accuracy = (cm[1,1] + cm[2,2])/(cm[1,1] + cm[2,2] + cm[1,2] + cm[2,1])
  return(accuracy)
  })
cv
Macc = mean(as.numeric(cv))
Macc

## Model Evaluation
### Plan

Here we apply performance metric to models.

Create a summary table with:
  - model
  - summary statistic 
    - maybe multiple?
    
Talk about how some models performed better under different conditions.

How applicable is each model for use in live intrusion detection in e.g. a corporate internal network.


### Writeup

## Evaluation of Performance Metric

### Plan
Here we evaluate how good our performance metric was.

Key points:
  - Limitations of our performance metric:
    - In the end, in order to pick a "winner" we need to decide on a single summary statistic. This approach to picking a model then becomes brute force and has loses the subtle differences in the models.
    - The class ratios in our dataset do not match real-world ratios, this means our performance metric may overly value certain criteria over others. (TODO: this needs to be more concrete).
    - Our performance metric in unable to take into account the conditions that an intrusion detection system would need to run under. Bla bla bla theres always a trade-off between model performance and speed.

### Writeup

## References

[1]: rpy2, https://rpy2.bitbucket.io/.

[2]: KDD-CUP-99 Task Description, http://kdd.ics.uci.edu/databases/kddcup99/task.html.

[3]: Hettich, S. and Bay, S. D. (1999). The UCI KDD Archive [http://kdd.ics.uci.edu]. Irvine, CA: University of California, Department of Information and Computer Science.

[4]: Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

[5]: Confusion Matrix – Another Single Value Metric – Kappa Statistic, https://standardwisdom.com/softwarejournal/2011/12/confusion-matrix-another-single-value-metric-kappa-statistic/

[6]: Practical Statistics for Data Science, 1st ed., by Peter Bruce and Andrew Bruce (O’Reilly Media, 2017).

[7]: Beautement, Adam, M. Angela Sasse, and Mike Wonham. "The compliance budget: managing security behaviour in organisations." Proceedings of the 2008 New Security Paradigms Workshop. ACM, 2009.

[8]: What-is-logistic-regression: https://www.statisticssolutions.com/what-is-logistic-regression/



[8]: Weirich, Dirk. “Persuasive password security.” CHI Extended Abstracts (2001).
