# 02 - Data Setup

This section explores the KDD Cup 1999 dataset briefly and acts as the setup for our subsequent sections.

## Importing and Installing Modules

To install the required libraries, uncomment the cell below and run on a blank environment

In [1]:
# $ pip install -r ../requirements.txt

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

## Discussing the Data

For this assessment, we are using a pre-processed version of the KDD Cup 1999 dataset found [here](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html)[1], and in particular we are using a pre-processed version of the 10% subset. The pre-processing was done by the author in a previous assessment and the details of this can be found [on this GitHub repository](https://github.com/billnunn/Assessment-1-Bill-Mo)[2] under [01 - Data Setup](https://github.com/billnunn/Assessment-1-Bill-Mo/blob/main/Report/01%20-%20Data%20Setup.ipynb) in the 'Report' folder.

It is worth noting that the classification task that we are going to carry out in this assessment was also caried out using RandomForests in the previous assessment with reasonable success. So, while a Neural Net may not be necessary for this particular dataset, it is still appropriate for the purpose of assessing the performance and convergence rate for different optimizer choices in this setting.

## Loading Data

Uncomment below to download data from Google Drive. If this doesn't work please try to download manually from [here](https://drive.google.com/file/d/1XSs3-GjPwB0FAYtGVNiwdKdiJ916BI75/view?usp=sharing) and save as '../data/kdd_log_df.csv'.

In [3]:
# from google_drive_downloader import GoogleDriveDownloader as gdd

In [4]:
# gdd.download_file_from_google_drive(file_id='1XSs3-GjPwB0FAYtGVNiwdKdiJ916BI75',
                                    # dest_path='../data/kdd_log_df.csv')

In [5]:
df = pd.read_csv('../data/kdd_log_df.csv')
df.head()

Unnamed: 0,duration,src_bytes,dst_bytes,urgent,hot,num_failed_logins,num_compromised,num_root,num_file_creations,num_access_files,...,flag_RSTOS0,flag_RSTR,flag_S0,flag_S1,flag_S2,flag_S3,flag_SF,flag_SH,connection_type,connection_category
0,0.0,5.204007,8.603554,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,normal,normal
1,0.0,5.480639,6.188264,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,normal,normal
2,0.0,5.463832,7.198931,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,normal,normal
3,0.0,5.393628,7.198931,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,normal,normal
4,0.0,5.384495,7.617268,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,normal,normal


## Exploring Data

Examine the labels for our dataset. We have 

In [6]:
num_type = df.connection_type.nunique()
num_cat = df.connection_category.nunique()
print('Total of {} connection types and {} connection categories'.format(num_type,num_cat))
print()
df[['connection_type','connection_category']].value_counts()

Total of 23 connection types and 5 connection categories



connection_type  connection_category
smurf            dos                    280790
neptune          dos                    107201
normal           normal                  97278
back             dos                      2203
satan            probe                    1589
ipsweep          probe                    1247
portsweep        probe                    1040
warezclient      r2l                      1020
teardrop         dos                       979
pod              dos                       264
nmap             probe                     231
guess_passwd     r2l                        53
buffer_overflow  u2r                        30
land             dos                        21
warezmaster      r2l                        20
imap             r2l                        12
rootkit          u2r                        10
loadmodule       u2r                         9
ftp_write        r2l                         8
multihop         r2l                         7
phf              r2l   

One of the issues of the dataset can be seen here, namely that there are fewer normal connections than there are attacks, whereas in a normal cybersecurity setting, we would expect to see the opposite; number of normal connections being orders of magnitude more than attacks.

We also have an overall class imbalance when looking at each connection category as a class as seen below. This is not an issue in the cybersecurity sense, as we do expect some attacks to be more prevalent than others, but it is something to keep in mind when choosing a metric for assessing model/optimizer performance.

In [7]:
sum(100*df.connection_category.value_counts()[:2])/len(df)

98.93020742033234

In [8]:
print(df.connection_category.value_counts())
print()
print(100*df.connection_category.value_counts()/len(df))

dos       391458
normal     97278
probe       4107
r2l         1126
u2r           52
Name: connection_category, dtype: int64

dos       79.239142
normal    19.691066
probe      0.831341
r2l        0.227926
u2r        0.010526
Name: connection_category, dtype: float64


## Train Test Split

In [9]:
#first 115 columns are our features
features = df[df.columns[:-2]]
#last two columns are connection category and connection type
targets = df[df.columns[-2:]]

In the following, we create our training and test dataset split but we do something unusual: we assign 50% of our dataset for training. There are a few reasons for this:
- The class imbalances are such that if we only assigned 10% or 20% of the data to training, we would only have 5 or 10 u2r connections, respectively, so assessing performance vs randomness on these becomes increasingly hard.
- Given that RandomForests perform so well, it reasonable to expect that Neural Networks would perform as well (if not better - and indeed we found that given a larger dataset our performance was too good to allow meaningful comparison in some cases), and so we try to artificially make the problem more difficult for our model.

Concretely, we are placing ourselves in the scenario where our available data is very limited and assessing how we perform given that we get large out-of-sample data to predict. We still stratify our data to make sure that we do not lose connection categories in one dataset due to pure randomness.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(features, targets,
                                                    test_size=0.5, 
                                                    random_state = 42, shuffle = True, 
                                                    stratify = targets)

Brief sanity check to see that our data has been shuffle and that our labels for X and y match.

In [11]:
X_train.head()

Unnamed: 0,duration,src_bytes,dst_bytes,urgent,hot,num_failed_logins,num_compromised,num_root,num_file_creations,num_access_files,...,flag_REJ,flag_RSTO,flag_RSTOS0,flag_RSTR,flag_S0,flag_S1,flag_S2,flag_S3,flag_SF,flag_SH
360984,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
112306,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
404506,0.0,6.25575,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
222435,0.0,6.940222,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
234492,0.0,6.940222,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [12]:
y_train.head()

Unnamed: 0,connection_type,connection_category
360984,neptune,dos
112306,neptune,dos
404506,smurf,dos
222435,smurf,dos
234492,smurf,dos


In [13]:
y_train.connection_type.nunique(), y_test.connection_type.nunique()

(23, 23)

And our y's have been stratified correctly.

In [14]:
differences = [(ind, y_train.value_counts()[ind] - y_test.value_counts()[ind]) for ind in y_train.value_counts().index]

for a in differences:
    print('Index :', a[0][0], a[0][1], 'has difference:', a[1])

Index : smurf dos has difference: 0
Index : neptune dos has difference: -1
Index : normal normal has difference: 0
Index : back dos has difference: -1
Index : satan probe has difference: -1
Index : ipsweep probe has difference: -1
Index : portsweep probe has difference: 0
Index : warezclient r2l has difference: 0
Index : teardrop dos has difference: -1
Index : pod dos has difference: 0
Index : nmap probe has difference: -1
Index : guess_passwd r2l has difference: 1
Index : buffer_overflow u2r has difference: 0
Index : land dos has difference: 1
Index : warezmaster r2l has difference: 0
Index : imap r2l has difference: 0
Index : rootkit u2r has difference: 0
Index : loadmodule u2r has difference: 1
Index : ftp_write r2l has difference: 0
Index : multihop r2l has difference: 1
Index : perl u2r has difference: 1
Index : phf r2l has difference: 0
Index : spy r2l has difference: 0


## Creating a Validation Set

Here, we look below to see a certain problem arise: one of our connection types (spy) only occurs once in our training dataset. This means that it's not possible to split into a train/validation set in any manner such that both sets contain at least 1 spy attack. Also, as seen from above, since we only had 2 spy attacks in the whole dataset, we could not perform the train/test split in such a way that we would have more spy attacks to share between train/validation.

For this purpose, we perform our split stratifying for connection category.

In [15]:
y_train.value_counts()

connection_type  connection_category
smurf            dos                    140395
neptune          dos                     53600
normal           normal                  48639
back             dos                      1101
satan            probe                     794
ipsweep          probe                     623
portsweep        probe                     520
warezclient      r2l                       510
teardrop         dos                       489
pod              dos                       132
nmap             probe                     115
guess_passwd     r2l                        27
buffer_overflow  u2r                        15
land             dos                        11
warezmaster      r2l                        10
imap             r2l                         6
rootkit          u2r                         5
loadmodule       u2r                         5
ftp_write        r2l                         4
multihop         r2l                         4
perl             u2r   

In [16]:
#we take only 20% of the data (equivalent to 10% of whole data) for training
#this is to artificially make the problem more difficult for our model as stated previously
X_train_small, X_val, y_train_small, y_val = train_test_split(X_train, y_train, 
                                                             test_size = 0.8,
                                                             random_state = 42, shuffle = True,
                                                             stratify = y_train.connection_category)

In [17]:
#turn target into binary with 0 = normal, 1 = attack
y_train_bin = y_train_small.connection_category.apply(lambda x: 0 if x == 'normal' else 1)
y_val_bin = y_val.connection_category.apply(lambda x: 0 if x == 'normal' else 1)
y_test_bin = y_test.connection_category.apply(lambda x: 0 if x == 'normal' else 1)

In [18]:
#create multi category classifier variable e.g. normal,u2r,dos... 
y_train_multi = y_train_small.connection_category
y_val_multi = y_val.connection_category
y_test_multi = y_test.connection_category

We do not take further steps for our categorical y's because the way they get saved and reloaded into the other notebooks causes issues with how TensorFlow likes to take in categorical data. Therefore, we process categorical y's in the models' notebooks instead.

We now save our variables to a data folder so that we have the same data split to load from across all notebooks.

In [19]:
pd.DataFrame.to_csv(X_train_small, '..\data\X_train.csv', index=False)
pd.DataFrame.to_csv(X_val, '..\data\X_val.csv', index=False)
pd.DataFrame.to_csv(X_test, '..\data\X_test.csv', index=False)

pd.DataFrame.to_csv(y_train_bin, '..\data\y_train_bin.csv', index=False)
pd.DataFrame.to_csv(y_val_bin, '..\data\y_val_bin.csv', index=False)
pd.DataFrame.to_csv(y_test_bin, '..\data\y_test_bin.csv', index=False)

pd.DataFrame(y_train_multi).to_csv('..\data\y_train_multi.csv', index=False)
pd.DataFrame(y_val_multi).to_csv('..\data\y_val_multi.csv', index=False)
pd.DataFrame(y_test_multi).to_csv('..\data\y_test_multi.csv', index=False)

## Appendix

Code below is for connection_type classification but that would need more tweaking when creating a neural network's architecture. The neural network would have to output some form of distance rather than probability of a certain type of attack, in case some attack type was not seen in training but was seen in test.

For example, given the way that we split the data, we have `rootkit`, `ftp_write`, `warezmaster`,`multihop`,`spy`,`phf` attacks that exist either in the training or test data, but not in both (you can see this for yourself by un-commenting the two code cells below and running tehm). These existing in the training data alone is less of an issue but if they only exist in the test data then we may get nonesense classification of a class that we have not seen. One solution to this may be to stratify the data but as seen above when highlighting different types of attacks that exist, attacks like `phf`,`perl`, and `spy` appear less than 5 times each and so even a stratified sample won't be very representative if we are taking a 1:9 test:train split. Some selection process can be used wherein the training data has the first occurence of each connection type and then added on top of a stratified sample. That way the training data will always consist of a 90% + 23 datapoints which will always contain all connection types. This, however, assumes that our dataset is inclusive of all possible attack types, and introducing new data with new attacks would mean our model may not generalise very well.

A model selection approach to resolve this would be to create a neural network that outputs some sort of distance from a class rather than a probability of belonging to a class, then assigning a new observation as an `other` class if its distance from all classes exceeds a certian threshold. This, however, feels like it would fall outside the scope of this assessment, and so we do not attempt it, unless we find ourselves with enough time to try this implementation.

We started working on the code in a previous iteration of this file (which is why some of the variables have different names) but this was abandoned quickly for the reasons mentioned above.

In [20]:
# #create multiclass category for all connection types 'normal','buffer_overflow','perl'...
# y_train_allclass = train_data.connection_type.copy()
# y_test_allclass = test_data.connection_type.copy()
# #as above, factorize then turn into tf.keras categorical
# y_train_allclass, train_classes = pd.factorize(y_train_allclass)
# y_test_allclass, test_classes = pd.factorize(y_test_allclass)
# #turn into tf.keras categorical
# y_train_allclass = to_categorical(y_train_allclass, num_classes = len(train_classes))
# y_test_allclass = to_categorical(y_test_allclass, num_classes = len(test_classes))

In [21]:
# all_classes = set(list(train_classes)+list(test_classes))
# inc_classes = [c for c in train_classes if c in test_classes]
# exc_classes = [c for c in all_classes if c not in inc_classes]

# exc_classes

## References for Section 2

- [1] KDD Cup 1999 http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
- [2] Assessment 1 https://github.com/billnunn/Assessment-1-Bill-Mo