## Importing

In [1]:
import pandas as pd
import numpy as np

## Loading Data

Loading pre-processed data from [Assessment 1](https://github.com/billnunn/Assessment-1-Bill-Mo). This was a fairly clean dataset and the pre-processing was thoroughly done with appropriate manual scaling and then subsequently ran on RandomForest classifiers. The data should remain appropriate for DNN though it may not be complex enough to necessarily require a DNN as a classifier (particularly as RandomForests already did a good job in the assessment above). 

Nevertheless, this dataset should still allow for comparison between optimizers which is the scientific question that this report asks.

In [2]:
df = pd.read_csv('../data/kdd_log_df.csv')
df.head()

Unnamed: 0,duration,src_bytes,dst_bytes,urgent,hot,num_failed_logins,num_compromised,num_root,num_file_creations,num_access_files,...,flag_RSTOS0,flag_RSTR,flag_S0,flag_S1,flag_S2,flag_S3,flag_SF,flag_SH,connection_type,connection_category
0,0.0,5.204007,8.603554,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,normal,normal
1,0.0,5.480639,6.188264,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,normal,normal
2,0.0,5.463832,7.198931,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,normal,normal
3,0.0,5.393628,7.198931,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,normal,normal
4,0.0,5.384495,7.617268,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,normal,normal


KFold

In [3]:
# np.random.seed(42)
# from sklearn.model_selection import KFold
# kf = KFold(n_splits=10,shuffle=True)
# kfsplit=kf.split(df)
# ## We're going to extract out the "test" dataset from the first fold, to do our testing on
# # kf.split returns an iterator, i.e. it creates a function that creates a test/split
# # which we can either loop over or get the first using next
# ninefolds,onefold = next(kfsplit) 
# train_data = df.loc[ninefolds]
# test_data = df.loc[onefold]

Examine the labels for our dataset. We have 

In [4]:
num_type = df.connection_type.nunique()
num_cat = df.connection_category.nunique()
print('Total of {} connection types and {} connection categories'.format(num_type,num_cat))
print()
df[['connection_type','connection_category']].value_counts()

Total of 23 connection types and 5 connection categories



connection_type  connection_category
smurf            dos                    280790
neptune          dos                    107201
normal           normal                  97278
back             dos                      2203
satan            probe                    1589
ipsweep          probe                    1247
portsweep        probe                    1040
warezclient      r2l                      1020
teardrop         dos                       979
pod              dos                       264
nmap             probe                     231
guess_passwd     r2l                        53
buffer_overflow  u2r                        30
land             dos                        21
warezmaster      r2l                        20
imap             r2l                        12
rootkit          u2r                        10
loadmodule       u2r                         9
ftp_write        r2l                         8
multihop         r2l                         7
phf              r2l   

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
df.columns[-2:]

Index(['connection_type', 'connection_category'], dtype='object')

In [7]:
features = df[df.columns[:-2]]
targets = df[df.columns[-2:]]

In [11]:
X_train, X_test, y_train, y_test = train_test_split(features, targets,
                                                    test_size=0.5, 
                                                    random_state = 42, shuffle = True, 
                                                    stratify = targets)

In [12]:
X_train.head()

Unnamed: 0,duration,src_bytes,dst_bytes,urgent,hot,num_failed_logins,num_compromised,num_root,num_file_creations,num_access_files,...,flag_REJ,flag_RSTO,flag_RSTOS0,flag_RSTR,flag_S0,flag_S1,flag_S2,flag_S3,flag_SF,flag_SH
360984,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
112306,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
404506,0.0,6.25575,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
222435,0.0,6.940222,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
234492,0.0,6.940222,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [13]:
y_train.head()

Unnamed: 0,connection_type,connection_category
360984,neptune,dos
112306,neptune,dos
404506,smurf,dos
222435,smurf,dos
234492,smurf,dos


In [15]:
y_train.connection_type.nunique(), y_test.connection_type.nunique()

(23, 23)

In [48]:
differences = [(ind, y_train.value_counts()[ind] - y_test.value_counts()[ind]) for ind in y_train.value_counts().index]

for a in differences:
    print('Index :', a[0][0], a[0][1], 'has difference:', a[1])

Index : smurf dos has difference: 0
Index : neptune dos has difference: -1
Index : normal normal has difference: 0
Index : back dos has difference: -1
Index : satan probe has difference: -1
Index : ipsweep probe has difference: -1
Index : portsweep probe has difference: 0
Index : warezclient r2l has difference: 0
Index : teardrop dos has difference: -1
Index : pod dos has difference: 0
Index : nmap probe has difference: -1
Index : guess_passwd r2l has difference: 1
Index : buffer_overflow u2r has difference: 0
Index : land dos has difference: 1
Index : warezmaster r2l has difference: 0
Index : imap r2l has difference: 0
Index : rootkit u2r has difference: 0
Index : loadmodule u2r has difference: 1
Index : ftp_write r2l has difference: 0
Index : multihop r2l has difference: 1
Index : perl u2r has difference: 1
Index : phf r2l has difference: 0
Index : spy r2l has difference: 0


In [34]:
y_train.connection_category.value_counts()

dos       195728
normal     48639
probe       2052
r2l          564
u2r           27
Name: connection_category, dtype: int64

In [53]:
X_train_small, X_val, y_train_small, y_val = train_test_split(X_train, y_train, 
                                                             test_size = 0.8,
                                                             random_state = 42, shuffle = True,
                                                             stratify = y_train.connection_category)

In [56]:
#turn target into binary with 0 = normal, 1 = attack
y_train_bin = y_train_small.connection_category.apply(lambda x: 0 if x == 'normal' else 1)
y_val_bin = y_val.connection_category.apply(lambda x: 0 if x == 'normal' else 1)
y_test_bin = y_test.connection_category.apply(lambda x: 0 if x == 'normal' else 1)

In [59]:
#create multi category classifier variable e.g. normal,u2r,dos... 
#these will get loaded slightly differently into the other files
y_train_multi = y_train_small.connection_category
y_val_multi = y_val.connection_category
y_test_multi = y_test.connection_category

In [None]:
# #last 2 columns are labels so we do not grab them for our X
# X_train = train_data[train_data.columns[:-2]]
# X_test = test_data[test_data.columns[:-2]]


# #create binary label arrays for attack/not attack 
# y_train_bin = train_data.connection_category.copy()
# y_test_bin = test_data.connection_category.copy()
# #turn our labels into 0 = not attack, 1 = attack
# y_test_bin = y_test_bin.apply(lambda x: 0 if x=='normal' else 1)
# y_train_bin = y_train_bin.apply(lambda x: 0 if x=='normal' else 1)

# #create multi category classifier e.g. normal,u2r,dos... 
# #these will get loaded slightly differently into the other files
# y_train_multi =  train_data.connection_category.copy()
# y_test_multi =  test_data.connection_category.copy()

In [60]:
pd.DataFrame.to_csv(X_train_small, '..\data\X_train.csv', index=False)
pd.DataFrame.to_csv(X_val, '..\data\X_val.csv', index=False)
pd.DataFrame.to_csv(X_test, '..\data\X_test.csv', index=False)

pd.DataFrame.to_csv(y_train_bin, '..\data\y_train_bin.csv', index=False)
pd.DataFrame.to_csv(y_val_bin, '..\data\y_val_bin.csv', index=False)
pd.DataFrame.to_csv(y_test_bin, '..\data\y_test_bin.csv', index=False)

pd.DataFrame(y_train_multi).to_csv('..\data\y_train_multi.csv', index=False)
pd.DataFrame(y_val_multi).to_csv('..\data\y_val_multi.csv', index=False)
pd.DataFrame(y_test_multi).to_csv('..\data\y_test_multi.csv', index=False)

Code below is for connection_type classification but that would need more tweaking when creating a neural network's architecture. The neural network would have to output some form of distance rather than probability of a certain type of attack, in case some attack type was not seen in training but was seen in test.

For example, given the way that we split the data, we have `rootkit`, `ftp_write`, `warezmaster`,`multihop`,`spy`,`phf` attacks that exist either in the training or test data, but not in both (you can see this for yourself by un-commenting the two code cells below and running tehm). These existing in the training data alone is less of an issue but if they only exist in the test data then we may get nonesense classification of a class that we have not seen. One solution to this may be to stratify the data but as seen above when highlighting different types of attacks that exist, attacks like `phf`,`perl`, and `spy` appear less than 5 times each and so even a stratified sample won't be very representative if we are taking a 1:9 test:train split. Some selection process can be used wherein the training data has the first occurence of each connection type and then added on top of a stratified sample. That way the training data will always consist of a 90% + 23 datapoints which will always contain all connection types. This, however, assumes that our dataset is inclusive of all possible attack types, and introducing new data with new attacks would mean our model may not generalise very well.

A model selection approach to resolve this would be to create a neural network that outputs some sort of distance from a class rather than a probability of belonging to a class, then assigning a new observation as an `other` class if its distance from all classes exceeds a certian threshold. This, however, feels like it would fall outside the scope of this assessment, and so we do not attempt it, unless we find ourselves with enough time to try this implementation.

The code is left below, commented out, for the time being.

In [None]:
# #create multiclass category for all connection types 'normal','buffer_overflow','perl'...
# y_train_allclass = train_data.connection_type.copy()
# y_test_allclass = test_data.connection_type.copy()
# #as above, factorize then turn into tf.keras categorical
# y_train_allclass, train_classes = pd.factorize(y_train_allclass)
# y_test_allclass, test_classes = pd.factorize(y_test_allclass)
# #turn into tf.keras categorical
# y_train_allclass = to_categorical(y_train_allclass, num_classes = len(train_classes))
# y_test_allclass = to_categorical(y_test_allclass, num_classes = len(test_classes))

In [None]:
# all_classes = set(list(train_classes)+list(test_classes))
# inc_classes = [c for c in train_classes if c in test_classes]
# exc_classes = [c for c in all_classes if c not in inc_classes]

# exc_classes