# Reading this section

Read in numerical order. The appendix contains additional data used in the assessment such as shell scripts and other useful files. 

# Preparing our data in Python for use in Neural Networks.

For use in Neural Networks we first need to pre-process our data into a suitable format for use with the Tensorflow package in order to compile and fit a Neural Network on our data that classifies attacks.

Throughout the report you will find similar methods used for each different network type that we attempted to implement, in order to achieve more consistent results, and these data processing techniques were shared amongst the group. The most important step is when we split the data for training and testing, where those of us using python made sure to specify the same `random_state` variable in order to produce the same data splitting.

We also each recreate these steps as we would send most of this processing off in our `.py` File that we submit to the HPC in order to avoid package discrepancy issues. At first I tried to save my training and testing data as pickle files, before copying them to the HPC and then accessing them with a Python script. Unfortunately due to discrepancies in the versions of Python packages this proved to be more trouble than it was worth and a lot of this data processing was then ported to the final Python script I submitted to the HPC.

We start by importing important packages:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import datetime as dt
import squarify
import tensorflow as tf
from tensorflow.keras import layers
import sklearn
from sklearn.model_selection import train_test_split
import pickle
import gzip
from urllib.request import urlopen

Now we import our data from the repository, saved in ZIP format in the GitHub respository to save space. These are split in 4 as the original data was.

In [None]:
start = dt.datetime.now()
print("Reading df1")
df1 = pd.read_csv("https://github.com/Galeforse/DST-Assessment-04/raw/main/Data/UNS1.zip",header=None)
print("Reading df2")
df2 = pd.read_csv("https://github.com/Galeforse/DST-Assessment-04/raw/main/Data/UNS2.zip",header=None)
print("Reading df3")
df3 = pd.read_csv("https://github.com/Galeforse/DST-Assessment-04/raw/main/Data/UNS3.zip",header=None)
print("Reading df4")
df4 = pd.read_csv("https://github.com/Galeforse/DST-Assessment-04/raw/main/Data/UNS4.zip",header=None)
print("Data fetched in:" ,dt.datetime.now()-start)

We now concatenate the data frames (ignoring index in order to avoide any issues that may arise with duplicate indexes) and then add column names to the data. These columns names are adapted from the information provided alongside the data set on it's website.

In [None]:
df = pd.concat([df1,df2,df3,df4],ignore_index=True)
df.columns = ['source_ip', 'source_port', 'dest_ip', 'dest_port', 'proto', 'state', 'duration', 'source_bytes', 'dest_bytes', 'source_ttl',
             'dest_ttl', 'source_loss', 'dest_loss', 'service', 'source_load', 'dest_load', 'source_pkts', 'dest_pkts', 'source_TP_win', 'dest_TP_win', 
             'source_tcp_bn', 'dest_tcp_bn', 'source_mean_sz', 'dest_mean_sz', 'trans_depth', 'res_bdy_len', 'source_jitter', 'dest_jitter', 'start_time',
             'last_time', 'source_int_pk_time', 'dest_int_pk_time', 'tcp_rtt', 'synack', 'ackdat', 'is_sm_ips_ports', 'count_state_ttl', 
             'count_flw_http_mthd', 'is_ftp_login', 'count_ftp_cmd', 'count_srv_source', 'count_srv_dest', 'count_dest_ltm',
             'count_source_ltm', 'count_source_destport_ltm', 'count_dest_sourceport_ltm', 'counts_dest_source_ltm', 'attack_cat', 'Label']
df.head(10)

Next we are going to check some of the features in our data. We check the size of the data frame and then take horizontal slices of the data as it is impossible to view all 49 columns in one view.

In [None]:
df.shape

In [None]:
df_1 = df.iloc[:,0:16]
df_2 = df.iloc[:,16:32]
df_3 = df.iloc[:,32:49]
df_1.head()

In [None]:
df_2.head()

In [None]:
df_3.head()

We briefly check that our slices are the size we expect them to be (same number of rows as main data frame, and columns add up to 49)

In [None]:
print("shape of 1st slice:")
print(df_1.shape)
print("shape of 2nd slice:")
print(df_2.shape)
print("shape of 3rd slice:")
print(df_3.shape)

We will describe each slice to check for any anomalous data values such as `NaN` and `inf`.

In [None]:
df_1.describe()

In [None]:
df_2.describe()

In [None]:
df_3.describe()

We want to check that our data is suitable for use on different types of machine, as most PCs are 64 bit, however the HPC runs a 32 bit system and we don't want there to be any errors arising from this bit difference. 

We start by defining the maximum value of each variable as follows:

In [None]:
dfmax = df.max()
dfmax

We define the following function to iterate through a data set and check that it passes the bit limit for 64 bit first (more likely to pass than 32 bit) then 32 bit. We then run the function on our `dfmax` as these are the maximums of each column.

In [None]:
def bitcheckmax(data):
    for i in data:
        count = 0
        fail = False
        if isinstance(i,str) == True:
            pass
        else:
            j = float(i)
            count = count+1
            if j <= np.finfo(np.float64).max:
                pass
            else:
                print("Fails 64 bit check at row: " + str(count))
                fail = True
                break
    if fail == False:
        print("Passes 64 bit check.")
    for i in data:
        count = 0
        fail = False
        if isinstance(i,str) == True:
            pass
        else:
            j = float(i)
            count = count+1
            if j <= np.finfo(np.float32).max:
                pass
            else:
                print("Fails 32 bit check at row: " + str(count))
                fail = True
                break
    if fail == False:
        print("Passes 32 bit check.")

In [None]:
bitcheckmax(dfmax)

We are interested in classifying attack types. We notice from our earlier look at the data that the `attack_cat` column for normal traffic is NaN (as it obviously isn't an attack, it has no attack category!). We therefore fill in these missing values with the label "Normal", to designate normal traffic.

In [None]:
df['attack_cat'] = df['attack_cat'].fillna('Normal')
df.head()

In [None]:
attackcount = pd.DataFrame(df['attack_cat'].value_counts())
attackcount

Above we are checking how many of each attack are present in the data; we notice here a problem with the strings used to represent certain attack but we'll address this later in the processing. For now we will make a visual plot of the representation of each attack type in our data.

In [None]:
ac = []
for i in attackcount.index:
    ac.append(i)
an = attackcount["attack_cat"].tolist()

In [None]:
fig = plt.gcf()
ax = fig.add_subplot()
fig.set_size_inches(18,8)
squarify.plot(label=ac,sizes=an, color = ["cyan","magenta","yellow","lime"])
plt.axis("off")
plt.show()

We will now deal with missing data in our data. We will check which columns have missing values and then deal with each accordingly by filling NA values with a 0 as this seems to be the easiest course of action.

In [None]:
l = []
colnames = df.columns

for name in colnames:
    if df[name].isnull().values.any():
        l.append(name)
        
print('The columns with missing values in them are: ' + str(l))

In addition to the above thanks to analysis by others in the group we noticed that the `count_ftp_cmd` column is also missing data but not as NaN values, instead these missing values are blank " " single spaces so we'll include this column in our processing even though to python it doesn't look like it has any missing data.

In [None]:
print("percentage of missing data in count_flw_http_mthd column:" + str(df["count_flw_http_mthd"].isnull().sum()*100/len(df)))
print("percentage of missing data in is_ftp_login column:" + str(df["is_ftp_login"].isnull().sum()*100/len(df)))
print("percentage of missing data in count_ftp_cmd column:" + str(df["count_ftp_cmd"].isnull().sum()*100/len(df)))

In [None]:
df = df.fillna(0)
print("percentage of missing data in count_flw_http_mthd column:" + str(df["count_flw_http_mthd"].isnull().sum()*100/len(df)))
print("percentage of missing data in is_ftp_login column:" + str(df["is_ftp_login"].isnull().sum()*100/len(df)))
print("percentage of missing data in count_ftp_cmd column:" + str(df["count_ftp_cmd"].isnull().sum()*100/len(df)))

As `count_ftp_cmd` is numeric data, we apply the panda function `pd.to_numeric` with the `errors="coerce"` parameter which will coerce the blank spaces (which count as strings) into NaN values, which will show when we use the same functions we've been using again.

In [None]:
df['count_ftp_cmd'] = df['count_ftp_cmd'].apply(pd.to_numeric,errors="coerce")
print("percentage of missing data in count_flw_http_mthd column:" + str(df["count_flw_http_mthd"].isnull().sum()*100/len(df)))
print("percentage of missing data in is_ftp_login column:" + str(df["is_ftp_login"].isnull().sum()*100/len(df)))
print("percentage of missing data in count_ftp_cmd column:" + str(df["count_ftp_cmd"].isnull().sum()*100/len(df)))

We fill the NA's with 0 again.

In [None]:
df = df.fillna(0)
print("percentage of missing data in count_flw_http_mthd column:" + str(df["count_flw_http_mthd"].isnull().sum()*100/len(df)))
print("percentage of missing data in is_ftp_login column:" + str(df["is_ftp_login"].isnull().sum()*100/len(df)))
print("percentage of missing data in count_ftp_cmd column:" + str(df["count_ftp_cmd"].isnull().sum()*100/len(df)))

Now we will deal with the problems we noticed earlier in that some of the attack categories were duplicated due to the structure of their string. Looking back at our earlier findings we write the following block of code to fix this problem. And will see we no longer have any duplicates, and we can repeat our visualisation.

In [None]:
df['attack_cat'] = df['attack_cat'].map({'Normal': 'Normal', 'Exploits': 'Exploits', ' Fuzzers ': 'Fuzzers', 'DoS': 'DoS',
                                          ' Reconnaissance ': 'Reconnaissance', ' Fuzzers': 'Fuzzers', 'Analysis': 'Analysis',
                                         'Backdoor': 'Backdoor', 'Reconnaissance': 'Reconnaissance',  ' Shellcode ': 'Shellcode',
                                         'Backdoors': 'Backdoor', 'Shellcode': 'Shellcode',  'Worms': 'Worms', 'Generic': 'Generic'})
df.groupby('attack_cat').size()

In [None]:
attackcount = pd.DataFrame(df['attack_cat'].value_counts())
ac = []
for i in attackcount.index:
    ac.append(i)
an = attackcount["attack_cat"].tolist()
fig = plt.gcf()
ax = fig.add_subplot()
fig.set_size_inches(18,8)
squarify.plot(label=ac,sizes=an, color = ["cyan","magenta","yellow","lime"])
plt.axis("off")
plt.show()

For use in neural networks prediction we want our data to be numeric however when looking at our data types in the next block we see there is all kinds of different types of data present. We will use a dictionary mapping in order to convert this data into something that is more appropriate. (We also drop the Label column here as it is not useful)

In [None]:
df.dtypes

In [None]:
df = df.drop('Label',axis=1)
df_source_ip = pd.DataFrame(df['source_ip'])
df_source_port = pd.DataFrame(df['source_port'])
df_dest_ip = pd.DataFrame(df['dest_ip'])
df_dest_port = pd.DataFrame(df['dest_port'])
df_proto = pd.DataFrame(df['proto'])
df_state = pd.DataFrame(df['state'])
df_service = pd.DataFrame(df['service'])
df_count_ftp_cmd = pd.DataFrame(df['count_ftp_cmd'])
df_attack_cat = pd.DataFrame(df['attack_cat'])

# we now create dictionaries to allow us to map onto the data frame

sips = df.source_ip.unique()
sip_dict = dict(zip(sips,range(len(sips))))

sp = df.source_port.unique()
sp_dict = dict(zip(sp,range(len(sp))))
               
dips = df.dest_ip.unique()
dip_dict = dict(zip(dips,range(len(dips))))

dp = df.dest_port.unique()
dp_dict = dict(zip(dp,range(len(dp))))

p = df.proto.unique()
p_dict = dict(zip(p,range(len(p))))

states = df.state.unique()
state_dict = dict(zip(states,range(len(states))))

services = df.service.unique()
service_dict = dict(zip(services,range(len(services))))

cfc = df.count_ftp_cmd.unique()
cfc_dict = dict(zip(cfc,range(len(cfc))))

ac = df.attack_cat.unique()
ac_dict = dict(zip(ac,range(len(ac))))

df['source_ip_int'] = df['source_ip'].map(sip_dict)
df['source_port_int'] = df['source_port'].map(sp_dict)
df['dest_ip_int'] = df['dest_ip'].map(dip_dict)
df['dest_port_int'] = df['dest_port'].map(dp_dict)
df['proto_int'] = df['proto'].map(p_dict)
df['state_int'] = df['state'].map(state_dict)
df['service_int'] = df['service'].map(service_dict)
df['count_ftp_cmd_int'] = df['count_ftp_cmd'].map(cfc_dict)
df['attack_cat_int'] = df['attack_cat'].map(ac_dict)

df = df.drop('source_ip',axis=1)
df = df.drop('source_port',axis=1)
df = df.drop('dest_ip',axis=1)
df = df.drop('dest_port',axis=1)
df = df.drop('proto',axis=1)
df = df.drop('state',axis=1)
df = df.drop('service',axis=1)
df = df.drop('count_ftp_cmd',axis=1)
df = df.drop('attack_cat',axis=1)

df.dtypes

Just in case we'll once again check for missingess.

In [None]:
l = []
colnames = df.columns

for name in colnames:
    if df[name].isnull().values.any():
        l.append(name)
        
print('The columns with na/nan values in them are: ' + str(l))

It is good practice to scale our data so that certain features do not heavily weight the learning process. We will also seperate out our `attack_cat` column as this will be what we are going to predict with our neural network.

In [None]:
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler

def preprocess(data,scaling=None):
    data = data.astype(np.float)
    if(scaling == None):
        scaling = StandardScaler()
        datat=scaling.fit_transform(data)
    else:
        datat=scaling.transform(data)
    return(datat,scaling)

In [None]:
Y = df['attack_cat_int']
X = df.drop('attack_cat_int',axis=1)

In [None]:
X_scaled, scaling = preprocess(X.values)
print(X.shape)
print(X_scaled.shape)
print(Y.shape)

We now split our data into testing and training data.

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, Y, test_size = 0.1, random_state = 10)

The next 2 documents are the python scripts that I submitted to the HPC, they repeats a lot of what has happened in this document and then proceed to define Neural Networks model for classification. I will address these models in detail in the document after but the basics are that each one has different layer configurations.

### References

[So much use of the SLURM documentation in order to understand the available function on the HPC.](https://slurm.schedmd.com/documentation.html)

[DST HPC documentation was a good start for using the HPC.](https://dsbristol.github.io/dst/coursebook/appendix5-bluecrystal.html)

[BlueCrystal Phase 4 Documentation.](https://www.acrc.bris.ac.uk/protected/bc4-docs/index.html)

[Tensorflow tutorial for classification.](https://www.tensorflow.org/tutorials/structured_data/feature_columns)