<a href="https://colab.research.google.com/github/fr0zenK/DVWA/blob/master/Thesis_23_06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Import Libraries and set up Notebook**

In [None]:
import numpy as np
import pandas as pd

from IPython.core.display import display, HTML
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio


import seaborn as sns
from importlib import reload
import matplotlib.pyplot as plt
import matplotlib
import warnings

# Configure Jupyter Notebook
pd.set_option('display.max_columns', None) 
pd.set_option('display.max_rows', 500) 
pd.set_option('display.expand_frame_repr', False)
# pd.set_option('max_colwidth', -1)
display(HTML("<style>div.output_scroll { height: 35em; }</style>"))

reload(plt)
%matplotlib inline
%config InlineBackend.figure_format ='retina'

warnings.filterwarnings('ignore')

# configure plotly graph objects
pio.renderers.default = 'iframe'
# pio.renderers.default = 'vscode'

pio.templates["ck_template"] = go.layout.Template(
    layout_colorway = px.colors.sequential.Viridis, 
#     layout_hovermode = 'closest',
#     layout_hoverdistance = -1,
    layout_autosize=False,
    layout_width=800,
    layout_height=600,
    layout_font = dict(family="Calibri Light"),
    layout_title_font = dict(family="Calibri"),
    layout_hoverlabel_font = dict(family="Calibri Light"),
#     plot_bgcolor="white",
)
 
# pio.templates.default = 'seaborn+ck_template+gridon'
pio.templates.default = 'ck_template+gridon'
# pio.templates.default = 'seaborn+gridon'
# pio.templates

#**Downloading datasets from Kaggle**
ref: https://www.analyticsvidhya.com/blog/2021/06/how-to-load-kaggle-datasets-directly-into-google-colab/

**! upload Kaggle JSON first !**

In [None]:
! pip install kaggle
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! rm kaggle.json

##CSE-CIC-IDS2018
This dataset is jointly created by Communications Security Establishment (CSE) and CIC in 2018.63
The user profiles containing the abstract representation of the different events is created. For the generation of the dataset, all these profiles are combined with a unique set of features. It includes seven different attack scenarios:
Brute-force, Heartbleed, Botnet, DoS, DDoS, Web attacks, and infiltration of the network from inside

In [None]:
! kaggle datasets download karenp/cse-cic-ids2018
! mkdir cse-cic-ids2018
! unzip -d cse-cic-ids2018 cse-cic-ids2018.zip
! rm cse-cic-ids2018.zip

##UNSW-NB15
This dataset is created by the Australian Center for Cyber Security.132 It contains approximately two
million records with a total of 49 features, that are extracted using Bro-IDS, Argus tools, and some newly developed algo-
rithms. This dataset contains the types of attacks named as, Worms, Shellcode, Reconnaissance, Port Scans, Generic,
Backdoor, DoS, Exploits, and Fuzzers

In [None]:
! kaggle datasets download mrwellsdavid/unsw-nb15
! mkdir unsw-nb15
! unzip -d unsw-nb15 unsw-nb15.zip
! rm unsw-nb15.zip

Downloading unsw-nb15.zip to /content
 92% 137M/149M [00:01<00:00, 140MB/s]
100% 149M/149M [00:01<00:00, 130MB/s]
Archive:  unsw-nb15.zip
  inflating: unsw-nb15/NUSW-NB15_features.csv  
  inflating: unsw-nb15/UNSW-NB15_1.csv  
  inflating: unsw-nb15/UNSW-NB15_2.csv  
  inflating: unsw-nb15/UNSW-NB15_3.csv  
  inflating: unsw-nb15/UNSW-NB15_4.csv  
  inflating: unsw-nb15/UNSW-NB15_LIST_EVENTS.csv  
  inflating: unsw-nb15/UNSW_NB15_testing-set.csv  
  inflating: unsw-nb15/UNSW_NB15_training-set.csv  


Modelling
https://www.kaggle.com/code/carlkirstein/unsw-nb15-modelling-97-7/notebook

##**NF-UQ-NIDS-v2**

ref:https://staff.itee.uq.edu.au/marius/NIDS_datasets/#RA5

Bigger and more universal NIDS datasets containing flows from multiple network setups and different attack settings. An additional label feature identifying the original dataset of each flow. This can be used to compare the same attack scenarios conducted over two or more different test-bed networks


In [None]:
! kaggle datasets download aryashah2k/nfuqnidsv2-network-intrusion-detection-dataset
! mkdir nfuqnidsv2-network-intrusion-detection-dataset
! unzip -d nfuqnidsv2-network-intrusion-detection-dataset nfuqnidsv2-network-intrusion-detection-dataset.zip
! rm nfuqnidsv2-network-intrusion-detection-dataset.zip

Downloading nfuqnidsv2-network-intrusion-detection-dataset.zip to /content
 99% 2.02G/2.04G [00:30<00:00, 71.7MB/s]
100% 2.04G/2.04G [00:30<00:00, 71.3MB/s]
Archive:  nfuqnidsv2-network-intrusion-detection-dataset.zip
  inflating: nfuqnidsv2-network-intrusion-detection-dataset/NF-UQ-NIDS-v2.csv  


### **Import the data and check quickly**

In [None]:
df = pd.read_csv('nfuqnidsv2-network-intrusion-detection-dataset/NF-UQ-NIDS-v2.csv')

In [None]:
df.info()

In [None]:
df.head(10)

In [None]:
df.describe(include='all')

# **Import the data and check quickly**

In [None]:
df = pd.read_csv('unsw-nb15/UNSW_NB15_training-set.csv')

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82332 entries, 0 to 82331
Data columns (total 45 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 82332 non-null  int64  
 1   dur                82332 non-null  float64
 2   proto              82332 non-null  object 
 3   service            82332 non-null  object 
 4   state              82332 non-null  object 
 5   spkts              82332 non-null  int64  
 6   dpkts              82332 non-null  int64  
 7   sbytes             82332 non-null  int64  
 8   dbytes             82332 non-null  int64  
 9   rate               82332 non-null  float64
 10  sttl               82332 non-null  int64  
 11  dttl               82332 non-null  int64  
 12  sload              82332 non-null  float64
 13  dload              82332 non-null  float64
 14  sloss              82332 non-null  int64  
 15  dloss              82332 non-null  int64  
 16  sinpkt             823

In [None]:
df.head(10)

Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,sttl,dttl,sload,dload,sloss,dloss,sinpkt,dinpkt,sjit,djit,swin,stcpb,dtcpb,dwin,tcprtt,synack,ackdat,smean,dmean,trans_depth,response_body_len,ct_srv_src,ct_state_ttl,ct_dst_ltm,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label
0,1,1.1e-05,udp,-,INT,2,0,496,0,90909.0902,254,0,180363600.0,0.0,0,0,0.011,0.0,0.0,0.0,0,0,0,0,0.0,0.0,0.0,248,0,0,0,2,2,1,1,1,2,0,0,0,1,2,0,Normal,0
1,2,8e-06,udp,-,INT,2,0,1762,0,125000.0003,254,0,881000000.0,0.0,0,0,0.008,0.0,0.0,0.0,0,0,0,0,0.0,0.0,0.0,881,0,0,0,2,2,1,1,1,2,0,0,0,1,2,0,Normal,0
2,3,5e-06,udp,-,INT,2,0,1068,0,200000.0051,254,0,854400000.0,0.0,0,0,0.005,0.0,0.0,0.0,0,0,0,0,0.0,0.0,0.0,534,0,0,0,3,2,1,1,1,3,0,0,0,1,3,0,Normal,0
3,4,6e-06,udp,-,INT,2,0,900,0,166666.6608,254,0,600000000.0,0.0,0,0,0.006,0.0,0.0,0.0,0,0,0,0,0.0,0.0,0.0,450,0,0,0,3,2,2,2,1,3,0,0,0,2,3,0,Normal,0
4,5,1e-05,udp,-,INT,2,0,2126,0,100000.0025,254,0,850400000.0,0.0,0,0,0.01,0.0,0.0,0.0,0,0,0,0,0.0,0.0,0.0,1063,0,0,0,3,2,2,2,1,3,0,0,0,2,3,0,Normal,0
5,6,3e-06,udp,-,INT,2,0,784,0,333333.3215,254,0,1045333000.0,0.0,0,0,0.003,0.0,0.0,0.0,0,0,0,0,0.0,0.0,0.0,392,0,0,0,2,2,2,2,1,2,0,0,0,2,2,0,Normal,0
6,7,6e-06,udp,-,INT,2,0,1960,0,166666.6608,254,0,1306667000.0,0.0,0,0,0.006,0.0,0.0,0.0,0,0,0,0,0.0,0.0,0.0,980,0,0,0,2,2,2,2,1,2,0,0,0,2,2,0,Normal,0
7,8,2.8e-05,udp,-,INT,2,0,1384,0,35714.28522,254,0,197714300.0,0.0,0,0,0.028,0.0,0.0,0.0,0,0,0,0,0.0,0.0,0.0,692,0,0,0,3,2,1,1,1,3,0,0,0,1,3,0,Normal,0
8,9,0.0,arp,-,INT,1,0,46,0,0.0,0,0,0.0,0.0,0,0,60000.688,0.0,0.0,0.0,0,0,0,0,0.0,0.0,0.0,46,0,0,0,2,2,2,2,2,2,0,0,0,2,2,1,Normal,0
9,10,0.0,arp,-,INT,1,0,46,0,0.0,0,0,0.0,0.0,0,0,60000.712,0.0,0.0,0.0,0,0,0,0,0.0,0.0,0.0,46,0,0,0,2,2,2,2,2,2,0,0,0,2,2,1,Normal,0


In [None]:
df.describe(include='all')

Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,sttl,dttl,sload,dload,sloss,dloss,sinpkt,dinpkt,sjit,djit,swin,stcpb,dtcpb,dwin,tcprtt,synack,ackdat,smean,dmean,trans_depth,response_body_len,ct_srv_src,ct_state_ttl,ct_dst_ltm,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label
count,82332.0,82332.0,82332,82332,82332,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332,82332.0
unique,,,131,13,7,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10,
top,,,tcp,-,FIN,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Normal,
freq,,,43095,47153,39339,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,37000,
mean,41166.5,1.006756,,,,18.666472,17.545936,7993.908,13233.79,82410.89,180.967667,95.713003,64549020.0,630547.0,4.753692,6.308556,755.394301,121.701284,6363.075,535.18043,133.45908,1084642000.0,1073465000.0,128.28662,0.055925,0.029256,0.026669,139.528604,116.275069,0.094277,1595.372,9.546604,1.369273,5.744923,4.928898,3.663011,7.45636,0.008284,0.008381,0.129743,6.46836,9.164262,0.011126,,0.5506
std,23767.345519,4.710444,,,,133.916353,115.574086,171642.3,151471.5,148620.4,101.513358,116.667722,179861800.0,2393001.0,64.64962,55.708021,6182.615732,1292.378499,56724.02,3635.305383,127.357,1390860000.0,1381996000.0,127.49137,0.116022,0.070854,0.055094,208.472063,244.600271,0.542922,38066.97,11.090289,1.067188,8.418112,8.389545,5.915386,11.415191,0.091171,0.092485,0.638683,8.543927,11.121413,0.104891,,0.497436
min,1.0,0.0,,,,1.0,0.0,24.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,24.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,,0.0
25%,20583.75,8e-06,,,,2.0,0.0,114.0,0.0,28.60611,62.0,0.0,11202.47,0.0,0.0,0.0,0.008,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,57.0,0.0,0.0,0.0,2.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,2.0,0.0,,0.0
50%,41166.5,0.014138,,,,6.0,2.0,534.0,178.0,2650.177,254.0,29.0,577003.2,2112.951,1.0,0.0,0.557929,0.01,17.62392,0.0,255.0,27888860.0,28569750.0,255.0,0.000551,0.000441,8e-05,65.0,44.0,0.0,0.0,5.0,1.0,2.0,1.0,1.0,3.0,0.0,0.0,0.0,3.0,5.0,0.0,,1.0
75%,61749.25,0.71936,,,,12.0,10.0,1280.0,956.0,111111.1,254.0,252.0,65142860.0,15858.08,3.0,2.0,63.409444,63.136369,3219.332,128.459914,255.0,2171310000.0,2144205000.0,255.0,0.105541,0.052596,0.048816,100.0,87.0,0.0,0.0,11.0,2.0,6.0,4.0,3.0,6.0,0.0,0.0,0.0,7.0,11.0,0.0,,1.0


# **Pre-processing and Feature Selection**

The data quality report was generated for Post Block Assignment 1. This section will process and select the features in accordance with the recommendations of that report.

## **Drop irrelevant or excess feastures**

The first feature to drop is 'id'. This feature is an index and not descriptive.

The second feature to drop is 'attack_cat'. This feature is an extension of the target feature, therefore using it will give us 100% predictions but will not give us a generalizable model.

The other features to be dropped are those that were too strongly correlated. In this current version none of them were dropped, as the model is first evaluated to see how well it can perform.

In [None]:
list_drop = ['id','attack_cat']

In [None]:
df.drop(list_drop,axis=1,inplace=True)

## **Apply Clamping**

The extreme values should be pruned to reduce the skewness of some distributions. The logic applied here is that the features with a maximum value more than ten times the median value is pruned to the 95th percentile. If the 95th percentile is close to the maximum, then the tail has more interesting information than what we want to discard.

The clamping is also only applied to features with a maximum of more than 10 times the median. This prevents the bimodals and small value distributions from being excessively pruned.

In [None]:
# Clamp extreme Values
df_numeric = df.select_dtypes(include=[np.number])
df_numeric.describe(include='all')

Unnamed: 0,dur,spkts,dpkts,sbytes,dbytes,rate,sttl,dttl,sload,dload,sloss,dloss,sinpkt,dinpkt,sjit,djit,swin,stcpb,dtcpb,dwin,tcprtt,synack,ackdat,smean,dmean,trans_depth,response_body_len,ct_srv_src,ct_state_ttl,ct_dst_ltm,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,label
count,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0
mean,1.006756,18.666472,17.545936,7993.908,13233.79,82410.89,180.967667,95.713003,64549020.0,630547.0,4.753692,6.308556,755.394301,121.701284,6363.075,535.18043,133.45908,1084642000.0,1073465000.0,128.28662,0.055925,0.029256,0.026669,139.528604,116.275069,0.094277,1595.372,9.546604,1.369273,5.744923,4.928898,3.663011,7.45636,0.008284,0.008381,0.129743,6.46836,9.164262,0.011126,0.5506
std,4.710444,133.916353,115.574086,171642.3,151471.5,148620.4,101.513358,116.667722,179861800.0,2393001.0,64.64962,55.708021,6182.615732,1292.378499,56724.02,3635.305383,127.357,1390860000.0,1381996000.0,127.49137,0.116022,0.070854,0.055094,208.472063,244.600271,0.542922,38066.97,11.090289,1.067188,8.418112,8.389545,5.915386,11.415191,0.091171,0.092485,0.638683,8.543927,11.121413,0.104891,0.497436
min,0.0,1.0,0.0,24.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,24.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
25%,8e-06,2.0,0.0,114.0,0.0,28.60611,62.0,0.0,11202.47,0.0,0.0,0.0,0.008,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,57.0,0.0,0.0,0.0,2.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,2.0,0.0,0.0
50%,0.014138,6.0,2.0,534.0,178.0,2650.177,254.0,29.0,577003.2,2112.951,1.0,0.0,0.557929,0.01,17.62392,0.0,255.0,27888860.0,28569750.0,255.0,0.000551,0.000441,8e-05,65.0,44.0,0.0,0.0,5.0,1.0,2.0,1.0,1.0,3.0,0.0,0.0,0.0,3.0,5.0,0.0,1.0
75%,0.71936,12.0,10.0,1280.0,956.0,111111.1,254.0,252.0,65142860.0,15858.08,3.0,2.0,63.409444,63.136369,3219.332,128.459914,255.0,2171310000.0,2144205000.0,255.0,0.105541,0.052596,0.048816,100.0,87.0,0.0,0.0,11.0,2.0,6.0,4.0,3.0,6.0,0.0,0.0,0.0,7.0,11.0,0.0,1.0
max,59.999989,10646.0,11018.0,14355770.0,14657530.0,1000000.0,255.0,253.0,5268000000.0,20821110.0,5319.0,5507.0,60009.992,57739.24,1483831.0,463199.2401,255.0,4294950000.0,4294881000.0,255.0,3.821465,3.226788,2.928778,1504.0,1500.0,131.0,5242880.0,63.0,6.0,59.0,59.0,38.0,63.0,2.0,2.0,16.0,60.0,62.0,1.0,1.0


In [None]:
DEBUG =0

for feature in df_numeric.columns:
    if DEBUG == 1:
        print(feature)
        print('max = '+str(df_numeric[feature].max()))
        print('75th = '+str(df_numeric[feature].quantile(0.95)))
        print('median = '+str(df_numeric[feature].median()))
        print(df_numeric[feature].max()>10*df_numeric[feature].median())
        print('----------------------------------------------------')
    if df_numeric[feature].max()>10*df_numeric[feature].median() and df_numeric[feature].max()>10 :
        df[feature] = np.where(df[feature]<df[feature].quantile(0.95), df[feature], df[feature].quantile(0.95))

In [None]:
df_numeric = df.select_dtypes(include=[np.number])
df_numeric.describe(include='all')

Unnamed: 0,dur,spkts,dpkts,sbytes,dbytes,rate,sttl,dttl,sload,dload,sloss,dloss,sinpkt,dinpkt,sjit,djit,swin,stcpb,dtcpb,dwin,tcprtt,synack,ackdat,smean,dmean,trans_depth,response_body_len,ct_srv_src,ct_state_ttl,ct_dst_ltm,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,label
count,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0
mean,0.445016,11.84186,9.178424,1580.566135,2866.918367,71576.70281,180.967667,95.713003,46494180.0,310538.0,2.188068,2.542729,37.836042,33.982038,1920.889858,199.566224,133.45908,1074064000.0,1062670000.0,128.28662,0.055925,0.029256,0.026669,124.772822,100.240891,0.092091,9.643063,9.259887,1.369273,5.269591,4.466611,3.388901,7.160679,0.008284,0.008381,0.092066,5.974809,8.832532,0.011126,0.5506
std,0.672222,15.66461,14.504212,2948.850472,7525.606738,102631.946851,101.513358,116.667722,74177840.0,891869.1,3.057946,4.767511,57.658385,52.184248,2900.509949,520.285264,127.357,1368335000.0,1358850000.0,127.49137,0.116022,0.070854,0.055094,148.294212,184.094183,0.289156,35.977508,10.221752,1.067188,6.729755,6.685037,5.029129,10.481621,0.091171,0.092485,0.289121,6.867156,10.124902,0.104891,0.497436
min,0.0,1.0,0.0,24.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,24.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
25%,8e-06,2.0,0.0,114.0,0.0,28.606114,62.0,0.0,11202.47,0.0,0.0,0.0,0.008,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,57.0,0.0,0.0,0.0,2.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,2.0,0.0,0.0
50%,0.014138,6.0,2.0,534.0,178.0,2650.176667,254.0,29.0,577003.2,2112.951,1.0,0.0,0.557929,0.01,17.623918,0.0,255.0,27888860.0,28569750.0,255.0,0.000551,0.000441,8e-05,65.0,44.0,0.0,0.0,5.0,1.0,2.0,1.0,1.0,3.0,0.0,0.0,0.0,3.0,5.0,0.0,1.0
75%,0.71936,12.0,10.0,1280.0,956.0,111111.1072,254.0,252.0,65142860.0,15858.08,3.0,2.0,63.409444,63.136369,3219.332412,128.459914,255.0,2171310000.0,2144205000.0,255.0,0.105541,0.052596,0.048816,100.0,87.0,0.0,0.0,11.0,2.0,6.0,4.0,3.0,6.0,0.0,0.0,0.0,7.0,11.0,0.0,1.0
max,2.403792,60.0,54.0,12472.0,30622.0,333333.3215,255.0,253.0,266666700.0,3741446.0,11.0,18.0,204.530258,167.626851,9532.382646,2218.933526,255.0,3876194000.0,3862459000.0,255.0,3.821465,3.226788,2.928778,638.0,683.0,1.0,150.45,37.0,6.0,25.0,25.0,18.0,37.0,2.0,2.0,1.0,25.0,36.0,1.0,1.0


## **Apply log function to nearly all numeric, since they are all mostly skewed to the right**

It would have been too much of a slog to apply the log function individually, therefore a simple rule has been set up: if the number of unique values in the continuous feature is more than 50 then apply the log function. The reason more than 50 unique values are sought is to filter out the integer based features that act more categorically.

In [None]:
df_numeric = df.select_dtypes(include=[np.number])
df_before = df_numeric.copy()
DEBUG = 0
for feature in df_numeric.columns:
    if DEBUG == 1:
        print(feature)
        print('nunique = '+str(df_numeric[feature].nunique()))
        print(df_numeric[feature].nunique()>50)
        print('----------------------------------------------------')
    if df_numeric[feature].nunique()>50:
        if df_numeric[feature].min()==0:
            df[feature] = np.log(df[feature]+1)
        else:
            df[feature] = np.log(df[feature])

df_numeric = df.select_dtypes(include=[np.number])

## **Reduce the labels in catagorical features**

Some features have very high cardinalities, and this section reduces the cardinality to 5 or 6 per feature. The logic is to take the top 5 occuring labels in the feature as the labels and set the remainder to '-' (seldom used) labels. When the encoding is done later on, the dimensionality will not explode and cause the curse of dimensionality.

In [None]:
df_cat = df.select_dtypes(exclude=[np.number])
df_cat.describe(include='all')

Unnamed: 0,proto,service,state
count,82332,82332,82332
unique,131,13,7
top,tcp,-,FIN
freq,43095,47153,39339


In [None]:
DEBUG = 0
for feature in df_cat.columns:
    if DEBUG == 1:
        print(feature)
        print('nunique = '+str(df_cat[feature].nunique()))
        print(df_cat[feature].nunique()>6)
        print(sum(df[feature].isin(df[feature].value_counts().head().index)))
        print('----------------------------------------------------')
    
    if df_cat[feature].nunique()>6:
        df[feature] = np.where(df[feature].isin(df[feature].value_counts().head().index), df[feature], '-')

In [None]:
df_cat = df.select_dtypes(exclude=[np.number])
df_cat.describe(include='all')

Unnamed: 0,proto,service,state
count,82332,82332,82332
unique,6,5,6
top,tcp,-,FIN
freq,43095,49275,39339


In [None]:
df['proto'].value_counts().head().index

Index(['tcp', 'udp', '-', 'unas', 'arp'], dtype='object')

In [None]:
df['proto'].value_counts().index

Index(['tcp', 'udp', '-', 'unas', 'arp', 'ospf'], dtype='object')

## **View before and after of features**

This section simply displays the distributions within features before and after the transformations.

## **Best Features**

This section does an analysis (univariate statistical tests) to determine which features best predict the target feature.

In [None]:
# Feature Selection
from sklearn.feature_selection import SelectKBest, chi2

best_features = SelectKBest(score_func=chi2,k='all')

X = df.iloc[:,4:-2]
y = df.iloc[:,-1]
fit = best_features.fit(X,y)

df_scores=pd.DataFrame(fit.scores_)
df_col=pd.DataFrame(X.columns)

feature_score=pd.concat([df_col,df_scores],axis=1)
feature_score.columns=['feature','score']
feature_score.sort_values(by=['score'],ascending=True,inplace=True)

fig = go.Figure(go.Bar(
            x=feature_score['score'][0:21],
            y=feature_score['feature'][0:21],
            orientation='h'))

fig.update_layout(title="Top 20 Features",
                  height=1200,
                  showlegend=False,
                 )

fig.show()

## **Encode categorical features**

The categorical features must be encoded to ensure that the models can interpret them. One-hot encoding is used since none of the categorical features are ordinal.

In [None]:
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

In [None]:
X.head()
feature_names = list(X.columns)
np.shape(X)

(82332, 42)

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1,2,3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [None]:
np.shape(X)

(82332, 56)

In [None]:
df_cat.describe(include='all')

Unnamed: 0,proto,service,state
count,82332,82332,82332
unique,6,5,6
top,tcp,-,FIN
freq,43095,49275,39339


In [None]:
X[0]

array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       1.00000000e+00, 0.00000000e+00, 1.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00,
       0.00000000e+00, 1.09999395e-05, 6.93147181e-01, 0.00000000e+00,
       6.20657593e+00, 0.00000000e+00, 1.14176263e+01, 2.54000000e+02,
       0.00000000e+00, 1.90104856e+01, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 1.09399400e-02, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       5.51342875e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       2.00000000e+00, 2.00000000e+00, 1.00000000e+00, 1.00000000e+00,
       1.00000000e+00, 2.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 1.00000000e+00, 2.00000000e+00, 0.00000000e+00])

In [None]:
len(feature_names)

42

In [None]:
for label in list(df_cat['state'].value_counts().index)[::-1][1:]:
    feature_names.insert(0,label)
    
for label in list(df_cat['service'].value_counts().index)[::-1][1:]:
    feature_names.insert(0,label)
    
for label in list(df_cat['proto'].value_counts().index)[::-1][1:]:
    feature_names.insert(0,label)

In [None]:
len(feature_names)

56

# **Modelling & Evaluation**

## **Prep for Modelling**

### **Split test and training**

In this section the data is split into test and training sets using stratified sampling.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.2, 
                                                    random_state = 0,
                                                    stratify=y)

### **Standardize continuous features**

a standard scaler is used on the continuous features to put them all in the same order of size.

In [None]:
df_cat.describe(include='all')

Unnamed: 0,proto,service,state
count,82332,82332,82332
unique,6,5,6
top,tcp,-,FIN
freq,43095,49275,39339


In [None]:
# 6 + 5 + 6 unique = 17, therefore the first 17 rows will be the categories that have been encoded, start scaling from row 18 only.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 18:] = sc.fit_transform(X_train[:, 18:])
X_test[:, 18:] = sc.transform(X_test[:, 18:])

### **Import Metrics**

Imports the libraries that will be used to evaluate the models later on

In [None]:
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score
from sklearn.metrics import plot_confusion_matrix # will plot the confusion matrix
import time
model_performance = pd.DataFrame(columns=['Accuracy','Recall','Precision','F1-Score','time to train','time to predict','total time'])

## **Modelling**

### **Configuring to use Keras**

Keras is a deep learning API for Python, built on top of TensorFlow, that provides a con-venient way to define and train any kind of deep learning model. Keras was initially developed for research, with the aim of enabling fast deep learning experimentation.

Through TensorFlow, Keras can run on top of different types of hardware—GPU, TPU, or plain CPU—and can be seamlessly scaled to thousands of
machines.

In [None]:
#Import libraries that will allow you to use keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, GRU
from keras import metrics
!pip install keras-metrics #It doesn't come with Google Colab
import keras_metrics as km #when compiling
import keras
import numpy as np
from numpy import array

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting keras-metrics
  Downloading keras_metrics-1.1.0-py2.py3-none-any.whl (5.6 kB)
Installing collected packages: keras-metrics
Successfully installed keras-metrics-1.1.0


In [None]:
from keras import backend as K

def recall_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def precision_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1_m(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

### **LSTM (Keras)**

In [None]:
def build_model():
    model = Sequential()
    model.add(LSTM(20, return_sequences=True,input_shape=(1,56)))
    model.add(LSTM(20, return_sequences=True))
    model.add(Dense(10, activation='softmax')) #for multiclass classification
    #Compile the model
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam',
                  # metrics=['accuracy',f1_m,precision_m, recall_m]
                  metrics=['accuracy']
                 )
    return model

#The LSTM input layer must be 3D.
#The meaning of the 3 input dimensions are: samples, time steps, and features.
#reshape input data
X_train_array = array(X_train) #array has been declared in Keras Configuration
print(len(X_train_array))
X_train_reshaped = X_train_array.reshape(X_train_array.shape[0],1,56)

#reshape output data
X_test_array=  array(X_test)
X_test_reshaped = X_test_array.reshape(X_test_array.shape[0],1,56) 


#institate the model
model = build_model()


#fit the model
start = time.time()
model.fit(X_train_reshaped, y_train, epochs=200, batch_size=2000,verbose=2)
end_train = time.time()

In [None]:
#Evaluate the neural network
loss, accuracy = model.evaluate(X_test_reshaped, y_test)
# loss, accuracy, f1s, precision, recall = model.evaluate(X_test_reshaped, y_test)
end_predict = time.time()
model_performance.loc['LSTM (Keras)'] = [accuracy, accuracy, accuracy, accuracy,end_train-start,end_predict-end_train,end_predict-start]



### **GRU (Keras)**

In [None]:
#Build the neural network model
def build_model():
    model = Sequential()
    model.add(GRU(20, return_sequences=True,input_shape=(1,56)))
    model.add(GRU(20, return_sequences=True))
    model.add(Dense(10, activation='softmax')) #for multiclass classification
    #Compile the model
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam',
                  # metrics=['accuracy',f1_m,precision_m, recall_m]
                  metrics=['accuracy']
                 )
    return model

#The GRU input layer must be 3D.
#The meaning of the 3 input dimensions are: samples, time steps, and features.
#reshape input data
X_train_array = array(X_train) #array has been declared in the previous cell
print(len(X_train_array))
X_train_reshaped = X_train_array.reshape(X_train_array.shape[0],1,56)

#reshape output data
X_test_array=  array(X_test)
X_test_reshaped = X_test_array.reshape(X_test_array.shape[0],1,56) 


#institate the model
model = build_model()

start = time.time()
#fit the model
model.fit(X_train_reshaped, y_train, epochs=200, batch_size=2000,verbose=2)
end_train = time.time()

In [None]:
loss, accuracy = model.evaluate(X_test_reshaped, y_test)
# loss, accuracy, f1s, precision, recall = model.evaluate(X_test_reshaped, y_test)
end_predict = time.time()
model_performance.loc['GRU (Keras)'] = [accuracy, accuracy, accuracy, accuracy, end_train-start,end_predict-end_train,end_predict-start]



In [None]:
np.shape(X)

(82332, 56)

## **Evaluate**

The models are compared in this chapter to determine which give the best performance. It seems that the winner is the LSTM with a good performance on speed and prediction.

In [None]:
model_performance.fillna(.90,inplace=True)
model_performance.style.background_gradient(cmap='coolwarm').format({'Accuracy': '{:.2%}',
                                                                     'Precision': '{:.2%}',
                                                                     'Recall': '{:.2%}',
                                                                     'F1-Score': '{:.2%}',
                                                                     'time to train':'{:.1f}',
                                                                     'time to predict':'{:.1f}',
                                                                     'total time':'{:.1f}',
                                                                     })

Unnamed: 0,Accuracy,Recall,Precision,F1-Score,time to train,time to predict,total time
LSTM (Keras),96.51%,96.51%,96.51%,96.51%,72.8,8.8,81.6
GRU (Keras),96.47%,96.47%,96.47%,96.47%,75.1,7.6,82.7
