# Technical Exploraton 4
Build proposed architecture in [Network Intrusion Detection System using Deep Learning](https://www.sciencedirect.com/science/article/pii/S1877050921011078)

In [14]:
# Setup
%matplotlib inline

import numpy as np
import pandas as pd
import tensorflow as tf
from keras import datasets, layers, models
import matplotlib.pyplot as plt

## Import and explore data
We're going to be using the [UNSW-NB15 dataset](https://research.unsw.edu.au/projects/unsw-nb15-dataset). These papers elaborate its creation
- Moustafa, Nour, and Jill Slay. "UNSW-NB15: a comprehensive data set for network intrusion detection
systems (UNSW-NB15 network data set)."Military Communications and Information Systems Conference
(MilCIS), 2015. IEEE, 2015.
- Moustafa, Nour, and Jill Slay. "The evaluation of Network Anomaly Detection Systems: Statistical analysis
of the UNSW-NB15 data set and the comparison with the KDD99 data set." Information Security Journal:
A Global Perspective (2016): 1-14.

In [27]:
train: pd.DataFrame = pd.read_csv("./UNSW-NB15/UNSW_NB15_training-set.csv")
test: pd.DataFrame = pd.read_csv("./UNSW-NB15/UNSW_NB15_testing-set.csv")

In [8]:
train.info()
train.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 175341 entries, 0 to 175340
Data columns (total 45 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   id                 175341 non-null  int64  
 1   dur                175341 non-null  float64
 2   proto              175341 non-null  object 
 3   service            175341 non-null  object 
 4   state              175341 non-null  object 
 5   spkts              175341 non-null  int64  
 6   dpkts              175341 non-null  int64  
 7   sbytes             175341 non-null  int64  
 8   dbytes             175341 non-null  int64  
 9   rate               175341 non-null  float64
 10  sttl               175341 non-null  int64  
 11  dttl               175341 non-null  int64  
 12  sload              175341 non-null  float64
 13  dload              175341 non-null  float64
 14  sloss              175341 non-null  int64  
 15  dloss              175341 non-null  int64  
 16  si

Unnamed: 0,id,dur,spkts,dpkts,sbytes,dbytes,rate,sttl,dttl,sload,...,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,label
count,175341.0,175341.0,175341.0,175341.0,175341.0,175341.0,175341.0,175341.0,175341.0,175341.0,...,175341.0,175341.0,175341.0,175341.0,175341.0,175341.0,175341.0,175341.0,175341.0,175341.0
mean,87671.0,1.359389,20.298664,18.969591,8844.844,14928.92,95406.19,179.546997,79.609567,73454030.0,...,5.383538,4.206255,8.729881,0.014948,0.014948,0.133066,6.955789,9.100758,0.015752,0.680622
std,50616.731112,6.480249,136.887597,110.258271,174765.6,143654.2,165401.0,102.940011,110.506863,188357400.0,...,8.047104,5.783585,10.956186,0.126048,0.126048,0.701208,8.321493,10.756952,0.124516,0.466237
min,1.0,0.0,1.0,0.0,28.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
25%,43836.0,8e-06,2.0,0.0,114.0,0.0,32.78614,62.0,0.0,13053.34,...,1.0,1.0,1.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0
50%,87671.0,0.001582,2.0,2.0,430.0,164.0,3225.807,254.0,29.0,879674.8,...,1.0,1.0,3.0,0.0,0.0,0.0,3.0,4.0,0.0,1.0
75%,131506.0,0.668069,12.0,10.0,1418.0,1102.0,125000.0,254.0,252.0,88888890.0,...,5.0,3.0,12.0,0.0,0.0,0.0,9.0,12.0,0.0,1.0
max,175341.0,59.999989,9616.0,10974.0,12965230.0,14655550.0,1000000.0,255.0,254.0,5988000000.0,...,51.0,46.0,65.0,4.0,4.0,30.0,60.0,62.0,1.0,1.0


We're going to use attack_cat as our target class

In [28]:
print(train["attack_cat"].value_counts(normalize=True))
categories = list(train["attack_cat"].unique())

attack_cat
Normal            0.319378
Generic           0.228127
Exploits          0.190446
Fuzzers           0.103706
DoS               0.069944
Reconnaissance    0.059832
Analysis          0.011406
Backdoor          0.009958
Shellcode         0.006462
Worms             0.000741
Name: proportion, dtype: float64


9 of these categories are attacks, while one represents normal traffic.
Looks like normal traffic is only 31% of this dataset, followed by generic attacks.

In [31]:
X_train = train.drop(columns=["attack_cat"])
y_train = train["attack_cat"].copy()

X_test = test.drop(columns=["attack_cat"])
y_test = test["attack_cat"].copy()