<h2> Problem Description: </h2>

We set out to  build a network intrusion detector, a predictive model capable of distinguishing between ''bad'' connections, called as intrusions or attacks, and ''good'' or normal connections. The datasets we are starting with two recent datasets which have some common features and each include a wide variety of intrusions. 

<h3> What is an INTRUSION DETECTOR? </h3>

Intrusion detector is a software used to detect network intrusions. It protects a computer network from unauthorized users, including perhaps insiders. 

<h3> The starting Datasets </h3>

The two datasets include UNSW-NB15 and CICDDoS2019 
CICDDoS2019 contains benign and the most up-to-date common DDoS attacks, which resembles the true real-world data (PCAPs). It also includes the results of the network traffic analysis using CICFlowMeter-V3 with labeled flows based on the time stamp, source, and destination IPs, source and destination ports, protocols and attack (CSV files).

UNSW-NB15 is a network intrusion dataset. It contains nine different attacks, includes DoS, worms, Backdoors, and Fuzzers. The dataset contains raw network packets. The number of records in the training set is 175,341 records and the testing set is 82,332 records from the different types, attack and normal.

# References:

- For the UNSW-NB15 dataset :- https://research.unsw.edu.au/projects/unsw-nb15-dataset

- For the CICDDoS2019 dataset :- 
    
- https://arxiv.org/pdf/1811.05372
    

In [1]:
import warnings
warnings.filterwarnings("ignore")
import shutil
import os
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pickle
from sklearn.manifold import TSNE
from sklearn import preprocessing
import pandas as pd
from multiprocessing import Process# this is used for multithreading
import multiprocessing 
import random as r
from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import MiniBatchKMeans

In [3]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
dspaths = []
for dirname, _, filenames in os.walk('./dataset/cicds/training'):
    for filename in filenames:
        if filename.endswith('.csv'):
            pds = os.path.join(dirname, filename)
            dspaths.append(pds)
            print(pds)

# You can write up to 20GB to the current directory  that gets preserved as output when you create a version using "Save & Run All" 

./dataset/cicds/training/UDPLag-training.csv
./dataset/cicds/training/LDAP-training.csv
./dataset/cicds/training/UDP-training.csv
./dataset/cicds/training/Portmap-training.csv
./dataset/cicds/training/Syn-training.csv
./dataset/cicds/training/MSSQL-training.csv
./dataset/cicds/training/NetBIOS-training.csv


In [5]:
from fastai.tabular.all import df_shrink
from fastcore.parallel import *

In [6]:
col_name_consistency = {
'Flow ID': 'Flow ID',
'Source IP': 'Source IP',
'Src IP':  'Source IP',
'Source Port': 'Source Port',
'Src Port': 'Source Port',
'Destination IP': 'Destination IP',
'Dst IP': 'Destination IP',
'Destination Port': 'Destination Port',
'Dst Port': 'Destination Port',
'Protocol': 'Protocol',
'Timestamp': 'Timestamp',
'Flow Duration': 'Flow Duration',
'Total Fwd Packets': 'Total Fwd Packets',
'Tot Fwd Pkts': 'Total Fwd Packets',
'Total Backward Packets': 'Total Backward Packets',
'Tot Bwd Pkts': 'Total Backward Packets',
'Total Length of Fwd Packets': 'Fwd Packets Length Total',
'TotLen Fwd Pkts': 'Fwd Packets Length Total',
'Total Length of Bwd Packets': 'Bwd Packets Length Total',
'TotLen Bwd Pkts': 'Bwd Packets Length Total',
'Fwd Packet Length Max': 'Fwd Packet Length Max',
'Fwd Pkt Len Max': 'Fwd Packet Length Max',
'Fwd Packet Length Min': 'Fwd Packet Length Min',
'Fwd Pkt Len Min': 'Fwd Packet Length Min',
'Fwd Packet Length Mean': 'Fwd Packet Length Mean',
'Fwd Pkt Len Mean': 'Fwd Packet Length Mean',
'Fwd Packet Length Std': 'Fwd Packet Length Std',
'Fwd Pkt Len Std': 'Fwd Packet Length Std',
'Bwd Packet Length Max': 'Bwd Packet Length Max',
'Bwd Pkt Len Max': 'Bwd Packet Length Max',
'Bwd Packet Length Min': 'Bwd Packet Length Min',
'Bwd Pkt Len Min': 'Bwd Packet Length Min',
'Bwd Packet Length Mean': 'Bwd Packet Length Mean',
'Bwd Pkt Len Mean': 'Bwd Packet Length Mean',
'Bwd Packet Length Std': 'Bwd Packet Length Std',
'Bwd Pkt Len Std': 'Bwd Packet Length Std',
'Flow Bytes/s': 'Flow Bytes/s',
'Flow Byts/s': 'Flow Bytes/s',
'Flow Packets/s': 'Flow Packets/s',
'Flow Pkts/s': 'Flow Packets/s',
'Flow IAT Mean': 'Flow IAT Mean',
'Flow IAT Std': 'Flow IAT Std',
'Flow IAT Max': 'Flow IAT Max',
'Flow IAT Min': 'Flow IAT Min',
'Fwd IAT Total': 'Fwd IAT Total',
'Fwd IAT Tot': 'Fwd IAT Total',
'Fwd IAT Mean': 'Fwd IAT Mean',
'Fwd IAT Std': 'Fwd IAT Std',
'Fwd IAT Max': 'Fwd IAT Max',
'Fwd IAT Min': 'Fwd IAT Min',
'Bwd IAT Total': 'Bwd IAT Total',
'Bwd IAT Tot': 'Bwd IAT Total',
'Bwd IAT Mean': 'Bwd IAT Mean',
'Bwd IAT Std': 'Bwd IAT Std',
'Bwd IAT Max': 'Bwd IAT Max',
'Bwd IAT Min': 'Bwd IAT Min',
'Fwd PSH Flags': 'Fwd PSH Flags',
'Bwd PSH Flags': 'Bwd PSH Flags',
'Fwd URG Flags': 'Fwd URG Flags',
'Bwd URG Flags': 'Bwd URG Flags',
'Fwd Header Length': 'Fwd Header Length',
'Fwd Header Len': 'Fwd Header Length',
'Bwd Header Length': 'Bwd Header Length',
'Bwd Header Len': 'Bwd Header Length',
'Fwd Packets/s': 'Fwd Packets/s',
'Fwd Pkts/s': 'Fwd Packets/s',
'Bwd Packets/s': 'Bwd Packets/s',
'Bwd Pkts/s': 'Bwd Packets/s',
'Min Packet Length': 'Packet Length Min',
'Pkt Len Min': 'Packet Length Min',
'Max Packet Length': 'Packet Length Max',
'Pkt Len Max': 'Packet Length Max',
'Packet Length Mean': 'Packet Length Mean',
'Pkt Len Mean': 'Packet Length Mean',
'Packet Length Std': 'Packet Length Std',
'Pkt Len Std': 'Packet Length Std',
'Packet Length Variance': 'Packet Length Variance',
'Pkt Len Var': 'Packet Length Variance',
'FIN Flag Count': 'FIN Flag Count',
'FIN Flag Cnt': 'FIN Flag Count',
'SYN Flag Count': 'SYN Flag Count',
'SYN Flag Cnt': 'SYN Flag Count',
'RST Flag Count': 'RST Flag Count',
'RST Flag Cnt': 'RST Flag Count',
'PSH Flag Count': 'PSH Flag Count',
'PSH Flag Cnt': 'PSH Flag Count',
'ACK Flag Count': 'ACK Flag Count',
'ACK Flag Cnt': 'ACK Flag Count',
'URG Flag Count': 'URG Flag Count',
'URG Flag Cnt': 'URG Flag Count',
'CWE Flag Count': 'CWE Flag Count',
'CWE Flag Cnt': 'CWE Flag Count',
'ECE Flag Count': 'ECE Flag Count',
'ECE Flag Cnt': 'ECE Flag Count',
'Down/Up Ratio': 'Down/Up Ratio',
'Average Packet Size': 'Avg Packet Size',
'Pkt Size Avg': 'Avg Packet Size',
'Avg Fwd Segment Size': 'Avg Fwd Segment Size',
'Fwd Seg Size Avg': 'Avg Fwd Segment Size',
'Avg Bwd Segment Size': 'Avg Bwd Segment Size',
'Bwd Seg Size Avg': 'Avg Bwd Segment Size',
'Fwd Avg Bytes/Bulk': 'Fwd Avg Bytes/Bulk',
'Fwd Byts/b Avg': 'Fwd Avg Bytes/Bulk',
'Fwd Avg Packets/Bulk': 'Fwd Avg Packets/Bulk',
'Fwd Pkts/b Avg': 'Fwd Avg Packets/Bulk',
'Fwd Avg Bulk Rate': 'Fwd Avg Bulk Rate',
'Fwd Blk Rate Avg': 'Fwd Avg Bulk Rate',
'Bwd Avg Bytes/Bulk': 'Bwd Avg Bytes/Bulk',
'Bwd Byts/b Avg': 'Bwd Avg Bytes/Bulk',
'Bwd Avg Packets/Bulk': 'Bwd Avg Packets/Bulk',
'Bwd Pkts/b Avg': 'Bwd Avg Packets/Bulk',
'Bwd Avg Bulk Rate': 'Bwd Avg Bulk Rate',
'Bwd Blk Rate Avg': 'Bwd Avg Bulk Rate',
'Subflow Fwd Packets': 'Subflow Fwd Packets',
'Subflow Fwd Pkts': 'Subflow Fwd Packets',
'Subflow Fwd Bytes': 'Subflow Fwd Bytes',
'Subflow Fwd Byts': 'Subflow Fwd Bytes',
'Subflow Bwd Packets': 'Subflow Bwd Packets',
'Subflow Bwd Pkts': 'Subflow Bwd Packets',
'Subflow Bwd Bytes': 'Subflow Bwd Bytes',
'Subflow Bwd Byts': 'Subflow Bwd Bytes',
'Init_Win_bytes_forward': 'Init Fwd Win Bytes',
'Init Fwd Win Byts': 'Init Fwd Win Bytes',
'Init_Win_bytes_backward': 'Init Bwd Win Bytes',
'Init Bwd Win Byts': 'Init Bwd Win Bytes',
'act_data_pkt_fwd': 'Fwd Act Data Packets',
'Fwd Act Data Pkts': 'Fwd Act Data Packets',
'min_seg_size_forward': 'Fwd Seg Size Min',
'Fwd Seg Size Min': 'Fwd Seg Size Min',
'Active Mean': 'Active Mean',
'Active Std': 'Active Std',
'Active Max': 'Active Max',
'Active Min': 'Active Min',
'Idle Mean': 'Idle Mean',
'Idle Std': 'Idle Std',
'Idle Max': 'Idle Max',
'Idle Min': 'Idle Min',
'Label': 'Label'
}

In [7]:
drop_columns = [ # this list includes all spellings across CIC NIDS datasets
    "Flow ID",    
    'Fwd Header Length.1',
    "Source IP", "Src IP",
    "Source Port", "Src Port",
    "Destination IP", "Dst IP",
    "Destination Port", "Dst Port",
    "Timestamp", 
    "Unnamed: 0", "Inbound", "SimillarHTTP" # CIC-DDoS other undocumented columns
    "Active Max", "Subflow Bwd Packets", # The following features are being dropped based on the findings from 
    "Bwd IAT Max" , "Active Min",  "Idle Min", # https://dalspace.library.dal.ca/bitstream/handle/10222/78536/Junhong-Li-MCS-FCS-April-2020.pdf?sequence=7
    "Fwd Bulk Rate Avg",  "Bwd PSH Flags",  
    "Packet Length Min",  "Bwd URG Flags",  
    "Idle Max",  "Fwd IAT Total",  "Bwd Bytes/Bulk Avg", 
    "Bwd Segment Size Avg",  "Fwd Packet/Bulk Avg",  "Fwd URG Flags", 
    "URG Flag Count",  "Active Mean", "ECE Flag Count",  
    "Fwd Bytes/Bulk Avg",  "ACK Flag Count",  "Fwd Packets/s",  
    "Bwd Packet Length Std", "Subflow Bwd Bytes",  
    "Fwd IAT Std",  "Bwd IAT Std",  "Fwd IAT Mean",  "Active Std",  
    "Packet Length Mean", "Fwd IAT Max",  "Fwd Segment Size Avg"
]

## Just reading the data
It's in standard CSV format

In [8]:
individual_dfs = [pd.read_csv(dsp, sep=',', encoding='utf-8') for dsp in dspaths]
[i.shape for i in individual_dfs]

[(40680, 88),
 (51240, 88),
 (31340, 88),
 (47340, 88),
 (357900, 88),
 (27940, 88),
 (13210, 88)]

## Drops and Renames
Some columns are not intended for use because they are metadata. Flow ID, IP addresses, timestamps, etc.

If you are unsure why source and destination port are also removed you can read [Establishing the Contaminating Effect of Metadata Feature Inclusion in Machine-Learned Network Intrusion Detection Models](https://link.springer.com/chapter/10.1007/978-3-031-09484-2_2). In short: any included metadata feature will act as a (very) powerful shortcut predictor.

In [9]:
for df in individual_dfs:
    df.columns = df.columns.str.strip() # sometimes there's leading / trailing whitespace
    df.drop(columns=drop_columns, inplace=True, errors='ignore')    
    df.rename(columns=col_name_consistency, inplace=True)
    df['Label'].replace({'BENIGN': 'Benign'}, inplace=True)
[i.shape for i in individual_dfs]

[(40680, 57),
 (51240, 57),
 (31340, 57),
 (47340, 57),
 (357900, 57),
 (27940, 57),
 (13210, 57)]

In [10]:
individual_dfs[0].dtypes

Protocol                      int64
Flow Duration                 int64
Total Fwd Packets             int64
Total Backward Packets        int64
Fwd Packets Length Total    float64
Bwd Packets Length Total    float64
Fwd Packet Length Max       float64
Fwd Packet Length Min       float64
Fwd Packet Length Mean      float64
Fwd Packet Length Std       float64
Bwd Packet Length Max       float64
Bwd Packet Length Min       float64
Bwd Packet Length Mean      float64
Flow Bytes/s                float64
Flow Packets/s              float64
Flow IAT Mean               float64
Flow IAT Std                float64
Flow IAT Max                float64
Flow IAT Min                float64
Fwd IAT Min                 float64
Bwd IAT Total               float64
Bwd IAT Mean                float64
Bwd IAT Min                 float64
Fwd PSH Flags                 int64
Fwd Header Length             int64
Bwd Header Length             int64
Bwd Packets/s               float64
Packet Length Min           

## Downsizing
The individual frames are optimized by downsizing their types to more appropriate forms than the default float64, int64 or object (str) types. 

In [11]:
individual_dfs = parallel(f=df_shrink, items=individual_dfs, progress=True)

In [12]:
individual_dfs[0].dtypes

Protocol                        int8
Flow Duration                  int32
Total Fwd Packets              int16
Total Backward Packets         int16
Fwd Packets Length Total     float32
Bwd Packets Length Total     float32
Fwd Packet Length Max        float32
Fwd Packet Length Min        float32
Fwd Packet Length Mean       float32
Fwd Packet Length Std        float32
Bwd Packet Length Max        float32
Bwd Packet Length Min        float32
Bwd Packet Length Mean       float32
Flow Bytes/s                 float64
Flow Packets/s               float64
Flow IAT Mean                float32
Flow IAT Std                 float32
Flow IAT Max                 float32
Flow IAT Min                 float32
Fwd IAT Min                  float32
Bwd IAT Total                float32
Bwd IAT Mean                 float32
Bwd IAT Min                  float32
Fwd PSH Flags                   int8
Fwd Header Length              int64
Bwd Header Length              int16
Bwd Packets/s                float32
P

## Removing NaN


In [13]:
for df in individual_dfs:
    df.replace([np.inf, -np.inf], np.nan, inplace=True)    
    # print(df.isna().any(axis=1).sum(), "rows with at least one NaN to remove")
    df.dropna(inplace=True)    
[i.shape for i in individual_dfs]

[(38076, 57),
 (49985, 57),
 (30699, 57),
 (45069, 57),
 (336214, 57),
 (27080, 57),
 (12717, 57)]

## Dropping duplicates
There should be no duplicates because they can bias training and can lead to over-optimistic estimates of classification performance during testing.

In [14]:
for df in individual_dfs:
    print(df.duplicated().sum(), "fully duplicate rows to remove")
    df.drop_duplicates(inplace=True)
    df.reset_index(inplace=True, drop=True)
[i.shape for i in individual_dfs]

24945 fully duplicate rows to remove
42834 fully duplicate rows to remove
12641 fully duplicate rows to remove
39762 fully duplicate rows to remove
264912 fully duplicate rows to remove
15748 fully duplicate rows to remove
10926 fully duplicate rows to remove


[(13131, 57),
 (7151, 57),
 (18058, 57),
 (5307, 57),
 (71302, 57),
 (11332, 57),
 (1791, 57)]

In [15]:
Cic_df = pd.DataFrame()

In [16]:
from functools import reduce

In [17]:
for df in individual_dfs:
    Cic_df = pd.concat([Cic_df, df])

In [18]:
Cic_df.head()

Unnamed: 0,Protocol,Flow Duration,Total Fwd Packets,Total Backward Packets,Fwd Packets Length Total,Bwd Packets Length Total,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,...,Subflow Fwd Bytes,Init Fwd Win Bytes,Init Bwd Win Bytes,Fwd Act Data Packets,Fwd Seg Size Min,Active Max,Idle Mean,Idle Std,SimillarHTTP,Label
0,17,2,2,0,802.0,0.0,401.0,401.0,401.0,0.0,...,802,-1,-1,1,8,0.0,0.0,0.0,0,UDP
1,6,40833176,10,4,60.0,24.0,6.0,6.0,6.0,0.0,...,60,5840,0,9,20,305540.0,13509211.0,2733040.25,0,Syn
2,6,1,2,0,12.0,0.0,6.0,6.0,6.0,0.0,...,12,5840,-1,1,20,0.0,0.0,0.0,0,Syn
3,6,103,2,2,12.0,12.0,6.0,6.0,6.0,0.0,...,12,5840,0,1,20,0.0,0.0,0.0,0,Syn
4,6,41,2,2,12.0,12.0,6.0,6.0,6.0,0.0,...,12,5840,0,1,20,0.0,0.0,0.0,0,Syn


In [19]:
Cic_df.shape

(128072, 57)

In [20]:
Cic_df.to_csv('s3://sagemaker-us-east-1-123251154742/cicddos_training.csv') 

In [21]:
def reorder_columns(columns, first_cols=[], last_cols=[], drop_cols=[]):
    columns = list(set(columns) - set(first_cols))
    columns = list(set(columns) - set(drop_cols))
    columns = list(set(columns) - set(last_cols))
    new_order = first_cols + columns + last_cols
    return new_order

In [22]:
my_list = Cic_df.columns.tolist()
reordered_cols = reorder_columns(my_list, first_cols=['Label'], last_cols=['Protocol'], drop_cols=[])
Cic_df_training = Cic_df[reordered_cols]

In [23]:
Cic_df_training.head()

Unnamed: 0,Label,Flow Bytes/s,Packet Length Max,Active Max,Init Bwd Win Bytes,FIN Flag Count,Idle Mean,Flow Duration,Bwd Packets Length Total,Down/Up Ratio,...,Bwd Avg Bulk Rate,Fwd IAT Min,Bwd Packets/s,Flow IAT Min,Fwd Packet Length Mean,Avg Fwd Segment Size,Init Fwd Win Bytes,Fwd Packets Length Total,Bwd Packet Length Mean,Protocol
0,UDP,401000000.0,401.0,0.0,-1,0,0.0,2,0.0,0.0,...,0,2.0,0.0,2.0,401.0,401.0,-1,802.0,0.0,17
1,Syn,2.057151,6.0,305540.0,0,0,13509211.0,40833176,24.0,0.0,...,0,0.0,0.09796,0.0,6.0,6.0,5840,60.0,6.0,6
2,Syn,12000000.0,6.0,0.0,-1,0,0.0,1,0.0,0.0,...,0,1.0,0.0,1.0,6.0,6.0,5840,12.0,0.0,6
3,Syn,233009.7,6.0,0.0,0,0,0.0,103,12.0,1.0,...,0,1.0,19417.476562,1.0,6.0,6.0,5840,12.0,6.0,6
4,Syn,585365.9,6.0,0.0,0,0,0.0,41,12.0,1.0,...,0,1.0,48780.488281,1.0,6.0,6.0,5840,12.0,6.0,6


In [24]:
Cic_df_training.to_csv('s3://sagemaker-us-east-1-123251154742/cicddos_training.csv') 

In [25]:
unsw_df = pd.read_csv('s3://sagemaker-us-east-1-123251154742/UNSW_NB15_training-set.csv')
unsw_df.head()

Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,...,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label
0,1,1.1e-05,udp,-,INT,2,0,496,0,90909.0902,...,1,2,0,0,0,1,2,0,Normal,0
1,2,8e-06,udp,-,INT,2,0,1762,0,125000.0003,...,1,2,0,0,0,1,2,0,Normal,0
2,3,5e-06,udp,-,INT,2,0,1068,0,200000.0051,...,1,3,0,0,0,1,3,0,Normal,0
3,4,6e-06,udp,-,INT,2,0,900,0,166666.6608,...,1,3,0,0,0,2,3,0,Normal,0
4,5,1e-05,udp,-,INT,2,0,2126,0,100000.0025,...,1,3,0,0,0,2,3,0,Normal,0


In [26]:
features = list(unsw_df)

In [27]:
print('The no of data points are:',unsw_df.shape[0])
print('='*40)
print('The no of features are:',unsw_df.shape[1])
print('='*40)
print('Some of the features are:',features)

The no of data points are: 82332
The no of features are: 45
Some of the features are: ['id', 'dur', 'proto', 'service', 'state', 'spkts', 'dpkts', 'sbytes', 'dbytes', 'rate', 'sttl', 'dttl', 'sload', 'dload', 'sloss', 'dloss', 'sinpkt', 'dinpkt', 'sjit', 'djit', 'swin', 'stcpb', 'dtcpb', 'dwin', 'tcprtt', 'synack', 'ackdat', 'smean', 'dmean', 'trans_depth', 'response_body_len', 'ct_srv_src', 'ct_state_ttl', 'ct_dst_ltm', 'ct_src_dport_ltm', 'ct_dst_sport_ltm', 'ct_dst_src_ltm', 'is_ftp_login', 'ct_ftp_cmd', 'ct_flw_http_mthd', 'ct_src_ltm', 'ct_srv_dst', 'is_sm_ips_ports', 'attack_cat', 'label']


In [28]:
output = unsw_df['attack_cat'].values
labels = set(output)

In [29]:
print('The different type of output labels are:',labels)
print('='*100)
print('No. of different output labels are:', len(labels))

The different type of output labels are: {'Worms', 'Reconnaissance', 'Exploits', 'Fuzzers', 'DoS', 'Shellcode', 'Generic', 'Backdoor', 'Normal', 'Analysis'}
No. of different output labels are: 10


<h2> Data Cleaning:- </h2>

<h6>Checking for NULL values:-</h6>

In [30]:
print('Null values in the dataset are: ',len(unsw_df[unsw_df.isnull().any(1)]))

Null values in the dataset are:  0


<h6>Checking for DUPLICATE values:-</h6>

In [31]:
duplicateRowsDF = unsw_df[unsw_df.duplicated()]

In [32]:
duplicateRowsDF.head(5)

Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,...,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label


In [33]:
#data.drop_duplicates(subset=features, keep='first', inplace=True)
unsw_df.shape

(82332, 45)

In [34]:
unsw_df.to_pickle('data.pkl')
unsw_df = pd.read_pickle('data.pkl')

<h5> Protocol_type:- </h5>

In [35]:
protocols = list(unsw_df['proto'].values)
protocols = list(set(protocols))
print('Protocol types are:', protocols)

Protocol types are: ['rvd', 'mhrp', 'idpr-cmtp', 'sps', 'mobile', 'stp', 'sdrp', 'iso-tp4', 'wb-mon', 'smp', 'pgm', 'zero', 'igp', 'st2', 'ip', 'larp', 'prm', 'ipnip', 'vrrp', 'ipip', 'sat-expak', 'sprite-rpc', 'trunk-2', 'nsfnet-igp', 'qnx', 'iatp', 'rdp', 'mux', 'idpr', 'ipv6-route', 'emcon', 'vmtp', 'ipv6-frag', 'iplt', 'swipe', 'argus', 'ifmp', 'trunk-1', 'ippc', 'sat-mon', 'tp++', 'wsn', 'dgp', 'chaos', 'cbt', 'leaf-1', 'ipv6', 'netblt', 'cftp', 'crtp', 'mfe-nsp', 'secure-vmtp', 'br-sat-mon', 'irtp', 'compaq-peer', 'fc', 'gre', 'cphb', 'il', 'uti', 'fire', 'sccopmce', 'ipx-n-ip', 'sep', 'skip', 'hmp', 'pri-enc', 'leaf-2', 'ib', 'isis', 'tcf', 'ax.25', 'ospf', 'ipv6-no', '3pc', 'pipe', 'ddp', 'sun-nd', 'gmtp', 'scps', 'ipv6-opts', 'pnni', 'sctp', 'vines', 'ipcv', 'srp', 'igmp', 'xnet', 'snp', 'crudp', 'xtp', 'encap', 'iso-ip', 'ttp', 'ggp', 'l2tp', 'cpnx', 'egp', 'rsvp', 'eigrp', 'xns-idp', 'pvp', 'merit-inp', 'etherip', 'mtp', 'a/n', 'sm', 'dcn', 'visa', 'pup', 'any', 'ptp', 'tlsp

In [36]:
len(protocols)

131

In [37]:
service_list = list(unsw_df['service'].values)
service_list = list(set(service_list))
print('Service types are:\n', service_list)
len(service_list)

Service types are:
 ['-', 'dhcp', 'ssl', 'ftp', 'irc', 'pop3', 'dns', 'ssh', 'http', 'smtp', 'snmp', 'radius', 'ftp-data']


13

In [38]:
state_list = list(unsw_df['state'].values)
state_list = list(set(state_list))
print('state types are:', state_list)

state types are: ['REQ', 'RST', 'FIN', 'INT', 'CLO', 'ACC', 'CON']


In [39]:
#Helper functions to normalize proto, service and state

In [40]:
def protocs(proto):
 
    if proto in protocols:
        return protocols.index(proto)
        
unsw_df['protocol'] = unsw_df['proto'].apply(lambda proto: 
   protocs(proto=proto) 
)
#drop the proto column
unsw_df.drop(columns=['proto'],
                   inplace=True)

In [41]:
def services(service):
 
    if service in service_list:
        return service_list.index(service)
        
unsw_df['sevices'] = unsw_df['service'].apply(lambda service: 
   services(service=service) 
)
#drop the proto column
unsw_df.drop(columns=['service'],
                   inplace=True)

In [42]:
def states(state):
 
    if state in state_list:
        return state_list.index(state)
        
unsw_df['states'] = unsw_df['state'].apply(lambda state: 
   states(state=state) 
)
#drop the proto column
unsw_df.drop(columns=['state'],
                   inplace=True)

In [43]:
unsw_df.head()

Unnamed: 0,id,dur,spkts,dpkts,sbytes,dbytes,rate,sttl,dttl,sload,...,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label,protocol,sevices,states
0,1,1.1e-05,2,0,496,0,90909.0902,254,0,180363632.0,...,0,0,1,2,0,Normal,0,120,0,3
1,2,8e-06,2,0,1762,0,125000.0003,254,0,881000000.0,...,0,0,1,2,0,Normal,0,120,0,3
2,3,5e-06,2,0,1068,0,200000.0051,254,0,854400000.0,...,0,0,1,3,0,Normal,0,120,0,3
3,4,6e-06,2,0,900,0,166666.6608,254,0,600000000.0,...,0,0,2,3,0,Normal,0,120,0,3
4,5,1e-05,2,0,2126,0,100000.0025,254,0,850400000.0,...,0,0,2,3,0,Normal,0,120,0,3


In [44]:
def reorder_columns(columns, first_cols=[], last_cols=[], drop_cols=[]):
    columns = list(set(columns) - set(first_cols))
    columns = list(set(columns) - set(drop_cols))
    columns = list(set(columns) - set(last_cols))
    new_order = first_cols + columns + last_cols
    return new_order

In [45]:
my_list = unsw_df.columns.tolist()
reordered_cols = reorder_columns(my_list, first_cols=['attack_cat'], last_cols=['id'], drop_cols=[])
unsw_df = unsw_df[reordered_cols]

In [46]:
unsw_df.head()

Unnamed: 0,attack_cat,ct_state_ttl,dbytes,protocol,tcprtt,dtcpb,sloss,sevices,sjit,states,...,ct_src_ltm,ct_src_dport_ltm,ct_ftp_cmd,dpkts,sinpkt,synack,ackdat,smean,ct_srv_src,id
0,Normal,2,0,120,0.0,0,0,0,0.0,3,...,1,1,0,0,0.011,0.0,0.0,248,2,1
1,Normal,2,0,120,0.0,0,0,0,0.0,3,...,1,1,0,0,0.008,0.0,0.0,881,2,2
2,Normal,2,0,120,0.0,0,0,0,0.0,3,...,1,1,0,0,0.005,0.0,0.0,534,3,3
3,Normal,2,0,120,0.0,0,0,0,0.0,3,...,2,2,0,0,0.006,0.0,0.0,450,3,4
4,Normal,2,0,120,0.0,0,0,0,0.0,3,...,2,2,0,0,0.01,0.0,0.0,1063,3,5


In [47]:
cic_features = list(Cic_df_training)

In [48]:
cic_features

['Label',
 'Flow Bytes/s',
 'Packet Length Max',
 'Active Max',
 'Init Bwd Win Bytes',
 'FIN Flag Count',
 'Idle Mean',
 'Flow Duration',
 'Bwd Packets Length Total',
 'Down/Up Ratio',
 'Subflow Fwd Packets',
 'Bwd Avg Bytes/Bulk',
 'Flow IAT Std',
 'Packet Length Std',
 'Avg Bwd Segment Size',
 'SimillarHTTP',
 'Bwd Packet Length Max',
 'CWE Flag Count',
 'PSH Flag Count',
 'Idle Std',
 'Fwd Act Data Packets',
 'Avg Packet Size',
 'Bwd Packet Length Min',
 'Fwd PSH Flags',
 'Flow IAT Mean',
 'Bwd IAT Total',
 'Bwd IAT Mean',
 'RST Flag Count',
 'Packet Length Min',
 'Subflow Fwd Bytes',
 'Bwd Avg Packets/Bulk',
 'Fwd Seg Size Min',
 'Fwd Avg Bulk Rate',
 'Fwd Avg Bytes/Bulk',
 'Fwd Packet Length Min',
 'Flow Packets/s',
 'Bwd IAT Min',
 'Packet Length Variance',
 'Total Fwd Packets',
 'Fwd Packet Length Max',
 'Fwd Avg Packets/Bulk',
 'Fwd Packet Length Std',
 'Fwd Header Length',
 'SYN Flag Count',
 'Total Backward Packets',
 'Bwd Header Length',
 'Flow IAT Max',
 'Bwd Avg Bulk Rate'

In [49]:
unsw_df.rename(columns = {'dur':'Flow Duration'}, inplace = True)

In [50]:
from functools import reduce

In [51]:
merged_df = reduce(lambda l,r: l.merge(r, on=['Flow Duration'], how='outer', suffixes=['_COPYL', '_COPYR']), [Cic_df_training, unsw_df])

In [52]:
merged_df.head()

Unnamed: 0,Label,Flow Bytes/s,Packet Length Max,Active Max,Init Bwd Win Bytes,FIN Flag Count,Idle Mean,Flow Duration,Bwd Packets Length Total,Down/Up Ratio,...,ct_src_ltm,ct_src_dport_ltm,ct_ftp_cmd,dpkts,sinpkt,synack,ackdat,smean,ct_srv_src,id
0,UDP,401000000.0,401.0,0.0,-1.0,0.0,0.0,2.0,0.0,0.0,...,,,,,,,,,,
1,Syn,6000000.0,6.0,0.0,-1.0,0.0,0.0,2.0,0.0,0.0,...,,,,,,,,,,
2,UDP,401000000.0,401.0,0.0,-1.0,0.0,0.0,2.0,0.0,0.0,...,,,,,,,,,,
3,UDP,401000000.0,401.0,0.0,-1.0,0.0,0.0,2.0,0.0,0.0,...,,,,,,,,,,
4,UDP,383000000.0,383.0,0.0,-1.0,0.0,0.0,2.0,0.0,0.0,...,,,,,,,,,,


In [53]:
y1 = merged_df['Label']
y2= merged_df['attack_cat']

In [54]:
y=pd.concat([y1, y2])

In [55]:
y

0            UDP
1            Syn
2            UDP
3            UDP
4            UDP
           ...  
210399    Normal
210400    Normal
210401    Normal
210402    Normal
210403    Normal
Length: 420808, dtype: object

In [56]:
x = merged_df.drop(['Label', 'attack_cat'], inplace = True, axis = 1)

In [57]:
merged_df.head()

Unnamed: 0,Flow Bytes/s,Packet Length Max,Active Max,Init Bwd Win Bytes,FIN Flag Count,Idle Mean,Flow Duration,Bwd Packets Length Total,Down/Up Ratio,Subflow Fwd Packets,...,ct_src_ltm,ct_src_dport_ltm,ct_ftp_cmd,dpkts,sinpkt,synack,ackdat,smean,ct_srv_src,id
0,401000000.0,401.0,0.0,-1.0,0.0,0.0,2.0,0.0,0.0,2.0,...,,,,,,,,,,
1,6000000.0,6.0,0.0,-1.0,0.0,0.0,2.0,0.0,0.0,2.0,...,,,,,,,,,,
2,401000000.0,401.0,0.0,-1.0,0.0,0.0,2.0,0.0,0.0,2.0,...,,,,,,,,,,
3,401000000.0,401.0,0.0,-1.0,0.0,0.0,2.0,0.0,0.0,2.0,...,,,,,,,,,,
4,383000000.0,383.0,0.0,-1.0,0.0,0.0,2.0,0.0,0.0,2.0,...,,,,,,,,,,


In [58]:
check_list2=merged_df.isnull().sum()

In [59]:
check_list2.to_csv('nulls_merged_df.csv')

In [60]:
from csv import DictReader
# open file in read mode
with open("nulls_merged_df.csv", 'r') as f:
	
	dict_reader = DictReader(f)
	
	list_of_dict2 = list(dict_reader)

list_of_dict2

[OrderedDict([('', 'Flow Bytes/s'), ('0', '82332')]),
 OrderedDict([('', 'Packet Length Max'), ('0', '82332')]),
 OrderedDict([('', 'Active Max'), ('0', '82332')]),
 OrderedDict([('', 'Init Bwd Win Bytes'), ('0', '82332')]),
 OrderedDict([('', 'FIN Flag Count'), ('0', '82332')]),
 OrderedDict([('', 'Idle Mean'), ('0', '82332')]),
 OrderedDict([('', 'Flow Duration'), ('0', '0')]),
 OrderedDict([('', 'Bwd Packets Length Total'), ('0', '82332')]),
 OrderedDict([('', 'Down/Up Ratio'), ('0', '82332')]),
 OrderedDict([('', 'Subflow Fwd Packets'), ('0', '82332')]),
 OrderedDict([('', 'Bwd Avg Bytes/Bulk'), ('0', '82332')]),
 OrderedDict([('', 'Flow IAT Std'), ('0', '82332')]),
 OrderedDict([('', 'Packet Length Std'), ('0', '82332')]),
 OrderedDict([('', 'Avg Bwd Segment Size'), ('0', '82332')]),
 OrderedDict([('', 'SimillarHTTP'), ('0', '82332')]),
 OrderedDict([('', 'Bwd Packet Length Max'), ('0', '82332')]),
 OrderedDict([('', 'CWE Flag Count'), ('0', '82332')]),
 OrderedDict([('', 'PSH Fla

In [61]:
merged_df.isnull().sum()

Flow Bytes/s           82332
Packet Length Max      82332
Active Max             82332
Init Bwd Win Bytes     82332
FIN Flag Count         82332
                       ...  
synack                128072
ackdat                128072
smean                 128072
ct_srv_src            128072
id                    128072
Length: 99, dtype: int64

In [62]:
merged_features = list(merged_df)

In [63]:
check_list2 = merged_df.isnull().sum()

In [64]:
check_list2.to_csv("nulls_check_list2.csv")

In [65]:
from csv import DictReader
# open file in read mode
with open("nulls_check_list2.csv", 'r') as f:
	
	dict_reader = DictReader(f)
	
	list_of_dict2 = list(dict_reader)

list_of_dict2

[OrderedDict([('', 'Flow Bytes/s'), ('0', '82332')]),
 OrderedDict([('', 'Packet Length Max'), ('0', '82332')]),
 OrderedDict([('', 'Active Max'), ('0', '82332')]),
 OrderedDict([('', 'Init Bwd Win Bytes'), ('0', '82332')]),
 OrderedDict([('', 'FIN Flag Count'), ('0', '82332')]),
 OrderedDict([('', 'Idle Mean'), ('0', '82332')]),
 OrderedDict([('', 'Flow Duration'), ('0', '0')]),
 OrderedDict([('', 'Bwd Packets Length Total'), ('0', '82332')]),
 OrderedDict([('', 'Down/Up Ratio'), ('0', '82332')]),
 OrderedDict([('', 'Subflow Fwd Packets'), ('0', '82332')]),
 OrderedDict([('', 'Bwd Avg Bytes/Bulk'), ('0', '82332')]),
 OrderedDict([('', 'Flow IAT Std'), ('0', '82332')]),
 OrderedDict([('', 'Packet Length Std'), ('0', '82332')]),
 OrderedDict([('', 'Avg Bwd Segment Size'), ('0', '82332')]),
 OrderedDict([('', 'SimillarHTTP'), ('0', '82332')]),
 OrderedDict([('', 'Bwd Packet Length Max'), ('0', '82332')]),
 OrderedDict([('', 'CWE Flag Count'), ('0', '82332')]),
 OrderedDict([('', 'PSH Fla

In [66]:
merged_df.drop('SimillarHTTP', axis=1, inplace=True)

In [67]:
merged_features = list(merged_df)

In [68]:
for feat in merged_features:
    merged_df[feat] = merged_df[feat].replace(([np.inf, -np.inf], np.NaN), merged_df[feat].median())

In [69]:
merged_df.head()

Unnamed: 0,Flow Bytes/s,Packet Length Max,Active Max,Init Bwd Win Bytes,FIN Flag Count,Idle Mean,Flow Duration,Bwd Packets Length Total,Down/Up Ratio,Subflow Fwd Packets,...,ct_src_ltm,ct_src_dport_ltm,ct_ftp_cmd,dpkts,sinpkt,synack,ackdat,smean,ct_srv_src,id
0,401000000.0,401.0,0.0,-1.0,0.0,0.0,2.0,0.0,0.0,2.0,...,3.0,1.0,0.0,2.0,0.557929,0.000441,8e-05,65.0,5.0,41166.5
1,6000000.0,6.0,0.0,-1.0,0.0,0.0,2.0,0.0,0.0,2.0,...,3.0,1.0,0.0,2.0,0.557929,0.000441,8e-05,65.0,5.0,41166.5
2,401000000.0,401.0,0.0,-1.0,0.0,0.0,2.0,0.0,0.0,2.0,...,3.0,1.0,0.0,2.0,0.557929,0.000441,8e-05,65.0,5.0,41166.5
3,401000000.0,401.0,0.0,-1.0,0.0,0.0,2.0,0.0,0.0,2.0,...,3.0,1.0,0.0,2.0,0.557929,0.000441,8e-05,65.0,5.0,41166.5
4,383000000.0,383.0,0.0,-1.0,0.0,0.0,2.0,0.0,0.0,2.0,...,3.0,1.0,0.0,2.0,0.557929,0.000441,8e-05,65.0,5.0,41166.5


In [70]:
merged_df.isnull().sum()

Flow Bytes/s          0
Packet Length Max     0
Active Max            0
Init Bwd Win Bytes    0
FIN Flag Count        0
                     ..
synack                0
ackdat                0
smean                 0
ct_srv_src            0
id                    0
Length: 98, dtype: int64

In [71]:
merged_df.shape

(210404, 98)

In [72]:
# Remove Missing Values 
na = pd.notnull(y)
y = y[na]

In [73]:
y

0            UDP
1            Syn
2            UDP
3            UDP
4            UDP
           ...  
210399    Normal
210400    Normal
210401    Normal
210402    Normal
210403    Normal
Length: 210404, dtype: object

In [74]:
merged_df.insert(1, "tmp", [i for i in range(0,210404)], True)

In [75]:
y_df = pd.DataFrame({'attack_cat':y.values})

In [76]:
y_df

Unnamed: 0,attack_cat
0,UDP
1,Syn
2,UDP
3,UDP
4,UDP
...,...
210399,Normal
210400,Normal
210401,Normal
210402,Normal


In [77]:
y_df.insert(1, "tmp", [i for i in range(0,210404)], True)

In [78]:
from functools import reduce

In [79]:
merged_train = reduce(lambda l,r: l.merge(r, on=['tmp'], how='outer', suffixes=['_COPYL', '_COPYR']), [y_df, merged_df])

In [80]:
merged_train.head()

Unnamed: 0,attack_cat,tmp,Flow Bytes/s,Packet Length Max,Active Max,Init Bwd Win Bytes,FIN Flag Count,Idle Mean,Flow Duration,Bwd Packets Length Total,...,ct_src_ltm,ct_src_dport_ltm,ct_ftp_cmd,dpkts,sinpkt,synack,ackdat,smean,ct_srv_src,id
0,UDP,0,401000000.0,401.0,0.0,-1.0,0.0,0.0,2.0,0.0,...,3.0,1.0,0.0,2.0,0.557929,0.000441,8e-05,65.0,5.0,41166.5
1,Syn,1,6000000.0,6.0,0.0,-1.0,0.0,0.0,2.0,0.0,...,3.0,1.0,0.0,2.0,0.557929,0.000441,8e-05,65.0,5.0,41166.5
2,UDP,2,401000000.0,401.0,0.0,-1.0,0.0,0.0,2.0,0.0,...,3.0,1.0,0.0,2.0,0.557929,0.000441,8e-05,65.0,5.0,41166.5
3,UDP,3,401000000.0,401.0,0.0,-1.0,0.0,0.0,2.0,0.0,...,3.0,1.0,0.0,2.0,0.557929,0.000441,8e-05,65.0,5.0,41166.5
4,UDP,4,383000000.0,383.0,0.0,-1.0,0.0,0.0,2.0,0.0,...,3.0,1.0,0.0,2.0,0.557929,0.000441,8e-05,65.0,5.0,41166.5


In [81]:
merged_train.drop('tmp', axis=1, inplace=True)

In [82]:
merged_train.head()

Unnamed: 0,attack_cat,Flow Bytes/s,Packet Length Max,Active Max,Init Bwd Win Bytes,FIN Flag Count,Idle Mean,Flow Duration,Bwd Packets Length Total,Down/Up Ratio,...,ct_src_ltm,ct_src_dport_ltm,ct_ftp_cmd,dpkts,sinpkt,synack,ackdat,smean,ct_srv_src,id
0,UDP,401000000.0,401.0,0.0,-1.0,0.0,0.0,2.0,0.0,0.0,...,3.0,1.0,0.0,2.0,0.557929,0.000441,8e-05,65.0,5.0,41166.5
1,Syn,6000000.0,6.0,0.0,-1.0,0.0,0.0,2.0,0.0,0.0,...,3.0,1.0,0.0,2.0,0.557929,0.000441,8e-05,65.0,5.0,41166.5
2,UDP,401000000.0,401.0,0.0,-1.0,0.0,0.0,2.0,0.0,0.0,...,3.0,1.0,0.0,2.0,0.557929,0.000441,8e-05,65.0,5.0,41166.5
3,UDP,401000000.0,401.0,0.0,-1.0,0.0,0.0,2.0,0.0,0.0,...,3.0,1.0,0.0,2.0,0.557929,0.000441,8e-05,65.0,5.0,41166.5
4,UDP,383000000.0,383.0,0.0,-1.0,0.0,0.0,2.0,0.0,0.0,...,3.0,1.0,0.0,2.0,0.557929,0.000441,8e-05,65.0,5.0,41166.5


In [83]:
merged_train.to_csv('s3://sagemaker-us-east-1-123251154742/research/training.csv')

In [84]:
import sklearn
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import shap
import time

In [85]:
merged_df = pd.read_csv('s3://sagemaker-us-east-1-123251154742/research/training.csv')

In [86]:
merged_df.head()

Unnamed: 0.1,Unnamed: 0,attack_cat,Flow Bytes/s,Packet Length Max,Active Max,Init Bwd Win Bytes,FIN Flag Count,Idle Mean,Flow Duration,Bwd Packets Length Total,...,ct_src_ltm,ct_src_dport_ltm,ct_ftp_cmd,dpkts,sinpkt,synack,ackdat,smean,ct_srv_src,id
0,0,UDP,401000000.0,401.0,0.0,-1.0,0.0,0.0,2.0,0.0,...,3.0,1.0,0.0,2.0,0.557929,0.000441,8e-05,65.0,5.0,41166.5
1,1,Syn,6000000.0,6.0,0.0,-1.0,0.0,0.0,2.0,0.0,...,3.0,1.0,0.0,2.0,0.557929,0.000441,8e-05,65.0,5.0,41166.5
2,2,UDP,401000000.0,401.0,0.0,-1.0,0.0,0.0,2.0,0.0,...,3.0,1.0,0.0,2.0,0.557929,0.000441,8e-05,65.0,5.0,41166.5
3,3,UDP,401000000.0,401.0,0.0,-1.0,0.0,0.0,2.0,0.0,...,3.0,1.0,0.0,2.0,0.557929,0.000441,8e-05,65.0,5.0,41166.5
4,4,UDP,383000000.0,383.0,0.0,-1.0,0.0,0.0,2.0,0.0,...,3.0,1.0,0.0,2.0,0.557929,0.000441,8e-05,65.0,5.0,41166.5


In [87]:
merged_df.drop('Unnamed: 0', axis=1, inplace=True)

In [88]:
merged_df.head()

Unnamed: 0,attack_cat,Flow Bytes/s,Packet Length Max,Active Max,Init Bwd Win Bytes,FIN Flag Count,Idle Mean,Flow Duration,Bwd Packets Length Total,Down/Up Ratio,...,ct_src_ltm,ct_src_dport_ltm,ct_ftp_cmd,dpkts,sinpkt,synack,ackdat,smean,ct_srv_src,id
0,UDP,401000000.0,401.0,0.0,-1.0,0.0,0.0,2.0,0.0,0.0,...,3.0,1.0,0.0,2.0,0.557929,0.000441,8e-05,65.0,5.0,41166.5
1,Syn,6000000.0,6.0,0.0,-1.0,0.0,0.0,2.0,0.0,0.0,...,3.0,1.0,0.0,2.0,0.557929,0.000441,8e-05,65.0,5.0,41166.5
2,UDP,401000000.0,401.0,0.0,-1.0,0.0,0.0,2.0,0.0,0.0,...,3.0,1.0,0.0,2.0,0.557929,0.000441,8e-05,65.0,5.0,41166.5
3,UDP,401000000.0,401.0,0.0,-1.0,0.0,0.0,2.0,0.0,0.0,...,3.0,1.0,0.0,2.0,0.557929,0.000441,8e-05,65.0,5.0,41166.5
4,UDP,383000000.0,383.0,0.0,-1.0,0.0,0.0,2.0,0.0,0.0,...,3.0,1.0,0.0,2.0,0.557929,0.000441,8e-05,65.0,5.0,41166.5


<h2> Train Test Split:- </h2>

In [89]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(merged_df.drop('attack_cat', axis=1), merged_df['attack_cat'], stratify=merged_df['attack_cat'], test_size=0.20)

In [90]:
print('Train data')
print(X_train.shape)
print(Y_train.shape)
print('='*20)
print('Test data')
print(X_test.shape)
print(Y_test.shape)

Train data
(168323, 98)
(168323,)
Test data
(42081, 98)
(42081,)


<h2> Applying Machine Algorithms:- </h2>

<h5> Utility Functions:- </h5>

In [91]:
import datetime as dt
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.externals import joblib

In [92]:
!pip install joblib

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [93]:
import sklearn.externals.joblib

In [94]:
attacks = list(set(Y_train))

In [95]:
attacks

['Worms',
 'Reconnaissance',
 'MSSQL',
 'Portmap',
 'Exploits',
 'NetBIOS',
 'UDP',
 'DoS',
 'Normal',
 'Generic',
 'Benign',
 'LDAP',
 'Syn',
 'Shellcode',
 'Backdoor',
 'UDPLag',
 'Fuzzers',
 'Analysis']

In [97]:
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
import numpy as np

In [96]:
X_train.shape,X_test.shape

((168323, 98), (42081, 98))

In [98]:
#Implementing a Random Forest Classifier

In [99]:
from sklearn.ensemble import RandomForestClassifier

In [100]:
model=RandomForestClassifier()

In [101]:
model.fit(X_train,Y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [102]:
Y_pred=model.predict(X_test)
Y_pred

array(['Normal', 'Syn', 'Normal', ..., 'Benign', 'Benign', 'Benign'],
      dtype=object)

In [103]:
accuracy_score(Y_pred,Y_test)
print(classification_report(Y_pred,Y_test))

                precision    recall  f1-score   support

      Analysis       0.07      0.08      0.08       119
      Backdoor       0.00      0.00      0.00       127
        Benign       1.00      1.00      1.00      9285
           DoS       0.36      0.43      0.39       683
      Exploits       0.78      0.71      0.75      2452
       Fuzzers       0.89      0.85      0.87      1263
       Generic       0.98      1.00      0.99      3689
          LDAP       0.95      0.95      0.95       454
         MSSQL       0.96      0.95      0.95      1799
       NetBIOS       0.65      0.59      0.62       190
        Normal       1.00      1.00      1.00      7400
       Portmap       0.43      0.53      0.48       146
Reconnaissance       0.84      0.87      0.85       675
     Shellcode       0.57      0.74      0.64        58
           Syn       1.00      1.00      1.00     10067
           UDP       0.98      0.99      0.98      3671
        UDPLag       0.17      0.67      0.27  

In [104]:
confusion_matrix(Y_pred,Y_test)

array([[   10,    59,     0,    17,    11,    21,     0,     0,     0,
            0,     0,     0,     1,     0,     0,     0,     0,     0],
       [   70,     0,     0,     9,    47,     1,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0],
       [    0,     0,  9279,     0,     0,     0,     0,     0,     0,
            0,     0,     2,     0,     0,     4,     0,     0,     0],
       [   19,     7,     0,   296,   284,    30,     9,     0,     0,
            0,     0,     0,    33,     5,     0,     0,     0,     0],
       [   20,    46,     0,   410,  1747,    76,    73,     0,     0,
            0,     0,     0,    63,     9,     0,     0,     0,     8],
       [   16,     3,     0,    43,    87,  1074,     9,     0,     0,
            0,     0,     0,    16,    15,     0,     0,     0,     0],
       [    0,     0,     0,     3,     5,     0,  3680,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     1],

In [105]:
#Thus, Random Forest Classifier shows 95% accuracy on the test set

In [106]:
#Implementing a KNN

In [107]:
from sklearn.neighbors import KNeighborsClassifier

In [108]:
knn=KNeighborsClassifier(n_neighbors=10)

In [109]:
knn.fit(X_train,Y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                     weights='uniform')

In [110]:
Y_pred_knn=knn.predict(X_test)
Y_pred_knn

array(['Exploits', 'Syn', 'Normal', ..., 'Benign', 'Benign', 'Benign'],
      dtype=object)

In [113]:
accuracy_score(Y_pred_knn,Y_test)
print(classification_report(Y_pred_knn,Y_test))

                precision    recall  f1-score   support

      Analysis       0.01      0.02      0.01        59
      Backdoor       0.00      0.00      0.00        20
        Benign       0.98      0.99      0.98      9254
           DoS       0.32      0.41      0.36       640
      Exploits       0.28      0.28      0.28      2192
       Fuzzers       0.17      0.33      0.23       646
       Generic       0.96      0.99      0.98      3669
          LDAP       0.87      0.87      0.87       455
         MSSQL       0.94      0.87      0.91      1911
       NetBIOS       0.79      0.57      0.66       241
        Normal       0.89      0.74      0.81      8944
       Portmap       0.21      0.41      0.28        93
Reconnaissance       0.38      0.92      0.53       288
     Shellcode       0.03      0.40      0.05         5
           Syn       0.99      0.99      0.99     10082
           UDP       0.94      0.97      0.96      3579
        UDPLag       0.17      0.67      0.27  

In [114]:
confusion_matrix(Y_pred_knn,Y_test)

array([[   1,    7,    0,    3,   34,   13,    0,    0,    0,    0,    1,
           0,    0,    0,    0,    0,    0,    0],
       [   2,    0,    0,    4,    9,    5,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0],
       [   0,    0, 9116,    0,    0,    0,    0,    3,    3,    1,    3,
          14,    0,    0,   92,   21,    1,    0],
       [  20,    4,    0,  264,  243,   42,    8,    0,    0,    0,   21,
           0,   34,    4,    0,    0,    0,    0],
       [  73,   66,    0,  315,  619,  339,   28,    0,    0,    0,  612,
           0,  122,   15,    0,    0,    0,    3],
       [  31,   34,    0,   46,  150,  211,    4,    0,    0,    0,  147,
           0,   17,    6,    0,    0,    0,    0],
       [   0,    0,    0,    6,    5,    6, 3636,    0,    0,    0,    9,
           0,    7,    0,    0,    0,    0,    0],
       [   0,    0,    6,    0,    0,    0,    0,  394,   34,    6,    0,
           1,    0,    0,    0,   14,    0,    0],


In [115]:
#Thus, the KNN Classifier shows 86% accuracy on the test set, i.e. 20% higher accuracy than the Random Forest Classifier

In [116]:
#Implementing a Decision Tree

In [117]:
from sklearn.tree import DecisionTreeClassifier

In [118]:
dtc=DecisionTreeClassifier()

In [119]:
dtc.fit(X_train,Y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [120]:
Y_pred_dtc=dtc.predict(X_test)
Y_pred_dtc

array(['Normal', 'Syn', 'Normal', ..., 'Benign', 'Benign', 'Benign'],
      dtype=object)

In [121]:
accuracy_score(Y_pred_dtc,Y_test)
print(classification_report(Y_pred_dtc,Y_test))

                precision    recall  f1-score   support

      Analysis       0.09      0.09      0.09       134
      Backdoor       0.09      0.09      0.09       119
        Benign       1.00      1.00      1.00      9283
           DoS       0.38      0.39      0.39       798
      Exploits       0.71      0.71      0.71      2231
       Fuzzers       0.83      0.82      0.82      1219
       Generic       0.98      0.98      0.98      3765
          LDAP       0.95      0.93      0.94       464
         MSSQL       0.95      0.94      0.94      1784
       NetBIOS       0.77      0.60      0.68       225
        Normal       1.00      1.00      1.00      7400
       Portmap       0.41      0.58      0.48       126
Reconnaissance       0.84      0.83      0.84       702
     Shellcode       0.57      0.51      0.53        85
           Syn       1.00      1.00      1.00     10056
           UDP       0.98      0.98      0.98      3673
        UDPLag       0.25      0.75      0.38  

In [122]:
confusion_matrix(Y_pred_dtc,Y_test)

array([[   12,    45,     0,    19,    16,    40,     0,     0,     0,
            0,     0,     0,     2,     0,     0,     0,     0,     0],
       [   48,    11,     0,    11,    43,     6,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0],
       [    0,     0,  9272,     0,     0,     0,     0,     0,     0,
            1,     0,     3,     0,     0,     7,     0,     0,     0],
       [   18,     9,     0,   314,   357,    42,    16,     0,     0,
            0,     0,     0,    34,     7,     0,     0,     0,     1],
       [   24,    40,     0,   372,  1589,    92,    41,     0,     0,
            0,     0,     0,    56,    14,     0,     0,     0,     3],
       [   33,     9,     0,    45,    97,  1002,     8,     0,     0,
            0,     0,     0,    17,     8,     0,     0,     0,     0],
       [    0,     0,     0,    16,    33,     7,  3704,     0,     0,
            0,     0,     0,     3,     1,     0,     0,     0,     1],

In [123]:
#Thus, the Decision tree Classifier shows 95% accuracy on the test set.

In [124]:
#Implementing a Naive Bayes Clasifier. 

In [125]:
from sklearn.naive_bayes import GaussianNB

In [126]:
nb=GaussianNB()

In [127]:
nb.fit(X_train,Y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [128]:
Y_pred_nb=nb.predict(X_test)
Y_pred_nb

array(['Reconnaissance', 'NetBIOS', 'Normal', ..., 'UDPLag', 'Benign',
       'UDPLag'], dtype='<U14')

In [129]:
accuracy_score(Y_pred_nb,Y_test)
print(classification_report(Y_pred_nb,Y_test))

                precision    recall  f1-score   support

      Analysis       0.01      0.02      0.01        44
      Backdoor       0.00      0.00      0.00         1
        Benign       0.09      1.00      0.16       825
           DoS       0.00      0.04      0.00        49
      Exploits       0.03      0.56      0.06       119
       Fuzzers       0.12      0.29      0.17       500
       Generic       0.97      0.50      0.66      7249
          LDAP       0.73      0.47      0.58       697
         MSSQL       0.09      0.55      0.15       281
       NetBIOS       0.00      0.00      0.00       656
        Normal       0.35      0.88      0.50      2891
       Portmap       0.01      0.06      0.01        17
Reconnaissance       0.50      0.06      0.11      5609
     Shellcode       0.01      0.33      0.03         3
           Syn       0.77      0.84      0.80      9135
           UDP       0.06      1.00      0.11       215
        UDPLag       0.92      0.00      0.00  

In [130]:
confusion_matrix(Y_pred_nb,Y_test)

array([[   1,    3,    0,    7,    7,    7,    0,    0,    0,    0,    3,
           0,   16,    0,    0,    0,    0,    0],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    1,
           0,    0,    0,    0,    0,    0,    0],
       [   0,    0,  822,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    3,    0,    0,    0],
       [   0,    0,    0,    2,    7,    0,    4,    0,    0,    0,   35,
           0,    0,    0,    0,    0,    0,    1],
       [   0,    0,    0,   18,   67,    6,    1,    0,    0,    0,   25,
           0,    1,    0,    0,    0,    0,    1],
       [   0,    0,    0,   10,   56,  145,   15,    0,    0,    0,  269,
           0,    1,    4,    0,    0,    0,    0],
       [ 125,  107,    0,  587,  687,  393, 3659,    0,    0,    0, 1323,
           0,  331,   37,    0,    0,    0,    0],
       [   0,    0,   16,    0,    0,    0,    0,  331,  349,    0,    0,
           1,    0,    0,    0,    0,    0,    0],


In [132]:
#Random Forest and Decision Tree Classifiers shows the best performance with 95% accuracy followed by  KNN with 85% accuracy, and NB with 38% accuracy. Thus, RF and DT Classifi

In [131]:
#We Now study the how the same datasets can be used in a neural network model.

In [133]:
!pip install keras

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [134]:
!pip install tensorflow

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [135]:
# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd 
import seaborn as sns
# Keras
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD, Adam, Adadelta, RMSprop
import keras.backend as K
# Train-Test
from sklearn.model_selection import train_test_split
# Scaling data
from sklearn.preprocessing import StandardScaler
# Classification Report
from sklearn.metrics import classification_report
from keras.utils.np_utils import to_categorical

In [136]:
#Categorizing attack_cat:

In [None]:
confusion_matrix(Y_pred_nb,Y_test)

In [None]:
#Random Forest and Decision Tree Classifiers shows the best performance with 95% accuracy followed by  KNN with 85% accuracy, and NB with 38% accuracy. Thus, RF and DT Classifiers exhibit the best performance and Naive Bayes the worst.

In [None]:
#We Now study the how the same datasets can be used in a neural network model. 

In [None]:
!pip install keras

In [None]:
!pip install tensorflow

In [None]:
# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd 
import seaborn as sns
# Keras
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD, Adam, Adadelta, RMSprop
import keras.backend as K
# Train-Test
from sklearn.model_selection import train_test_split
# Scaling data
from sklearn.preprocessing import StandardScaler
# Classification Report
from sklearn.metrics import classification_report
from keras.utils.np_utils import to_categorical

In [None]:
#Categorizing attack_cat:


In [137]:
for attack in attacks:
    merged_train.loc[merged_train["attack_cat"] == attack, "attack_cat"] =attacks.index(attack)

In [138]:
# Remove Missing Values 
na = pd.notnull(merged_train["attack_cat"])
merged_train = merged_train[na]

In [139]:
x = merged_train.drop("attack_cat", axis = 1)

In [140]:
#Standard Scaling of features:

In [141]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x = pd.DataFrame(sc.fit_transform(x))
y = merged_train["attack_cat"]

In [142]:
y_cat = to_categorical(y)

In [143]:
y_cat

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

In [190]:
from keras.layers.core import Dense, Dropout, Activation

In [191]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x.values, y_cat, test_size=0.2)

In [192]:
model = Sequential()
model.add(Dense(60, input_shape = (98,), activation = "relu"))
model.add(Dense(15, activation = "relu"))
model.add(Dropout(0.2))
model.add(Dense(18, activation = "softmax"))
model.compile(Adam(lr = 0.01), "categorical_crossentropy", metrics = ["accuracy"])
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_6 (Dense)             (None, 60)                5940      
                                                                 
 dense_7 (Dense)             (None, 15)                915       
                                                                 
 dropout (Dropout)           (None, 15)                0         
                                                                 
 dense_8 (Dense)             (None, 18)                288       
                                                                 
Total params: 7,143
Trainable params: 7,143
Non-trainable params: 0
_________________________________________________________________


In [193]:
#The above diagram explains the neural network model we have build. The model has two reLU units and the final layer which is a dense layer that has softmax activation for predicting multi-class probability output. We have also used the categorical cross-entropy as our loss function with the Adam optimizer.

In [194]:
#Fit the model and run for 10 epochs:

In [195]:
#Implementing a NN on the combined dataset

In [196]:
model.fit(x_train, y_train, verbose=1, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fc640069690>

In [197]:
#We could achieve around 91% accuracy within the first 10 epochs

In [198]:
#Fit the model and run for 10 epochs:

In [199]:
#Confusion Matrix:

In [200]:
y_pred_class = model.predict(x_test)
classes_x=np.argmax(y_pred_class,axis=1)
from sklearn.metrics import confusion_matrix
y_test_class = np.argmax(y_test, axis=1)
confusion_matrix(y_test_class, classes_x)



array([[   0,    0,    0,    0,    9,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    1,    0],
       [   0,  473,    0,    0,  185,    0,    0,    0,    0,    2,    0,
           1,    0,    0,    0,    0,   53,    0],
       [   0,    0, 1538,    0,    0,    0,    7,    0,    0,    0,    2,
         246,    0,    0,    0,    0,    0,    0],
       [   0,    0,   12,   87,    0,   35,    7,    0,    0,    0,   20,
          15,    0,    0,    0,    0,    0,    0],
       [   0,   17,    0,    0, 1979,    0,    0,    0,    0,    3,    0,
          82,    0,    0,    0,    0,  134,    0],
       [   0,    0,    8,   96,    0,   63,   11,    0,    0,    0,    1,
           1,    0,    0,    0,    0,    0,    0],
       [   0,    0,  342,    2,    0,    0, 3331,    0,    0,    0,    0,
           3,    0,    0,    0,    0,    0,    0],
       [   0,   21,    0,    0,  727,    0,    0,    0,    0,    0,    0,
          13,    0,    0,    0,    0,   63,    0],


In [201]:
#Classification Report:

In [202]:
from sklearn.metrics import classification_report
print(classification_report(y_test_class, classes_x))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        10
           1       0.82      0.66      0.73       714
           2       0.79      0.86      0.82      1793
           3       0.47      0.49      0.48       176
           4       0.57      0.89      0.69      2215
           5       0.62      0.35      0.45       180
           6       0.99      0.91      0.95      3678
           7       0.00      0.00      0.00       824
           8       1.00      1.00      1.00      7422
           9       1.00      0.96      0.98      3723
          10       0.99      0.99      0.99      9347
          11       0.50      0.99      0.66       482
          12       1.00      0.99      0.99      9960
          13       0.00      0.00      0.00        80
          14       0.00      0.00      0.00       133
          15       0.00      0.00      0.00        17
          16       0.72      0.75      0.73      1188
          17       0.00    