# **Network IPS with enhancements for DDoS attack detection**

In this project we are trying to build a network intrusion prevention system to detect different kinds of attacks on the SDN controller specially DDoS attacks. We used both machine and deep learning algorithms to build a better system.

The following are the detailed steps and code explaining the whole system.


# ***1) Importing needed libraries***
We thought it best to collect all used python libraries in one section instead of just leaving them sparsed in the whole project to be imported all at once.

In [None]:
%matplotlib inline
# Start Python Imports
import math, time, random, datetime

# Data Manipulation
import numpy as np
import pandas as pd
from numpy import mean
from numpy import std

# Visualization 
import matplotlib.pyplot as plt

#import missingno
import seaborn as sns
plt.style.use('seaborn-whitegrid')

# Preprocessing
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, label_binarize

# Machine learning
from sklearn.model_selection import train_test_split
from sklearn import model_selection, tree, preprocessing, metrics, linear_model
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split, cross_validate, cross_val_score
from sklearn.datasets import load_iris, load_boston
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
!pip install category_encoders
import category_encoders as ce


# Deep learning libraries
import keras
from keras import layers
from keras import Model
from keras.optimizers import RMSprop
import tensorflow as tf
from tensorflow.keras import regularizers
from tensorflow.keras import initializers
from keras.layers import Input
from keras.layers import Dense
from keras.models import Model
from keras.utils import plot_model

# Let's be rebels and ignore warnings for now
import warnings
warnings.filterwarnings('ignore')

# To import dataset files from google drive into google colab
from google.colab import drive
drive.mount('/content/drive')




Collecting category_encoders
[?25l  Downloading https://files.pythonhosted.org/packages/44/57/fcef41c248701ee62e8325026b90c432adea35555cbc870aff9cfba23727/category_encoders-2.2.2-py2.py3-none-any.whl (80kB)
[K     |████                            | 10kB 14.4MB/s eta 0:00:01[K     |████████▏                       | 20kB 11.2MB/s eta 0:00:01[K     |████████████▏                   | 30kB 8.7MB/s eta 0:00:01[K     |████████████████▎               | 40kB 8.1MB/s eta 0:00:01[K     |████████████████████▎           | 51kB 4.5MB/s eta 0:00:01[K     |████████████████████████▍       | 61kB 4.9MB/s eta 0:00:01[K     |████████████████████████████▍   | 71kB 5.3MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 3.6MB/s 
Installing collected packages: category-encoders
Successfully installed category-encoders-2.2.2


  import pandas.util.testing as tm


Mounted at /content/drive


# ***2) Data preprocessing***

In this section we import data files used for training and testing the model. Then, we apply multiple operations on data such as: filling nulls, removing outlayers and encoding (label,hashing).

We used The (UNSW_NB15) dataset for training and testing the model. You can find more about it and download it from the following link (https://www.kaggle.com/mrwellsdavid/unsw-nb15)



In [None]:
# Reading data files 
df1 = pd.read_csv('/content/drive/MyDrive/Dataset/UNSW-NB15_1.csv')
df2 = pd.read_csv('/content/drive/MyDrive/Dataset/UNSW-NB15_2.csv')
df3 = pd.read_csv('/content/drive/MyDrive/Dataset/UNSW-NB15_3.csv')
df4 = pd.read_csv('/content/drive/MyDrive/Dataset/UNSW-NB15_4.csv')

# Adding columns names to data
columns=['srcip','sport','dstip','dsport','proto','state','dur','sbytes','dbytes','sttl', 'dttl', 'sloss', 'dloss','service', 'sload','dload','spkts','dpkts','swin','dwin','stcpb','dtcpb','smeansz','dmeansz','trans_depth','res_bdy_len', 'sjit','djit', 'stime','ltime','sintpkt','dintpkt','tcprtt','synack','ackdat','is_sm_ips_ports','ct_state_ttl','ct_flw_http_mthd' ,'is_ftp_login','ct_ftp_cmd','ct_srv_src','ct_srv_dst','ct_dst_ltm','ct_src_ ltm','ct_src_dport_ltm','ct_dst_sport_ltm','ct_dst_src_ltm','attack_cat','label']
df1.columns = columns
df2.columns = columns
df3.columns = columns
df4.columns = columns

# Combining all files in one file called (data).
dfs = [ df1, df2, df3, df4]
data = pd.concat(dfs).reset_index()
data.drop('index',axis=1,inplace=True)

data.head()



Unnamed: 0,srcip,sport,dstip,dsport,proto,state,dur,sbytes,dbytes,sttl,dttl,sloss,dloss,service,sload,dload,spkts,dpkts,swin,dwin,stcpb,dtcpb,smeansz,dmeansz,trans_depth,res_bdy_len,sjit,djit,stime,ltime,sintpkt,dintpkt,tcprtt,synack,ackdat,is_sm_ips_ports,ct_state_ttl,ct_flw_http_mthd,is_ftp_login,ct_ftp_cmd,ct_srv_src,ct_srv_dst,ct_dst_ltm,ct_src_ ltm,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm,attack_cat,label
0,59.166.0.0,33661,149.171.126.9,1024,udp,CON,0.036133,528,304,31,29,0,0,-,87676.08594,50480.17188,4,4,0,0,0,0,132,76,0,0,9.89101,10.682733,1421927414,1421927414,7.005,7.564333,0.0,0.0,0.0,0,0,0.0,0.0,0,2,4,2,3,1,1,2,,0
1,59.166.0.6,1464,149.171.126.7,53,udp,CON,0.001119,146,178,31,29,0,0,dns,521894.5313,636282.375,2,2,0,0,0,0,73,89,0,0,0.0,0.0,1421927414,1421927414,0.017,0.013,0.0,0.0,0.0,0,0,0.0,0.0,0,12,8,1,2,2,1,1,,0
2,59.166.0.5,3593,149.171.126.5,53,udp,CON,0.001209,132,164,31,29,0,0,dns,436724.5625,542597.1875,2,2,0,0,0,0,66,82,0,0,0.0,0.0,1421927414,1421927414,0.043,0.014,0.0,0.0,0.0,0,0,0.0,0.0,0,6,9,1,1,1,1,1,,0
3,59.166.0.3,49664,149.171.126.0,53,udp,CON,0.001169,146,178,31,29,0,0,dns,499572.25,609067.5625,2,2,0,0,0,0,73,89,0,0,0.0,0.0,1421927414,1421927414,0.005,0.003,0.0,0.0,0.0,0,0,0.0,0.0,0,7,9,1,1,1,1,1,,0
4,59.166.0.0,32119,149.171.126.9,111,udp,CON,0.078339,568,312,31,29,0,0,-,43503.23438,23896.14258,4,4,0,0,0,0,142,78,0,0,29.682221,34.37034,1421927414,1421927414,21.003,24.315,0.0,0.0,0.0,0,0,0.0,0.0,0,2,4,2,3,1,1,2,,0


In [None]:
# Finding missing values in the dataset
data.isnull().sum().sort_values(ascending=False)[0:5]

attack_cat          2218760
is_ftp_login        1429877
ct_flw_http_mthd    1348143
label                     0
sloss                     0
dtype: int64

In [None]:
# Finding number of attack samples in the dataset
data.query('label == 1')

Unnamed: 0,srcip,sport,dstip,dsport,proto,state,dur,sbytes,dbytes,sttl,dttl,sloss,dloss,service,sload,dload,spkts,dpkts,swin,dwin,stcpb,dtcpb,smeansz,dmeansz,trans_depth,res_bdy_len,sjit,djit,stime,ltime,sintpkt,dintpkt,tcprtt,synack,ackdat,is_sm_ips_ports,ct_state_ttl,ct_flw_http_mthd,is_ftp_login,ct_ftp_cmd,ct_srv_src,ct_srv_dst,ct_dst_ltm,ct_src_ ltm,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm,attack_cat,label
19,175.45.176.3,21223,149.171.126.18,32780,udp,INT,0.000021,728,0,254,0,0,0,-,1.386667e+08,0.000000,2,0,0,0,0,0,364,0,0,0,0.000000,0.000000,1421927415,1421927415,0.021000,0.000000,0.000000,0.000000,0.000000,0,2,0.0,0.0,0,1,1,1,1,1,1,1,Exploits,1
20,175.45.176.2,23357,149.171.126.16,80,tcp,FIN,0.240139,918,25552,62,252,2,10,http,2.805042e+04,815794.187500,12,24,255,255,1708297952,1939490744,77,1065,1,12026,1170.481668,1144.383360,1421927416,1421927416,21.830818,9.570304,0.051475,0.006528,0.044947,0,1,1.0,0.0,0,3,2,2,1,1,1,1,Exploits,1
21,175.45.176.0,13284,149.171.126.16,80,tcp,FIN,2.390390,1362,268,254,252,6,1,http,4.233619e+03,749.668518,14,6,255,255,3897219059,2466816006,97,45,1,0,18786.711400,941.724938,1421927414,1421927416,183.579303,474.259406,0.066088,0.017959,0.048129,0,1,1.0,0.0,0,5,2,2,1,1,1,1,Reconnaissance,1
38,175.45.176.2,13792,149.171.126.16,5555,tcp,FIN,0.175190,8168,268,254,252,4,1,-,3.463668e+05,10228.894530,14,6,255,255,2505143795,3592239707,583,45,0,0,774.788316,47.765387,1421927417,1421927417,11.837692,33.287000,0.054878,0.008744,0.046134,0,1,0.0,0.0,0,1,1,1,1,1,1,1,Exploits,1
39,175.45.176.2,26939,149.171.126.10,80,tcp,FIN,0.190600,844,268,254,252,2,1,http,3.189927e+04,9401.888672,10,6,255,255,3006332195,1452987536,84,45,1,0,996.632407,59.532129,1421927418,1421927418,18.573778,36.845602,0.050675,0.006354,0.044321,0,1,1.0,0.0,0,3,1,1,1,1,1,1,Exploits,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2540023,175.45.176.0,47439,149.171.126.10,53,udp,INT,0.000001,114,0,254,0,0,0,dns,4.560000e+08,0.000000,2,0,0,0,0,0,57,0,0,0,0.000000,0.000000,1424262068,1424262068,0.001000,0.000000,0.000000,0.000000,0.000000,0,2,,,,15,15,15,15,15,15,15,Generic,1
2540024,175.45.176.0,17293,149.171.126.17,110,tcp,CON,0.942984,574,676,62,252,5,6,-,4.470914e+03,5259.898438,12,12,255,255,3026824982,3748412468,48,56,0,0,3903.523582,95.650531,1424262068,1424262069,79.714089,80.827180,0.139446,0.053884,0.085562,0,3,,,,2,1,2,4,2,2,2,Exploits,1
2540025,175.45.176.0,33654,149.171.126.12,80,tcp,CON,2.579405,269883,1300,62,252,103,1,-,8.330169e+05,3898.573486,208,30,255,255,183420721,3548597985,1298,43,1,0,1701.614470,138.857703,1424262066,1424262069,12.239136,86.655617,0.159923,0.066388,0.093535,0,3,2.0,,,2,1,2,4,2,2,2,DoS,1
2540026,175.45.176.0,33654,149.171.126.12,80,tcp,CON,2.579405,269883,1300,62,252,103,1,http,8.330169e+05,3898.573486,208,30,255,255,183420721,3548597985,1298,43,1,0,1701.614470,138.857703,1424262066,1424262069,12.239136,86.655617,0.159923,0.066388,0.093535,0,3,2.0,,,1,1,2,4,2,2,2,DoS,1


In [None]:
# Attack types in the datasets
data.attack_cat.unique()

array([nan, 'Exploits', 'Reconnaissance', 'DoS', 'Generic', 'Shellcode',
       ' Fuzzers', 'Worms', 'Backdoors', 'Analysis', ' Reconnaissance ',
       'Backdoor', ' Fuzzers ', ' Shellcode '], dtype=object)

In [None]:
# Dropping unused features
# We only used about the first 20 feature
# We dropped port numbers and IPs as they will cause overfitting, also dropped some columns that have huge NaNs

#data = data.drop(['srcip','sport','dstip','dsport','is_ftp_login','ct_flw_http_mthd','ct_dst_sport_ltm','ct_src_dport_ltm','ct_src_ ltm','ct_dst_ltm','ct_srv_dst','ct_srv_src','ct_ftp_cmd','ct_state_ttl','attack_cat','ct_dst_src_ltm','is_sm_ips_ports','ackdat','synack','tcprtt','dintpkt','sintpkt','res_bdy_len','trans_depth','state','sttl','dttl'], axis=1)

# List of Features that Implemenataion subteam succeeded to extract + label
imp_features= ['proto','dur','sbytes','dbytes','sloss','dloss','service','sload','dload','spkts','dpkts','swin','dwin','stcpb','dtcpb','smeansz','dmeansz','sjit','djit','stime','ltime','label']
data = data.loc[:,imp_features]

In [None]:
# Looking at the different values in the service column in the dataset
data.service.value_counts()

-           1246395
dns          781667
http         206273
ftp-data     125783
smtp          81644
ftp           49090
ssh           47160
pop3           1533
dhcp            172
ssl             142
snmp            113
radius           40
irc              31
Name: service, dtype: int64

In [None]:
# replacing '-' values in service with 'notservice'
data.service.replace('-','notservice',inplace=True)

In [None]:
# Exploring protocols values in the dataset
data.proto.unique()

array(['udp', 'arp', 'tcp', 'ospf', 'icmp', 'igmp', 'sctp', 'udt', 'sep',
       'sun-nd', 'swipe', 'mobile', 'pim', 'rtp', 'ipnip', 'ip', 'ggp',
       'st2', 'egp', 'cbt', 'emcon', 'nvp', 'igp', 'xnet', 'argus',
       'bbn-rcc', 'chaos', 'pup', 'hmp', 'mux', 'dcn', 'prm', 'trunk-1',
       'xns-idp', 'trunk-2', 'leaf-1', 'leaf-2', 'irtp', 'rdp', 'iso-tp4',
       'netblt', 'mfe-nsp', 'merit-inp', '3pc', 'xtp', 'idpr', 'tp++',
       'ddp', 'idpr-cmtp', 'ipv6', 'il', 'idrp', 'ipv6-frag', 'sdrp',
       'ipv6-route', 'gre', 'rsvp', 'mhrp', 'bna', 'esp', 'i-nlsp',
       'narp', 'ipv6-no', 'tlsp', 'skip', 'ipv6-opts', 'any', 'cftp',
       'sat-expak', 'kryptolan', 'rvd', 'ippc', 'sat-mon', 'ipcv', 'visa',
       'cpnx', 'cphb', 'wsn', 'pvp', 'br-sat-mon', 'wb-mon', 'wb-expak',
       'iso-ip', 'secure-vmtp', 'vmtp', 'vines', 'ttp', 'nsfnet-igp',
       'dgp', 'tcf', 'eigrp', 'sprite-rpc', 'larp', 'mtp', 'ax.25',
       'ipip', 'micp', 'aes-sp3-d', 'encap', 'etherip', 'pri-enc', 'gmtp'

In [None]:
# Applying label encoder for the dataset that will be used in the supervised model
Supervised_data = data.copy()
col =[]
for column in Supervised_data.columns :
    if Supervised_data[column].dtype == object :
        col.append(column)  
for column in col :
    le = LabelEncoder()
    print(column)
    le.fit(Supervised_data[column])
    
    Supervised_data[column]=le.transform(Supervised_data[column])
   

proto
service


# ***3) The Models*** 
Here we have two models: Supervised machine learning model and Unsupervised deep learning model 

The unsupervised model is used at the beginning before the supervised one to check if the flow (data) is suspicious or not. If the flow looks normal, it is allowed to pass peacefully without further checks ant it doesn't inovke any other investigation (the supervised one).

If the flow looks suspicious, the supervised model is invoked and 
used to investigate this flow to see if it is really an attack or not to to inform the controller to take the needed action

***A) Unsupervised deep learning model***

Here we built an Unsupervised deep learning model to perform an initial investigation of the data.

The model consists of () hidden-layer and each layer has () neuron. We used the ('tanh') function as the activation function and ('Adam') optimizer algorithm to handle sparse gradients along with other algorithms for a better efficient model.

***B) Supervised machine learning model***

In this part we used a supervised machine learning model to check the suspicious data (flow) that was passed by the unspervised DL model for further investigation to inform the controller with the proper action.

Here we used Random forest algorithm to classify the data. It is also imperative to say that we didn't just arbitrarly select Random forest. We tried many different algorithms such as: Gradient boost, K-NN , Support vector machines and even a deep learning model. we then compared their efficiency and run time too because we need the model to be able to detect attacks fast and accurately. At the end , we chose random forest because it gave the best results. 


In [None]:
X_train, X_test, y_train, y_test = train_test_split(Supervised_data.iloc[:,:-1],Supervised_data.iloc[:,-1], test_size=0.2, random_state=42,shuffle=True)

model = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)

#Normalization

for ee in X_train.columns:
# fit on training data column
  scale = MinMaxScaler().fit(X_train[[ee]])  
# transform the training data column
  X_train[ee] = scale.transform(X_train[[ee]])      
# transform the testing data column
  X_test[ee] = scale.transform(X_test[[ee]])


model.fit(X_train,y_train)

print("Train Accuracy : ",model.score(X_train, y_train) *100,"%")
print("Test Accuracy : ",model.score(X_test, y_test) *100,"%")

# Measuring test time
start = time.time()
y_pred = model.predict(X_test) 
finish = time.time()
print ('Test time >>> ', finish - start , 'seconds')
cm = confusion_matrix(y_test, y_pred)
print(cm)



Train Accuracy :  99.99576778735 %
Test Accuracy :  99.56122824595627 %
Test time >>>  0.06930303573608398 seconds
[[442553   1116]
 [  1113  63227]]


# ***Results & Conclusion***

