### Problem Statement


* Modern computer networks are constantly exposed to a wide range of cyber-attacks such as denial-of-service, probing, and privilege escalation attacks. Detecting malicious network traffic in real time is a critical requirement for ensuring the security and reliability of information systems.


* The objective of this project is to build and evaluate multiple machine learning classification models that can accurately distinguish between normal network traffic and malicious traffic using the UNSW-NB15 dataset. By comparing the performance of traditional machine learning models and ensemble techniques using standard evaluation metrics, this project aims to identify the most effective model for network intrusion detection.

In [14]:
import pandas as pd 
from sklearn.model_selection import train_test_split

In [None]:


df=pd.read_csv('../data/UNSW_NB15_training-set.csv')

In [10]:
df.shape

(175341, 36)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 175341 entries, 0 to 175340
Data columns (total 36 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   dur                175341 non-null  float64
 1   proto              175341 non-null  object 
 2   service            175341 non-null  object 
 3   state              175341 non-null  object 
 4   spkts              175341 non-null  int64  
 5   dpkts              175341 non-null  int64  
 6   sbytes             175341 non-null  int64  
 7   dbytes             175341 non-null  int64  
 8   rate               175341 non-null  float64
 9   sload              175341 non-null  float64
 10  dload              175341 non-null  float64
 11  sloss              175341 non-null  int64  
 12  dloss              175341 non-null  int64  
 13  sinpkt             175341 non-null  float64
 14  dinpkt             175341 non-null  float64
 15  sjit               175341 non-null  float64
 16  dj

In [5]:
df.columns

Index(['dur', 'proto', 'service', 'state', 'spkts', 'dpkts', 'sbytes',
       'dbytes', 'rate', 'sload', 'dload', 'sloss', 'dloss', 'sinpkt',
       'dinpkt', 'sjit', 'djit', 'swin', 'stcpb', 'dtcpb', 'dwin', 'tcprtt',
       'synack', 'ackdat', 'smean', 'dmean', 'trans_depth',
       'response_body_len', 'ct_src_dport_ltm', 'ct_dst_sport_ltm',
       'is_ftp_login', 'ct_ftp_cmd', 'ct_flw_http_mthd', 'is_sm_ips_ports',
       'attack_cat', 'label'],
      dtype='object')

In [6]:
df.iloc[:10,:18]

Unnamed: 0,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,sload,dload,sloss,dloss,sinpkt,dinpkt,sjit,djit,swin
0,0.121478,tcp,-,FIN,6,4,258,172,74.08749,14158.942,8495.365,0,0,24.2956,8.375,30.177547,11.830604,255
1,0.649902,tcp,-,FIN,14,38,734,42014,78.47337,8395.112,503571.3,2,17,49.915,15.432865,61.426933,1387.7783,255
2,1.623129,tcp,-,FIN,8,16,364,13186,14.170161,1572.2719,60929.23,1,6,231.87556,102.737206,17179.586,11420.926,255
3,1.681642,tcp,ftp,FIN,12,12,628,770,13.677108,2740.179,3358.622,1,3,152.87654,90.235725,259.08017,4991.7847,255
4,0.449454,tcp,-,FIN,10,6,534,268,33.373825,8561.499,3987.0598,2,1,47.75033,75.6596,2415.8376,115.807,255
5,0.380537,tcp,-,FIN,10,6,534,268,39.41798,10112.025,4709.135,2,1,39.92878,52.241,2223.7302,82.5505,255
6,0.637109,tcp,-,FIN,10,8,534,354,26.683033,6039.783,3892.5837,2,1,68.26778,81.13771,4286.8286,119.42272,255
7,0.521584,tcp,-,FIN,10,8,534,354,32.593025,7377.5273,4754.747,2,1,55.794,66.05414,3770.5808,118.96263,255
8,0.542905,tcp,-,FIN,10,8,534,354,31.31303,7087.7964,4568.0186,2,1,60.210888,68.109,4060.6255,106.61155,255
9,0.258687,tcp,-,FIN,10,6,534,268,57.985134,14875.12,6927.291,2,1,27.505112,39.1068,1413.6864,57.200394,255


In [7]:
df.iloc[:10,18:37]

Unnamed: 0,stcpb,dtcpb,dwin,tcprtt,synack,ackdat,smean,dmean,trans_depth,response_body_len,ct_src_dport_ltm,ct_dst_sport_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,is_sm_ips_ports,attack_cat,label
0,621772692,2202533631,255,0.0,0.0,0.0,43,43,0,0,1,1,0,0,0,0,Normal,0
1,1417884146,3077387971,255,0.0,0.0,0.0,52,1106,0,0,1,1,0,0,0,0,Normal,0
2,2116150707,2963114973,255,0.111897,0.061458,0.050439,46,824,0,0,1,1,0,0,0,0,Normal,0
3,1107119177,1047442890,255,0.0,0.0,0.0,52,64,0,0,1,1,1,1,0,0,Normal,0
4,2436137549,1977154190,255,0.128381,0.071147,0.057234,53,45,0,0,2,1,0,0,0,0,Normal,0
5,3984155503,1796040391,255,0.172934,0.119331,0.053603,53,45,0,0,2,1,0,0,0,0,Normal,0
6,1787309226,1767180493,255,0.143337,0.069136,0.074201,53,44,0,0,1,1,0,0,0,0,Normal,0
7,205985702,316006300,255,0.116615,0.059195,0.05742,53,44,0,0,3,1,0,0,0,0,Normal,0
8,884094874,3410317203,255,0.118584,0.066133,0.052451,53,44,0,0,3,1,0,0,0,0,Normal,0
9,3368447996,584859215,255,0.087934,0.063116,0.024818,53,45,0,0,3,1,0,0,0,0,Normal,0


The label column was used as the target variable, where 0 represents normal network traffic and 1 represents malicious traffic. The attack_cat column was excluded to maintain a binary classification setup.

In [18]:
sampled_df,_ =train_test_split(df, train_size=20000,stratify=df['label'], random_state=42)


sampled_df["label"].value_counts(),sampled_df.shape

(label
 1    13612
 0     6388
 Name: count, dtype: int64,
 (20000, 36))

In [28]:
target_column='label'
drop_columns=['id','attack_cat']    
num_features=["dur","sbytes","dbytes","spkts","dpkts","rate","sload","dload","sloss","dloss","sinpkt","dinpkt","sjit","djit","swin","dwin","tcprtt","synack","ackdat"]
cat_features=["proto","service","state"]
features=num_features+cat_features

In [30]:
X=sampled_df[features]
y=sampled_df[target_column]
X.shape,y.shape

((20000, 22), (20000,))

Feature selection was performed by choosing a subset of relevant numerical and categorical attributes commonly used in network traffic analysis.

A total of 22 features (19 numerical and 3 categorical ) were selected, while non-imformative and multi-class attributes were excluded to maintain a binary classification setup.