# Intrusion Detection System

Implementation of an IDS with machine learning techniques, exploiting the public dataset UNSW NB-15

In [1]:
import pandas as pd
import numpy as np

# Dataset exploration and processing
The UNSW-NB 15 dataset was created by using different tools that generates a hybrid of realistic normal traffic activities and synthetic attack behaviors in network traffic, that are collected over four different ".csv" files.

Each instance of the dataset is described by 49 features collected in features.csv file. 
In UNSW-NB 15 there are nine categories of attack types:
- `Fuzzers`: Massive input of data
- `Analysis:`: Different type of intrusions into web apps
- `Backdoor:`: An unauthorized remote access
- `DoS:`: Denial of service
- `Exploit:`: Activities that uses bug or vulnerabilites to attack
- `Generic:`: Attack to block-cipher
- `Reconnaissance:`: Probing activities
- `Shellcode:`: Used to exploit sw vulnerabilites
- `Worm:`: A type of malware

In [2]:
# used as reference
features = pd.read_csv("UNSW-NB15/UNSW-NB15 - CSV Files/UNSW-NB15_features.csv")
features = pd.DataFrame(features)

In [3]:
makeLowerString = lambda x: x.strip().lower()

features['Type'] = features['Type'].apply(makeLowerString)
features['Name'] = features['Name'].apply(makeLowerString)
features

Unnamed: 0,No.,Name,Type,Description
0,1,srcip,nominal,Source IP address
1,2,sport,integer,Source port number
2,3,dstip,nominal,Destination IP address
3,4,dsport,integer,Destination port number
4,5,proto,nominal,Transaction protocol
5,6,state,nominal,Indicates to the state and its dependent proto...
6,7,dur,float,Record total duration
7,8,sbytes,integer,Source to destination transaction bytes
8,9,dbytes,integer,Destination to source transaction bytes
9,10,sttl,integer,Source to destination time to live value


The dataset contains several types of features: nominal, numerical and binary

In [4]:
# Retrieving indexies of different types in dataset
nominal_idx = features[features.Type=="nominal"].index.tolist()
binary_idx = features[features.Type=="binary"].index.tolist()
integer_idx = features[features.Type=="integer"].index.tolist()
float_idx = features[features.Type=="float"].index.tolist()

In [5]:
# Concatenate the four different dataset
df1 = pd.read_csv("UNSW-NB15/UNSW-NB15 - CSV Files/UNSW-NB15_1.csv",header=None)
df2 = pd.read_csv("UNSW-NB15/UNSW-NB15 - CSV Files/UNSW-NB15_2.csv",header=None)
df3 = pd.read_csv("UNSW-NB15/UNSW-NB15 - CSV Files/UNSW-NB15_3.csv",header=None)
df4 = pd.read_csv("UNSW-NB15/UNSW-NB15 - CSV Files/UNSW-NB15_4.csv",header=None)

  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)


In [6]:
df = pd.concat([df1,df2,df3,df4],ignore_index=True)

In [7]:
# Here the complete dataset
df.columns = features.Name
df.shape

(2540047, 49)

In [8]:
df.columns

Index(['srcip', 'sport', 'dstip', 'dsport', 'proto', 'state', 'dur', 'sbytes',
       'dbytes', 'sttl', 'dttl', 'sloss', 'dloss', 'service', 'sload', 'dload',
       'spkts', 'dpkts', 'swin', 'dwin', 'stcpb', 'dtcpb', 'smeansz',
       'dmeansz', 'trans_depth', 'res_bdy_len', 'sjit', 'djit', 'stime',
       'ltime', 'sintpkt', 'dintpkt', 'tcprtt', 'synack', 'ackdat',
       'is_sm_ips_ports', 'ct_state_ttl', 'ct_flw_http_mthd', 'is_ftp_login',
       'ct_ftp_cmd', 'ct_srv_src', 'ct_srv_dst', 'ct_dst_ltm', 'ct_src_ ltm',
       'ct_src_dport_ltm', 'ct_dst_sport_ltm', 'ct_dst_src_ltm', 'attack_cat',
       'label'],
      dtype='object', name='Name')

## Preprocessing of data within the dataset

In [9]:
df.iloc[0]

Name
srcip                  59.166.0.0
sport                        1390
dstip               149.171.126.6
dsport                         53
proto                         udp
state                         CON
dur                      0.001055
sbytes                        132
dbytes                        164
sttl                           31
dttl                           29
sloss                           0
dloss                           0
service                       dns
sload                 500473.9375
dload                 621800.9375
spkts                           2
dpkts                           2
swin                            0
dwin                            0
stcpb                           0
dtcpb                           0
smeansz                        66
dmeansz                        82
trans_depth                     0
res_bdy_len                     0
sjit                          0.0
djit                          0.0
stime                  1421927414
ltime    

In [10]:
df.head()

Name,srcip,sport,dstip,dsport,proto,state,dur,sbytes,dbytes,sttl,...,ct_ftp_cmd,ct_srv_src,ct_srv_dst,ct_dst_ltm,ct_src_ ltm,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm,attack_cat,label
0,59.166.0.0,1390,149.171.126.6,53,udp,CON,0.001055,132,164,31,...,0,3,7,1,3,1,1,1,,0
1,59.166.0.0,33661,149.171.126.9,1024,udp,CON,0.036133,528,304,31,...,0,2,4,2,3,1,1,2,,0
2,59.166.0.6,1464,149.171.126.7,53,udp,CON,0.001119,146,178,31,...,0,12,8,1,2,2,1,1,,0
3,59.166.0.5,3593,149.171.126.5,53,udp,CON,0.001209,132,164,31,...,0,6,9,1,1,1,1,1,,0
4,59.166.0.3,49664,149.171.126.0,53,udp,CON,0.001169,146,178,31,...,0,7,9,1,1,1,1,1,,0


In [11]:
# By executing this we can notice some errors in attack categories. (Like backdoor twice as "backdoor" and "backdoors")
# Also a normal category is missing, that refers to the genuine traffic
df.attack_cat.value_counts()

Generic             215481
Exploits             44525
 Fuzzers             19195
DoS                  16353
 Reconnaissance      12228
 Fuzzers              5051
Analysis              2677
Backdoor              1795
Reconnaissance        1759
 Shellcode            1288
Backdoors              534
Shellcode              223
Worms                  174
Name: attack_cat, dtype: int64

In [12]:
# Check for null values
df.columns[df.isnull().any()].tolist()

['ct_flw_http_mthd', 'is_ftp_login', 'attack_cat']

In [13]:
df.attack_cat.isnull().sum()

2218764

All null values in *attack_cat* feature correspond to the missing category *normal*

In [14]:
# Since "normal" label is missing, let's associate normal to NaN values
df.attack_cat = df.attack_cat.fillna(value="normal")

# Adjust labels in attack cat
df['attack_cat'] = df['attack_cat'].apply(makeLowerString)
df.attack_cat.value_counts()


normal            2218764
generic            215481
exploits            44525
fuzzers             24246
dos                 16353
reconnaissance      13987
analysis             2677
backdoor             1795
shellcode            1511
backdoors             534
worms                 174
Name: attack_cat, dtype: int64

In [15]:
df['attack_cat'].replace("backdoors", "backdoor", inplace=True)
df.attack_cat.value_counts()

normal            2218764
generic            215481
exploits            44525
fuzzers             24246
dos                 16353
reconnaissance      13987
analysis             2677
backdoor             2329
shellcode            1511
worms                 174
Name: attack_cat, dtype: int64

Now let's adjust the other two columns with null values:
- **ct_flw_http_mthd**: No. of flows that has methods such as Get and Post in http service.
- **is_ftp_login**: If the ftp session is accessed by user and password then 1 else 0.

In [16]:
# Check for null values
df.columns[df.isnull().any()].tolist()

['ct_flw_http_mthd', 'is_ftp_login']

In [17]:
# Assuming 0 for null values
df.ct_flw_http_mthd = df.ct_flw_http_mthd.fillna(value=0)
df.is_ftp_login = df.is_ftp_login.fillna(value=0)

In [18]:
# Now all the null values are solved
df.columns[df.isnull().any()].tolist()

[]

In [19]:
#################################### TO DELETE ####################################
pd.set_option('display.max_colwidth', None)
features[features.Name=="ct_flw_http_mthd"]
features[features.Name=="is_ftp_login"]

Unnamed: 0,No.,Name,Type,Description
38,39,is_ftp_login,binary,If the ftp session is accessed by user and password then 1 else 0.


Check that all binary variables are in the range [0,1]

In [20]:
binary_cols = df.columns[binary_idx]
df[binary_cols].describe().transpose()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
is_sm_ips_ports,2540047.0,0.001652,0.040606,0.0,0.0,0.0,0.0,1.0
is_ftp_login,2540047.0,0.017351,0.133457,0.0,0.0,0.0,0.0,4.0
label,2540047.0,0.126487,0.332398,0.0,0.0,0.0,0.0,1.0


In [21]:
# From the features description, is_ftp_login is a binary variable, but we have also non binary values
df.is_ftp_login.value_counts()

0.0    2496472
1.0      43389
4.0        156
2.0         30
Name: is_ftp_login, dtype: int64

In [22]:
# Assume is_ftp_login higer then 1 correspond to binary value 0
df['is_ftp_login'].replace(4,0,inplace=True)
df['is_ftp_login'].replace(2,0,inplace=True)
df.is_ftp_login.describe()

count    2.540047e+06
mean     1.708197e-02
std      1.295769e-01
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      1.000000e+00
Name: is_ftp_login, dtype: float64

In [23]:
# Checking consistency of features.csv columns with dataframe columns
df_objcet_cols = df.columns[df.dtypes=="object"]
nominal_cols = df.columns[nominal_idx]
np.array_equal(df_objcet_cols,nominal_cols) # false
print(df_objcet_cols,nominal_cols)
# Here we can see that there are discrepancies on what nominal features are

Index(['srcip', 'sport', 'dstip', 'dsport', 'proto', 'state', 'service',
       'ct_ftp_cmd', 'attack_cat'],
      dtype='object', name='Name') Index(['srcip', 'dstip', 'proto', 'state', 'service', 'attack_cat'], dtype='object', name='Name')


By analyzing the feature file, the attributes sport, dsport and ct_ftp_cmd should be integer, instead in dataset they appear as nominal variables.

Given that the source IP address, source port number, destination IP address, and destination port number are characteristic that are specific to the laboratory environment used in the generation of the dataset, ther are not relevant in the implementation of the IDS.

In [24]:
features_to_drop = ['srcip', 'sport', 'dstip', 'dsport']
df.drop(features_to_drop, axis = 1, inplace=True)

In [25]:
df.ct_ftp_cmd.unique()

array([0, 1, 6, 2, 4, 8, 5, 3, '0', '1', ' ', '2', '4'], dtype=object)

In [26]:
df[["ct_ftp_cmd",'label']].value_counts()
# the most common label value for missing ct_ftp_cmd is 0, let's asssume 0 for these values, since the real 0 has also 0 in the label

ct_ftp_cmd  label
            0        1132696
0           0        1034172
            1         297183
0           1          22167
1           0          21129
            0          17039
0           0          10159
1           1           1861
2           0           1234
4           0            804
3           0            729
6           0            332
5           0            290
4           0            140
1           1             48
2           0             22
8           0             18
4           1             16
2           1              8
dtype: int64

In [27]:
df["ct_ftp_cmd"].replace(' ',0,inplace=True)

In [28]:
df.ct_ftp_cmd = df.ct_ftp_cmd.astype(int)

In [29]:
# Let's check consistency of the remaining nominal columns [proto,state,service]
df.proto.unique() # no noise detected
# In state and service we have 'no' and '-' used when no state or service are detected. Substituting with none such values
df.state.unique()
df.service.unique() 

array(['dns', '-', 'http', 'smtp', 'ftp-data', 'ftp', 'ssh', 'pop3',
       'snmp', 'ssl', 'irc', 'radius', 'dhcp'], dtype=object)

In [30]:
df["state"].replace('no','none',inplace=True)
df["service"].replace('-','none',inplace=True)
df.state = df.state.apply(makeLowerString)
df.service = df.service.apply(makeLowerString)


In [None]:
df.to_csv('foo.csv')