<a href="https://colab.research.google.com/github/ammarSherif/KDD-ML-IDS/blob/main/Random_Forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ML IDS using Random Forest
In this notebook, I will try to achieve high accuracy using the Random Forest model

## Import Dataset
Previously, I modified the dataset adding a feature header to it, so as to be able to deal with it as a CSV dataset. I will train on the 10% dataset whose name is `kddcup.data_10_percent` on the official website. 

In [1]:
# ================================================================
# Import some needed packages
# ================================================================
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import matplotlib.pyplot as plt

In [2]:
# ================================================================
# Importing the training data from the drive
# ================================================================
f = '/content/drive/MyDrive/Data/IDS/kddcup.data_10_percent'
data10 = pd.read_csv(f)
# ================================================================
# We print its shape
# ================================================================
data10.shape

(494021, 42)

In [3]:
# ================================================================
# Importing a new labeled test data to test our model on it
# ================================================================
f = '/content/drive/MyDrive/Data/IDS/corrected'
testCorrect = pd.read_csv(f)
# ================================================================
# We print its shape
# ================================================================
testCorrect.shape

(311029, 42)

Notice that the labeld test dataset `corrected` includes around **17 new attacks** that were not present in the original training dataset, as seen below. The idea is that security, domain, experts claims that new attacks are similar to the older ones, and our model should be able to detect them as well.

In [4]:
data10.label.unique()

array(['normal.', 'buffer_overflow.', 'loadmodule.', 'perl.', 'neptune.',
       'smurf.', 'guess_passwd.', 'pod.', 'teardrop.', 'portsweep.',
       'ipsweep.', 'land.', 'ftp_write.', 'back.', 'imap.', 'satan.',
       'phf.', 'nmap.', 'multihop.', 'warezmaster.', 'warezclient.',
       'spy.', 'rootkit.'], dtype=object)

In [5]:
testCorrect.label.unique()

array(['normal.', 'snmpgetattack.', 'named.', 'xlock.', 'smurf.',
       'ipsweep.', 'multihop.', 'xsnoop.', 'sendmail.', 'guess_passwd.',
       'saint.', 'buffer_overflow.', 'portsweep.', 'pod.', 'apache2.',
       'phf.', 'udpstorm.', 'warezmaster.', 'perl.', 'satan.', 'xterm.',
       'mscan.', 'processtable.', 'ps.', 'nmap.', 'rootkit.', 'neptune.',
       'loadmodule.', 'imap.', 'back.', 'httptunnel.', 'worm.',
       'mailbomb.', 'ftp_write.', 'teardrop.', 'land.', 'sqlattack.',
       'snmpguess.'], dtype=object)

In [6]:
# ================================================================
# Print the attacks not in the previous dataset
# ================================================================
newAttacks = set(testCorrect.label.unique())- set(
                 data10.label.unique())
print(newAttacks)
print(len(newAttacks))

{'snmpgetattack.', 'apache2.', 'mscan.', 'worm.', 'snmpguess.', 'xsnoop.', 'processtable.', 'named.', 'sqlattack.', 'xterm.', 'udpstorm.', 'httptunnel.', 'ps.', 'mailbomb.', 'sendmail.', 'xlock.', 'saint.'}
17


## Feature Engineering
Now, we will manipulate the features to increase the efficiency of the training. Nevertheless, in order to do so, we first check some information about the data as below:

In [7]:
# ================================================================
# Check the datatypes
# ================================================================
data10.dtypes

duration                         int64
protocol_type                   object
service                         object
flag                            object
src_bytes                        int64
dst_bytes                        int64
land                             int64
wrong_fragment                   int64
urgent                           int64
hot                              int64
num_failed_logins                int64
logged_in                        int64
num_compromised                  int64
root_shell                       int64
su_attempted                     int64
num_root                         int64
num_file_creations               int64
num_shells                       int64
num_access_files                 int64
num_outbound_cmds                int64
is_host_login                    int64
is_guest_login                   int64
count                            int64
srv_count                        int64
serror_rate                    float64
srv_serror_rate          

### Feature Selection
We remove the 0 variance features

In [8]:
# ================================================================
# Check the variance of numeric features
# ================================================================
data10.var()

duration                       5.009051e+05
src_bytes                      9.765750e+11
dst_bytes                      1.091642e+09
land                           4.453063e-05
wrong_fragment                 1.817245e-02
urgent                         3.036294e-05
hot                            6.116844e-01
num_failed_logins              2.408579e-04
logged_in                      1.262699e-01
num_compromised                3.233977e+00
root_shell                     1.113191e-04
su_attempted                   6.072496e-05
num_root                       4.051035e+00
num_file_creations             9.296022e-03
num_shells                     1.214406e-04
num_access_files               1.330914e-03
num_outbound_cmds              0.000000e+00
is_host_login                  0.000000e+00
is_guest_login                 1.384661e-03
count                          4.543182e+04
srv_count                      6.067493e+04
serror_rate                    1.449454e-01
srv_serror_rate                1

In [9]:
# ================================================================
# Delete the 0 variance features
# ================================================================
del data10['num_outbound_cmds']
del data10['is_host_login']
del testCorrect['num_outbound_cmds']
del testCorrect['is_host_login']

### New attacks records
While testing we would like also to check the performance of our model in the unseen attacks; therefore, we copy them as below.

In [10]:
newRecords = testCorrect.loc[testCorrect['label'].isin(
                            newAttacks)].copy()
print(newRecords.shape)

(18729, 40)


In [11]:
# ================================================================
# Make sure  it equals the  new ones.
# ================================================================
print('Equal new attacks:',set(newRecords.label.unique()
                              ) == newAttacks)

Equal new attacks: True


### Replicating Datasets
As we are about to label the values to train it with our model, we copy it first.

In [12]:
dTrain = data10.copy()
dTest  = testCorrect.copy()
dNewTst= newRecords.copy()

### Categorical Features [Label Encoding]
Notice we have some categorical features like `protocol_type,service,` and `flag`. Nevertheless, because we will use a **tree-based** model (Random Forest), we do encode the values using **label encoding**, as it would not affect the training of our model. In case, we are using Neural Network, for example, we might have used one of the below techniques:
- Embedding
- Dummy encoding


In [13]:
# ================================================================
# Get the indices of categorical features
# ================================================================
objList = list(testCorrect.select_dtypes(
                include = "object").columns)[:-1]
print(objList)

['protocol_type', 'service', 'flag']


In [14]:
# ================================================================
# Store the size of original dataset
# ================================================================
s1 = len(dTrain)
s2 = len(dTest)
s3 = len(dNewTst)
# ================================================================
# Merge the two datasets to label them uniquely.
# ----------------------------------------------------------------
# Note if we labeled  them indiviually, an item  'A' might have  a
# label 0 in one dataset and 2 in another.  This will make it  in-
# appropriate to test the model on the other dataset.
# ================================================================
data = dTrain.append(dTest).append(dNewTst)

In [15]:
# ================================================================
# Create label encoding, to train it later on
# ================================================================
le = LabelEncoder()
for feature in objList:
  data[feature] = le.fit_transform(data[feature])

In [16]:
# ================================================================
# Split them again
# ================================================================
dTrain = data.iloc[:s1,:].copy()
dTest  = data.iloc[s1:(s1+s2),:].copy()
dNewTst= data.iloc[(s1+s2):,:].copy()

## Training [Malicious vs Normal]
Now, we will train our model using all the features to detect whether a record is normal or malicious. Therefore, we first encode the labels into $0,1$

In [17]:
# ================================================================
# Encode the labels into 0 for normal, 1 otherwise
# ================================================================
datasets = [dTrain,dTest,dNewTst]
for d in datasets:
  d['label'] = d['label'].map(lambda x: 0 if x =='normal.' else 1)

In [18]:
print(dTrain.label.value_counts()/len(dTrain))
print(dTest.label.value_counts()/len(dTest))
print(dNewTst.label.value_counts()/len(dNewTst))

1    0.803089
0    0.196911
Name: label, dtype: float64
1    0.805185
0    0.194815
Name: label, dtype: float64
1    1.0
Name: label, dtype: float64


In [19]:
# ================================================================
# Split the data from the target label
# ================================================================
colNum = len(dTrain.columns)
y = dTrain.iloc[:,colNum-1].copy()
x = dTrain.iloc[:,0:colNum-1].copy()
yT = dTest.iloc[:,colNum-1].copy()
xT = dTest.iloc[:,0:colNum-1].copy()
yN= dNewTst.iloc[:,colNum-1].copy()
xN= dNewTst.iloc[:,0:colNum-1].copy()

In [20]:
# ================================================================
# Split the training features to have 30% for testing and  70% for
# training.
# ================================================================
x_train,x_test,y_train,y_test = train_test_split(x,y, 
                                                 test_size = 0.30) 

In [21]:
# ================================================================
# Train our model based on the training part from the 
# ================================================================
rf = RandomForestClassifier(n_estimators = 100,n_jobs=-1,verbose=0)
rf.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [22]:
y_pred = rf.predict(x_test)
print("Accuracy (same distribution): ", metrics.accuracy_score(y_test, y_pred))
y_pred = rf.predict(xT)
print("Accuracy (out of distribution): ", metrics.accuracy_score(yT, y_pred))
y_pred = rf.predict(xN)
print("Accuracy (Unknown attacks): ", metrics.accuracy_score(yN, y_pred))

Accuracy (same distribution):  0.9998178223700634
Accuracy (out of distribution):  0.9281739001829411
Accuracy (Unknown attacks):  0.09482620534999199


## Training [Categories]
Now, we will train our model using all the features to detect whether a record lies in the below categories as below:

<table style="width: 80%">
  <tr>
    <th>Category</th>
    <th>Attack Name</th>
    <th>Category</th>
    <th>Attack Name</th>
    <th>Category</th>
    <th>Attack Name</th>
    <th>Category</th>
    <th>Attack Name</th>
  </tr>
  <tr>
    <td rowspan="8"><center>DOS</center></td>
    <td align='center'>smurf</td>
    <td rowspan="8"><center>U2R</center></td>
    <td align='center'>buffer_overflow</td>
    <td rowspan="8"><center>R2L</center></td>
    <td align='center'>ftp_write</td>
    <td rowspan="8"><center>PROBE</center></td>
    <td align='center'>ipsweep</td>
  </tr>
  <tr>
    <td align='center'>neptune</td>
    <td align='center'>loadmodule</td>
    <td align='center'>guess_passwd</td>
    <td align='center'>nmap</td>
  </tr>
  <tr>
    <td align='center'>back</td>
    <td align='center'>perl</td>
    <td align='center'>imap</td>
    <td align='center'>portsweep</td>
  </tr>
  <tr>
    <td align='center'>pod</td>
    <td align='center'>rootkit</td>
    <td align='center'>multihop</td>
    <td align='center'>satan</td>
  </tr>
  <tr>
    <td align='center'>teardrop</td>
    <td align='center'>&nbsp;</td>
    <td align='center'>phf</td>
    <td align='center'>&nbsp;</td>
  </tr>
  <tr>
    <td align='center'>land</td>
    <td align='center'>&nbsp;</td>
    <td align='center'>spy</td>
    <td align='center'>&nbsp;</td>
  </tr>
  <tr>
    <td align='center'>&nbsp;</td>
    <td align='center'>&nbsp;</td>
    <td align='center'>warezclient</td>
    <td align='center'>&nbsp;</td>
  </tr>
  <tr>
    <td align='center'>&nbsp;</td>
    <td align='center'>&nbsp;</td>
    <td align='center'>warezmaster</td>
    <td align='center'>&nbsp;</td>
  </tr>
  <tr>
</table>
<!-- array(['normal.', 'buffer_overflow.', 'loadmodule.', 'perl.', 'neptune.',
       'smurf.', 'guess_passwd.', 'pod.', 'teardrop.', 'portsweep.',
       'ipsweep.', 'land.', 'ftp_write.', 'back.', 'imap.', 'satan.',
       'phf.', 'nmap.', 'multihop.', 'warezmaster.', 'warezclient.',
       'spy.', 'rootkit.'], dtype=object) -->

<!-- | Category | Attack Name |
| :-: | :-: |
| DOS | smurf. | -->

In [23]:
# ================================================================
# Define the mapping table of attack categories
# ================================================================
attackMap = {
    'normal.':'normal', 'smurf.':'dos', 'neptune.':'dos',
    'back.':'dos', 'pod.':'dos', 'teardrop.':'dos', 'land.':'dos',
    'buffer_overflow.':'u2r', 'loadmodule.':'u2r', 'perl.':'u2r',
    'rootkit.':'u2r', 'ftp_write.':'r2l', 'guess_passwd.':'r2l',
    'imap.':'r2l', 'multihop.':'r2l', 'phf.':'r2l', 'spy.':'r2l',
    'warezclient.':'r2l', 'warezmaster.':'r2l',
    'ipsweep.':'probe', 'nmap.':'probe', 'portsweep.':'probe',
    'satan.':'probe',
}

# ================================================================
# Define a function to map into malicious/normal
# ================================================================
def mapMalNorm(res):
  return np.array([0 if x=='normal' else 1 for x in res])

In [24]:
# ================================================================
# Update the label data back
# ================================================================
dTrain['label'] = data10['label']
dTest['label']  = testCorrect['label']
dNewTst['label']= newRecords['label']
# ================================================================
# Encode the labels into categories
# ================================================================
datasets = [dTrain,dTest,dNewTst]
for d in datasets:
  d['label'] = d['label'].map(lambda x: attackMap[x] if x 
                              in attackMap else 'other')

In [25]:
# ================================================================
# Print the percentage of each attack
# ================================================================
print(dTrain.label.value_counts()/len(dTrain))
print(dTest.label.value_counts()/len(dTest))
print(dNewTst.label.value_counts()/len(dNewTst))

dos       0.792391
normal    0.196911
probe     0.008313
r2l       0.002279
u2r       0.000105
Name: label, dtype: float64
dos       0.717933
normal    0.194815
other     0.060216
r2l       0.019268
probe     0.007642
u2r       0.000125
Name: label, dtype: float64
other    1.0
Name: label, dtype: float64


In [26]:
# ================================================================
# Split the data from the target label
# ================================================================
colNum = len(dTrain.columns)
y = dTrain.iloc[:,colNum-1].copy()
x = dTrain.iloc[:,0:colNum-1].copy()
yT = dTest.iloc[:,colNum-1].copy()
xT = dTest.iloc[:,0:colNum-1].copy()
yN= dNewTst.iloc[:,colNum-1].copy()
xN= dNewTst.iloc[:,0:colNum-1].copy()

In [27]:
# ================================================================
# Split the training features to have 30% for testing and  70% for
# training.
# ================================================================
x_train,x_test,y_train,y_test = train_test_split(x,y, 
                                                 test_size = 0.30) 

In [28]:
# ================================================================
# Train our model based on the training part from the 
# ================================================================
rf = RandomForestClassifier(n_estimators =100,n_jobs=-1,verbose=0)
rf.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [29]:
y_pred = rf.predict(x_test)
print("Accuracy (same distribution): ", metrics.accuracy_score(y_test, y_pred))
y_pred = rf.predict(xT)
print("Accuracy (out of distribution): ", metrics.accuracy_score(yT, y_pred))
# y_pred = rf.predict(xN)
# print("Accuracy (Unknown attacks): ", metrics.accuracy_score(yN, y_pred))
# ================================================================
# After Mapping
# ================================================================
print("=====After Mapping into Maliciou/Normal=====")
y_pred = rf.predict(x_test)
print("Accuracy (same distribution): ", metrics.accuracy_score(
    mapMalNorm(y_test), mapMalNorm(y_pred)))
y_pred = rf.predict(xT)
print("Accuracy (out of distribution): ", metrics.accuracy_score(
    mapMalNorm(yT), mapMalNorm(y_pred)))
y_pred = rf.predict(xN)
print("Accuracy (Unknown attacks): ", metrics.accuracy_score(
    mapMalNorm(yN), mapMalNorm(y_pred)))

Accuracy (same distribution):  0.9997503491737907
Accuracy (out of distribution):  0.9204318568365008
=====After Mapping into Maliciou/Normal=====
Accuracy (same distribution):  0.9997503491737907
Accuracy (out of distribution):  0.9245600892521276
Accuracy (Unknown attacks):  0.06786267286027017
