This project study network intrusion detection problem with different ML algorithms.

Network security is a big concern nowadays and machine learning can be a useful tool to detect network attacks and protect the network. Therefore, we want to use different machine learning algorithms to detect malicious network 
packet and try to find the best one. 

In this project, we use the [NSL-KDD](https://www.unb.ca/cic/datasets/nsl.html) dataset and try three representative algorithms to classify the malicious packet in the dataset. The three algorithms include a **conventional supervised algorithm Random Forest/SVM, a neural network algorithm, and an unsupervised algorithm**. We will also do some preprocessing like **feature scaling** and discuss its influence. 

In the end, we will evaluate the detection accuracy and speed to find out which is the best for network intrusion detection applications.

Part 1: importing the dataset.

The dataset has been divided into training and testing so we just read the csv file and name the columns.

In [10]:
import pandas as pd
import numpy as np

df_train = pd.read_csv('KDDTrain+.txt', header=None, 
    names = ["duration","protocol_type","service","flag","src_bytes", 
    "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
    "logged_in","num_compromised","root_shell","su_attempted","num_root",
    "num_file_creations","num_shells","num_access_files","num_outbound_cmds",
    "is_host_login","is_guest_login","count","srv_count","serror_rate",
    "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
    "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
    "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
    "dst_host_rerror_rate","dst_host_srv_rerror_rate","labels", "diff"])
df_test = pd.read_csv('KDDTest+.txt', header=None, 
    names = ["duration","protocol_type","service","flag","src_bytes",
    "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
    "logged_in","num_compromised","root_shell","su_attempted","num_root",
    "num_file_creations","num_shells","num_access_files","num_outbound_cmds",
    "is_host_login","is_guest_login","count","srv_count","serror_rate",
    "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
    "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
    "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
    "dst_host_rerror_rate","dst_host_srv_rerror_rate","labels", "diff"])
# check the input
print(df_train.head())
print(df_test.head())

   duration protocol_type   service  ... dst_host_srv_rerror_rate   labels  diff
0         0           tcp  ftp_data  ...                     0.00   normal    20
1         0           udp     other  ...                     0.00   normal    15
2         0           tcp   private  ...                     0.00  neptune    19
3         0           tcp      http  ...                     0.01   normal    21
4         0           tcp      http  ...                     0.00   normal    21

[5 rows x 43 columns]
   duration protocol_type   service  ... dst_host_srv_rerror_rate   labels  diff
0         0           tcp   private  ...                     1.00  neptune    21
1         0           tcp   private  ...                     1.00  neptune    21
2         2           tcp  ftp_data  ...                     0.00   normal    21
3         0          icmp     eco_i  ...                     0.00    saint    15
4         1           tcp    telnet  ...                     0.71    mscan    11

[5 r

Part 2: preprocess.

The original labels includes many attack types, but we only want to detect whether the packet is malicious, so we modify the labels to 0 and​​ 1, 0 for normal and 1 for malicious. Then we divide features and labels, do feature encoding, missing data check and feature scaling. 


In [11]:
# turn all the attack types to only normal and abnormal
df_train.loc[df_train['labels'] != 'normal', 'labels'] = 1
df_train.loc[df_train['labels'] == 'normal', 'labels'] = 0
df_test.loc[df_test['labels'] != 'normal', 'labels'] = 1
df_test.loc[df_test['labels'] == 'normal', 'labels'] = 0


x_train = df_train.iloc[:, 0:41].values
y_train = df_train.iloc[:, 41].values
x_test = df_test.iloc[:, 0:41].values
y_test = df_test.iloc[:, 41].values

y_train=y_train.astype(float)
y_test=y_test.astype(float)
print(y_train[0:5])
print(y_test[0:5])

# print(x_train[0:5, :])
# print(x_test[0:5, :])

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
ct = ColumnTransformer(transformers = [('encoder', OrdinalEncoder(), [1, 2, 3])], remainder = 'passthrough')
x_train = np.array(ct.fit_transform(x_train))
x_test = np.array(ct.transform(x_test))

# check whether we need to process missing value
print("the number of nan in train and test")
print(np.count_nonzero(np.isnan(np.array(x_train, dtype=np.float64))))
print(np.count_nonzero(np.isnan(np.array(x_test, dtype=np.float64))))

# feature scaling, we choose Standardization instead of Normalization in case there are outliers.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
print ('Train set:', x_train.shape)
print ('Test set:', x_test.shape)

# print(x_train[0:5, :])
# print(x_test[0:5, :])

[0. 0. 1. 0. 0.]
[1. 1. 0. 1. 1.]
the number of nan in train and test
0
0
Train set: (125973, 41)
Test set: (22544, 41)


the evaluation metrics

In [12]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score
def print_result(labels, yhat):
    print("The accuracy is", accuracy_score(labels, yhat))
    print("The roc_auc_score is", roc_auc_score(labels, yhat))
    print("The confusion matrix is")
    print(confusion_matrix(labels, yhat))
    print("The classification report is")
    print(classification_report(labels, yhat))

classify with Random Forest

In [13]:
import time
beforeRF = time.time()

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 128, max_depth = 15)
rf.fit(x_train, y_train)
yhat_test_rf = rf.predict(x_test)

afterRF = time.time()
print("The time used to classify using random forest is", str(afterRF - beforeRF), "seconds")

print("\n/**** Result using the Random Forest ****/")
print_result(y_test, yhat_test_rf)

The time used to classify using random forest is 16.57157301902771 seconds

/**** Result using the Random Forest ****/
The accuracy is 0.7621540099361249
The roc_auc_score is 0.7875782160866079
The confusion matrix is
[[9431  280]
 [5082 7751]]
The classification report is
              precision    recall  f1-score   support

         0.0       0.65      0.97      0.78      9711
         1.0       0.97      0.60      0.74     12833

    accuracy                           0.76     22544
   macro avg       0.81      0.79      0.76     22544
weighted avg       0.83      0.76      0.76     22544



classify with SVM

In [17]:
beforeSVM = time.time()

from sklearn.svm import LinearSVC
clf = LinearSVC(random_state=0, tol=1e-5)

import warnings
with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    clf.fit(x_train, y_train)

yhat_test_SVM = clf.predict(x_test)

afterSVM = time.time()
print("The time used to classify using Support Vector Machine is", str(afterSVM - beforeSVM), "seconds")
print_result(y_test, yhat_test_SVM)

The time used to classify using Support Vector Machine is 46.19372606277466 seconds
The accuracy is 0.7508427963094393
The roc_auc_score is 0.7731210242990396
The confusion matrix is
[[9070  641]
 [4976 7857]]
The classification report is
              precision    recall  f1-score   support

         0.0       0.65      0.93      0.76      9711
         1.0       0.92      0.61      0.74     12833

    accuracy                           0.75     22544
   macro avg       0.79      0.77      0.75     22544
weighted avg       0.80      0.75      0.75     22544



classify with MLP

In [18]:
beforeMLP = time.time()

from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(100,), activation='relu', solver='adam'
, alpha=0.0001, batch_size='auto', learning_rate='constant', learning_rate_init=0.001)
mlp.fit(x_train, y_train)

yhat_test_mlp = mlp.predict(x_test)
afterMLP = time.time()
print("The time used to classify using MLP is", str(afterMLP - beforeMLP), "seconds")

print("\n/ Result using the MLP /")
print_result(y_test, yhat_test_mlp)

The time used to classify using MLP is 82.81490516662598 seconds

/ Result using the MLP /
The accuracy is 0.7962650816181689
The roc_auc_score is 0.8176778187523264
The confusion matrix is
[[9442  269]
 [4324 8509]]
The classification report is
              precision    recall  f1-score   support

         0.0       0.69      0.97      0.80      9711
         1.0       0.97      0.66      0.79     12833

    accuracy                           0.80     22544
   macro avg       0.83      0.82      0.80     22544
weighted avg       0.85      0.80      0.79     22544



classify with K-Means. We will use K-Means to divide the data into two clusters (cluster1 and cluster 2) and calculate the accuracy based on that. Since we do not know which clusters is normal and which is malicious, there are two accuracy scores(one for cluster1 is normal and cluster2 is malicious, one for another case).

In [22]:
def result_statistic(data, total):
    gp = data.groupby(by=['labels', 'result'])
    newdf = gp.size()
    newdf = newdf.reset_index(name='count')
    print("cluster statistics:")
    print(newdf)
    # row 0 is true condition 0 and predicted 0, so TN; row 1 is FN, row 2 FP, row 3 TP
    tn = 0
    fn = 0
    tp = 0
    fp = 0
    i = 0
    while i < newdf.shape[0]:
        if newdf.iat[i, 0] == 0 and newdf.iat[i, 1] == 0:
            tn = newdf.iat[i, 2]
        elif newdf.iat[i, 0] == 0 and newdf.iat[i, 1] == 1:
            fn = newdf.iat[i, 2]
        elif newdf.iat[i, 0] == 1 and newdf.iat[i, 1] == 0:
            fp = newdf.iat[i, 2]
        elif newdf.iat[i, 0] == 1 and newdf.iat[i, 1] == 1:
            tp = newdf.iat[i, 2]
        else:
            print("There are other labels/outliers. The result is")
            print(newdf)
        i = i + 1

    print("The confusion matrix ([[TN, FN] [FP, TP]]) is: [[", tn, fn, "][", fp, tp, "]]")
    accuracy1 = (tp + tn) / total
    accuracy2 = (fp + fn) / total
    print("The accuracy is", accuracy1, "or", accuracy2)


beforeKM = time.time()

from sklearn.cluster import KMeans
km = KMeans(n_clusters=2).fit(x_train)
df_test_result = df_test.copy()
df_test_result["result"] = km.predict(x_test)
result_statistic(df_test_result, df_test.shape[0])

afterKM = time.time()
print("The time used to classify using K-Means is", str(afterKM - beforeKM), "seconds")

cluster statistics:
   labels  result  count
0       0       0   9549
1       0       1    162
2       1       0   5623
3       1       1   7210
The confusion matrix ([[TN, FN] [FP, TP]]) is: [[ 9549 162 ][ 5623 7210 ]]
The accuracy is 0.7433907026259758 or 0.2566092973740241
The time used to classify using K-Means is 2.185791254043579 seconds
