<a href="https://colab.research.google.com/github/fwangliberty/AIoTDesign-Frontend/blob/master/cnn_model_balanced_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intrusion Detection based on CICIDS 2017 Data Set (2)

We use the pre-processing dataset from mlp4nids (Multi-layer perceptron for network intrusion detection). https://github.com/ArnaudRosay/mlp4nids. Use another colab script to conver parquet files to csv files. 

In [56]:
import os
from os.path import join
import glob
import pandas as pd
import numpy as np
import time
import seaborn as sns
import matplotlib.pyplot as plt
%load_ext autoreload

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [57]:
def display_all(df):
    with pd.option_context("display.max_rows", 100, "display.max_columns", 100): 
        print(df)

In [58]:
def make_value2index(attacks):
    #make dictionary
    attacks = sorted(attacks)
    d = {}
    counter=0
    for attack in attacks:
        d[attack] = counter
        counter+=1
    return d

In [59]:
# chganges label from string to integer/index
def encode_label(Y_str):
    labels_d = make_value2index(np.unique(Y_str))
    Y = [labels_d[y_str] for y_str  in Y_str]
    Y = np.array(Y)
    return np.array(Y)

## Step 1. Loading csv files

Connect to Google Drive

In [60]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [61]:
# All columns
col_names = np.array(['Source Port', 'Destination Port',
                      'Protocol', 'Flow Duration', 'Total Fwd Packets', 'Total Backward Packets', 'Total Length of Fwd Packets',
                      'Total Length of Bwd Packets', 'Fwd Packet Length Max', 'Fwd Packet Length Min', 'Fwd Packet Length Mean',
                      'Fwd Packet Length Std', 'Bwd Packet Length Max', 'Bwd Packet Length Min', 'Bwd Packet Length Mean', 'Bwd Packet Length Std',
                      'Flow Bytes/s', 'Flow Packets/s', 'Flow IAT Mean', 'Flow IAT Std', 'Flow IAT Max', 'Flow IAT Min', 'Fwd IAT Total',
                      'Fwd IAT Mean', 'Fwd IAT Std', 'Fwd IAT Max', 'Fwd IAT Min', 'Bwd IAT Total', 'Bwd IAT Mean', 'Bwd IAT Std', 'Bwd IAT Max',
                      'Bwd IAT Min', 'Fwd PSH Flags', 'Fwd URG Flags', 'Fwd Header Length', 'Bwd Header Length',
                      'Fwd Packets/s', 'Bwd Packets/s', 'Min Packet Length', 'Max Packet Length', 'Packet Length Mean', 'Packet Length Std',
                      'Packet Length Variance', 'FIN Flag Count', 'SYN Flag Count', 'RST Flag Count', 'PSH Flag Count', 'ACK Flag Count',
                      'URG Flag Count', 'CWE Flag Count', 'ECE Flag Count', 'Down/Up Ratio', 'Average Packet Size', 'Avg Fwd Segment Size',
                      'Avg Bwd Segment Size','Subflow Fwd Packets', 'Subflow Fwd Bytes',
                      'Subflow Bwd Packets', 'Subflow Bwd Bytes', 'Init_Win_bytes_forward', 'Init_Win_bytes_backward',
                      'act_data_pkt_fwd', 'min_seg_size_forward', 'Active Mean', 'Active Std', 'Active Max', 'Active Min', 'Idle Mean',
                      'Idle Std', 'Idle Max', 'Idle Min', 'Label'])

In [62]:
# load three csv files generated by mlp4nids (Multi-layer perceptron for network intrusion detection )
# first load the train set
df_train = pd.read_csv('/content/drive/My Drive/CICIDS2017/train_set.csv',names=col_names, skiprows=1)  

In [63]:
# Here we can see the number of rows and columns for each table.
print(df_train.shape)

(556548, 72)


Count the number of attacks

In [64]:
df_train['Label'].value_counts()

BENIGN                        278274
DoS Hulk                      115062
PortScan                       79402
DDoS                           64012
DoS GoldenEye                   5146
FTP-Patator                     3967
SSH-Patator                     2948
DoS slowloris                   2898
DoS Slowhttptest                2749
Bot                              978
Web Attack  Brute Force         753
Web Attack  XSS                 326
Infiltration                      18
Web Attack  Sql Injection        10
Heartbleed                         5
Name: Label, dtype: int64

Read test and validation sets

In [65]:
df_test = pd.read_csv('/content/drive/My Drive/CICIDS2017/test_set.csv',names=col_names, skiprows=1)  
print('Test set size: ', df_test.shape)

df_val = pd.read_csv('/content/drive/My Drive/CICIDS2017/crossval_set.csv',names=col_names, skiprows=1)  
print('Validation set size: ', df_val.shape)

Test set size:  (278270, 72)
Validation set size:  (278270, 72)


Distribution of different attack cases

In [66]:
print('Test set: ')
df_test['Label'].value_counts()

Test set: 


BENIGN                        139135
DoS Hulk                       57531
PortScan                       39701
DDoS                           32006
DoS GoldenEye                   2573
FTP-Patator                     1983
SSH-Patator                     1474
DoS slowloris                   1449
DoS Slowhttptest                1374
Bot                              489
Web Attack  Brute Force         376
Web Attack  XSS                 163
Infiltration                       9
Web Attack  Sql Injection         5
Heartbleed                         2
Name: Label, dtype: int64

In [67]:
print('Validation set: ')
df_val['Label'].value_counts()

Validation set: 


BENIGN                        139135
DoS Hulk                       57531
PortScan                       39701
DDoS                           32006
DoS GoldenEye                   2573
FTP-Patator                     1983
SSH-Patator                     1474
DoS slowloris                   1449
DoS Slowhttptest                1374
Bot                              489
Web Attack  Brute Force         376
Web Attack  XSS                 163
Infiltration                       9
Web Attack  Sql Injection         5
Heartbleed                         2
Name: Label, dtype: int64

## Step 2. Normalization

The continuous feature values are normalized into the same feature space. This is important when using features that have different measurements, and is a general requirement of many machine learning algorithms. Therefore, the values for this dataset are also normalized using the Min-Max scaling technique, bringing them all within a range of [0,1].

### Step 2.1 Encoding train dataset
Encoding the labels, and generate numpy array. Note that the label has not been encoded as one-hot coding. We will use one-hot code later. 

In [68]:
df_label = df_train['Label']
data = df_train.drop(columns=['Label'])
X_train = data.values
y_train = encode_label(df_label.values)

### Step 2.2 Normalizing train dataset

In [69]:
from sklearn.preprocessing import MinMaxScaler

In [70]:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_train

array([[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        1.64199192e-01, 2.67500000e-01, 6.67991250e-02],
       [8.60215054e-04, 3.36651758e-04, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       ...,
       [0.00000000e+00, 6.98552397e-03, 1.61354247e-02, ...,
        0.00000000e+00, 6.92500000e-01, 6.92500000e-01],
       [0.00000000e+00, 3.19819170e-03, 2.81067803e-03, ...,
        9.18180611e-05, 2.06666667e-01, 2.06666667e-01],
       [0.00000000e+00, 4.59155592e-02, 5.83030421e-02, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])

### Step 2.3. Encoding test dataset

In [71]:
df_label = df_test['Label']
data = df_test.drop(columns=['Label'])
X_test = data.values
y_test = encode_label(df_label.values)

### Step 2.4. Normalizing test dataset

In [72]:
scaler = MinMaxScaler()
X_test = scaler.fit_transform(X_test)
X_test

array([[0.00000000e+00, 1.88574865e-02, 3.72734867e-02, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [2.61501211e-02, 9.09200241e-03, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.79176755e-02, 6.22970535e-03, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       ...,
       [1.45278450e-02, 5.05111245e-03, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 5.40889958e-03, 1.83091266e-02, ...,
        8.72072356e-05, 8.33333333e-02, 8.32840583e-02],
       [0.00000000e+00, 6.50330728e-03, 1.88504848e-02, ...,
        0.00000000e+00, 8.22500000e-01, 8.22500000e-01]])

### Step 2.5 Encoding validation dataset

In [73]:
df_label = df_val['Label']
data = df_val.drop(columns=['Label'])
X_val = data.values
y_val = encode_label(df_label.values)

### Step 2.6. Normalizing validation dataset

In [74]:
scaler = MinMaxScaler()
X_val = scaler.fit_transform(X_val)
X_val

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.0234824 , 0.03596298, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.00078058, 0.00059772, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.00169125, 0.00177089, ..., 0.        , 0.        ,
        0.        ],
       [0.03922518, 0.01580671, 0.        , ..., 0.        , 0.        ,
        0.        ]])

## Step 3 One-hot encoding for labels

y_train, y_test and y_val have to be one-hot-encoded. That means they must have dimension (number_of_samples, 15), where 15 denotes number of classes.

In [75]:
from tensorflow.keras.utils import to_categorical

In [76]:
y_train = to_categorical(y_train, 15)
y_test = to_categorical(y_test, 15)
y_val = to_categorical(y_val, 15)

## Step 4. Build the model

In [77]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv1D, BatchNormalization, Flatten, Dense, Activation,Dropout
from tensorflow.keras.constraints import max_norm

In [80]:
#hyper-params
batch_size = 1024 # increasing batch size with more gpu added
input_dim = X_train.shape[1]
num_class = 15                   # 15 intrusion classes, including benign traffic class
num_epochs = 30
learning_rates = 1e-3
regularizations = 1e-3
optim = tf.keras.optimizers.Adam(lr=learning_rates, beta_1=0.9, beta_2=0.999, epsilon=1e-8)

print(input_dim)
print(num_class)

71
15


In [81]:
#X_train_r = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
X_train_r = np.zeros((len(X_train), input_dim, 1))
X_train_r[:, :, 0] = X_train[:, :input_dim]
print(X_train_r.shape)

(556548, 71, 1)


In [82]:
X_test_r = np.zeros((len(X_test), input_dim, 1))
X_test_r[:, :, 0] = X_test[:, :input_dim]
print(X_test_r.shape)

(278270, 71, 1)


In [83]:
X_val_r = np.zeros((len(X_val), input_dim, 1))
X_val_r[:, :, 0] = X_val[:, :input_dim]
print(X_val_r.shape)

(278270, 71, 1)


In [84]:
model = Sequential()

# input layer
model.add(Conv1D(filters=64, kernel_size=3, padding='same', input_shape=(71,1)))
model.add(BatchNormalization(axis=1))
model.add(Activation('relu'))


model.add(Conv1D(filters=128, kernel_size=3))
model.add(BatchNormalization(axis=1))
model.add(Activation('relu'))

model.add(Flatten())
model.add(Dense(100, activation='relu'))
model.add(Dense(num_class))
model.add(Activation('softmax'))


model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d_2 (Conv1D)            (None, 71, 64)            256       
_________________________________________________________________
batch_normalization_2 (Batch (None, 71, 64)            284       
_________________________________________________________________
activation_3 (Activation)    (None, 71, 64)            0         
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 69, 128)           24704     
_________________________________________________________________
batch_normalization_3 (Batch (None, 69, 128)           276       
_________________________________________________________________
activation_4 (Activation)    (None, 69, 128)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 8832)             

In [85]:
model.compile(loss='categorical_crossentropy', optimizer=optim, metrics=['accuracy']) 

## Step 5. Training the model

In [86]:
# fit network
model.fit(X_train_r, y_train, epochs=num_epochs, batch_size=batch_size, validation_data=(X_val_r, y_val), verbose=1)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<tensorflow.python.keras.callbacks.History at 0x7f523913c7b8>

In [87]:
# evaluate model
accuracy = model.evaluate(X_test_r, y_test, batch_size=batch_size, verbose=0)

Save the model

In [88]:
model.save('/content/drive/My Drive/CICIDS2017/cicids2017cnn.h5')

## Step 6. Calculate Precision, Recall, and F-sore

Classification accuracy is the total number of correct predictions divided by the total number of predictions made for a dataset. As a performance measure, accuracy is inappropriate for imbalanced classification problems. The main reason is that the overwhelming number of examples from the majority class (or classes) will overwhelm the number of examples in the minority class, meaning that even unskillful models can achieve accuracy scores of 90 percent, or 99 percent, depending on how severe the class imbalance happens to be.

An alternative to using classification accuracy is to use precision and recall metrics.

In [89]:
# demonstration of calculating metrics for a neural network model using sklearn
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix

In [90]:
df_label = df_test['Label']
ytest = encode_label(df_label.values)

In [91]:
# predict probabilities for test set
yhat_probs = model.predict(X_test_r, verbose=0)

# predict crisp classes for test set
yhat_classes = model.predict_classes(X_test_r, verbose=0)

# reduce to 1d array
yhat_probs = yhat_probs[:, 0]
#yhat_classes = yhat_classes[:, 0]



In [92]:
print(yhat_probs.shape)
print(ytest.shape)

(278270,)
(278270,)


**Accuracy**

In [93]:
# accuracy: (tp + tn) / (p + n)
accuracy = accuracy_score(ytest, yhat_classes)
print('Accuracy: %f' % accuracy)

Accuracy: 0.983814


**Precision**

Precision calculates the accuracy for the minority class. It is calculated as the ratio of correctly predicted positive examples divided by the total number of positive examples that were predicted.

In [94]:
# precision tp / (tp + fp)
precision = precision_score(ytest, yhat_classes, labels=[1,2], average='micro')
# labels is a list of all possible class labels
print('Precision: %f' % precision)

Precision: 0.999441


**Recall**

Recall is the ratio of true positives to the ground-truth positives in the sample. Unlike Precision, Recall also considers the number of positive (minority) cases that were not classified as such.

In [95]:
# recall: tp / (tp + fn)
recall = recall_score(ytest, yhat_classes, average='weighted')
print('Recall: %f' % recall)

Recall: 0.983814


**F-Score**

Precision is appropriate when we are more concerned with minimizing false positives, while Recall is appropriate when the number of false negatives is more critica

In [96]:
# f1: 2 tp / (2 tp + fp + fn)
f1 = f1_score(ytest, yhat_classes, average='weighted')
print('F1 score: %f' % f1)

F1 score: 0.982994


In [97]:
# kappa
kappa = cohen_kappa_score(ytest, yhat_classes)
print('Cohens kappa: %f' % kappa)

# confusion matrix
matrix = confusion_matrix(ytest, yhat_classes)
print(matrix)

Cohens kappa: 0.975991
[[136519      4     14     36    411     40      7      5      0      0
    1983      0    111      0      5]
 [   307    182      0      0      0      0      0      0      0      0
       0      0      0      0      0]
 [    18      0  31984      3      1      0      0      0      0      0
       0      0      0      0      0]
 [    10      0      0   2539     22      2      0      0      0      0
       0      0      0      0      0]
 [   391      0      0      2  57138      0      0      0      0      0
       0      0      0      0      0]
 [    14      0      0      1      0   1346     10      0      0      0
       0      0      3      0      0]
 [    43      0      0      0      0      8   1397      0      0      0
       0      0      1      0      0]
 [    29      0      0      0      0      0      0   1948      0      0
       0      5      1      0      0]
 [     0      0      0      0      0      0      0      0      2      0
       0      0      0   