#  Unsupervised Pretraining Using Stacked Autoencoders
This notebook illustrates the training process of a Network Intrusion Detection System (NIDS) using a dataset with a limited number of labeled samples. The approach involves initial unsupervised training of an autoencoder on the entire dataset, incorporating both labeled and unlabeled samples, where the labels are disregarded during this phase.

The key strategy is to pre-train the autoencoder to learn meaningful representations from the data without relying on labeled information. Subsequently, we leverage the pre-trained encoder to construct a new neural network for building the NIDS. The rationale behind unsupervised pretraining with stacked autoencoders lies in the ability of the initial layers to autonomously capture general features and patterns present in the data. This understanding is then leveraged to enhance the NIDS's ability to detect network intrusions with a limited set of labeled examples.

| <img src="./autoencoder_pretraining.png" width="100%">  |
|--|
| The two stages of unsupervised pretraining using stacked Autoencoders|


We will use a dataset of benign and various DDoS attacks from the CIC-DDoS2019 dataset (https://www.unb.ca/cic/datasets/ddos-2019.html).
The network traffic has been previously pre-processed in a way that packets are grouped in bi-directional traffic flows using the 5-tuple (source IP, destination IP, source Port, destination Port, protocol). Each flow is represented with 21 packet-header features computed from max 1000 packets:

| Feature nr.         | Feature Name |
|---------------------|---------------------|
| 00 | timestamp (mean IAT) | 
| 01 | packet_length (mean)| 
| 02 | IP_flags_df (sum) |
| 03 | IP_flags_mf (sum) |
| 04 | IP_flags_rb (sum) | 
| 05 | IP_frag_off (sum) |
| 06 | protocols (mean) |
| 07 | TCP_length (mean) |
| 08 | TCP_flags_ack (sum) |
| 09 | TCP_flags_cwr (sum) |
| 10 | TCP_flags_ece (sum) |
| 11 | TCP_flags_fin (sum) |
| 12 | TCP_flags_push (sum) |
| 13 | TCP_flags_res (sum) |
| 14 | TCP_flags_reset (sum) |
| 15 | TCP_flags_syn (sum) |
| 16 | TCP_flags_urg (sum) |
| 17 | TCP_window_size (mean) |
| 18 | UDP_length (mean) |
| 19 | ICMP_type (mean) |
| 20 | Packets (counter)|

In [None]:
# Author: Roberto Doriguzzi-Corin
# Project: Course on Network Intrusion and Anomaly Detection with Machine Learning
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import time
import pprint
import numpy as np
import random
from tensorflow.keras.models import Sequential
from keras.layers import Input, Dense, Flatten, Reshape
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers import Adam
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import set_random_seed
from keras.models import Model
import matplotlib.pyplot as plt
import tensorflow as tf
from util_functions import *
DATASET_FOLDER = "./DOS2019_Binary"
X_train, y_train = load_dataset(DATASET_FOLDER + "/*" + '-train.hdf5')
X_val, y_val = load_dataset(DATASET_FOLDER + "/*" + '-val.hdf5')
X_test, y_test = load_dataset(DATASET_FOLDER + "/*" + '-test.hdf5')

# disable GPUs for test reproducibility
tf.config.set_visible_devices([], 'GPU')

SEED=1
PATIENCE = 25

random.seed(SEED)
np.random.seed(SEED)
set_random_seed(SEED)

In [None]:
def report_results(Y_true, Y_pred, model_name, data_source, prediction_time):
    ddos_rate = '{:04.3f}'.format(sum(Y_pred) / Y_pred.shape[0])

    if Y_true is not None and len(Y_true.shape) > 0:  # if we have the labels, we can compute the classification accuracy
        Y_true = Y_true.reshape((Y_true.shape[0], 1))
        accuracy = accuracy_score(Y_true, Y_pred)

        f1 = f1_score(Y_true, Y_pred)
        tn, fp, fn, tp = confusion_matrix(Y_true, Y_pred, labels=[0, 1]).ravel()
        tnr = tn / (tn + fp)
        fpr = fp / (fp + tn)
        fnr = fn / (fn + tp)
        tpr = tp / (tp + fn)

        row = {'Model': model_name, 'Time': '{:04.3f}'.format(prediction_time),
               'Samples': Y_pred.shape[0], 'DDOS%': ddos_rate, 'Accuracy': '{:05.4f}'.format(accuracy), 'F1Score': '{:05.4f}'.format(f1),
               'TPR': '{:05.4f}'.format(tpr), 'FPR': '{:05.4f}'.format(fpr), 'TNR': '{:05.4f}'.format(tnr), 'FNR': '{:05.4f}'.format(fnr), 'Source': data_source}

    pprint.pprint(row, sort_dicts=False)

# Supervised Training
In this first step, we implement and train an MLP model for binary traffic classification using supervised learning. Thus, we assume that all the samples are labelled. This will be the baseline to evaluate the pre-trained autoencoder.

In [None]:
# MLP model
def create_model():
    model = Sequential(name  = "mlp", layers=[Input(shape=(X_train.shape[1],)),
                                              Dense(18, activation="relu"), 
                                              Dense(12, activation="relu")])
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(optimizer=Adam(learning_rate=0.01), loss='binary_crossentropy', metrics=['accuracy'])
    print (model.summary())
    return model

mlp_model = create_model()

early_stopping = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=PATIENCE, restore_best_weights=True)
start_time = time.time()
history = mlp_model.fit(X_train, y_train, epochs=1000, batch_size=64, validation_data=(X_val, y_val), callbacks= [early_stopping])

In [None]:
# Predict using the trained model
Y_pred = np.squeeze(mlp_model.predict(X_test, batch_size=16) > 0.5,axis=1)
report_results(np.squeeze(y_test), Y_pred,  'MLP', '', 0)

# Unsupervised pre-training
Here we define an autoencoder, whose encoder is exactly like the previous MLP model (except for the output layer). We first perform unsupervised pre-training of the autoencoder using the whole training and validation sets.  

In [None]:
# Define the architecture of the stacked autoencoder
print ("Shape of the samples: ", X_train.shape[1])

stacked_encoder = Sequential(name='Encoder',layers=[Input(shape=(X_train.shape[1],)), 
                              Dense(18, activation="relu",name='encoder1'), 
                              Dense(12, activation="relu",name='encoder2')]) 

stacked_decoder = Sequential(name='Decoder',layers=[ Dense(18, activation="relu", input_shape=[12]), 
                              Dense(X_train.shape[1], activation="sigmoid") ]) 
stacked_ae = Sequential([stacked_encoder, stacked_decoder])

# Compile the model
stacked_ae.compile(optimizer=Adam(learning_rate=0.01), loss='mean_squared_error')
print (stacked_encoder.summary())
print (stacked_decoder.summary())

early_stopping = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=PATIENCE, restore_best_weights=True)
# Train the stacked autoencoder
history = stacked_ae.fit(X_train, X_train, epochs=1000, batch_size=64, shuffle=True, validation_data=(X_val, X_val),callbacks=[early_stopping])
stacked_ae.save("./stacked_ae.h5")

# Training the MLP model by using the pre-trained Autoencoder
In this last step, we perform supervised training of the MLP model with only a small portion of the training set. Thus, we simulate the situation in which only part of the training set is labelled.
To address this limitation, we re-use the pre-trained encoder of the Autoencoder to build the MLP model. During the training process, the encoder layers are not trained (they are "frozen").

In [None]:
X_train_small,_,y_train_small,_ = train_test_split(X_train,y_train, train_size=0.1, shuffle=True,random_state=SEED) # simulate that only part of the training set is labelled
X_train_small,X_val_small,y_train_small,y_val_small = train_test_split(X_train_small,y_train_small, train_size=0.9, shuffle=True,random_state=SEED) 

# Freeze the encoder layers
for layer in stacked_encoder.layers:
    layer.trainable = False

mlp_pretrained = Sequential([stacked_encoder, 
                             Dense(10, activation='relu'), 
                             Dense(1, activation='sigmoid')])
mlp_pretrained.compile(optimizer=Adam(learning_rate=0.01), loss='binary_crossentropy', metrics=['accuracy'])

early_stopping = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=PATIENCE, restore_best_weights=True)
history = mlp_pretrained.fit(X_train_small, y_train_small, epochs=1000, batch_size=64, validation_data=(X_val_small, y_val_small), callbacks= [early_stopping])
mlp_pretrained.save("./mlp_pretrained.h5")

In [None]:
# Predict using the trained model
Y_pred = np.squeeze(mlp_pretrained.predict(X_test, batch_size=16) > 0.5,axis=1)
report_results(np.squeeze(y_test), Y_pred,  'MLP', '', 0)