# Anomaly detection with Stacked Autoencoders
In this laboratory, you will implement and tune the hyperparameters of a Stacked Autoencoder for network anomaly detection. The autoencoder must be trained and validated using benign traffic samples and then tested on network anomalies (in this case, network attacks). 
You will tune the model using Grid or Random search in order to find the best hyperparameters for this task. In particular, you will focus on *learning rate*, size of the *coding layer*, *batch size* and *optimizer*.

| <img src="./autoencoder.png" width="80%">  |
|--|
| Architecture of a stacked autoencoder|

We will use a dataset of benign and various DDoS attacks from the CIC-DDoS2019 dataset (https://www.unb.ca/cic/datasets/ddos-2019.html).
The network traffic has been previously pre-processed in a way that packets are grouped in bi-directional traffic flows using the 5-tuple (source IP, destination IP, source Port, destination Port, protocol). Each flow is represented with 21 packet-header features computed from max 1000 packets:

| Feature nr.         | Feature Name |
|---------------------|---------------------|
| 00 | timestamp (mean IAT) | 
| 01 | packet_length (mean)| 
| 02 | IP_flags_df (sum) |
| 03 | IP_flags_mf (sum) |
| 04 | IP_flags_rb (sum) | 
| 05 | IP_frag_off (sum) |
| 06 | protocols (mean) |
| 07 | TCP_length (mean) |
| 08 | TCP_flags_ack (sum) |
| 09 | TCP_flags_cwr (sum) |
| 10 | TCP_flags_ece (sum) |
| 11 | TCP_flags_fin (sum) |
| 12 | TCP_flags_push (sum) |
| 13 | TCP_flags_res (sum) |
| 14 | TCP_flags_reset (sum) |
| 15 | TCP_flags_syn (sum) |
| 16 | TCP_flags_urg (sum) |
| 17 | TCP_window_size (mean) |
| 18 | UDP_length (mean) |
| 19 | ICMP_type (mean) |
| 20 | Packets (counter)|

**IMPORTANT**: The traffic features of the dataset used in this laboratory have been previously normalised between 0 and 1. Therefore, you can use the *Sigmoid* activation function in the output layer.

In [None]:
# Author: Roberto Doriguzzi-Corin
# Project: Course on Network Intrusion and Anomaly Detection with Machine Learning
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import random
import numpy as np
from tensorflow.keras.models import Sequential
from keras.layers import Input, Dense, Flatten, Reshape
from tensorflow.keras.callbacks import EarlyStopping
from keras.wrappers.scikit_learn import KerasRegressor
from tensorflow.keras.optimizers import Adam,SGD
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
from tensorflow.keras.utils import set_random_seed
from keras.models import Model
import matplotlib.pyplot as plt
from scipy.stats import uniform, randint
from util_functions import *
DATASET_FOLDER = "./DOS2019_Anomaly_Flatten"
X_train, _ = load_dataset(DATASET_FOLDER + "/*" + '-benign-train.hdf5')
X_val, _ = load_dataset(DATASET_FOLDER + "/*" + '-benign-val.hdf5')
X_test, _ = load_dataset(DATASET_FOLDER + "/*" + '-benign-test.hdf5')
X_test_anomalies, _ = load_dataset(DATASET_FOLDER + "/*" + '-anomaly-test.hdf5')

SEED = 0
PATIENCE = 25

random.seed(SEED)
np.random.seed(SEED)
set_random_seed(SEED)

# Model definition
Define the Stacked Autoencoder here by adding the missing hidden layers to both *encoder* and *decoder*. Keep in mind that the input shape of the decoder must be the same as the encoder's output shape (the *coding_layer_size*).

In [None]:
def create_model(hidden_units=10,coding_layer_size=2, learning_rate = 0.001, optimizer=SGD):
    stacked_encoder = Sequential(name='Encoder',layers=[Input(shape=(X_train.shape[1],)), 
                              ### ADD YOUR CODE HERE ###


                              ##########################
                              ]) 

    stacked_decoder = Sequential(name='Decoder',layers=[ 
                            ### ADD YOUR CODE HERE ###


                            ##########################
                            ]) 
    stacked_ae = Sequential([stacked_encoder, stacked_decoder])

    # Compile the model
    stacked_ae.compile(optimizer=optimizer(learning_rate=learning_rate), loss='mean_squared_error')
    print (stacked_encoder.summary())
    print (stacked_decoder.summary())
    return stacked_ae


# Training and tuning the Stacked Autoencoder for anomaly detection
First, implement hyperparameter tuning for your autoencoder. Focus on *learning rate*, size of the *coding layer*, *batch size* and *optimizer*.
Then, configure *early stopping* and train the autoencoder using *random search*. 

In [None]:
# Define the architecture of the stacked autoencoder
print ("Shape of the samples: ", X_train.shape[1])

# Create a KerasClassifier based on the create_model function
stacked_ae = KerasRegressor(build_fn=create_model, batch_size=128, verbose=1)

# Define the hyperparameters to tune and their possible values
param_dist = {
    ### ADD YOUR CODE HERE ###


    ##########################
}

### ADD YOUR CODE HERE ###

random_search = ### DEFINE RANDOM SEARCH HERE ###
early_stopping = ### DEFINE EARLY STOPPING HERE

# Train the stacked autoencoder
search_result = ### RUN HYPERPARAMETER TUNING WITH RANDOM SEARCH AND EARLY STOPPING ###

##########################

# Print the best parameters and corresponding accuracy
print("Best parameters found: ", search_result.best_params_)

# Save the best model for later
best_model = search_result.best_estimator_

# Test the autoencoder
In this last step, you can evaluate the autoencoder on unseen data. In particular, you can test the ability of the model to detect network anomalies and its sensitivity to benign outliers by measuring the False Positive Rate.

An important parameter here is the **anomaly threshold, defined as the sum of mean and standard deviation of the reconstruction error measured on the benign validation data**.

**DEFINITION**: The **standard deviation** is a measure of the amount of variability or spread in a set of data values. It indicates how much individual data points differ from the mean (average) of the data set. A low standard deviation means that the data points are close to the mean, while a high standard deviation indicates that they are spread out over a wider range of values.

By setting the threshold to the mean plus the standard deviation, we aim to ensure that most benign samples (which are expected to produce low error) are classified correctly.

In [None]:
# Compute the anomaly threshold using the erron on the validation data
reconstructed_benign_validation = best_model.predict(X_val)
reconstruction_error_benign_validation = np.mean(np.square(X_val - reconstructed_benign_validation), axis=1)
# Set a threshold for anomaly detection (adjust as needed)
anomaly_threshold = np.mean(reconstruction_error_benign_validation) + np.std(reconstruction_error_benign_validation)

# Evaluate the model on unseen benign and anomalous traffic
reconstructed_benign_test = best_model.predict(X_test)
reconstructed_anomalies = best_model.predict(X_test_anomalies)

# Calculate reconstruction errors on unseen data
reconstruction_error_benign_test = np.mean(np.square(X_test - reconstructed_benign_test), axis=1)
reconstruction_error_anomalies = np.mean(np.square(X_test_anomalies - reconstructed_anomalies), axis=1)

# Identify anomalies
false_positives = np.where(reconstruction_error_benign_test > anomaly_threshold)[0]
anomalies = np.where(reconstruction_error_anomalies > anomaly_threshold)[0]

# Print the indices of detected anomalies
print("Detected anomalies:", anomalies)
print("Anomaly detection accuracy: ", float(len(anomalies))/X_test_anomalies.shape[0])
print("False positive rate: ", float(len(false_positives))/X_test.shape[0])