# Anomaly detection con Autoencoders
In questo laboratorio, implementeremo e configureremo un Autoencoder per il rilevamento di anomalie di rete. L’autoencoder sara' addestrato e validato utilizzando campioni di traffico benigno e successivamente testato su anomalie di rete (in questo caso, attacchi informatici).
Per trovare la miglior configurazione dell'Autoencoder, utilizzeremo la tecnica ```Grid Search```.

| <img src="./autoencoder.png" width="80%">  |
|--|
| Architecture dell'Autoencoder|

Addestreremo l'Autoencoder con traffico di rete benigno dal dataset CIC-DDoS2019 dell’Università del New Brunswick. Il traffico di rete è stato precedentemente pre-elaborato in modo che i pacchetti siano raggruppati in flussi di traffico bidirezionali utilizzando la 5-tupla (IP sorgente, IP destinazione, porta sorgente, porta destinazione, protocollo). Ogni flusso è rappresentato da 21 features (attributi) dell’header dei pacchetti calcolate da un massimo di 1000 pacchetti:

| Feature nr.         | Feature Name |
|---------------------|---------------------|
| 00 | timestamp (mean IAT) | 
| 01 | packet_length (mean)| 
| 02 | IP_flags_df (sum) |
| 03 | IP_flags_mf (sum) |
| 04 | IP_flags_rb (sum) | 
| 05 | IP_frag_off (sum) |
| 06 | protocols (mean) |
| 07 | TCP_length (mean) |
| 08 | TCP_flags_ack (sum) |
| 09 | TCP_flags_cwr (sum) |
| 10 | TCP_flags_ece (sum) |
| 11 | TCP_flags_fin (sum) |
| 12 | TCP_flags_push (sum) |
| 13 | TCP_flags_res (sum) |
| 14 | TCP_flags_reset (sum) |
| 15 | TCP_flags_syn (sum) |
| 16 | TCP_flags_urg (sum) |
| 17 | TCP_window_size (mean) |
| 18 | UDP_length (mean) |
| 19 | ICMP_type (mean) |
| 20 | Packets (counter)|


In [None]:
# Author: Roberto Doriguzzi-Corin
# Project: Corso di Algoritmi di Machine Learning per la rilevazione di attacchi informatici
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import random
import numpy as np
from tensorflow.keras.models import Sequential
from keras.layers import Input, Dense, Flatten, Reshape
from tensorflow.keras.callbacks import EarlyStopping
from keras.wrappers.scikit_learn import KerasRegressor
from tensorflow.keras.optimizers import Adam,SGD
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
from tensorflow.keras.utils import set_random_seed
from keras.models import Model
import matplotlib.pyplot as plt
from scipy.stats import uniform, randint
from util_functions import *
DATASET_FOLDER = "./DOS2019_Anomaly_Flatten"
X_train, _ = load_dataset(DATASET_FOLDER + "/*" + '-benign-train.hdf5')
X_val, _ = load_dataset(DATASET_FOLDER + "/*" + '-benign-val.hdf5')
X_test, _ = load_dataset(DATASET_FOLDER + "/*" + '-benign-test.hdf5')
X_test_anomalies, _ = load_dataset(DATASET_FOLDER + "/*" + '-anomaly-test.hdf5')

SEED = 0

random.seed(SEED)
np.random.seed(SEED)
set_random_seed(SEED)

# Definizione del modello di Autoencoder
Nella cella successiva, implementiamo l'Autoencoder definendo sia l'*encoder* che il *decoder*. 

In [None]:
def create_model(hidden_units=10,coding_layer_size=2, learning_rate = 0.001, optimizer=Adam):
    stacked_encoder = Sequential(name='Encoder',layers=[Input(shape=(X_train.shape[1],)), 
                              ### ADD YOUR CODE HERE ###
                              Dense(hidden_units, activation="relu"), 
                              Dense(coding_layer_size, activation="relu")
                              ##########################
                              ]) 

    stacked_decoder = Sequential(name='Decoder',layers=[ 
                            ### ADD YOUR CODE HERE ###
                            Dense(hidden_units, activation="relu", input_shape=[coding_layer_size]), 
                            Dense(X_train.shape[1], activation="sigmoid") 
                            ##########################
                            ]) 
    stacked_ae = Sequential([stacked_encoder, stacked_decoder])

    # Compile the model
    stacked_ae.compile(optimizer=optimizer(learning_rate=learning_rate), loss='mean_squared_error')
    print (stacked_encoder.summary())
    print (stacked_decoder.summary())
    return stacked_ae


# Addrestramento dell'Autoencoder
L'autoencoder sara' configurato cercando il numero ottimale di neuroni per la rappresentazione compressa dei flussi di rete benigni (```coding layer```) ed il numero di neuroni negli altri layers (```hidden_units```).
Stabilisci un valore di ```PATIENCE``` ed inserisci delle liste di numeri interi separati da virgole tra le parentesi quadre nella cella sotto.

In [None]:
# Create a KerasClassifier based on the create_model function
stacked_ae = KerasRegressor(build_fn=create_model, batch_size=128, verbose=1)

PATIENCE = 

param_dist = {
    'hidden_units': [],
    'coding_layer_size': []
    ##########################
}


grid_search = GridSearchCV(estimator=stacked_ae, param_grid=param_dist, cv=2)
early_stopping = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=PATIENCE, restore_best_weights=True)
# Train the stacked autoencoder
search_result = grid_search.fit(X_train, X_train,epochs=100, validation_data=(X_val, X_val),callbacks= [ early_stopping])

# Print the best parameters and corresponding accuracy
print("Configurazione migliore: ", search_result.best_params_)

# Save the best model for later
best_model = search_result.best_estimator_

# Test dell Autoencoder
In quest’ultimo passaggio, valutiamo l’autoencoder su dati non visti. In particolare, testiamo la capacità del modello di rilevare anomalie di rete e la sua sensibilità agli outlier benigni misurando il Tasso di **Falsi Positivi**.

In [None]:
# Compute the anomaly threshold using the erron on the validation data
reconstructed_benign_validation = best_model.predict(X_val)
reconstruction_error_benign_validation = np.mean(np.square(X_val - reconstructed_benign_validation), axis=1)
# Set a threshold for anomaly detection (adjust as needed)
anomaly_threshold = np.mean(reconstruction_error_benign_validation) + np.std(reconstruction_error_benign_validation)

# Evaluate the model on unseen benign and anomalous traffic
reconstructed_benign_test = best_model.predict(X_test)
reconstructed_anomalies = best_model.predict(X_test_anomalies)

# Calculate reconstruction errors on unseen data
reconstruction_error_benign_test = np.mean(np.square(X_test - reconstructed_benign_test), axis=1)
reconstruction_error_anomalies = np.mean(np.square(X_test_anomalies - reconstructed_anomalies), axis=1)

# Identify anomalies
false_positives = np.where(reconstruction_error_benign_test > anomaly_threshold)[0]
anomalies = np.where(reconstruction_error_anomalies > anomaly_threshold)[0]

# Print the indices of detected anomalies
print("Detected anomalies:", anomalies)
print("Anomaly detection accuracy: ", float(len(anomalies))/X_test_anomalies.shape[0])
print("False positive rate: ", float(len(false_positives))/X_test.shape[0])