# System Deployment
This laboratory encompasses the entire life-cycle of a NIDS. It spans from the ANN model design, through the phases of model training and testing, culminating in the deployment onto the target machine. Follow the steps of the laboratory.

## Model implementation
1. Implement an MLP model from scratch
2. Define the ranges of hyperparameters and execute training
3. Execute hyperparameter tuning
4. Test the trained model on the test set and on one or more pcap files using the notebook

## Model deployment
1. Export the notebook to a stand-alone Python script.
2. Execute the script using the arguments ```--model``` and ```--predict``` to indicate the paths to the trained model and to the folder with the test set respectively. 
3. Execute the script using the arguments ```--model``` and ```--predict_live``` to indicate the paths to the trained model and to a ```pcap``` file respectively.
4. 
    (a) Execute the script using the arguments ```--model``` and ```--predict_live lo```, where ```lo``` is the loopback interface (special network interface that the system uses to communicate with itself). 
    (b) In another terminal, simulate a network attack by injecting the traffic traces in the ```DOS2019_Binary_5_Attacks_PCAPs``` to the ```lo``` interface using ```tcpreplay -i lo traffic_trace.pcap```

You will use a dataset of benign and various DDoS attacks from the CIC-DDoS2019 dataset (https://www.unb.ca/cic/datasets/ddos-2019.html).
The network traffic has been previously pre-processed in a way that packets are grouped in bi-directional traffic flows using the 5-tuple (source IP, destination IP, source Port, destination Port, protocol). Each flow is represented with 21 packet-header features computed from max 1000 packets:

| Feature nr.         | Feature Name |
|---------------------|---------------------|
| 00 | timestamp (mean IAT) | 
| 01 | packet_length (mean)| 
| 02 | IP_flags_df (sum) |
| 03 | IP_flags_mf (sum) |
| 04 | IP_flags_rb (sum) | 
| 05 | IP_frag_off (sum) |
| 06 | protocols (mean) |
| 07 | TCP_length (mean) |
| 08 | TCP_flags_ack (sum) |
| 09 | TCP_flags_cwr (sum) |
| 10 | TCP_flags_ece (sum) |
| 11 | TCP_flags_fin (sum) |
| 12 | TCP_flags_push (sum) |
| 13 | TCP_flags_res (sum) |
| 14 | TCP_flags_reset (sum) |
| 15 | TCP_flags_syn (sum) |
| 16 | TCP_flags_urg (sum) |
| 17 | TCP_window_size (mean) |
| 18 | UDP_length (mean) |
| 19 | ICMP_type (mean) |
| 20 | Packets (counter)|

In [None]:
# Author: Roberto Doriguzzi-Corin
# Project: Course on Network Intrusion and Anomaly Detection with Machine Learning
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import time
import argparse
import pyshark
import numpy as np
import pprint
from scipy.stats import uniform, randint
import tensorflow as tf
import matplotlib.pyplot as plt
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import RandomizedSearchCV
from tensorflow.keras.models import Sequential,load_model, save_model
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Dense, Conv2D, GlobalMaxPooling2D, Flatten, Dropout
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix
from tensorflow.keras.optimizers import Adam, SGD
from tensorflow.keras.utils import set_random_seed
from lucid_dataset_parser import *
from util_functions import *

# We need the following to get around “RuntimeError: This event loop is already running” when using Pyshark within Jupyter notebooks.
# Not needed in stand-alone Python projects
import nest_asyncio
nest_asyncio.apply()  

EPOCHS = 100
TEST_ITERATIONS=100

# disable GPUs for test reproducibility
tf.config.set_visible_devices([], 'GPU')

SEED = 0
np.random.seed(SEED)
set_random_seed(SEED)

# Argument parser
The following cell defines the arguments that are accepted by the NIDS. The arguments ```--train``` and ```--predict``` can be used to indicate the folder with the dataset. The script will loaf the ```hdf5``` files for training and testing respectively. The argument ```--predict_live``` can be used to perform predictions on live traffic captured from a network interface (e.g., ```eth0``` or ```lo```) or to make prediction using a pre-recorded traffic trace (e.g, ```ddos-chunk.pcap```). In both cases the argument is a string. In the first case, it is the name of the interface, in the second case, the path to the ```pcap``` file. The argument ```--model``` indicates the path to a trained model that will be used to make predictions (```--predict``` or ```predict_live```). 

In [None]:
parser = argparse.ArgumentParser(description="A DL-based NIDS for DDoS attack detection")
args = parser.parse_args(args=[])
parser.add_argument('-t', '--train', nargs='?', type=str,  default=None, help="Start the training process")
parser.add_argument('-p', '--predict', nargs='?', type=str,  default=None, help="Perform a prediction on pre-preprocessed data")
parser.add_argument('-pl', '--predict_live', nargs='?', type=str, default=None, help='Perform a prediction on live traffic or on a pre-recorded traffic trace in pcap format')
parser.add_argument('-m', '--model', type=str, default = None, help='File containing the model in h5 format')

args, unknown = parser.parse_known_args()
print("see all args:", args)

In [None]:
def report_results(Y_true, Y_pred, model_name, data_source, prediction_time):
    ddos_rate = '{:04.3f}'.format(sum(Y_pred) / Y_pred.shape[0])

    if Y_true is not None and len(Y_true.shape) > 0:  # if we have the labels, we can compute the classification accuracy
        Y_true = Y_true.reshape((Y_true.shape[0], 1))
        accuracy = accuracy_score(Y_true, Y_pred)

        f1 = f1_score(Y_true, Y_pred)
        tn, fp, fn, tp = confusion_matrix(Y_true, Y_pred, labels=[0, 1]).ravel()
        tnr = tn / (tn + fp)
        fpr = fp / (fp + tn)
        fnr = fn / (fn + tp)
        tpr = tp / (tp + fn)

        row = {'Model': model_name, 'Time': '{:04.3f}'.format(prediction_time),
               'Samples': Y_pred.shape[0], 'DDOS%': ddos_rate, 'Accuracy': '{:05.4f}'.format(accuracy), 'F1Score': '{:05.4f}'.format(f1),
               'TPR': '{:05.4f}'.format(tpr), 'FPR': '{:05.4f}'.format(fpr), 'TNR': '{:05.4f}'.format(tnr), 'FNR': '{:05.4f}'.format(fnr), 'Source': data_source}

    pprint.pprint(row, sort_dicts=False)

# Implement an MLP model
Finalise the model by using all the four arguments of ```create_model```. You will use ALL these arguments to tune the model afterwards.

In [None]:
def create_model(optimizer=SGD, dense_layers=4, hidden_units=2, learning_rate = 0.001):
    model = Sequential(name  = "mlp")
    ### Add YOUR CODE HERE ###
    
    ##########################
    model.summary()
    return model

# Prediction on static test set

In [None]:
def predict(dataset_path, model_path):
    if dataset_path is not None:
        X_test, y_test = load_dataset(dataset_path + "/*" + '-test.hdf5')

        if model_path == None or model_path.endswith('.h5') == False:
                print ("No valid model specified!")
                exit(-1)

        if model_path is not None:
            model = load_model(model_path)
        else:
            print ("Invalid model path: ", model_path) 
            return

        pt0 = time.time()
        for i in range(TEST_ITERATIONS):
            Y_pred = np.squeeze(model.predict(X_test, batch_size=16) > 0.5,axis=1)
        pt1 = time.time()
        prediction_time = pt1 - pt0

        report_results(np.squeeze(y_test), Y_pred,  model.name, '', prediction_time/TEST_ITERATIONS)

# Prediction on live traffic
Complete the following cell by printing the identifiers of the DDoS flows (```flow_ids```).

In [None]:
def predict_live(source,model_path):
    if source is not None:
        if source.endswith('.pcap'):
            pcap_file = source
            cap = pyshark.FileCapture(pcap_file)
            data_source = pcap_file.split('/')[-1].strip()
        else:
            cap =  pyshark.LiveCapture(interface=source)
            data_source = args.predict_live

        print ("Prediction on network traffic from: ", source)

        if model_path is not None:
            model = load_model(model_path)
        else:
            print ("Invalid model path: ", model_path) 
            return

        # load the labels, if available
        labels = parse_labels('DOS2019')

        mins, maxs = static_min_max(flatten=True,time_window=10,max_flow_len=1000)

        while (True):
            samples = process_live_traffic(cap, 'DOS2019', labels, max_flow_len=1000, traffic_type="all")
            if len(samples) > 0:
                X,Y_true,flow_ids = dataset_to_list_of_fragments(samples)
                X_flatten = flatten_samples(X)
                X = np.array(normalize(X_flatten, mins, maxs))
                if labels is not None:
                    Y_true = np.array(Y_true)
                else:
                    Y_true = None
                
                pt0 = time.time()
                Y_pred = np.squeeze(model.predict(X, batch_size=2048) > 0.5,axis=1)
                pt1 = time.time()
                prediction_time = pt1 - pt0

                report_results(np.squeeze(Y_true), Y_pred,  model.name, '', prediction_time)
                
                ### ADD YOUR CODE HERE ###

                ##########################


# Hyperparameter tuning
Complete the code of the cell below by implementing random search on the 4 hyperparameters of the ```create_model``` method above. Use k-fold cross-validation with ```k=2``` and a maximum of ```10``` iterations for the random search.

In [None]:
def train(dataset_path, model_path):
    if dataset_path is not None:
        X_train, y_train = load_dataset(dataset_path + "/*" + '-train.hdf5')
        X_val, y_val = load_dataset(dataset_path + "/*" + '-val.hdf5')

        param_dist = {
            ### ADD YOUR CODE HERE ###

            ##########################
        }

        model = KerasClassifier(build_fn=create_model, batch_size=100, verbose=1)

        ### ADD YOUR CODE HERE ###
        random_search = ### ADD YOUR CODE HERE ###
        early_stopping = ### ADD YOUR CODE HERE ###
        random_search_result = ### ADD YOUR CODE HERE ###
        ##########################

        # Print the best parameters and corresponding accuracy
        print("Best parameters found: ", random_search_result.best_params_)
        print("Best cross-validated accuracy: {:.2f}".format(random_search_result.best_score_))

        # Save the best model
        best_model = random_search.best_estimator_.model
        if model_path is not None:
            save_model(best_model,model_path)
        else:
            print ("Invalid model path: ", model_path)
            print ("Model saved as: " + './nids_model.h5')
            save_model(best_model,'./nids_model.h5')

# Train your model
Train you model by executing the method above with appropriate arguments (```args....``` see the ```argparse``` cell above). This will prepare the code for the stand-alone script, which will accept arguments from the command line. 
If you want to train the model in this notebook for testing purposes, you can first call the ```train``` method with static dataset path (```'./DOS2019_Binary_5_Attacks_MLP'```) and model path (e.g., ```'./nids_model.h5'```). 

**NOTE1**: these arguments are strings, do not forget the two paths between ```''```.
**NOTE2**: Before exporting the notebook to a Python script, remember to replace the static parameters with the previous```args....```.

In [None]:
# Train the model
### ADD YOUR CODE HERE ###

##########################

# Make predictions on the test set
In the following cell, add the code to make prediction with the model saved before on the test set. 
If you want to test the model in this notebook, you can first call the ```predict``` method with static dataset path (```'./DOS2019_Binary_5_Attacks_MLP'```) and model path (e.g., ```'./nids_model.h5'```). 

**NOTE1**: these arguments are strings, do not forget the two paths between ```''```.
**NOTE2**: Before exporting the notebook to a Python script, remember to replace the static parameters with the previous```args....```.

In [None]:
# Predictions on the test set

### ADD YOUR CODE HERE ###

##########################

# Make predictions using a pcap file
In the following cell, add the code to make prediction with the model saved before on a pcap file. 
If you want to test the model in this notebook, you can first call the ```predict_live``` method using the path to a ```pcap``` file (e.g., ```'./DOS2019_Binary_5_Attacks_PCAPs/ddos-chunk.pcap'```) and model path (e.g., ```'./nids_model.h5'```). 

**NOTE1**: these arguments are strings, do not forget the two paths between ```''```.
**NOTE2**: Before exporting the notebook to a Python script, remember to replace the static parameters with the previous```args....```.

In [None]:
# Predictions on a pcap file

### ADD YOUR CODE HERE ###

##########################