# Deep Learning Network traffic classifier
This notebook demonstrates how to train and evaluate two deep learning models, an MLP and a CNN, using datasets generated by LUCX. The key cells are located at the bottom of the notebook and enable both training and testing of the models.  
The training phase is performed with randomized search, which tunes the main parameters of the deep learning models. The final two cells implement the testing phase: the first uses the test set produced by the parser, while the second allows the user to evaluate the model on live traffic, either captured directly from a network interface or obtained from a pre-recorded traffic trace in `pcap` format.

## Python script
This notebook can be converted into a Python script by running the following command:
```jupyter nbconvert --to python NIDS.ipynb```.

Once exported, the resulting `NIDS.py` script can be executed to train and test the model using the arguments defined in the second cell of the notebook.

**Note**: Before exporting the notebook, modify the code in the last three cells by commenting out the line that uses the static paths for training or prediction, and uncommenting the line that uses the args parameters. For example, the current training cell is written as follows (to allow testing directly within the notebook):

```train('./Dataset/DOS2019_Binary_5_Attacks_PCAPs','./nids_model.keras')```
```#train(args.train,args.model)```

Before converting the notebook into a Python script, it should be updated as:

```#train('./Dataset/DOS2019_Binary_5_Attacks_PCAPs','./nids_model.keras')```
```train(args.train,args.model)```

Apply the same modification to the other two cells.


In [None]:
# Copyright (c) 2025 @ FBK - Fondazione Bruno Kessler
# Author: Roberto Doriguzzi-Corin
# Project: LUCX: LUCID network traffic parser eXtended
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import time
import argparse
import pyshark
import numpy as np
import pprint
from scipy.stats import uniform, randint
import tensorflow as tf
import matplotlib.pyplot as plt
from scikeras.wrappers import KerasClassifier
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
from tensorflow.keras.models import Sequential,load_model, save_model
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Dense, Input, Conv2D, Dropout, GlobalMaxPooling2D, Flatten
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix
from tensorflow.keras.optimizers import Adam, SGD
from tensorflow.keras.utils import set_random_seed
from lucx_network_traffic_parser import *
from util_functions import *

# We need the following to get around “RuntimeError: This event loop is already running” when using Pyshark within Jupyter notebooks.
# Not needed in stand-alone Python projects
import nest_asyncio
nest_asyncio.apply()  

EPOCHS = 1000
TEST_ITERATIONS=10

### SELECT THE MODEL_TYPE HERE ('MLP' or 'CNN') ###
MODEL_TYPE = 'CNN' 
###################################################

SEED = 0
np.random.seed(SEED)
set_random_seed(SEED)

# Argument parser
The following cell defines the arguments that are accepted by the `NIDS.py` Python script exported from this notebook. The arguments ```--train``` and ```--predict``` can be used to indicate the folder with the dataset. The script will load the ```hdf5``` files for training and testing respectively. The argument ```--predict_live``` can be used to perform predictions on live traffic captured from a network interface (e.g., ```eth0``` or ```lo```) or to make prediction using a pre-recorded traffic trace (e.g, ```ddos-chunk.pcap```). In both cases the argument is a string. In the first case, it is the name of the interface, in the second case, the path to the ```pcap``` file. The argument ```--model``` indicates the path to a trained model that will be used to make predictions (```--predict``` or ```predict_live```). 

In [None]:
parser = argparse.ArgumentParser(description="A DL-based NIDS for DDoS attack detection")
args = parser.parse_args(args=[])
parser.add_argument('-t', '--train', nargs='?', type=str,  default=None, help="Start the training process")
parser.add_argument('-p', '--predict', nargs='?', type=str,  default=None, help="Perform a prediction on pre-preprocessed data")
parser.add_argument('-pl', '--predict_live', nargs='?', type=str, default=None, help='Perform a prediction on live traffic or on a pre-recorded traffic trace in pcap format')
parser.add_argument('-m', '--model', type=str, default = None, help='File containing the model in keras format')

args, unknown = parser.parse_known_args()
print("see all args:", args)

In [None]:
def report_results(Y_true, Y_pred, model_name, data_source, prediction_time):
    ddos_rate = '{:04.3f}'.format(sum(Y_pred) / Y_pred.shape[0])

    if Y_true is not None and len(Y_true.shape) > 0:  # if we have the labels, we can compute the classification accuracy
        Y_true = Y_true.reshape((Y_true.shape[0], 1))
        accuracy = accuracy_score(Y_true, Y_pred)

        f1 = f1_score(Y_true, Y_pred)
        tn, fp, fn, tp = confusion_matrix(Y_true, Y_pred, labels=[0, 1]).ravel()
        tnr = tn / (tn + fp)
        fpr = fp / (fp + tn)
        fnr = fn / (fn + tp)
        tpr = tp / (tp + fn)

        row = {'Model': model_name, 'Time': '{:04.3f}'.format(prediction_time),
               'Samples': Y_pred.shape[0], 'DDOS%': ddos_rate, 'Accuracy': '{:05.4f}'.format(accuracy), 'F1Score': '{:05.4f}'.format(f1),
               'TPR': '{:05.4f}'.format(tpr), 'FPR': '{:05.4f}'.format(fpr), 'TNR': '{:05.4f}'.format(tnr), 'FNR': '{:05.4f}'.format(fnr), 'Source': data_source}

    pprint.pprint(row, sort_dicts=False)

# Implementation of the DL models
The following cell implements two deep learning models: an MLP and a CNN. Both can be trained using the traffic flow representations generated by LUCX. The MLP is trained on the flattened version of the data, whereas the CNN is trained on the array-like version.

In [None]:
def create_mlp_model(input_shape,optimizer=Adam, dense_layers=4, hidden_units=2, learning_rate = 0.01,dropout_rate=0):
    model = Sequential(name  = "mlp")

    model.add(Input(shape=(input_shape,)))
    model.add(Dense(hidden_units, activation='relu'))
    for layer in range(dense_layers):
        model.add(Dense(hidden_units, activation='relu', name='hidden-fc' + str(layer)))
        model.add(Dropout(dropout_rate))
    model.add(Dense(1, activation='sigmoid', name='fc2'))
    model.compile(optimizer=optimizer(learning_rate=learning_rate), loss='binary_crossentropy', metrics=['accuracy'])

    model.summary()
    return model

def create_cnn_model(input_shape,optimizer=Adam, filters = 100, kernel_size=(3,3), strides=(1,1), padding='same',learning_rate = 0.01,dropout_rate=0.1):
    model = Sequential(name  = "cnn")

    model.add(Input(shape=(input_shape[1], input_shape[2], 1)))
    model.add(Conv2D(filters=filters, kernel_size=kernel_size, data_format='channels_last', activation='relu', padding=padding, strides=strides))
    model.add(Dropout(dropout_rate))
    model.add(GlobalMaxPooling2D())
    model.add(Flatten())
    model.add(Dense(1, activation='sigmoid',name='output'))
    model.compile(optimizer=optimizer(learning_rate=learning_rate), loss='binary_crossentropy', metrics=['accuracy'])

    model.summary()
    return model

# Prediction on static test set

In [None]:
def predict(dataset_path, model_path):
    if dataset_path is not None:
        X_test, y_test = load_dataset(dataset_path + "/*" + '-test.hdf5')

        if model_path == None or model_path.endswith('.keras') == False:
                print ("No valid model specified!")
                return

        if model_path is not None:
            model = load_model(model_path)
        else:
            print ("Invalid model path: ", model_path) 
            return

        pt0 = time.time()
        for i in range(TEST_ITERATIONS):
            Y_pred = np.squeeze(model.predict(X_test, batch_size=16) > 0.5,axis=1)
        pt1 = time.time()
        prediction_time = pt1 - pt0

        report_results(np.squeeze(y_test), Y_pred,  model.name, '', prediction_time/TEST_ITERATIONS)

# Prediction on live traffic

In [None]:
def predict_live(source,model_path):
    if source is not None:
        if source.endswith('.pcap'):
            pcap_file = source
            cap = pyshark.FileCapture(pcap_file)
            data_source = pcap_file.split('/')[-1].strip()
        else:
            cap =  pyshark.LiveCapture(interface=source)
            data_source = args.predict_live

        print ("Prediction on network traffic from: ", source)

        if model_path is not None:
            model = load_model(model_path)
        else:
            print ("Invalid model path: ", model_path) 
            return

        # load the labels, if available
        labels = parse_labels('DOS2019')

        # Statistics on live traffic are computed considering benign vs malicious only
        mc_labels = {'benign':0, 'malicious':1}


        if MODEL_TYPE == 'MLP':
            MAX_FLOW_LEN=10
            mins, maxs = static_min_max(flatten=True,time_window=10,max_flow_len=MAX_FLOW_LEN)
        else:
            MAX_FLOW_LEN=10
            mins, maxs = static_min_max(flatten=False,time_window=10,max_flow_len=MAX_FLOW_LEN)

        while (True):
            samples = process_live_traffic(cap, labels, mc_labels, max_flow_len=MAX_FLOW_LEN, traffic_type="all",time_window=10)
            if len(samples) > 0:
                X,Y_true,flow_ids = dataset_to_list_of_fragments(samples)
                if MODEL_TYPE == 'MLP':
                    X = flatten_samples(X)
                    X = np.array(normalize(X, mins, maxs))
                else:
                    X = np.array(normalize_and_padding(X, mins, maxs, MAX_FLOW_LEN))
                if labels is not None:
                    Y_true = np.array(Y_true)
                else:
                    Y_true = None
                
                pt0 = time.time()
                Y_pred = np.squeeze(model.predict(X, batch_size=2048) > 0.5,axis=1)
                pt1 = time.time()
                prediction_time = pt1 - pt0

                report_results(np.squeeze(Y_true), Y_pred,  MODEL_TYPE, '', prediction_time)

# Hyperparameter tuning
Hyperparameter tuning with random search and k-fold cross-validation.

In [None]:
def train(dataset_path, model_path):
    if dataset_path is not None:
        X_train, y_train = load_dataset(dataset_path + "/*" + '-train.hdf5')
        X_val, y_val = load_dataset(dataset_path + "/*" + '-val.hdf5')

        param_mlp_dist = {
            'optimizer': [SGD, Adam],
            'model__learning_rate': uniform(0.0001, 0.01),
            'model__hidden_units': randint(8,32),
            'model__dense_layers': randint(2,8)
        }

        param_cnn_dist = {
            ### ADD YOUR CODE HERE ###
            'model__learning_rate' : uniform(0.0001, 0.01),
            'model__filters' : randint(16,64),
            'optimizer' : [SGD,Adam],
            'model__kernel_size': [(2,2),(3,3),(2,3)],
            'model__strides': [(1,1),(2,2)],
            'model__padding' : ['same', 'valid']
            ##########################
        }

        if MODEL_TYPE == 'MLP':
            param_dist = param_mlp_dist
            create_model = create_mlp_model
        else:
            param_dist = param_cnn_dist
            create_model = create_cnn_model


        early_stopping = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=10, restore_best_weights=True)
        model = KerasClassifier(model=create_model, input_shape=X_train.shape, batch_size=32, verbose=1,callbacks=[early_stopping])
        random_search = RandomizedSearchCV(estimator=model, param_distributions=param_dist, n_iter=2, cv=2, random_state=SEED)
        random_search_result = random_search.fit(X_train, y_train,epochs=1000, validation_data=(X_val, y_val))


        # Print the best parameters and corresponding accuracy
        print("Best parameters found: ", random_search.best_params_)
        print("Best cross-validated accuracy: {:.2f}".format(random_search.best_score_))


        # Save the best model
        best_model = random_search.best_estimator_.model_
        if model_path is not None:
            print ("Model saved as: " + model_path)
            best_model.save(model_path)
        else:
            model_path = './nids_model-' + MODEL_TYPE + '.keras'
            print ("Model saved as: " + model_path)
            best_model.save(model_path)

# Train your model
Train the model by executing the method above with appropriate arguments (```args....``` see the ```argparse``` cell at the beginning of the notebook). This will prepare the code for the stand-alone script, which will accept arguments from the command line. 
If you want to train the model within this notebook for testing purposes, first call the ```train``` method with static dataset path (```'./DOS2019_Binary_5_Attacks_PCAPs'```) and model path (e.g., ```'./nids_model-MLP.keras'```). 

**NOTE**: Before exporting the notebook to a Python script, remember to replace the static parameters with the previous ```args....```.

In [None]:
# Train the model
model_path = './nids_model-' + MODEL_TYPE + '.keras'
train('./Dataset/DOS2019_Binary_5_Attacks_PCAPs','./nids_model.keras')
#train(args.train,args.model)


# Make predictions on the test set
The following cell allows testing a model pre-trained model on the test set. 
If you want to test the model in this notebook, you can first call the ```predict``` method with static dataset path (```'./DOS2019_Binary_5_Attacks_PCAPs'```) and model path (e.g., ```'./nids_model-MLP.keras'```). 

**NOTE**: Before exporting the notebook to a Python script, remember to replace the static parameters with the previous```args....```.

In [None]:
# Predictions on the test set

model_path = './nids_model-' + MODEL_TYPE + '.keras'
predict('./Dataset/DOS2019_Binary_5_Attacks_PCAPs',model_path)
#predict(args.predict,args.model)

# Make predictions using a pcap file
The following cell allows testing a pre-trained model with live traffic by either collecting traffic from a network interface or by reading the packets from a pre-recorded traffic trace in `pcap` format. 
In the first case, call the ```predict_live``` method using the name of the interface (e.g., `eth0` or `lo`). In the second case, use the)path to a ```pcap``` file (e.g., ```'./DOS2019_Binary_5_Attacks_PCAPs/ddos-chunk.pcap'```). The `model_path` argument is the path to the model saved after training (e.g., ```'./nids_model-MLP.keras'```). 

**NOTE1**: these arguments are strings, do not forget the two paths between ```''```.
**NOTE2**: Before exporting the notebook to a Python script, remember to replace the static parameters with the previous```args....```.

In [None]:
# Predictions on a pcap file

model_path = './nids_model-' + MODEL_TYPE + '.keras'
predict_live('./Dataset/DOS2019_Binary_5_Attacks_PCAPs_test/ddos-chunk.pcap',model_path)
#predict_live(args.predict_live,args.model)