# TMA01 Question 2 (50 marks)

**Name**: \[Enter your name here\]

**PI**: \[Enter your student ID here\]

In this question you will attempt to distinguish the type of traffic, given that you've already identified it as being sent over a VPN or not.

Each flow is labelled according to the type of traffic.

| Traffic | Content |
|---------|---------|
| Web Browsing | Firefox and Chrome |
| Chat | ICQ, AIM, Skype, Facebook and Hangouts | 
| Streaming | Vimeo and Youtube | 
| Email | SMPTS, POP3S and IMAPS | 
| VoIP |Facebook, Skype and Hangouts voice calls (1h duration) | 
| P2P | uTorrent and Transmission (Bittorrent) |
| FT (File Transfer) | Skype, FTPS and SFTP using Filezilla and an external service | 

Note that there are **two** collections of datasets: one for traffic sent over a VPN, and one for traffic not sent over a VPN. Each collection has training, validation, and test data.

## Completing the question
The tasks in this notebook can be addressed using the techniques discussed in the Foundation and Block 1 of the module materials, and the associated notebooks.

> **You should be able to complete this question when you have completed the practical activities in Block 1**
>
> You should look at the notebooks for Block 1 while working through this question. You will find many useful examples in those notebooks which will help you in this assignment.

Record all your activity and observations in this notebook. Insert additional notebook cells as required. Remember to run each cell in sequence and to rerun cells if you make any changes in earlier cells. 

Include Markdown cells (like this one) liberally in your solutions, to describe what you are doing. This will help your tutor give full credit for all you have done, and is invaluable in reminding you what you were doing when you return to the TMA after a few days away.

Before you submit your notebook make sure you run all cells in order and check that you get the results you expect. (It is not unknown to receive notebooks which don't work when the cells are run in order.)

See the VLE for details of how to submit your completed notebook. You should submit only this notebook file for this question.

## Marks are based on process, not results

In this notebook, you will be asked to create, train, and evaluate several neural networks. Training neural networks is inherently a stochastic process, based on the random allocation of initial weights and the shuffled order of training examples. Therefore, your results will differ from results generated by other students, and those generated by the module team and presented in the tutor's marking guide.

The marks in this question are awarded solely on your ability to carry out the steps of training and evaluation, not on any particular results you may achieve. **There are no thresholds for accuracy (or any other metric) you must achieve.** You will gain credit for carrying out the tasks specified in this question, including honest evaluations of how the models perform. 

## Setup

This imports the required libraries and loads the data into training, validation, and testing datasets.

In [None]:
import tensorflow as tf
from tensorflow.keras import layers, optimizers, metrics, Sequential, utils

import os
import json

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from IPython.display import HTML, display

## Loading and preparing the dataset

This section of the notebook loads the dataset and makes it available for training.

First, we define some constants we will use later and define some metrics to use for model evaluation.

In [None]:
BATCH_SIZE = 64

In [None]:
class_names = {i: n.strip() for i, n in enumerate(open('/datasets/cybersecurity/vpn-nonvpn/class_names.txt'))}
class_names

In [None]:
class_names_array = np.array([class_names[i] for i in sorted(class_names)])
class_names_array

In [None]:
def pretty_cm(cm):
    result_table  = '<h3>Confusion matrix</h3>\n'
    result_table += '<table border=1>\n'
    result_table += '<tr><td>&nbsp;</td><td>&nbsp;</td><th colspan=10}>Predicted labels</th></tr>\n'
    result_table += '<tr><td>&nbsp;</td><td>&nbsp;</td>'

    for cn in class_names.values():
        result_table += f'<td><strong>{cn}</strong></td>'
    result_table += '</tr>\n'

    result_table += '<tr>\n'
    result_table += '<th rowspan=11>Actual labels</th>\n'

    for ai, an in class_names.items(): # enumerate(class_names):
        result_table += '<tr>\n'
        result_table += f'  <td><strong>{an}</strong></td>\n'
        for pi, pn in class_names.items(): #enumerate(class_names):
            result_table += f'  <td>{cm[ai, pi]}</td>\n'
        result_table += '</tr>\n'
    result_table += "</table>"
    # print(result_table)
    display(HTML(result_table))

In [None]:
def multi_class_precision(cmatrix):
    s = cmatrix.shape[0]
    numerator = tf.reduce_sum(tf.linalg.diag(tf.ones(s)) * cmatrix, axis=0)
    denominator = tf.cast(tf.reduce_sum(cmatrix, axis=0), tf.float32)
    return numerator / denominator

In [None]:
def multi_class_recall(cmatrix):
    s = cmatrix.shape[0]
    numerator = tf.reduce_sum(tf.linalg.diag(tf.ones(s)) * cmatrix, axis=1)
    denominator = tf.cast(tf.reduce_sum(cmatrix, axis=1), tf.float32)
    return numerator / denominator

Where to find the data.

In [None]:
base_dir = '/datasets/cybersecurity/vpn-nonvpn/'

In [None]:
train_vpn_data = tf.data.Dataset.load(os.path.join(base_dir, 'scenario_a2_15s_VPN_train'))
train_vpn_data = train_vpn_data.cache()
train_vpn_data = train_vpn_data.batch(BATCH_SIZE, num_parallel_calls=tf.data.AUTOTUNE)
train_vpn_data = train_vpn_data.shuffle(1000)
train_vpn_data

In [None]:
validation_vpn_data = tf.data.Dataset.load(os.path.join(base_dir, 'scenario_a2_15s_VPN_validation'))
validation_vpn_data = validation_vpn_data.cache()
validation_vpn_data = validation_vpn_data.batch(BATCH_SIZE, num_parallel_calls=tf.data.AUTOTUNE)
validation_vpn_data

In [None]:
test_vpn_data = tf.data.Dataset.load(os.path.join(base_dir, 'scenario_a2_15s_VPN_test'))
test_vpn_data = test_vpn_data.cache()
test_vpn_data = test_vpn_data.batch(BATCH_SIZE, num_parallel_calls=tf.data.AUTOTUNE)
test_vpn_data

In [None]:
train_no_vpn_data = tf.data.Dataset.load(os.path.join(base_dir, 'scenario_a2_15s_NO-VPN_train'))
train_no_vpn_data = train_no_vpn_data.cache()
train_no_vpn_data = train_no_vpn_data.batch(BATCH_SIZE, num_parallel_calls=tf.data.AUTOTUNE)
train_no_vpn_data = train_no_vpn_data.shuffle(1000)
train_no_vpn_data

In [None]:
validation_no_vpn_data = tf.data.Dataset.load(os.path.join(base_dir, 'scenario_a2_15s_NO-VPN_validation'))
validation_no_vpn_data = validation_no_vpn_data.cache()
validation_no_vpn_data = validation_no_vpn_data.batch(BATCH_SIZE, num_parallel_calls=tf.data.AUTOTUNE)
validation_no_vpn_data

In [None]:
test_no_vpn_data = tf.data.Dataset.load(os.path.join(base_dir, 'scenario_a2_15s_NO-VPN_test'))
test_no_vpn_data = test_no_vpn_data.cache()
test_no_vpn_data = test_no_vpn_data.batch(BATCH_SIZE, num_parallel_calls=tf.data.AUTOTUNE)
test_no_vpn_data

In [None]:
input_shape = (train_vpn_data.element_spec[0].shape[1],)
num_classes = train_vpn_data.element_spec[1].shape[1]
input_shape, num_classes

## Validation and test labels    

Use these for generating confusion matrices.

In [None]:
validation_vpn_labels = np.array(list(validation_vpn_data.unbatch().map(lambda x, y: y).as_numpy_iterator()))
validation_vpn_labels = np.argmax(validation_vpn_labels, axis=1)
validation_vpn_labels.shape

In [None]:
validation_no_vpn_labels = np.array(list(validation_no_vpn_data.unbatch().map(lambda x, y: y).as_numpy_iterator()))
validation_no_vpn_labels = np.argmax(validation_no_vpn_labels, axis=1)
validation_no_vpn_labels.shape

In [None]:
test_vpn_labels = np.array(list(test_vpn_data.unbatch().map(lambda x, y: y).as_numpy_iterator()))
test_vpn_labels = np.argmax(test_vpn_labels, axis=1)
test_vpn_labels.shape

In [None]:
test_no_vpn_labels = np.array(list(test_no_vpn_data.unbatch().map(lambda x, y: y).as_numpy_iterator()))
test_no_vpn_labels = np.argmax(test_no_vpn_labels, axis=1)
test_no_vpn_labels.shape

# Part a (14 marks)

Create a four layer model. The first two layers should have 64 units, the third should have 32 units, and the last should have 7 units (for the seven classes of traffic). Note that you will need an initial `Input` layer with `shape=input_shape`.

The first three layers should use `sigmoid` activation. The last layer should use `softmax` activation. 

The model summary should look like this.

```
Model: "sequential"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 64)             │         1,536 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 64)             │         4,160 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_2 (Dense)                 │ (None, 32)             │         2,080 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_3 (Dense)                 │ (None, 7)              │           231 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 8,007 (31.28 KB)

 Trainable params: 8,007 (31.28 KB)

 Non-trainable params: 0 (0.00 B)
```

Store the model in a variable called `model_a`.

Use the RMSprop optimiser, with its default parameters, and train this model for **200** epochs. Train using the `train_vpn_data` and `validation_vpn_data` defined above, as per the module notebooks. 

Plot how the accuracy and loss change during training. Comment on your observations of training. 

Evaluate the model on the test dataset and generate a confusion matrix. Store the confusion matrix in a variable called `cmatrx_a`. Comment on this evaluation and confusion matrix.

#### Important
This is a multi-class classification task. You must use the `categorical_crossentropy` loss function for training. The final output layer should use the `softmax` activation function.

> **Hints**
> * When training for many epochs, you may want to pass in the parameter `verbose=0` to `model_a.fit()` to reduce the output verbiage.
> * You may want to save the model and the training history so you can reload it in a later session.

In [None]:
# Your solution here
# Use additional cells as needed.

# Part b (7 marks)

Create another model with the same structure as in part (a). Store this model in a variable called `model_b`. 

Train this model with the same hyperparameters as in part (a), but now using the `train_no_vpn_data` and `validation_no_vpn_data` datasets. 

Plot how the accuracy and loss change during training. Comment on your observations of training.

Evaluate the model on the test dataset and generate a confusion matrix. Store the confusion matrix in a variable called `cmatrix_b`. Comment on this evaluation and confusion matrix.

In [None]:
# Your solution here
# Use additional cells as needed.

# Part c (8 marks)

You now have two models, one trained on the VPN traffic and one trained on the non-VPN traffic. How good are these models, and how well do these models compare?

You can also now replicate the results presented by Draper-Gil _et al._ (2016) figure 3(a)-(d). 

Using the example below, generate and plot the multi-class precision and multi-class recall scores, based on each of the evaluation of the models created in parts (a) and (b) above. Plot those results in the four subplots of one figure, using the example below for guidance.

Comment on your results.

Compare your results to those presented by Draper-Gil _et al._ (2016) (paying most attention to the "15s" data, as that is what you're using). 

> Reminder: there are no marks awarded for how well your models do in comparison to those created by Draper-Gil _et al._ This question is about generating results and comparing them.

In [None]:
# Your solution here
# Use additional cells as needed.

# Part d (12 marks)

Create a four layer model with the same structure as parts (a) and (b) above. This new model should use `relu` activation rather than `sigmoid`. (The last layer must still use `softmax` activation.)

Store the models in variables called `model_d1` and `model_d2`.

Use the SGD optimiser, with its default parameters, and train these models for **300** epochs. 

Train `model_d1` using the `train_vpn_data` and `validation_vpn_data` defined above. Train `model_d2` using the `train_no_vpn_data` and `validation_no_vpn_data` defined above.

Plot how the accuracy and loss change during training. Comment on your observations of training. 

Evaluate the models on the appropriate test datasets and generate a confusion matrices. Comment on this evaluation and confusion matrices.

Compare the performance of the models against those you generated in parts (a) and (b) above.

In [None]:
# Your solution here
# Use additional cells as needed.

# Part e (3 marks)

One feature of neural network based models is that they only perform well on data that is similar to what they've been trained on. 

From the work above, you have two datasets (VPN and non-VPN) and a model trained on each. At first glance, the flows in these datasets should be similar to the other dataset. How well do the models from part (d) work on the dataset it was _not_ trained on?

Evaluate each model from part (d) against each test dataset. Compare the results between models and datasets. Comment on what you find.

In [None]:
# Your solution here
# Use additional cells as needed.

# Part f (6 marks)

You can now address a simplified version of "scenario B" from the Draper-Gil _et al._ (2016) paper. This scenario attempts to classify flows by type, when the flows are a mixture of VPN and non-VPN traffic. However, you will only need to classify the flows into the current seven classes, rather than the fourteen in the paper.

The cell below combines the `vpn` and `no_vpn` datasets. 

Use this combined dataset to train another model, with the same structure as in part (d) above. Evaluate the model's performance on the `vpn`, `no_vpn`, and `all` datasets. How does this model compare to the models created in parts (a), (b), and (d)?

In [None]:
# Your solution here
# Use additional cells as needed.