# TMA01 Question 1 (35 marks)

Name: [Enter your name here]

PI: [Enter your student ID here]

In this question you will attempt to distinguish between traffic between data sent over a VPN link and traffic that isn't. In Question 2, you will classify the traffic into different types.

This question uses datasets of preprocessed flows. Each flow is labelled with whether that flow went over a VPN or not.

## Completing the question
The tasks in this notebook can be addressed using the techniques discussed in the Foundation and Block 1 of the module materials, and the associated notebooks.

> **You should be able to complete this question when you have completed the practical activities in Block 1**
>
> You should look at the notebooks for Block 1 while working through this question. You will find many useful examples in those notebooks which will help you in this assignment.

Record all your activity and observations in this notebook. Insert additional notebook cells as required. Remember to run each cell in sequence and to rerun cells if you make any changes in earlier cells. 

Include Markdown cells (like this one) liberally in your solutions, to describe what you are doing. This will help your tutor give full credit for all you have done, and is invaluable in reminding you what you were doing when you return to the TMA after a few days away.

Before you submit your notebook make sure you run all cells in order and check that you get the results you expect. (It is not unknown to receive notebooks which don't work when the cells are run in order.)

See the VLE for details of how to submit your completed notebook. You should submit only this notebook file for this question.

## Marks are based on process, not results

In this notebook, you will be asked to create, train, and evaluate several neural networks. Training neural networks is inherently a stochastic process, based on the random allocation of initial weights and the shuffled order of training examples. Therefore, your results will differ from results generated by other students, and those generated by the module team and presented in the tutor's marking guide.

The marks in this question are awarded solely on your ability to carry out the steps of training and evaluation, not on any particular results you may achieve. **There are no thresholds for accuracy (or any other metric) you must achieve.** You will gain credit for carrying out the tasks specified in this question, including honest evaluations of how the models perform. 

## Setup

This imports the required libraries.

In [None]:
import tensorflow as tf
from tensorflow.keras import layers, optimizers, metrics, Sequential, utils

import os
import json
import sklearn.metrics
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

We define some constants we will use later and define some metrics to use for model evaluation.

In [None]:
BATCH_SIZE = 64

METRICS = [
      lambda : metrics.TruePositives(name='tp'),
      lambda : metrics.FalsePositives(name='fp'),
      lambda : metrics.TrueNegatives(name='tn'),
      lambda : metrics.FalseNegatives(name='fn'), 

      lambda : metrics.BinaryAccuracy(name='accuracy'),
      lambda : metrics.Precision(name='precision'),
      lambda : metrics.Recall(name='recall'),
      lambda : metrics.AUC(name='auc'),
]

def fresh_metrics():
    return [metric() for metric in METRICS]

Define a function for plotting ROC curves.

In [None]:
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

def plot_roc(name, labels, predictions, show_points=True, show_point_labels=True, **kwargs):
    """Plot the ROC curve for a binary classifer, given some labels and predictions for those labels.
    name is the name shown in the legend.
    if show_points is True, show the locations on the curve if the threshold is set to 0.25, 0.5, and 0.75."""
    
    # Calcuate the points of the curve
    fpr, tpr, ths = sklearn.metrics.roc_curve(labels, predictions)

    # Plot as percentabes
    plt.plot(100*fpr, 100*tpr, label=name, linewidth=2, **kwargs)
    plt.xlabel('False positive rate [%]')
    plt.ylabel('True positive rate [%]')
    text_x_offset = 3
    text_y_offset = 3
    plt.grid(True)
    ax = plt.gca()
    ax.set_aspect('equal')
    
    # Show the locations of various thresholds.
    if show_points:
        for pt in [0.25, 0.5, 0.75]:  
            pi = np.argmax(ths < pt)
            px = fpr[pi] * 100
            py = tpr[pi] * 100
            plt.plot(px, py, marker="o", markersize=10, **kwargs)
            if show_point_labels:
                ax.text(x = px + text_x_offset, y = py - text_y_offset, s = f'{pt}')
    
    return fpr, tpr, ths

## Loading and preparing the dataset

This section of the notebook loads the dataset and makes it available for training.

In [None]:
class_names = {0: 'Not VPN', 1: 'VPN'}

Where to find the data.

In [None]:
base_dir = '/datasets/cybersecurity/vpn-nonvpn/'

In [None]:
train_data = tf.data.Dataset.load(os.path.join(base_dir, f'scenario_a1_15s_train'))
train_data = train_data.cache()
train_data = train_data.batch(BATCH_SIZE, num_parallel_calls=tf.data.AUTOTUNE)
train_data = train_data.shuffle(1000)
train_data

In [None]:
validation_data = tf.data.Dataset.load(os.path.join(base_dir, f'scenario_a1_15s_validation'))
validation_data = validation_data.batch(BATCH_SIZE)
validation_data

In [None]:
test_data = tf.data.Dataset.load(os.path.join(base_dir, f'scenario_a1_15s_test'))
test_data = test_data.batch(BATCH_SIZE)
test_data

In [None]:
input_shape = (train_data.element_spec[0].shape[1],)
input_shape

## Jittered labels

The labels of the validation set, jittered. These may be useful for charts similar to those in the Foundations notebooks.

In [None]:
validation_labels = np.array(list(validation_data.unbatch().map(lambda x, y: y).as_numpy_iterator()))
validation_labels.shape

In [None]:
jittered_validation_labels = validation_labels + (np.random.random(validation_labels.shape) * 0.8)
jittered_validation_labels.shape

In [None]:
test_labels = np.array(list(test_data.unbatch().map(lambda x, y: y).as_numpy_iterator()))
test_labels.shape

In [None]:
jittered_labels = test_labels + (np.random.random(test_labels.shape) * 0.8)
jittered_labels.shape

# Define and train a sample model

We now create and train a simple model using these datasets.

You should use this example as a basis for the models of your own that you create in this question.

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Input(input_shape),
    tf.keras.layers.Dense(8, activation='relu'), 
    tf.keras.layers.Dense(1, activation='sigmoid')
])

Note that we're using **binary** cross entropy as the loss function (as there are two classes). Categorical cross-entropy is used when there are multiple classes, one-hot encoded.

In [None]:
opt = optimizers.RMSprop()
model.compile(optimizer=opt, 
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [None]:
history = model.fit(train_data,
    validation_data=validation_data,
    epochs=5,
    verbose=0)

Save and reload the model and the training history.

In [None]:
model.save('q1_sample.keras')

with open('q1_sample_history.json', 'w') as f:
    json.dump(history.history, f)

In [None]:
model = tf.keras.models.load_model('q1_sample.keras')

with open('q1_sample_history.json') as f:
    sample_history = json.load(f)

Plot the training history.

In [None]:
acc = sample_history['accuracy']
val_acc = sample_history['val_accuracy']
loss = sample_history['loss']
val_loss = sample_history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'ro', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'ro', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

Update the metrics used on the model and evaluate them on the validation data.

In [None]:
model.compile(metrics=fresh_metrics(), loss='binary_crossentropy')
model.evaluate(validation_data, return_dict=True)

**You are now able to work on the tasks in this TMA question.**

# (a) (4 marks): examine the model

Referring to the sample model above:
* show how many trainable parameters it has
* extract the shape of the input to the model

In [None]:
# Your solution here
# Use additional cells as needed

# (b) (6 marks): new model, train it

Following the sample model defined above, create and train a new classifier model of this dataset. Your new model should have an `Input` layer and four `Dense` layers.

The `Dense` layers should have these parameters:
* 64 units, `relu` activation
* 64 units, `relu` activation
* 32 units, `relu` activation
* 1 unit, `sigmoid` activation

Training should use the `RMSprop` optimiser with the default learning rate.

Remember to use `binary_crossentropy` as the loss function

Train your modified model for **300** epochs. Use `verbose=0`. Show plots of how the accuracy and loss changed over training, for both the training and validation datasets.

(You may wish to save your model and the training history.)

In [None]:
# Your solution here
# Use additional cells as needed

# (c) (5 marks): comment on training

Comment on the plots of loss and accuracy, for both training and validation data, during the training of this model. Do you think this model would benefit from additional training?

In [None]:
# Your solution here
# Use additional cells as needed

# (d) (10 marks): evaluate with new metrics

Recompile the model from part (b) above to use the metrics defined by the `fresh_metrics` function defined above. 

Evaluate the model, using these metrics, on all three of the **train**, **validation**, and **test** datasets. 

Use that model to generate predicted classes for all elements in the **test** dataset. Plot a scatter chart of the predicted results with the actual results (defined above as either `test_labels` or `jittered_labels`.)

Generate and plot the ROC curve for this, using the **test** dataset.

Comment on these results. 

In [None]:
# Your solution here
# Use additional cells as needed

# (e) (10 marks): Experiments

Neural network models with the same structure can vary in different hyperparameters. In this part, you will train and evaluate two variations of the model you used in part (b) above, to see how these changes affect model training and performance.

1. The first variation should use a new model with the same definition as in part (b), but you should use the RMSprop optimiser with the default learning rate and train the model for 300 epochs. Store the model in a variable called `model_e1`.
2. The second variation should use a new model with the same definition as in part (b), but with `sigmoid` activation in all `Dense` layers, the  RMSprop optimiser with the default learning rate, and train the model for **600** epochs. Store the model in a variable called `model_e2`.

For each model:
* plot and comment on the metrics generated during training
* evaluate the model (with `fresh_metrics`) on the test dataset
* generate a scatter diagram of predictions
* comments on the evaluation and scatter diagram

Finally, generate an ROC curve with all three models: `model_b`, `model_e`, and `model_e2. Comment on these curves.

In [None]:
# Your solution here
# Use additional cells as needed