# The NetSecure NIDS - Example answers
The startup company NetSecure has been observing the recent hacks on the news. Because they are working on a new product, and they want to protect their intellectual property, they have decided to improve their network security.
	
To this end, they want to deploy a Network Intrusion Detection System (NIDS) that analyses all the traffic passing between their internal network and the internet. Unfortunately, all security solutions provided by security companies are too expensive for NetSecure. Therefore they approached you with the assignment of building a NIDS.

## Network setup
NetSecure is a small company with only 3 employees. All of whom have a single desktop machine which they use for their daily work. Furthermore, the company runs a server which provides various webservices to the internet. Figure 1 gives an overview of the network structure of NetSecure. It also shows where the NIDS will be placed.

<img src="images/network.png" width=600>
<div align="center">Figure 1. Overview of the network setup of company NetSecure, including the IP addresses of their machines.</div>

## Available data
In order to train the NIDS that you are creating you are given two datasets of network traffic, both captured at the router as given in Figure 1.
	
 1. `data/benign.csv`, a dataset of benign traffic.
 2. `data/unknown.csv`, a dataset of unknown traffic.

It is up to you to train a machine learning algorithm with the benign traffic. Once you have trained the NIDS, you will test the unknown traffic to determine which parts should be classified as benign and which parts should be classified as malicious.
	
The network traffic that is captured is provided as a .csv file of individual TCP and UDP packets. Each packet has the following features:

| Feature   | Type   | Description                                           |
|-----------|--------|-------------------------------------------------------|
| timestamp | int    | Timestamp at which packet was sent.                   |
| protocol  | string | Indicating whether a packet was sent over UDP or TCP. |
| src       | string | Source IP address of the packet.                      |
| sport     | int    | Source port of the packet.                            |
| dst       | string | Destination IP of the packet.                         |
| dport     | int    | Destination port of the packet.                       |
| size      | int    | Number of bytes in packet that was sent.              |

Furthermore, the dataset of benign traffic also contains labels of the applications that are generating the network traffic.

## Assignment
In this assignment we will walk through all the steps necessary to create a proper NIDS for NetSecure. The first thing we need to do is load the assignment.

<div class="alert alert-block alert-info">
<b>Hint:</b> This assignment works with the `assignment` class, which provides functionality for running your code and checks for common (albeit not all) bugs. Please see the <b>reference.html</b> provided with the assignment for a description of the functionality.
</div>

In [None]:
# Imports numpy and pandas libraries
import numpy  as np
import pandas as pd
# Imports type hints
from typing import Literal, List, Optional, Tuple

# Imports assignment from backend
from ml4sec import Assignment

# Initialises assignment with given files.
# HINT: make sure you extract the benign.csv and unknown.csv files in data directory as assignment.ipynb
assignment = Assignment(
    file_benign  = 'data/benign.csv',
    file_unknown = 'data/unknown.csv',
)

## 1. Feature selection/extraction
The network packets in the transport layer (i.e. TCP and UDP packets) all belong to a network flow. Such a flow is given by the 5-tuple (*protocol*, *src*, *sport*, *dst*, *dport*). All of the packets within such a flow belong to the same application. We can leverage this knowledge to group packets together and extract statistical features from them as a group.

**Question 1.1.** We can compute some statistical features of each group. These features may give us an advantage in the detection phase of our NIDS. Which statistical features can be computed?

*(Hint 1: An example feature could be the maximum length of a packet in a flow.)*

*(Hint 2: The timestamp itself does not say much about the origin of a packet, however the frequency (i.e. time between packets) might give us some more information.)*

**NB:** in this question it is up to you to determine how many features you want to extract. Keep in mind that this will have influence on the performance of your NIDS so think carefully about the features you could extract.

**Question 1.2.** We now want to extract these new features for each flow. To this end, you will implement the function `extract()` which has access to the `protocol`, `src`, `sport`, `dst`, `dport` of a flow, as well as the `timestamps`, and `sizes` for each packet within the flow. Your assignment is to implement `extract` which should return a list of extracted features as described the in previous question. Note that this function is called for each flow. To test your method for X flows, please run the `assignment.test_extract(X)` as described below.

In [None]:
def extract(
        protocol  : Literal['UDP', 'TCP'],
        src       : str,
        sport     : int,
        dst       : str,
        dport     : int,
        timestamps: np.ndarray,
        sizes     : np.ndarray,
    ) -> List[float]:
    # This part must be implemented
    pass
    # This part must be implemented
    

# Set extract method
assignment.extract = extract

You can use the code below to test your implementation of `extract()` with a given number of flows.

In [None]:
# Change the number to test different numbers of flows
assignment.test_extract(1)

Now that we extracted the features, we represent them in a feature matrix. Each row in this matrix represents a flow and each column represents a feature. Run the code below to show the feature matrix for your feature extraction.

In [None]:
# Shows feature matrix
assignment.show_matrix()

## Information Gain (for repair only)
<div class="alert alert-block alert-info">
<b>Note:</b> When talking about the log function, we mean the <b>natural logarithm</b> as implemented by either `math.log` or `np.log`.
</div>

In this repair assignment, we will evaluate how much information your selected features give with respect to the application to which they belong. In machine learning, we often compute the information that a feature gives for a specific label using the `information gain`.

Without any features, we could only guess the label that we have. However, since there are many labels, this guess will often be wrong. The difficulty of this guess can be expressed in the entropy of the labels. The entropy $H(X)$ for a variable $X$ is defined as:
$$
H(X) = - \sum_{i=1}^n p(x_i) \cdot \log p(x_i)
$$
where $p(x_i)$ is the probability of observing $x_i$.

As an example, consider the following labels `[a, b, b, b, b]`. We see that the probability $p(a) = 0.2$ and the probability $p(b) = 0.8$. The entropy in this case will be $H(X) = -(0.2 \cdot \log 0.2) - (0.8 \cdot \log 0.8) \approx 0.5004$.

We can also compute the entropy of our labels, given that we know a specific feature. This is the conditional entropy. The conditional entropy $H(Y|X)$ for a variable $Y$ given the observation $X$ is defined as:
$$
H(Y|X) = - \sum_{x \in X, y \in Y} p(x, y) \cdot \log \frac{p(x, y)}{p(x)}
$$
where $p(x)$ is the probability of observing $x$ and $p(x, y)$ is the joint probability of observing both $x$ and $y$ simultaneously.

Consider the following features and their corresponding labels

| Feature (X) | Label (Y) |
|:-----------:|:---------:|
| 1           | a         |
| 1           | a         |
| 2           | b         |
| 3           | b         |
| 3           | c         |

This gives a conditional entropy of
$$
H(Y|X) = - (0.4 * \log \frac{0.4}{0.4}) - (0.2 * \log \frac{0.2}{0.2}) - (0.2 * \log \frac{0.2}{0.4}) - (0.2 * \log \frac{0.2}{0.4}) \approx 0.2773
$$

This consists of the following parts:
 * X = 1 and Y = a: $p(X=1, Y=a) = 0.4$, $p(X=1) = 0.4 \Rightarrow (0.4 * \log \frac{0.4}{0.4})$
 * X = 2 and Y = b: $p(X=2, Y=b) = 0.2$, $p(X=2) = 0.2 \Rightarrow (0.2 * \log \frac{0.2}{0.2})$
 * X = 3 and Y = b: $p(X=3, Y=b) = 0.2$, $p(X=3) = 0.4 \Rightarrow (0.2 * \log \frac{0.2}{0.4})$
 * X = 3 and Y = c: $p(X=3, Y=c) = 0.2$, $p(X=3) = 0.4 \Rightarrow (0.2 * \log \frac{0.2}{0.4})$

The `information gain` computes the difference between the original entropy of our labels: $H(labels)$ and the conditional entropy of our labels given our features, i.e. for each feature, it computes $H(labels | feature)$.
If the information gain is high, the feature is very useful for determining the label, conversely if the information gain is close to 0, the feature is not useful for determining the label.

**Question (repair).** Implement the `entropy` and `conditional_entropy` functions using the formulas described above.

In [None]:
def entropy(X):
    # This part must be implemented
    pass
    # This part must be implemented
    
    
def conditional_entropy(Y, X):
    # This part must be implemented
    pass
    # This part must be implemented    


# Sets entropy methods
assignment.entropy             = entropy
assignment.conditional_entropy = conditional_entropy

The code below runs the examples given in the assignment so that you can debug your own implementation.

In [None]:
example_input_1   = np.asarray(['a', 'b', 'b', 'b', 'b'])
example_input_2_X = np.asarray(['1', '1', '2', '3', '3'])
example_input_2_Y = np.asarray(['a', 'a', 'b', 'b', 'c'])

# Test case 1
result_1 = assignment.entropy(example_input_1)
print("Test case - entropy            : {}".format(result_1))

# Test case 2
result_2 = assignment.conditional_entropy(
    Y = example_input_2_Y,
    X = example_input_2_X,
)
print("Test case - conditional entropy: {}".format(result_2))

# Check if output is correct
assert abs(result_1 - 0.5004024) < 1e-6, "Test case 1: entropy() is computed incorrectly"
assert abs(result_2 - 0.2772588) < 1e-6, "Test case 2: conditional_entropy() is computed incorrectly"

Using the code below, you can compute the information gain for your own extracted features.

_Important: Your features are likely continuous (any rational number) rather than categorical. The information gain is only defined for categorical values. Therefore, the code below automatically transforms your features into categories by putting them into bins. You can play around with the number of bins, but note that for high values, each value gets its own bin and the information gain will always look high (even though it is not)._

In [None]:
# Computes the information gain for your computed features
assignment.show_info_gain(bins=20)

### Repair submission
To check if you correctly implemented the `entropy` and `conditional_entropy` methods, you can test your implementation using the code below.

<div class="alert alert-block alert-info">
<b>Note:</b> In order to pass the repair, you need to submit <b>both this repair assignment and the complete notebook</b>. The notebook is equivalent to the original assignment with some additional documentation and test functions to help with debugging. Therefore, if you already completed the notebook, but submitted late, you can reuse your code from the previous submission.
</div>

In [None]:
# Fill in your credentials to submit the repair of the assignment.
assignment.submit_repair(
    student_numbers = ["<your_student_numbers>", "<your_student_numbers>"],
)

## 2. Data Preprocessing
The data that we have stored is composed of different features. All these features are stored within a range of values. For example, the length of a packet is limited by the Maximum Transmission Unit (MTU), which by default is 1500 bytes giving this feature a range of [0, 1500]. Conversely, we may observe from the data that the average time between packets is in the range [0, 10] seconds. When a machine learning algorithm tries to compare two datapoints, it will put a much greater weight to the feature with range [0, 1500] than the feature with range [0, 10] because the absolute differences are much larger. Instead, we would like to compare the relative distances for each feature. To this end, we apply a technique called `scaling`.

There are many different scaling techniques, but for the purpose of creating a NIDS, we will use min-max scaling. This scaling technique scales the original values in such a way that the minimum value for each feature is mapped to 0, and the maximum value for each feature is mapped to 1. The relative distance between each values is preserved. The formula for calculating the new value `z_i` of feature $x_i$ from feature $X$ is given as

$$z_i = \frac{x_i - \min{(X)}}{\max{(X)} - \min{(X)}}.$$

**Question 2.1.** Implement the `scale` function using the given formula. First compute the minimum and maximum values for each column from the matrix. Next use these values to scale the matrix.

_Important: The method should return the scaled matrix, computed minimum, and computed maximum values for the given matrix._

In [None]:
def scale(
        matrix: np.ndarray,
        minimum: Optional[np.ndarray] = None,
        maximum: Optional[np.ndarray] = None,
    ) -> Tuple[list, list, list]:
    
    # If minimum value is not given, compute the minimum for each feature
    if minimum is None:
        # This part must be implemented
        minimum = None
        # This part must be implemented
        
    # If maximum value is not given, compute the maximum for each feature
    if maximum is None:
        # This part must be implemented
        maximum = None
        # This part must be implemented
        
        
    # Use the minimum and maximum to compute the scaled matrix
        
    # This part must be implemented
    scaled = None
    # This part must be implemented
    
    return scaled, minimum, maximum
    
    
# Sets scale method
assignment.scale = scale

You can use the code below to test your implementation of `scale()` for various test cases. Set verbose to `True` to show which test cases are running.

In [None]:
# Test your implementation of scale()
assignment.test_scale(verbose=False)

Run the code below to show the scaled feature matrix produced by your scale function. All values should be between 0 and 1.

_Hint: If your matrix shows NaN values, this likely means that there was a divide by 0 error. This happens if you have features that are all the same. If this is the case for you, please revisit your_ `extract()` _function and check which feature produces values that are all the same._

In [None]:
# Show scaled feature matrix
assignment.show_scaled()

Running the function below plots the benign and unknown data in both unscaled and scaled form.

**NB**: The plot shows the features of each flow compressed to a 2D space using [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis). Therefore, if you have more than 2 features, these plots will show all the features "compressed" into a 2D mapping.

In [None]:
# Plots unscaled and scaled features
assignment.plot_scaled()

**Question 2.2.** Look at the above plots, the left plots show the unscaled features and the right plots show the scaled features. As we can see, the scaled features are separated better then the unscaled features. However, are all applications seperated properly? If you see some overlap, is this something you could explain or do you need to choose different features?

**NB:** Note that some overlap might be caused by the dimentionality reduction from the PCA in the plotting method. However, this will very likely result in clusters being very close together, but will likely not result in a *complete* overlap.

**NB:** If there is a weird overlap in features, or if the scaled data does not seem to be separated very well, try to extract different features in your `extract()` implementation.  If it looks difficult to separate the applications in the scaled version of the dataset, your NIDS will likely also have difficulties separating the applications.

## 3. Model/Parameter selection
Currently we have a scaled matrix where each row contains a data sample, and each column a feature. Now that we have prepared the features that we feed into our NIDS, we have to choose an anomaly detection algorithm to observe new behaviour. Given that we only have access to benign data and unknown data, we cannot train a classifier. Therefore, we need to train a novelty/outlier detection algorithm. In this assignment we will use a One-Class SVM.

As you have learned during the lecture, a One-Class SVM uses a kernel function K to define the geometric relationship between a feature vector X and the support vectors Y. For an RBF (Gaussian) kernel, this relationship is defined as:

$$K(X, Y) = e^{-\frac{||x - y||^2}{2\sigma^2}}$$

**NB:** We use $||x - y||^2$ to denote the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) between $x$ and $y$.

**Question 3.1.** Implement the RBF kernel function.

In [None]:
def K(X: np.ndarray, Y: np.ndarray, sigma: float=1) -> float:
    # The code between this text should be implemented
    pass
    # The code between this text should be implemented

A support vector machine (SVM) uses its kernel function to compute a score between a feature vector $X$ and support vector $Y$. You can think of the score as the likelihood **(note that it is not a probability)** that $X$ and $Y$ originate from the same underlying distribution. Because an SVM contains multiple support vectors $Y_i$, we can assign a score to feature vector $X$ based on the likelihood that it shares the same underlying distribution with any of our support vectors $Y_i \in Y$. We compute this overall score by simply adding the (weighted) scores for $X$ with each support vector $Y_i$, which is called the soft score:

$$\text{soft score} = \sum_{i=0}^N w_iK(X, Y_i) \leq \tau$$

We compare this soft score with a threshold $\tau$ to determine if a point falls within or outside of our model. The assignment automatically computes the soft score with all weights as 1 for all given points $X$, support vectors $Y$ and your implemented kernel $K$.

We show the influence of different kernel values on the decision function for random support vectors and datapoints. All datapoints that appear in the red area are considered to fall within the soft hypersphere. All datapoints outside the area are considered anomalous. We see that the sigma changes the shape of the hypershpere and the threshold changes the volume while preserving the shape.

In [None]:
# Plots random kernels and shows the spheres produced by your kernel
assignment.plot_kernels(K, sigmas=[1, 2, 3], thresholds=[3, 2, 1])

Later in this assignment, we will use the `OneClassSVM` implementation from the `scikit-learn` library. This implementation automatically choses optimal values for the weights $w_i$ and threshold $\tau$. However, we can still set two values to influence this decision: `gamma` and `nu`. The `gamma` value determines the size of the kernel, similar to the value of $\sigma$. Additionally, the `nu` value sets an upper bound on the fraction of training errors and a lower bound of the fraction of support vectors. We recall from the lecture that we want to minimize the threshold $\tau$. The `nu` value allows us to adjust the number of support vectors as well as set the allowed number of errors the OneClassSVM can make during training while keeping the $\tau$ value minimal.

## 4. Evaluation

Now that we have taken a closer look at One-Class SVMs, we have to think about how we can train, test and evaluate the model. We do this to get an idea of how well the NIDS will perform in a real-world scenario, which allows us to optimise its parameters before deployment and classification of the unknown data. In order to evaluate our model, we require labelled data, i.e. data from the `benign.csv` file. Given that all data is benign, it might seem difficult to assess whether there are malicious network connections. However, we can think of ways to redefine our problem such that we can only use benign data in order to train and test the NIDS.

All data in the `benign.csv` file is labelled with the corresponding application that produced the data. We know that a malicious program is just another application that we have not seen in our benign data. We will leverage this observation to train and test our model.

To this end, we randomly select a couple of applications that we leave out of the training data and only use for the test dataset. These applications simulate the 'unknown' apps or malware that we might encounter in the `unknown.csv` dataset. All other applications will be present in both the train and test set. However, for these applications, we still need to select a portion of the data that we use in the test set, and a portion of the data that we use in the train set. We can define this portion with a `ratio`.

As an example, consider a dataset of 5 applications: $Apps = [\text{Firefox}, \text{Word}, \text{Git}, \text{Excel}, \text{Outlook}]$. Now we randomly select $[\text{Firefox}, \text{Excel}]$ to be our apps for the testing set, meaning $[\text{Word}, \text{Git}, \text{Outlook}]$ are the apps present in both the train and test datasets. Next we define our `ratio` for training items as `0.75`. This means that we select 75% of the flows from $[\text{Word}, \text{Git}, \text{Outlook}]$ as our train data and add the remaining 25% of flows to the test dataset.

_Hint: we round the ratio if it does not split properly_

Therefore, if we have the following input:

**Apps train:** $[\text{Word}, \text{Git}, \text{Outlook}]$ <br>
**Apps test:** $[\text{Firefox}, \text{Excel}]$ <br>
**Ratio:** 0.75 <br>

We will get the following split:

<div style="-webkit-column-count: 3; -moz-column-count: 3; column-count: 3; -webkit-column-rule: 1px dotted #e0e0e0; -moz-column-rule: 1px dotted #e0e0e0; column-rule: 1px dotted #e0e0e0; text-align: center;">
    <div style="display: inline-block;">
<h3 style="text-align: center;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Original data and Labels</h3>
        
|  Label  | feature 1 | feature 2 | ... | feature n |
|:-------:|:---------:|:---------:|:---:|:---------:|
| Firefox | value     | value     | ... | value     |
| Firefox | value     | value     | ... | value     |
| Word    | value     | value     | ... | value     |
| Word    | value     | value     | ... | value     |
| Word    | value     | value     | ... | value     |
| Git     | value     | value     | ... | value     |
| Git     | value     | value     | ... | value     |
| Excel   | value     | value     | ... | value     |
| Excel   | value     | value     | ... | value     |
| Excel   | value     | value     | ... | value     |
| Excel   | value     | value     | ... | value     |
| Outlook | value     | value     | ... | value     |
| Outlook | value     | value     | ... | value     |
| Outlook | value     | value     | ... | value     |
| Outlook | value     | value     | ... | value     |
         
</div>
<div style="display: inline-block;">
<h3 style="text-align: center;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Train set</h3>
        
|  Label  | feature 1 | feature 2 | ... | feature n |
|:-------:|:---------:|:---------:|:---:|:---------:|
| Word    | value     | value     | ... | value     |
| Word    | value     | value     | ... | value     |
| Git     | value     | value     | ... | value     |
| Outlook | value     | value     | ... | value     |
| Outlook | value     | value     | ... | value     |
| Outlook | value     | value     | ... | value     |
        
</div>
<div style="display: inline-block;">
<h3 style="text-align: center;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Test set</h3>

|  Label  | feature 1 | feature 2 | ... | feature n |
|:-------:|:---------:|:---------:|:---:|:---------:|
| Firefox | value     | value     | ... | value     |
| Firefox | value     | value     | ... | value     |
| Word    | value     | value     | ... | value     |
| Git     | value     | value     | ... | value     |
| Excel   | value     | value     | ... | value     |
| Excel   | value     | value     | ... | value     |
| Excel   | value     | value     | ... | value     |
| Excel   | value     | value     | ... | value     |
| Outlook | value     | value     | ... | value     |
        
</div>
</div>

**Question 4.1.** Implement the function `split()` that implements the split technique described above.

In [None]:
def split(
        data: np.ndarray,
        labels: np.ndarray,
        apps_train: List[str],
        apps_test: List[str],
        ratio: float,
    ) -> Tuple[list, list, list, list]:
    # The code between this text should be implemented
    pass
    # The code between this text should be implemented
    
    
# Sets split method
assignment.split = split

You can use the code below to test your implementation of `split()` for several test cases, set variable `verbose = True` to print each test case.

In [None]:
# Tests split method
assignment.test_split(verbose=False)

The following code uses your `split()` method to obtain the train and test split for the data in `benign.csv`.

**Important**: Note that the labels for this split are integers that are `+1` if the corresponding application is in the training data or `-1` if the corresponding application is not in the training data. We will use this to determine whether our NIDS can correctly determine whether apps are present in the training set (`+1`, benign) or whether they are not present in the training set (`-1`, malicious).

In [None]:
# Gets training and testing data
X_train, y_train, X_test, y_test = assignment.get_split(
    
        # Select the apps used for training, the rest will be used for testing
        apps_train = [
            'IntelliJ',
            'PowerPoint',
            'Thunderbird',
            'Web Server',
            'Chrome',
            'Word',
            'Mail Server',
            'Anti Virus',
            'Outlook',
            'DNS Server',
        ],
    
    
        # Select the ratio of flows from training apps used for training
        ratio     = 0.75,
)

# Prints overview of data
print(
"""
Your data is structured as follows:
-----------------------------------
Train data   (X_train): np.array of shape={}
Train labels (y_train): np.array of shape={}
Test  data   (X_test ): np.array of shape={}
Test  labels (y_test ): np.array of shape={}
""".format(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
)

Now that we have split our data into training and testing sets with the corresponding labels, we want to compute some metrics over a prediction. We recall from the lecture that most metrics are based on the True Positive (TP), False Positive (FP), True Negative (TN) and False Negative values (FN).

 * True Positives (TP), the amount of samples predicted as *unknown* (-1), that should be labelled as *unknown* (-1).
 * True Negatives (TN), the amount of samples predicted as *known* (+1), that should be labelled as *known* (+1).
 * False Positives (FP), the amount of samples predicted as *unknown* (-1), that should be labelled as *known* (+1).
 * False Negatives (FN), the amount of samples predicted as *known* (+1), that should be labelled as *unknown* (-1).
 
Or in a diagram:

| $ $                | Actual Unknown      | Actual Known        |
|--------------------|---------------------|---------------------|
| Predicted Unknown  | True Positive (TP)  | False Positive (FP) |
| Predicted Known    | False Negative (FN) | True Negative (TN)  |

**Question 4.2.** Given these definitions of TP, TN, FP, FN, implement the methods that compute each value given the predictions (`y_pred`) and actual labels (`y_true`).

In [None]:
def TP(y_true: np.ndarray, y_pred: np.ndarray) -> int:
    # The code between this text should be implemented
    pass
    # The code between this text should be implemented
    
def TN(y_true: np.ndarray, y_pred: np.ndarray) -> int:
    # The code between this text should be implemented
    pass
    # The code between this text should be implemented
    
def FP(y_true: np.ndarray, y_pred: np.ndarray) -> int:
    # The code between this text should be implemented
    pass
    # The code between this text should be implemented
    
def FN(y_true: np.ndarray, y_pred: np.ndarray) -> int:
    # The code between this text should be implemented
    pass
    # The code between this text should be implemented
    
    
# Sets True/False Positive/Negative methods
assignment.TP = TP
assignment.TN = TN
assignment.FP = FP
assignment.FN = FN

You can use the code below to test your implementations for the True/False Positive/Negative values for several test cases, set variable `verbose = True` to print each test case.

In [None]:
# Tests your implementation of True/False Positive/Negative values
assignment.test_metrics(verbose=False)

We use your implementation provided above to compute several performance metrics. There are various metrics that are used widely in the evaluation of machine learning techniques. We recall 5 of these from the lecture:

The True Positive Rate (TPR), also called sensitivity, measures the proportion of actual positives that are correctly identified as such.
$$\text{TPR} = \frac{\text{TP}}{\text{TP}+\text{FN}} = 1-\text{FNR}$$

The True Negative Rate (TNR), also called specificity, measures the proportion of actual negatives that are correctly identified as such.
$$\text{TNR} = \frac{\text{TN}}{\text{TN}+\text{FP}} = 1-\text{FPR}$$

The False Positive Rate (FPR), measures the proportion of actual negatives that are identified as positives. If a NIDS has a high FPR, it will not find malicious software, making it useless.
$$\text{FPR} = \frac{\text{FP}}{\text{TN}+\text{FP}} = 1-\text{TNR}$$

The False Negative Rate (FNR), measures the proportion of actual positives that are identified as negatives. If a NIDS has a high FNR, it will raise alarms for benign network traffic, causing a lot of frustration for the people on the network.
$$\text{FNR} = \frac{\text{FN}}{\text{TP}+\text{FN}} = 1-\text{TPR}$$

The Accuracy (ACC) combines the TPR and FPR to get a complete overview of the system performance. An accuracy of 1.0 means everything is correctly classified. An accuracy of 0.0 means nothing is correctly classified.
$$\text{ACC} = \frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}}$$

Additionally, you will also often encounter the [Precision, recall](https://en.wikipedia.org/wiki/Precision_and_recall) and [F1-score](https://en.wikipedia.org/wiki/F-score). The precision gives the ratio of 'how many selected items are relevant' and the recall gives the ratio of 'how many relevant items are selected'. The F1-score gives the harmonic mean between precision and recall and is often used in place of the accuracy. For reference, the formulas for all three metrics are given below.

$$\text{Precision} = \frac{\text{TP}}{\text{TP}+\text{FP}}$$

$$\text{Recall} = \text{TPR} = \frac{\text{TP}}{\text{TP}+\text{FN}}$$

$$\text{F1-score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2 \text{TP}}{2 \text{TP} + \text{FP} + \text{FN}}$$

We compute these metrics based on your True Positive, True Negative, False Positive, and False Negative implementations.

## 5. Classifier
Now that we have prepared the data and implemented some of the evaluation metrics, we are ready to create our Network Intrusion Detection System (NIDS). To this end, we will use the [OneClassSVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html) implementation from the scikit-learn library. This model uses the `fit()` method with train data of known classes to learn its model. Subsequently it will use the method `predict()` with test data as +1 if they fit the known model, or -1 if they are considered anomalous.

**Question 5.1.** We import the One-class SVM from [Scikit-Learn - OneClassSVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html). Given this class, initialize it with different parameters and run the subsequent code to fit and predict our training and testing data. Try to get the best performance by adjusting the `kernel`, `gamma` and `nu` values.

**NB**: The OneClassSVM implementation of scikit-learn uses the `gamma` parameter for the RBF kernel, which in this library is defined as $\gamma = \frac{1}{2\sigma^2}$. This changes the kernel function from
$K(X, Y) = e^{-\frac{||x - y||^2}{2\sigma^2}}$ to $K(X, Y) = e^{-\gamma||x - y||^2}$.

Hint 1: You may notice the performance is not that good. If that is the case, go back to the `assignment.get_split()` method and change the parameters with different values for `ratio` and `apps_train`.

_Hint 2: From your answer at question 2.2, you should be able to explain why some combinations of training versus test app combinations give bad results._

<div class="alert alert-block alert-info">
    <b>Hint:</b> You can use the kernel="rbf", or any of the other kernel values documented at <a href="https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html">Scikit-Learn - OneClassSVM</a>. Please <b>do not</b> use your own implementation of K() because this uses a slightly different API than the sklearn.svm.OneClassSVM expects.
</div>

In [None]:
from sklearn.svm import OneClassSVM

assignment.NIDS = OneClassSVM(
    # The code between this text should be implemented
#     kernel = "<your kernel>",
#     gamma  = "<gamma value>",
#     nu     = "<   nu value>",
    # The code between this text should be implemented
)

# This code fits the NIDS using the training data X_train
assignment.NIDS.fit(X_train)
# This code lets the NIDS predict the test data X_test
y_pred = assignment.NIDS.predict(X_test)

# We evaluate the prediction y_pred and compare it with the actual values y_test
assignment.prediction_report(y_test, y_pred)

## 6. Submission
If you played around with different combinations of training and testing data and selected good parameter values for your OneClassSVM, you can submit your assignment which will be evaluated on the `data/unknown.csv` data. Please fill out your `student_number` and `highscore_name`. To pass you will need to achieve an Accuracy $\geq$ 0.9 and a False Positive Rate of $\leq$ 0.05. 

_Hint: If you cannot seem to get a good performance, perhaps the features that you have chosen for your NIDS in the `extract()` method need to be improved, or the model parameters that you have chosen for the `OneClassSVM` are not suitable for the NIDS_

**Important notices:**
 1. This assignment is graded in pairs. **Please use your actual student numbers, e.g. `["s1234567", "s9876543"]` and only your own student numbers. We log each sign of misconduct**.
 2. The `submit` function evaluates the performance of your NIDS on the `data/unknown.csv` data. Note that this dataset is different from the data you tested with in question 5.1. Therefore, the performance of your NIDS may differ. If you are unable to achieve a good performance, you may have overfitted your NIDS on the training data. Also see the [FAQ](#FAQ) below.
 3. We store your submissions on our server and these values will be used to check if you passed the assignment or not. Therefore, make sure that you submit your prediction correctly using the `assignment.submit()` method. If any error occurs, the error message will be printed. If you are unable to fix these error messages, please contact one of the TAs.
 4. The `assignment.submit()` method requires internet access.
 5. If you want to participate in the highscores, please provide a `highscore_name`. If you **do not** want to participate in the highscores please set this field to `None`. You can check the highscores at [ml4sec.eemcs.utwente.nl](https://ml4sec.eemcs.utwente.nl/).

In [None]:
# Fill in your credentials to submit the assignment.
assignment.submit(
    student_numbers = ["<your_student_numbers>", "<your_student_numbers>"],
    highscore_name  = None, # Set to string value to enter highscores
)

<div class="alert alert-block alert-info">
<h1>FAQ</h1>
<ul>
    <li><b>My accuracy is 85-89%, why is this not enough to pass the assignment?</b></li>
    <ul>
      <li>Accuracy is a metric that takes into account the overall performance of the system. However, the metric does not take into account class balance. This means that if 85% of the data is benign and 15% is malicious, you can classify all data as benign and still get an accuracy of 85%. In such an extreme case, the NIDS would not add any value for classification. In intrusion detection, this kind of imbalance is very common as the majority of traffic is benign. Pay close attention to the other metrics as well. If your True Positive Rate is low, this means that your NIDS is more inclined to label data as benign (i.e., malicious traffic is not detected). Vice versa, if your True Negative Rate is low, data gets classified as malicious. These metrics should give you a hint in how you can improve your NIDS.</li>
    </ul>
  <li><b>My performance on the benign.csv dataset is very high (~99% accuracy). Why is my accuracy on the unknown.csv dataset so much worse?</b></li>
    <ul>
      <li>If this happens to you, you may be overfitting your NIDS on the benign.csv dataset. Try some different <i>apps_train</i> and <i>ratio</i> values for your split function. If the problem persists, have a look at the features that you are using. The unknown.csv data was captured at a later time than the original dataset. This means that some features will be slightly different (e.g., timestamps, dst IP address due to load balancing). Can your NIDS deal with these changes?</li>
    </ul>
</ul>
</div>

## 7. Optimisation (optional)
If you passed the assignment, you can try to improve your performance by selecting different features and optimizing the `gamma` and `nu` values of your `OneClassSVM`. However, there are also other methods we can explore. Consider a benign application $B$ that only runs on our server with IP address `10.0.0.1`. We are able to detect this application in the network traffic using our previous NIDS. Now consider a malicious application $M$ that behaves very similar to application $B$, so much, that our NIDS would be unable to distinguish application $B$ from application $M$. As we have seen in question 2.2, this is something that will occur in practice.

As a partial solution, we can try to create a more fine-grained NIDS that does not make a single model for all traffic on the network, but instead create a model **per device**. This way, if malicious application $M$ infects any of our desktop machines `10.0.0.2-4`, we would still be able to detect that there is a new application running. This solution is not full proof, because this new NIDS would still not be able to detect application $M$ on the server `10.0.0.1`. Nevertheless, it is a slight improvement on our original NIDS design.

**Question 7.1. (optional)** Create the NIDS described above by implementing the DeviceNIDS class. This class should support the `fit(self, X)` and `predict(self, X)` methods that take a scaled feature matrix `X` to fit and predict, respectively.

In [None]:
class DeviceNIDS(object):
    
    def __init__(self):
        # The code between this text should be implemented
#         self.classifiers = {
#             '10.0.0.1': OneClassSVM(kernel="<your_kernel>", gamma="<your_gamma>", nu="<your_nu>"),
#             '10.0.0.2': OneClassSVM(kernel="<your_kernel>", gamma="<your_gamma>", nu="<your_nu>"),
#             '10.0.0.3': OneClassSVM(kernel="<your_kernel>", gamma="<your_gamma>", nu="<your_nu>"),
#             '10.0.0.4': OneClassSVM(kernel="<your_kernel>", gamma="<your_gamma>", nu="<your_nu>"),
#         }
        # The code between this text should be implemented
    
    
    def fit(self, X: np.ndarray) -> 'DeviceNIDS':
        # The code between this text should be implemented
        pass
        # The code between this text should be implemented
        
        # Return self
        return self
    
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        # The code between this text should be implemented
        pass
        # The code between this text should be implemented
        
    
    
# Set the NIDS to the newly implemented DeviceNIDS
assignment.NIDS = DeviceNIDS()

# Fill in your credentials to submit the assignment.
assignment.submit(
    student_numbers = ["<your_student_numbers>", "<your_student_numbers>"],
    highscore_name  = None, # Set to string value to enter highscores
)