# Anomaly detection - the data
## Load settings and functions

In [None]:
%run helpers.py

## Create dataset

In [None]:
dataset = create_dataset()

<div class="alert alert-info">
<h3>Important</h3>

<p>When training the model we will only pass it the features of the samples. The model will never see the class labels, and hence can not gain any feedback from comparing them against its own 'decisions'.<br>
Thus anomaly detection is an <b>unsupervised</b> machine learning approach.</p>
    
<p>However, here we will use the class labels to gain further insight by analysing and visualising the data and later for evaluating our model in some more detail.</p>
</div>

## Explore the training data and visualise 1000 samples

1) We look at the distribution of different attack types in our training data.

2) We use a TSNE-plot to explore part of our training data.  
The TSNE-plot reduces our 51 features to 2 dimensions by trying to represent local structures faithfully.

In [None]:
explore_and_visualise_training_data(dataset)

<div class="alert alert-success">
<h2>Questions - Part 1</h2>
    
The purpose of these questions is to create some expectations of the performance of our model and the problems it might encounter.

Note: <i>These questions only make sense for training data that has been contaminated</i>.
    
What do you observe about <br>
<ul>
<li> the <b>frequency of the different malware types</b> in our data set? </li>
<li> the <b>distribution of the different malware types</b> in the TSNE plot? </li>
<li> the <b>distribution of the normal samples</b> in the TSNE plot? </li>
</ul>

<b>Which types of malware</b> do you expect to be <b>easy to find</b>, and which ones would be <b>harder to detect</b>?
</div>

My observations:

-  
-  

# Anomaly detection - the model
## Training our model

The **expected_contamination** is the portion of anomalies (attacks) that we expect in the real-life test set. This value is independent of the *portion of attacks* we chose to place in the training data.

**PCA (Principal Component Analysis)** transforms our data by trying to identify the directions (= combinations of features) in the data in which the data varies most. This might help the Isolation Forest algorithm to isolate anomalies faster, or it might not. 

In [None]:
model, dataset = build_anomaly_detector(
    dataset,
    expected_contamination=0.1,      # value between 0 and 0.5
    with_PCA = False)                # True / False

## Evaluating the performance

1) We take at look at the scores assigned by the **decision function** to our different samples. Samples with negative scores are marked as anomalies.

In [None]:
evaluate_model(dataset, model)

<div class="alert alert-success">
<h2>Questions - Part 2</h2>

<p>Let's take a look at the results above. First take a look at everything, then try to answer the following questions:</p>

<ol>
<li>  In the first plot how many blue samples are found on the left hand side?</li>
<li>  In the first plot how many orange samples are found on the right hand side?</li>
<li>  Which of these two mispredictions are worse for us?</li>
</ol>

</div>

My observations:

-  
-  

## Taking a more detailed look at our performance

1) We look at the performance of our predictions at the level of individual attack types.

2) We locate the misclassified samples (triangles, colour gives true label) inside our test set.

In [None]:
detailed_evaluation(dataset, model)

<div class="alert alert-success">
<h2>Questions - Part 3</h2>

Let's compare our expectations build from Part 1 with the actual performance. 

<ol>
<li>  For which types of malware did our model perform well, and for which did it do poorly?</li>
<li>  Does this match the observations and expectations we made in Part 1?</li>
</ol>   
<p>Dark blue triangles indicate false alarms, i.e. normal behaviour that was predicted as malware. <br>
The other triangles indicate malware that was not detected.</p>

<ol start="3">
<li>  What do you observe about the location of the false alarms?</li>
<li>  What do you observe about the location of undetected malware?</li>
<li>  Does this match the observations and expectations we made in Part 1?</li>
</ol>

</div>

My observations:

-  
-  

# Additional tasks

<div class="alert alert-success">
<h3>Task 1</h3>

Let's vary the `expected_contamination` parameter which is used by the model to make decisions. <br>     You only need to run cells in section 2 "Anomaly detection - the model" <br>    
What do you observe about distribution of the outlier scores and the position of the decision boundary?<br>
</div>

<div class="alert alert-success">
<h3>Task 2</h3>

You need to rerun the entire notebook here. <br>
1. Vary the size of your training data.<br> 
2. Vary the portion of attacks in your training data.<br>
3. In particular let's see how the model performs when we only use totally clean training data without any anomalies.<br>
</div>

<div class="alert alert-success">
<h3>Task 3</h3>
    
Let's see whether using PCA before training our model helps. <br> 
You only need to run cells in  section 2 "Anomaly detection - the model"<br>
    
Set `with_PCA = True` inside the `build_anomaly_detector`. <br>
</div>