<div class="alert alert-danger">
<h3>Setup for Google Colab only- otherwise ignore first cell</h3>
</div>

In [None]:
#@title << Setup Google Colab by running this cell {display-mode: "form"}
import sys
if 'google.colab' in sys.modules:
    # Clone GitHub repository
    !git clone https://github.com/epfl-exts/amld20-anomaly-detection.git
        
    # Copy files required to run the code
    !cp -r "amld20-anomaly-detection/data" "amld20-anomaly-detection/anomaly_helpers.py" .
    
    # Install packages via pip
    !pip install -r "amld20-anomaly-detection/colab-requirements.txt"
    
    # Restart Runtime
    import os
    os.kill(os.getpid(), 9)

## Load settings and functions

In [None]:
%run anomaly_helpers.py

## Create dataset

In [None]:
dataset=create_dataset()

<div class="alert alert-info">
<h3>Important</h3>

<p>When training the model we will only pass it the features of the samples. The model will never see the class labels, and hence can not gain any feedback from comparing them against its own 'decisions'.<br>
We call this <b>unsupervised learning</b>.</p>
    
<p>However, here we will use the class labels to gain further insight by analysing and visualising the data and later for evaluating our model in some more detail.</p>
</div>

## Explore the training data and visualise 1000 samples

1) We look at the distribution of different attack types in our training data.

2) We use a TSNE-plot to explore part of our training data.  
The TSNE-plot reduces our 51 features to 2 dimensions by trying to represent local structures faithfully.

In [None]:
explore_and_visualise_training_data(dataset)

## Training our model

The **expected_contamination** is the portion of anomalies (attacks) that we expect in the real-life test set. This value is independent of the *portion of attacks* we chose to place in the training data.

**PCA (Principal Component Analysis)** transforms our data by trying to identify the directions (= combinations of features) in the data in which the data varies most. This might help the Isolation Forest algorithm to isolate anomalies faster, or it might not. 

In [None]:
model, dataset=build_anomaly_detector(dataset, 
                       expected_contamination=0.1,     # value between 0 and 0.5
                       with_PCA = False)                 # True / False

## Evaluating the performance

1) We take at look at the scores assigned by the **decision function** to our different samples. Samples with negative scores are marked as anomalies.

In [None]:
evaluate_model(dataset,model)

## Taking a more detailed look at our performance

1) We look at the performance of our predictions at the level of individual attack types.

2) We locate the misclassified samples (triangles, colour gives true label) inside our test set.

In [None]:
detailed_evaluation(dataset, model)

<div class="alert alert-success">
<h3>Task 1</h3>

Vary the `expected_contamination` parameter which is used by the model to make decisions.<br>    
What do you observe about distribution of the outlier scores and the position of the decision boundary?<br>
</div>

<div class="alert alert-success">
<h3>Task 2</h3>
    
Vary the size of your training data.<br> 
Vary the portion of attacks in your training data.<br>
In particular see how the model performs on totally clean data.<br>
</div>

<div class="alert alert-success">
<h3>Task 3</h3>
    
Use PCA for training your model by setting `with_PCA = True` when training the model. 
</div>