# Clustering Autoencoder Latent Space
-------------------------------------

In this notebook, we will use the latent space of the trained autoencoder to cluster the netflow data

In [1]:
import sys
sys.path.append("../..")
import numpy as np

# 1. Load the trained autoencoder
-------------------------------------
- We previously trained an autoencoder on the netflow data.
- The checkpoint path are saved in logs/autoencoder/[VERSION]
- We will load the weights of the trained model to run inference
- As recall, during the inference phase, we will only use the encoder part of the autoencoder to produce the latent representation of the netflow data

In [10]:
CHECKPOINT_PATH = r"F:\Docs\GS FORMATION\Machine Learning - Epita\Network Anomaly detection\logs\autoencoder_old\debug2\autoencoder-epoch=01-val_loss=0.016.ckpt"

[https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html](https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html)

In [13]:
##########################################################
###Write your code Here to load the trained autoencoder###
##########################################################

HIDDEN_DIM = 256
LATENT_DIM = 8
LEARNING_RATE = 1e-4
NUM_WORKERS = 1
N_EPOCHS= 2
DROPOUT_RATE = 0
INPUT_DIM=704
import torch
from network_ad.unsupervised.autoencoder_lightning import Autoencoder
model = Autoencoder(input_dim=INPUT_DIM,
                        hidden_dim=HIDDEN_DIM,
                        latent_dim=LATENT_DIM,
                        learning_rate=LEARNING_RATE,
                        dropout_rate=DROPOUT_RATE
                        )
model.load_state_dict(torch.load(CHECKPOINT_PATH),strict=False)


# 2. Load the netflow data
-------------------------------------
- We will load the netflow data to get the latent representation of the data
- We will use the datamodule to load the data
- But we will only use the test data

In [15]:
from network_ad.config import VAL_RATIO
from network_ad.unsupervised.autoencoder_datamodule import AutoencoderDataModule
BATCH_SIZE = 64
data_module = AutoencoderDataModule(batch_size=BATCH_SIZE, val_ratio=VAL_RATIO)
data_module.setup()

In [None]:
##########################################################
###Write your code here to get print a sample ############
##########################################################

# 3. Inference
-------------------------------------
- We will use the encoder part of the autoencoder to get the latent representation of the netflow data

In [None]:
from network_ad.unsupervised.autoencoder import Autoencoder
###################################################################
## Write your code here to get the latent representation##########
## TIPS: You should write a loop to iterate over the test dataloader
###################################################################


# 4. Latent Space Visualization
-------------------------------------
- Make use of the plotly library to visualize the latent space
- We will first perform PCA on the latent space to reduce the dimensionality to 3 or 2

## 4.1 Perform PCA

- Principal component analysis (PCA) is a technique used to emphasize variation and bring out strong patterns in a dataset. It's often used to make data easy to explore and visualize.
- We will use the PCA class from the sklearn library to perform PCA on the latent space

In [None]:
from sklearn.decomposition import PCA

In [8]:
##########################################################
###Write your code here to perform PCA####################
# Recommendation :  Use 3 principal components and create
# a new dataframe with the columns ['PC1', 'PC2', 'PC3']
##########################################################

# 4.2 Visualize the latent space

-------------------------------------
- Make use of the plotly scatter plot to visualize the latent space
[https://plotly.com/python/hover-text-and-formatting/](https://plotly.com/python/hover-text-and-formatting)

In [27]:
import plotly.express as px

### 4.2.1 Load some raw data to  annotate the plot

We use the following columns: 'HOST_IPV4_SRC_ADDR', 'HOST_IPV4_DST_ADDR','MAX_TTL',
'Label','Attack'

In [22]:
# We use the method load_data that we have implemented in the datamodule
df =data_module.load_data(mode="test")
# ONnly keep the columns that we need
df = df[['NETWORK_IPV4_SRC_ADDR', 'NETWORK_IPV4_DST_ADDR','MAX_TTL', 'Label','Attack']]
df.head()

Unnamed: 0,NETWORK_IPV4_SRC_ADDR,NETWORK_IPV4_DST_ADDR,MAX_TTL,Label,Attack
129520,59.166.0,149.171.126,32,0,Benign
274350,59.166.0,149.171.126,31,0,Benign
371847,59.166.0,149.171.126,32,0,Benign
218043,59.166.0,149.171.126,32,0,Benign
183089,59.166.0,149.171.126,32,0,Benign


### 4.2.2.  Visualize the latent space in 2D

----------------------------------------------
- Use the 2 principal components to visualize the latent space in 2D
- Use the plotly scatter3d to visualize the latent space
- At this stage, you can use the argument `hover_data` px.scatter to display the columns 'HOST_IPV4_SRC_ADDR', 'HOST_IPV4_DST_ADDR','MAX_TTL',
- Make sure to update the dataset of principal components with the columns ['HOST_IPV4_SRC_ADDR', 'HOST_IPV4_DST_ADDR','MAX_TTL']

In [30]:
#############################################################
###Write your code here to visualize the latent space in 3D##
#############################################################

### 4.2.3.  Visualize the latent space in 3D
----------------------------------------------
- Similarly, use the 3 principal components to visualize the latent space in 3D
- Use the plotly scatter3d to visualize the latent space
- Example : [https://plotly.com/python/3d-scatter-plots/](https://plotly.com/python/3d-scatter-plots/)

In [31]:
#############################################################
###Write your code here to visualize the latent space in 3D##
#############################################################

# 5. Clustering

-------------------------------------
- Now let's use the KMeans algorithm to cluster the latent space
- We will use the KMeans class from the sklearn library

In [None]:
from sklearn.cluster import KMeans

 ## 5.1 KMeans Clustering with a fixed number of clusters
-------------------------------------
- We will first use a fixed number of clusters to perform the clustering
- We perform clutering then update the dataframe with the cluster label
- We will then visualize the clusters in 2D and 3D by coloring the points according to the cluster label

In [33]:
# TODO : Define the number of clusters below
N_CLUTERS =  None

In [37]:
##########################################################
###Write your code here to perform KMeans clustering######
## TIPS : Don't use the PCA components but the full
## dimension of the latent space
##########################################################

In [None]:
##########################################################
###Update the dataframe with the cluster label.############
##Recommendation : Add a new column 'Cluster' to the dataframe
##########################################################

### Visualize the clusters (In 2D and 3D)
----------------------------------------------
- It is similar to the step 4.2.2 and 4.2.3 but this time we will color the points according to the cluster label

In [38]:
#############################################################
###Write your code here to visualize the clusters in 2D and 3D##
## Recommendation : Do not copy code but refactor to create a
## visualization function that takes the dataframe and the
## columns to visualize as arguments
#############################################################


### 5.2 KMeans Clustering with an optimal number of clusters
-------------------------------------
- We will use the elbow method to find the optimal number of clusters
- We run the KMeans algorithm with a range of number of clusters and plot the inertia

In [39]:
NUM_CLUSTERS =  list(range(1, 16))  # [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,15]

In [None]:
num_clusters_and_inertia = dict() # Store the number of clusters and the inertia
##########################################################
###Write your code here to perform KMeans clustering######
## A loop that will iterate over the number of clusters
##########################################################

#### Inertia plot

In [43]:
#############################################################
###Write your code here to plot the inertia vs the number of clusters##
## Use matplotlib or plotly
#############################################################

In [42]:
## TODO : Guess the optimal number of clusters visually by looking at the elbow

#### Elbow Method (programmatically)
-------------------------------------
The following code will help you to find the optimal number of clusters programmatically

In [None]:
def find_optimal_number_of_clusters(num_clusters_and_inertia):
    """
    Find the optimal number of clusters using the elbow method
    """
    # Compute the first derivative of the inertia
    first_derivative = np.diff(list(num_clusters_and_inertia.values()), 1)
    # Compute the second derivative of the inertia
    second_derivative = np.diff(first_derivative, 1)
    # Find the optimal number of clusters
    optimal_number_of_clusters = np.where(second_derivative == max(second_derivative))[0][0] + 2
    return optimal_number_of_clusters

In [41]:
# Question : Is your guess correct? (Run the function find_optimal_number_of_clusters)

In [44]:
#############################################################
###Visualize the clusters with the optimal number of clusters(2D and 3D)##
#############################################################

### 5.3  KMeans clustering and True Labels
-------------------------------------
- No that the true labels are available in the dataframe
   - Binary labels in the column 'Label'
    - Multiclass labels in the column 'Attack'

In [47]:
#############################################################
###Write your code here to visualize the clusters with the true labels(2D and 3D)##
### TODO : Color the points according to the true labels
##  Compare the clusters with the true labels (Visual inspection)
#############################################################

In [48]:
#Question : Are the clusters consistent with the true labels?

### 6. Clustering Characterization
-------------------------------------
- In this last section, we characterize clusters by analysing the similarity between the points in the same cluster

 *Global Question* : What are the features common to the points in the same cluster?

- Guidelines:
  - Enrich the dataframe with other columns ( features) that was used to train the autoencoder
  - Choose of the features that you think are relevant to characterize the clusters
  - In the plotly scatter plot, use the argument `hover_data` to display the columns that you have chosen
  - Visualize guess the common properties of the points in the same cluster
