# Clustering Autoencoder Latent Space
-------------------------------------

In this notebook, we will use the latent space of the trained autoencoder to cluster the netflow data

In [1]:
import sys
sys.path.append("../..")
import numpy as np

# 1. Load the trained autoencoder
-------------------------------------
- We previously trained an autoencoder on the netflow data.
- The checkpoint path are saved in logs/autoencoder/[VERSION]
- We will load the weights of the trained model to run inference
- As recall, during the inference phase, we will only use the encoder part of the autoencoder to produce the latent representation of the netflow data

In [2]:
from network_ad.config import LOGS_DIR
CHECKPOINT_PATH = LOGS_DIR/"autoencoder/v3_latent_dim8_2_hidden/autoencoder-epoch=19-val_loss=0.012.ckpt"

[https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html](https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html)

In [3]:
##########################################################
###Write your code Here to load the trained autoencoder###
##########################################################
from network_ad.unsupervised.autoencoder_lightning import Autoencoder
model = Autoencoder.load_from_checkpoint(CHECKPOINT_PATH)

# 2. Load the netflow data
-------------------------------------
- We will load the netflow data to get the latent representation of the data
- We will use the datamodule to load the data
- But we will only use the test data

In [4]:
from network_ad.config import VAL_RATIO
from network_ad.unsupervised.autoencoder_datamodule import AutoencoderDataModule
BATCH_SIZE = 64
data_module = AutoencoderDataModule(batch_size=BATCH_SIZE, val_ratio=VAL_RATIO)
data_module.setup()

In [5]:
##########################################################
###Write your code here to get print a sample ############
##########################################################
#1. Train dataloader
train_dataloader = data_module.train_dataloader()
first_batch= next(iter(train_dataloader))

In [6]:
first_batch

tensor([[-0.1947, -0.1909, -0.1728,  ...,  0.0000,  0.0000,  0.0000],
        [-0.1947, -0.2153,  0.9181,  ...,  0.0000,  0.0000,  0.0000],
        [-0.1947, -0.1909, -0.1728,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [-0.1947, -0.1909, -0.1728,  ...,  0.0000,  0.0000,  0.0000],
        [-0.1947, -0.1909, -0.1728,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.2406,  5.2263, -4.5364,  ...,  0.0000,  0.0000,  0.0000]])

In [7]:
first_batch.shape

torch.Size([64, 704])

# 3. Inference
-------------------------------------
- We will use the encoder part of the autoencoder to get the latent representation of the netflow data

In [8]:
import torch
from tqdm import tqdm
###################################################################
## Write your code here to get the latent representation##########
## TIPS: You should write a loop to iterate over the test dataloader
###################################################################
test_dataloader = data_module.test_dataloader()
test_outputs = []

with torch.no_grad():
    #torch.no_grad() disables the autograd engine which is used to compute gradients during training thus reducing memory consumption
    for batch in tqdm(test_dataloader, "Running inference on test dataloader"):
        batch =batch.to(model.device)
        test_outputs.append(model.encoder(batch).cpu())

Running inference on test dataloader: 100%|████████████████████████████████████████| 7470/7470 [03:16<00:00, 38.00it/s]


In [9]:
latent_vectors =torch.concat(test_outputs, dim=0).numpy()

In [11]:
latent_vectors.shape

(478055, 8)

    # 4. Latent Space Visualization
-------------------------------------
- Make use of the plotly library to visualize the latent space
- We will first perform PCA on the latent space to reduce the dimensionality to 3 or 2

## 4.1 Perform PCA

- Principal component analysis (PCA) is a technique used to emphasize variation and bring out strong patterns in a dataset. It's often used to make data easy to explore and visualize.
- We will use the PCA class from the sklearn library to perform PCA on the latent space

In [12]:
# Instanciate the PCA class
from sklearn.decomposition import PCA
reducer_pca = PCA(n_components=3)
# Fit the PCA on the latent space
latent_vectors_3d = reducer_pca.fit_transform(latent_vectors)
print(reducer_pca.explained_variance_ratio_)

[0.65140598 0.14013773 0.07283445]


## 4.1.2. Better than PCA, we can use UMAP to visualize the latent space

https://umap-learn.readthedocs.io/en/latest/to visualize the latent space

In [13]:
# Better than PCA, we can use Umap to visualize the latent space
#!pip install umap-learn
# import umap
# reducer_umap = umap.UMAP(n_components=3, verbose=True)
# latent_vectors_3d = reducer_umap.fit_transform(latent_vectors)


### 4.1.3. Create a new dataframe with the columns ['PC1', 'PC2', 'PC3']

Each row of the dataframe will represent a projected netflow point in the latent space

In [14]:
##########################################################
###Write your code here to perform PCA####################
# Recommendation :  Use 3 principal components and create
# a new dataframe with the columns ['PC1', 'PC2', 'PC3']
##########################################################
import pandas as pd
df_latent = pd.DataFrame(latent_vectors_3d, columns=['PC1', 'PC2', 'PC3'])

# 4.2 Visualize the latent space

-------------------------------------
- Make use of the plotly scatter plot to visualize the latent space
[https://plotly.com/python/hover-text-and-formatting/](https://plotly.com/python/hover-text-and-formatting)

In [15]:
import plotly.express as px

### 4.2.1 Load some raw data to  annotate the plot

We use the following columns: 'HOST_IPV4_SRC_ADDR', 'HOST_IPV4_DST_ADDR','MAX_TTL',
'Label','Attack'

In [16]:
# We use the method load_data that we have implemented in the datamodule
df_annotation =data_module.load_data(mode="test")
# ONnly keep the columns that we need
df_annotation = df_annotation[['NETWORK_IPV4_SRC_ADDR', 'NETWORK_IPV4_DST_ADDR','HOST_IPV4_DST_ADDR','HOST_IPV4_DST_ADDR', 'MAX_TTL','Label','Attack']]
# Convert each column to string
df_annotation = df_annotation.astype(str)
df_annotation.head()

Unnamed: 0,NETWORK_IPV4_SRC_ADDR,NETWORK_IPV4_DST_ADDR,HOST_IPV4_DST_ADDR,HOST_IPV4_DST_ADDR.1,MAX_TTL,Label,Attack
0,59.166.0,149.171.126,1,1,32,0,Benign
1,59.166.0,149.171.126,1,1,31,0,Benign
2,59.166.0,149.171.126,9,9,32,0,Benign
3,59.166.0,149.171.126,9,9,32,0,Benign
4,59.166.0,149.171.126,6,6,32,0,Benign


### 4.2.2.  Visualize the latent space in 2D

----------------------------------------------
- Use the 2 principal components to visualize the latent space in 2D
- Use the plotly scatter3d to visualize the latent space
- At this stage, you can use the argument `hover_data` px.scatter to display the columns 'HOST_IPV4_SRC_ADDR', 'HOST_IPV4_DST_ADDR','MAX_TTL',
- Make sure to update the dataset of principal components with the columns ['HOST_IPV4_SRC_ADDR', 'HOST_IPV4_DST_ADDR','MAX_TTL']

In [None]:
#############################################################
###Write your code here to visualize the latent space in 3D##
#############################################################
def visualize_latent_space_2D(df_latent, df_annotation, label='Attack', bening_frac =10):
    """
    Visualize the latent space in 2D
    """
    # Only keep 10% of the benign data to avoid cluttering the plot
    df = pd.concat([df_latent, df_annotation], axis=1)
    df = df[df['Attack'] == 'Benign'].sample(frac=bening_frac / 100).append(df[df['Attack'] != 'Benign'])

    fig = px.scatter(df, x='PC1',
                     y='PC2',
                     color = label,
                     )
    fig.show()


visualize_latent_space_2D(df_latent, df_annotation)

### 4.2.3.  Visualize the latent space in 3D
----------------------------------------------
- Similarly, use the 3 principal components to visualize the latent space in 3D
- Use the plotly scatter3d to visualize the latent space
- Example : [https://plotly.com/python/3d-scatter-plots/](https://plotly.com/python/3d-scatter-plots/)

In [None]:
#############################################################
###Write your code here to visualize the latent space in 3D##
#############################################################
def visualize_latent_space_3D(df_latent, df_annotation, label='Attack', bening_frac =10):
    """
    Visualize the latent space in 3D
    """
    df = pd.concat([df_latent, df_annotation], axis=1)
    df = df[df['Attack'] == 'Benign'].sample(frac=bening_frac / 100).append(df[df['Attack'] != 'Benign'])
    fig = px.scatter_3d(df, x='PC1',
                        y='PC2',
                        z='PC3',
                        color = label,
                        )
    fig.show()

visualize_latent_space_3D(df_latent, df_annotation)

# 5. Clustering

-------------------------------------
- Now let's use the KMeans algorithm to cluster the latent space
- We will use the KMeans class from the sklearn library

In [None]:
from sklearn.cluster import KMeans

 ## 5.1 KMeans Clustering with a fixed number of clusters
-------------------------------------
- We will first use a fixed number of clusters to perform the clustering
- We perform clutering then update the dataframe with the cluster label
- We will then visualize the clusters in 2D and 3D by coloring the points according to the cluster label

In [None]:
# TODO : Define the number of clusters below
N_CLUSTERS =  2

In [None]:
##########################################################
###Write your code here to perform KMeans clustering######
## TIPS : Don't use the PCA components but the full
## dimension of the latent space
##########################################################
def kmeans_clustering(latent_vectors, n_clusters)-> np.ndarray:
    """
    Perform KMeans clustering on the latent space
    """
    kmeans = KMeans(n_clusters=n_clusters, random_state=0)
    kmeans.fit(latent_vectors_3d)
    return kmeans.labels_

In [None]:
##########################################################
###Update the dataframe with the cluster label.############
##Recommendation : Add a new column 'Cluster' to the dataframe
##########################################################
df_annotation['Cluster'] = kmeans_clustering(latent_vectors_3d, N_CLUSTERS)
# df_annotation['Cluster'] = df_annotation['Cluster'].astype(str)

### Visualize the clusters (In 2D and 3D)
----------------------------------------------
- It is similar to the step 4.2.2 and 4.2.3 but this time we will color the points according to the cluster label

In [None]:
#############################################################
###Write your code here to visualize the clusters in 2D and 3D##
## Recommendation : Do not copy code but refactor to create a
## visualization function that takes the dataframe and the
## columns to visualize as arguments
#############################################################

In [None]:
#2D visualization
visualize_latent_space_2D(df_latent, df_annotation, label='Cluster')

In [None]:
#3D visualization
visualize_latent_space_3D(df_latent, df_annotation, label='Cluster')


### 5.2 KMeans Clustering with an optimal number of clusters
-------------------------------------
- We will use the elbow method to find the optimal number of clusters
- We run the KMeans algorithm with a range of number of clusters and plot the inertia

In [None]:
NUM_CLUSTERS =  list(range(1, 16))  # [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,15]

In [None]:
num_clusters_and_inertia = dict() # Store the number of clusters and the inertia
##########################################################
###Write your code here to perform KMeans clustering######
## A loop that will iterate over the number of clusters
##########################################################
for n_clusters in NUM_CLUSTERS:
    kmeans = KMeans(n_clusters=n_clusters, random_state=0)
    kmeans.fit(latent_vectors)
    num_clusters_and_inertia[n_clusters] = kmeans.inertia_


#### Inertia plot

In [None]:
#############################################################
###Write your code here to plot the inertia vs the number of clusters##
## Use matplotlib or plotly
#############################################################
import matplotlib.pyplot as plt
plt.plot(list(num_clusters_and_inertia.keys()), list(num_clusters_and_inertia.values()))
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')

In [None]:
## TODO : Guess the optimal number of clusters visually by looking at the elbow

#### Elbow Method (programmatically)
-------------------------------------
The following code will help you to find the optimal number of clusters programmatically

In [None]:
def find_optimal_number_of_clusters(num_clusters_and_inertia):
    """
    Find the optimal number of clusters using the elbow method
    """
    # Compute the first derivative of the inertia
    first_derivative = np.diff(list(num_clusters_and_inertia.values()), 1)
    # Compute the second derivative of the inertia
    second_derivative = np.diff(first_derivative, 1)
    # Find the optimal number of clusters
    optimal_number_of_clusters = np.where(second_derivative == max(second_derivative))[0][0] + 2
    return optimal_number_of_clusters

optimal_number_of_clusters = find_optimal_number_of_clusters(num_clusters_and_inertia)
print("Optimal number of clusters : ", optimal_number_of_clusters)

### 5.3  KMeans clustering and True Labels
-------------------------------------
- No that the true labels are available in the dataframe
   - Binary labels in the column 'Label'
    - Multiclass labels in the column 'Attack'

In [47]:
#############################################################
###Write your code here to visualize the clusters with the true labels(2D and 3D)##
### TODO : Color the points according to the true labels
##  Compare the clusters with the true labels (Visual inspection)
#############################################################

In [48]:
#Question : Are the clusters consistent with the true labels?

### 6. Clustering Characterization
-------------------------------------
- In this last section, we characterize clusters by analysing the similarity between the points in the same cluster

 *Global Question* : What are the features common to the points in the same cluster?

- Guidelines:
  - Enrich the dataframe with other columns ( features) that was used to train the autoencoder
  - Choose of the features that you think are relevant to characterize the clusters
  - In the plotly scatter plot, use the argument `hover_data` to display the columns that you have chosen
  - Visualize guess the common properties of the points in the same cluster
