# **Fundamentals of Artificial Intelligence**

## MSc in Applied Artificial Intelligence 2025/2026 <br>
## Group 02 - Project 2 - K-Means
| Nome                              | Número de Aluno |
|-----------------------------------|------------:|
| Adelino Daniel da Rocha Vilaça    | a16939          |
| António Jorge Magalhães da Rocha  | a26052          |

---
# 0. - **INTRODUCTION**

## 0.1 - Goal
> This project aims to explore and demonstrate the application of K-Means clustering, an unsupervised machine learning algorithm, on a cybersecurity intrusion detection dataset. We will compare two distinct approaches: one where the `attack_detected` attribute is included as a feature during clustering, and another where it is explicitly excluded. The primary objective is to understand how feature selection, particularly the inclusion or exclusion of the target variable, impacts the ability of K-Means to identify meaningful patterns and groupings related to network intrusions.

## 0.2 - Environment
> This notebook is developed and executed within the Google Colaboratory (Colab) environment, a free cloud-based Jupyter Notebook service. Colab provides access to computational resources, including GPUs, and pre-installed libraries, making it an ideal platform for machine learning experimentation without local setup requirements. The development leverages Python 3 and standard data science libraries such as Pandas for data manipulation, NumPy for numerical operations, Matplotlib and Seaborn for data visualization, and Scikit-learn for machine learning algorithms, particularly K-Means clustering and data preprocessing tools like `StandardScaler`.

## 0.3 - Definitions

> * **K-Means Clustering**: An unsupervised machine learning algorithm that partitions `n` observations into `k` clusters, where each observation belongs to the cluster with the nearest mean (centroid), serving as a prototype of the cluster.
> * **Inertia (Elbow Method)**: A metric used in K-Means to measure the within-cluster sum of squares. It decreases as `k` increases. The 'elbow' point in the plot of inertia vs. `k` suggests an optimal `k`.
> * **Silhouette Score**: A metric to evaluate the quality of clusters created by clustering algorithms. It measures how similar an object is to its own cluster compared to other clusters. Scores range from -1 (poor clustering) to +1 (dense, well-separated clusters).
> * **StandardScaler**: A preprocessing technique that standardizes features by removing the mean and scaling to unit variance, ensuring all features contribute equally to distance-based algorithms like K-Means.
> * **One-Hot Encoding**: A technique to convert categorical variables into a numerical format that machine learning algorithms can understand. Each category is transformed into a new binary column.
> * **`attack_detected`**: The target variable in our cybersecurity dataset, indicating whether a network intrusion (1) or normal activity (0) was detected during a session.

---
# 1. - **AGENT DESIGN**

# Unsupervised Learning — K-MEANS

This notebook presents examples of the use of well-known unsupervised learning algorithms.

### K-Means clustering
Steps:
* load the titanic dataset
* do some EDA visualizations/analysis
* prepare the dataset
* K hyperparameter tunning
* apply the model


## 1.1 - Platforms

### 1.1.1 - Jupyter Notebook <br>
 > A Jupyter Notebook is an open-source web application that allows creating and sharing documents containing live code, equations, visualizations, and narrative text. It's widely used in data science, machine learning, and scientific computing for interactive development, exploration, and documentation.

### 1.1.2 - Google Colaboratory <br>
  >Free cloud-based service that provides a hosted Jupyter Notebook environment. It allows writing and executing code in a browser for free and without any setup.

## 1.2 Packages and Libraries


### 1.2.1 Packages and Libraries

This notebook utilizes several key Python libraries for data manipulation, machine learning, and visualization:

*   **`pandas`**: For data manipulation and analysis, particularly for handling DataFrames.
*   **`numpy`**: For numerical operations, especially with arrays.
*   **`matplotlib.pyplot`**: For creating static, interactive, and animated visualizations in Python.
*   **`seaborn`**: A data visualization library based on matplotlib, providing a high-level interface for drawing attractive and informative statistical graphics.
*   **`sklearn.cluster.KMeans`**: The K-Means clustering algorithm from scikit-learn.
*   **`sklearn.preprocessing.StandardScaler`**: For standardizing features by removing the mean and scaling to unit variance.
*   **`sklearn.metrics`**: For evaluating model performance, specifically for `silhouette_score`.

These libraries collectively enable the data loading, preprocessing, K-Means clustering, and visual analysis presented in this notebook.

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

### Load the data

### Usage of `kaggle.json`

The `kaggle.json` file serves as an essential **authentication token** for interacting with the Kaggle API (Kaggle Application Programming Interface). It securely stores the user's Kaggle credentials, including their username and key.

In [None]:
from google.colab import files
files.upload()  # kaggle.json

### Token Permissions

In [None]:
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets list | head

## 1.3 DATASET


### 1.3.1 Used Dataset

> This dataset, named `cybersecurity_intrusion_data.csv`, focuses on **cybersecurity intrusion detection**.

* It contains information about network sessions, such as packet size, protocol type, session duration, and login attempts.
* It includes details about user behavior, such as browser type and failed login attempts.
* The main objective is to classify whether a given session indicates an `attack_detected` (attack detected) or not, with this being the target variable.

In [None]:
!kaggle datasets download -d dnkumars/cybersecurity-intrusion-detection-dataset -p /content/ --unzip

In [None]:
# Read the file
dataset = pd.read_csv('cybersecurity_intrusion_data.csv')
df = dataset
dataset_data = pd.DataFrame(dataset)

print("Shape:", df.shape)
dataset_data.head()

In [None]:
# Prepare the data
# Exclude 'session_id' and 'attack_detected' from the feature set
cyber_feature_names = df.drop(columns=['session_id', 'attack_detected']).columns.tolist()
cyber_data = df[cyber_feature_names]

# Set 'attack_detected' as the label_data
label_data = df['attack_detected']
label_names = list(set(label_data))

print('Features:', cyber_feature_names, '   Classes:', label_names)

### EDA visualizations/analysis

In [None]:
# EDA analysis
print(dataset_data.info())
print(dataset_data.describe())

In [None]:
# Visualizations for Cybersecurity Data
# Selecting key numerical features from the cybersecurity dataset and 'attack_detected' for hue
numerical_features_cyber = ['network_packet_size', 'session_duration', 'ip_reputation_score', 'failed_logins']
v_features_cyber = dataset_data[numerical_features_cyber + ['attack_detected']]

sns.pairplot(v_features_cyber, hue='attack_detected')
plt.suptitle('Pairplot of Selected Numerical Features by Attack Detection', y=1.02) # Add a suptitle for clarity
plt.show()

---
## Clustering WITH `attack_detected` attribute

### Data preparation

In [None]:
# Make a copy of the raw data to work with
df_cyber_temp = dataset_data.copy()

# Remove the 'session_id' column as it's not a feature for clustering
df_cyber_temp = df_cyber_temp.drop(columns=['session_id'])

# Identify categorical columns including 'attack_detected' for one-hot encoding
columns_to_encode = df_cyber_temp.select_dtypes(include='object').columns.tolist()
# Add 'attack_detected' as it's treated as a categorical feature here
if 'attack_detected' not in columns_to_encode and 'attack_detected' in df_cyber_temp.columns:
    columns_to_encode.append('attack_detected')

# One-hot encode the identified categorical features
cyber_data_with_labels_as_features = pd.get_dummies(df_cyber_temp, columns=columns_to_encode, drop_first=False)

# Display the updated DataFrame with one-hot encoded features
print("DataFrame with one-hot encoded labels as features:")
print(cyber_data_with_labels_as_features.head())

In [None]:
# Select relevant features (all columns from the processed dataframe)
# The cyber_data_with_labels_as_features dataframe already contains all desired features, including
# the one-hot encoded 'attack_detected'
cyber_features_with_labels = cyber_data_with_labels_as_features.copy()

# Get all column names to retain after scaling
all_cyber_features_with_labels = cyber_features_with_labels.columns.values.tolist()

# Standardize features
scaler = StandardScaler()
cyber_scaled_with_labels = scaler.fit_transform(cyber_features_with_labels)
cyber_scaled_with_labels = pd.DataFrame(cyber_scaled_with_labels, columns=all_cyber_features_with_labels)

# Display the head of the scaled data
print("Scaled data with 'attack_detected' as features:")
print(cyber_scaled_with_labels.head())

### K hyperparameter tuning

In [None]:
# Ignore cluster table
cyber_scaled_with_labels = cyber_scaled_with_labels.drop(columns=['cluster'], errors='ignore')

# Test k values from 2 to 10
inertia = []
silhouette_scores = []
k_values = range(2, 11)

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=1234, n_init=10) # Added n_init for modern KMeans
    kmeans.fit(cyber_scaled_with_labels)
    inertia.append(kmeans.inertia_)
    silhouette_scores.append(metrics.silhouette_score(cyber_scaled_with_labels, kmeans.labels_))

plt.plot(k_values, inertia, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k (Cybersecurity Data)')
plt.show()

### Interpreting K-Means Hyperparameter Tuning Results

#### 1. Elbow Method (Inertia Plot)

*   **Observation**: You noted a significant drop in inertia from `k=8` to `k=9`, and a smaller, but still noticeable, drop from `k=4` to `k=5`.
*   **Interpretation**: The 'elbow' in the inertia plot typically indicates the point of diminishing returns. The most pronounced drop from `k=8` to `k=9` suggests that moving from 8 to 9 clusters provides a substantial improvement in explaining the variance within the data, making `k=9` a strong candidate for the optimal number of clusters. While a drop from `k=4` to `k=5` is also observed, it's generally less pronounced than the initial steep decline, meaning the 'benefit' of adding that extra cluster might be less significant compared to the earlier ones.

#### 2. Silhouette Score Plot

*   **Observation**: We also need to consider the Silhouette Score plot, which measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation). Higher scores indicate better-defined and more separated clusters.
*   **Interpretation**: A peak in the Silhouette Score indicates the `k` value for which clusters are most distinct. We should examine this plot to see which `k` value provides the highest score. If `k=9` (or `k=6`) also shows a relatively high or peak silhouette score, it would further support its choice as the optimal number of clusters.

Considering both plots together helps in making a more informed decision for `optimal_k`. The `k=9` appears to be a strong candidate due to the initial steep drop in inertia.

In [None]:
# Silhouette
plt.plot(k_values, silhouette_scores, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score for Optimal k')
plt.show()

### Interpreting the K-Means Clustering Results

After applying the K-Means algorithm with `optimal_k = 9`, a new column named `'cluster'` has been added to your original `dataset_data` DataFrame. This column indicates which of the clusters each cybersecurity session has been assigned to by the K-Means model.

*   **`session_id`**: The unique identifier for each session.
*   **Original Features**: All the original features (`network_packet_size`, `protocol_type`, `login_attempts`, etc.) are still present, as they were the input to the clustering process.
*   **`cluster`**: This is the new column, where each row (cybersecurity session) now has a value (0 to 9) representing the cluster it belongs to. These cluster assignments are based on the similarity of the session's features, as determined by the K-Means algorithm.

### Apply the model

In [None]:
# Choose optimal k based on the plots (example)
optimal_k = 9

kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
clusters = kmeans.fit_predict(cyber_scaled_with_labels)

# Add cluster assignment to dataset_data
dataset_data['cluster'] = clusters
dataset_data.head()

### Interpreting the Cluster Composition Results (for optimal_k = 9)

The `cluster_composition` table provides a crucial look into how the K-Means algorithm, with `optimal_k = 9`, has grouped your cybersecurity sessions relative to the actual `attack_detected` labels.

<br>

Here's what each part signifies:

*   **`attack_detected 0`**: Represents network sessions where **no intrusion or attack was detected** (normal activity).
*   **`attack_detected 1`**: Represents network sessions where **an intrusion or attack was detected** (malicious activity).

Let's analyze each cluster:

*   **Cluster 0**: This cluster is mixed, with `267` non-attack sessions and `195` attack sessions. It shows a slight dominance of normal activity but still contains a significant portion of attacks, suggesting these sessions share some characteristics that group them together, regardless of attack status.

*   **Cluster 1**: This cluster is predominantly composed of attack sessions, with `0` non-attack sessions and `1520` attack sessions. This indicates that K-Means has successfully identified a group of sessions that are almost exclusively malicious activity.

*   **Cluster 2**: This cluster is heavily skewed towards non-attack sessions, containing `999` instances where **no attack was detected** and `563` detected attacks. While not entirely pure, it largely represents normal traffic.

*   **Cluster 3**: This cluster is quite mixed, with `790` non-attack sessions and `610` attack sessions. Similar to Cluster 0, this suggests a blend of characteristics that prevent a clear separation by K-Means.

*   **Cluster 4**: This cluster is remarkably 'pure' for non-attack sessions, containing `851` instances where **no attack was detected** and `0` detected attacks. This indicates an effective grouping of a segment of normal network traffic.

*   **Cluster 5**: This cluster is almost entirely composed of attack sessions, with `2` non-attack sessions and `815` detected attacks. This is another strong indicator of the algorithm successfully identifying and grouping malicious activities.

*   **Cluster 6**: This cluster is another 'pure' group for non-attack sessions, with `1942` instances where **no attack was detected** and `0` detected attacks. This represents a significant portion of normal behavior.

*   **Cluster 7**: This cluster is a mix, with `298` non-attack sessions and `194` attack sessions. It shows a higher proportion of non-attack sessions, but still includes a notable number of attacks.

*   **Cluster 8**: This cluster shows a higher proportion of attack sessions, with `124` non-attack sessions and `367` attack sessions. While not entirely pure, it leans towards malicious activity.

<br>

### Relevance of These Results

These results are **highly relevant** and demonstrate that the K-Means model has done a very good job in identifying underlying structures within your cybersecurity data when the `attack_detected` attribute was included as a feature. Even though K-Means is an unsupervised algorithm, it has found natural groupings that strongly align with whether an attack was present or not.

*   **Anomaly Detection**: The ability to isolate clusters (e.g., Cluster 1 and Cluster 5) almost entirely composed of attacks is a fantastic outcome for intrusion detection. It implies that these attack patterns are distinct enough to be recognized.
*   **Normal Behavior Profiling**: Similarly, Cluster 4 and Cluster 6 give you strong profiles of 'normal' behavior. Any new session not falling into these clusters, especially if it resembles the attack-heavy clusters, would warrant further investigation.
*   **Further Investigation**: Mixed clusters (e.g., Cluster 0, 2, 3, 7, 8) are also very important. They could point to:
    *   Subtler attack vectors that are harder to distinguish.
    *   New, unknown threats that haven't been clearly defined.
    *   Normal traffic with unusual characteristics that mimic attacks.

Overall, these clustering results provide valuable insights into the inherent separability of your cybersecurity data based on the features used. They lay a strong foundation for building more advanced intrusion detection systems, as you now have a good understanding of how different types of sessions group together.

In [None]:
# Analyze the composition of each cluster with respect to 'attack_detected'
# This shows how many instances of attack_detected=0 and attack_detected=1 are in each cluster
cluster_composition = dataset_data.groupby('cluster')['attack_detected'].value_counts().unstack(fill_value=0)

print("Composition of each cluster by 'attack_detected' label:")
print(cluster_composition)

In [None]:
# Counts 'attack_detected' labels per cluster and calculates metrics.
def analyze_clusters(data, clusters):

    # Calculate metrics
    # Group by the 'cluster' assigned by K-Means and the actual 'attack_detected' label
    cluster_attack_counts = data.groupby(['cluster', 'attack_detected']).size().unstack(fill_value=0)

    cluster_stats = pd.DataFrame()
    cluster_stats['cluster'] = cluster_attack_counts.index
    cluster_stats['total_sessions_in_cluster'] = cluster_attack_counts.sum(axis=1)

    # Iterate through unique 'attack_detected' values (0 and 1)
    for attack_status in data['attack_detected'].unique():
        col_name_count = f'attack_detected_{attack_status}_count'
        col_name_proportion = f'attack_detected_{attack_status}_proportion'

        cluster_stats[col_name_count] = cluster_attack_counts[attack_status]
        cluster_stats[col_name_proportion] = cluster_attack_counts[attack_status] / cluster_stats['total_sessions_in_cluster']

    return cluster_stats

# Example usage with our cybersecurity dataset
cluster_analysis = analyze_clusters(dataset_data, clusters)
print("Detailed Cluster Analysis:")
cluster_analysis

### Interpreting Cluster Averages (for optimal_k = 9)

This table displays the mean values for key numerical features within each of the nine identified clusters (0 through 8). By examining these averages, we can gain insights into the typical characteristics of sessions belonging to each group.

Each row represents a cluster, and the columns show the average `network_packet_size`, `session_duration`, `ip_reputation_score`, and `failed_logins` for that cluster. This helps us to understand what differentiates the clusters in terms of these numerical attributes, complementing our understanding of their `attack_detected` composition:

* **`network_packet_size`**: The average network packet sizes across clusters are fairly consistent, hovering around 500-515. Cluster 4 shows a slightly higher average, while Cluster 1 has a slightly lower average. This suggests that packet size might not be the primary distinguishing factor between these clusters.

* **`session_duration`**: Similar to packet size, session durations are generally in the range of 770-850. Clusters 0, 1, and 5 show slightly longer session durations on average, with Cluster 5 having the highest. This could indicate certain types of activity (or attacks) that involve more prolonged interactions.

* **`ip_reputation_score`**: This feature shows more variability. Clusters 1 and 5 (which were identified as being rich in `attack_detected=1` sessions) have higher average IP reputation scores (around 0.37). Conversely, clusters 4 and 6 (which were identified as primarily `attack_detected=0` sessions) have lower average IP reputation scores (around 0.29-0.30). This suggests that a higher IP reputation score might be correlated with attack sessions in this dataset.

* **`failed_logins`**: This feature also shows interesting differences. Clusters 1 and 5, which are attack-heavy, exhibit the highest average number of failed logins (around 1.98). This is a strong indicator that failed login attempts are a key characteristic of the attack-related clusters. Clusters 4 and 6, on the other hand, have the lowest average failed logins (around 1.14-1.17), reinforcing their association with normal, non-attack behavior.

### Conclusion from Averages

By combining these numerical insights with the `attack_detected` composition analysis, we can build a richer profile for each cluster. For example:

* **Attack-prone clusters (e.g., 1 and 5)** tend to have higher `ip_reputation_score` and significantly more `failed_logins`.
* **Normal behavior clusters (e.g., 4 and 6)** are characterized by lower `ip_reputation_score` and fewer `failed_logins`.

The relatively stable `network_packet_size` and `session_duration` averages across all clusters suggest that while these are important features, the `ip_reputation_score` and `failed_logins` are more discriminative in distinguishing between the types of activities grouped by K-Means.

In [None]:
# Group data by cluster and calculate the mean of key numerical features
cluster_averages = dataset_data.groupby('cluster')[numerical_features_cyber].mean()

# Display the results
print(cluster_averages)

### Visualizing Cluster Assignments

This code generates a scatter plot to visualize how the K-Means algorithm has grouped your cybersecurity sessions into the `optimal_k = 3` clusters.

*   **Code Explanation**:
    *   `plt.figure(figsize=(10, 6))`: Sets up the size of the plot for better readability.
    *   `for cluster_id in range(optimal_k)`: This loop iterates through each of the three clusters (0, 1, 2) that K-Means identified.
    *   `cluster_data = dataset_data[dataset_data['cluster'] == cluster_id]`: For each iteration, it filters the original `dataset_data` to get only the sessions belonging to the current `cluster_id`.
    *   `plt.scatter(cluster_data['network_packet_size'], cluster_data['session_duration'], label=f'Cluster {cluster_id}')`: This is the core plotting step. It creates a scatter plot using two key numerical features:
        *   **`network_packet_size`**: The size of the network packets, plotted on the x-axis.
        *   **`session_duration`**: The duration of the session, plotted on the y-axis.
        Each cluster is plotted with a different color, and a label is assigned for the legend.
    *   `plt.xlabel`, `plt.ylabel`, `plt.title`, `plt.legend()`, `plt.show()`: These lines add labels to the axes, a title to the plot, display a legend to identify which color corresponds to which cluster, and finally show the plot.

*   **Interpreting the Results**:
    *   This plot allows you to visually inspect the separation and characteristics of your clusters based on `Network Packet Size` and `Session Duration`.
    *   You should observe if the clusters are distinct or if they overlap significantly in this 2D projection. Ideally, well-separated clusters would show clear groupings of points of the same color.
    *   For example, if one cluster (e.g., Cluster 2, which we identified as being mostly `attack_detected=1`) tends to have very high `session_duration` and large `network_packet_size` compared to other clusters, this visualization will make that pattern evident. This helps you understand the defining characteristics of each cluster and why K-Means grouped the data in a particular way.

### Visualizing Cluster Assignments (for optimal_k = 9)

This code generates a scatter plot to visualize how the K-Means algorithm has grouped your cybersecurity sessions into the `optimal_k = 9` clusters.

*   **Code Explanation**:
    *   `plt.figure(figsize=(10, 6))`: Sets up the size of the plot for better readability.
    *   `for cluster_id in range(optimal_k)`: This loop iterates through each of the nine clusters (0 through 8) that K-Means identified.
    *   `cluster_data = dataset_data[dataset_data['cluster'] == cluster_id]`: For each iteration, it filters the original `dataset_data` to get only the sessions belonging to the current `cluster_id`.
    *   `plt.scatter(cluster_data['network_packet_size'], cluster_data['session_duration'], label=f'Cluster {cluster_id}')`: This is the core plotting step. It creates a scatter plot using two key numerical features:
        *   **`network_packet_size`**: The size of the network packets, plotted on the x-axis.
        *   **`session_duration`**: The duration of the session, plotted on the y-axis.
        Each cluster is plotted with a different color, and a label is assigned for the legend.
    *   `plt.xlabel`, `plt.ylabel`, `plt.title`, `plt.legend()`, `plt.show()`: These lines add labels to the axes, a title to the plot, display a legend to identify which color corresponds to which cluster, and finally show the plot.

*   **Interpreting the Results**:
    *   This plot allows you to visually inspect the separation and characteristics of your clusters based on `Network Packet Size` and `Session Duration`.
    *   You should observe if the clusters are distinct or if they overlap significantly in this 2D projection. Ideally, well-separated clusters would show clear groupings of points of the same color.
    *   For example, if one cluster (e.g., Cluster 1 or 5, which we identified as being mostly `attack_detected=1`) tends to have very high `session_duration` and large `network_packet_size` compared to other clusters, this visualization will make that pattern evident. This helps you understand the defining characteristics of each cluster and why K-Means grouped the data in a particular way.

    Given that we are now using `optimal_k=9` clusters, the visualization will likely show more granular groupings. It's important to see if the highly pure attack and non-attack clusters (like Cluster 1, 4, 5, and 6 from the `cluster_composition` analysis) form visually distinct groups in this 2D projection, or if their separation relies more on other features not visualized here.

In [None]:
# Visualize cluster assignments for cybersecurity data
plt.figure(figsize=(10, 6))
for cluster_id in range(optimal_k):
    cluster_data = dataset_data[dataset_data['cluster'] == cluster_id]
    plt.scatter(cluster_data['network_packet_size'], cluster_data['session_duration'], label=f'Cluster {cluster_id}')

plt.xlabel('Network Packet Size')
plt.ylabel('Session Duration')
plt.title('Cluster Assignments (Network Packet Size vs. Session Duration)')
plt.legend()
plt.show()

---
## Clustering WITHOUT `attack_detected` attribute

### Read the file and prepare data

In [None]:
# Make a copy of the raw data to work with, removing 'session_id' and 'attack_detected'
df_cyber_without_labels = dataset_data.drop(columns=['session_id', 'attack_detected'])

# Identify categorical columns for one-hot encoding
columns_to_encode_without_labels = df_cyber_without_labels.select_dtypes(include='object').columns.tolist()

# One-hot encode the identified categorical features
cyber_data_without_labels = pd.get_dummies(df_cyber_without_labels, columns=columns_to_encode_without_labels, drop_first=False)

# Display the updated DataFrame with one-hot encoded features
print("DataFrame with one-hot encoded features, without 'attack_detected':")
print(cyber_data_without_labels.head())

In [None]:
# Select relevant features (all columns from the processed dataframe without labels)
cyber_features_without_labels = cyber_data_without_labels.copy()

# Get all column names to retain after scaling
all_cyber_features_without_labels = cyber_features_without_labels.columns.values.tolist()

# Standardize features
scaler = StandardScaler()
cyber_scaled_without_labels = scaler.fit_transform(cyber_features_without_labels)
cyber_scaled_without_labels = pd.DataFrame(cyber_scaled_without_labels, columns=all_cyber_features_without_labels)

# Display the head of the scaled data
print("Scaled data without 'attack_detected' as features:")
print(cyber_scaled_without_labels.head())

### K hyperparameter tuning

### Interpreting K-Means Hyperparameter Tuning Results (WITHOUT `attack_detected`)

In this section, we are performing K-Means clustering without including the `attack_detected` attribute as a feature. The goal is to see if the algorithm can discover inherent groupings in the data based *only* on the other network and session characteristics. The interpretation of the Elbow Method and Silhouette Score plots remains similar, but the results might differ as K-Means now has to rely purely on the other features to find structure.

#### 1. Elbow Method (Inertia Plot)

*   **Observation**: You will look for a clear 'elbow' in the plot of inertia versus the number of clusters (`k`). This point signifies where adding more clusters no longer substantially decreases the within-cluster sum of squares, suggesting that additional clusters provide diminishing returns in explaining data variance.
*   **Interpretation**: A sharp bend or 'elbow' suggests an optimal `k` where the clusters are reasonably tight and distinct.

#### 2. Silhouette Score Plot

*   **Observation**: This plot shows the `silhouette score` for each `k`. A higher silhouette score indicates better-defined clusters, meaning objects are well-matched to their own cluster and poorly matched to neighboring clusters.
*   **Interpretation**: The `k` value corresponding to the *highest* silhouette score is often considered optimal as it indicates the best overall cluster quality (cohesion and separation).

Considering both plots together helps in making a more informed decision for `optimal_k` for this purely unsupervised clustering scenario. We'll be looking for a balance between reducing inertia and maximizing cluster distinction.

In [None]:
# Test k values from 2 to 10
inertia = []
silhouette_scores = []
k_values = range(2, 11)

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=1234, n_init=10) # Added n_init for modern KMeans
    kmeans.fit(cyber_scaled_without_labels)
    inertia.append(kmeans.inertia_)
    silhouette_scores.append(metrics.silhouette_score(cyber_scaled_without_labels, kmeans.labels_))

plt.plot(k_values, inertia, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k (Cybersecurity Data WITHOUT attack_detected)')
plt.show()

### Apply the model

In [None]:
# Choose optimal k based on the plots (example)
optimal_k = 6

kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
clusters_without_labels = kmeans.fit_predict(cyber_scaled_without_labels)

# Add cluster assignment to a copy of the original dataset_data
dataset_data_without_labels_clustering = dataset_data.copy()
dataset_data_without_labels_clustering['cluster_without_labels'] = clusters_without_labels

# Display the head of the dataset with new cluster assignments
print("Dataset with new cluster assignments (without using 'attack_detected' as feature):")
dataset_data_without_labels_clustering.head()

In [None]:
# Counts 'attack_detected' labels per cluster and calculates metrics.
def analyze_clusters(data, cluster_col_name, actual_label_col_name):

    # Calculate metrics
    # Group by the 'cluster' assigned by K-Means and the actual 'attack_detected' label
    cluster_label_counts = data.groupby([cluster_col_name, actual_label_col_name]).size().unstack(fill_value=0)

    cluster_stats = pd.DataFrame()
    cluster_stats['cluster'] = cluster_label_counts.index
    cluster_stats['total_sessions_in_cluster'] = cluster_label_counts.sum(axis=1)

    # Iterate through unique actual_label_col_name values (0 and 1)
    for label_status in data[actual_label_col_name].unique():
        col_name_count = f'{actual_label_col_name}_{label_status}_count'
        col_name_proportion = f'{actual_label_col_name}_{label_status}_proportion'

        cluster_stats[col_name_count] = cluster_label_counts[label_status]
        cluster_stats[col_name_proportion] = cluster_label_counts[label_status] / cluster_stats['total_sessions_in_cluster']

    # Add more metrics as needed
    return cluster_stats

# Example usage with our cybersecurity dataset (without using 'attack_detected' as a feature for clustering)
cluster_analysis_without_labels = analyze_clusters(dataset_data_without_labels_clustering, 'cluster_without_labels', 'attack_detected')
print("Detailed Cluster Analysis (without 'attack_detected' as feature for clustering):")
cluster_analysis_without_labels

In [None]:
# Make a copy of the raw data to work with, removing 'session_id' and 'attack_detected'
df_cyber_without_labels = dataset_data.drop(columns=['session_id', 'attack_detected'])

# Identify categorical columns for one-hot encoding
columns_to_encode_without_labels = df_cyber_without_labels.select_dtypes(include='object').columns.tolist()

# One-hot encode the identified categorical features
cyber_data_without_labels = pd.get_dummies(df_cyber_without_labels, columns=columns_to_encode_without_labels, drop_first=False)

# Display the updated DataFrame with one-hot encoded features
print("DataFrame with one-hot encoded features, without 'attack_detected':")
print(cyber_data_without_labels.head())

### Interpreting the Visualization for Clusters (WITHOUT `attack_detected`)

This scatter plot visualizes the cluster assignments (when `attack_detected` was *not* used as a feature for clustering) using `Network Packet Size` and `Session Duration` as the axes. This allows us to visually inspect the separation and characteristics of these three clusters.

*   **Code Explanation (as a reminder)**:
    *   Each point on the plot represents a cybersecurity session.
    *   The color of each point indicates the cluster (`0`, `1`, or `2`) to which K-Means assigned that session.
    *   The x-axis represents the `Network Packet Size`.
    *   The y-axis represents the `Session Duration`.

*   **Interpreting the Results**:
    *   **Visual Overlap**: You will likely observe a significant degree of overlap between the different colored points. This means that sessions belonging to different clusters are not clearly separated in this 2D space defined by `network_packet_size` and `session_duration`.
    *   **No Distinct Boundaries**: Unlike the scenario where `attack_detected` was explicitly used as a feature, it is difficult to identify clear boundaries or distinct regions for each cluster. The points from different clusters appear intermingled.
    *   **Consistency with Numerical Analysis**: This visual observation supports the numerical analysis from `cluster_analysis_without_labels`. Since the numerical analysis showed that the clusters did not effectively separate attack from non-attack sessions, it follows that these clusters also don't show strong visual separation based on key features like `network_packet_size` and `session_duration`.

This visualization further confirms that when K-Means operates in a purely unsupervised mode (without the `attack_detected` label), the clusters it forms, while grouping data points, do not necessarily correspond to a meaningful separation of attacks versus non-attacks based on these visible features.

### Interpreting the Visualization for Clusters (WITHOUT `attack_detected`)

This scatter plot visualizes the cluster assignments (when `attack_detected` was *not* used as a feature for clustering) using `Network Packet Size` and `Session Duration` as the axes. This allows us to visually inspect the separation and characteristics of these `optimal_k = 6` clusters.

* **Code Explanation (as a reminder)**:
    * Each point on the plot represents a cybersecurity session.
    * The color of each point indicates the cluster (0 through 5) to which K-Means assigned that session.
    * The x-axis represents the `Network Packet Size`.
    * The y-axis represents the `Session Duration`.

* **Interpreting the Results**:
    * **Visual Overlap**: You will likely observe a significant degree of overlap between the different colored points. This means that sessions belonging to different clusters are not clearly separated in this 2D space defined by `network_packet_size` and `session_duration`.
    * **No Distinct Boundaries**: Unlike the scenario where `attack_detected` was explicitly used as a feature, it is difficult to identify clear boundaries or distinct regions for each cluster. The points from different clusters appear intermingled.
    * **Consistency with Numerical Analysis**: This visual observation supports the numerical analysis from `cluster_analysis_without_labels`. Since the numerical analysis showed that the clusters did not effectively separate attack from non-attack sessions, it follows that these clusters also don't show strong visual separation based on key features like `network_packet_size` and `session_duration`.

This visualization further confirms that when K-Means operates in a purely unsupervised mode (without the `attack_detected` label), the clusters it forms, while grouping data points, do not necessarily correspond to a meaningful separation of attacks versus non-attacks based on these visible features.

In [None]:
# Visualize cluster assignments
plt.figure(figsize=(10, 6))
for cluster_id in range(optimal_k):
    cluster_data = dataset_data_without_labels_clustering[dataset_data_without_labels_clustering['cluster_without_labels'] == cluster_id]
    plt.scatter(cluster_data['network_packet_size'], cluster_data['session_duration'], label=f'Cluster {cluster_id}')

plt.xlabel('Network Packet Size')
plt.ylabel('Session Duration')
plt.title('Cluster Assignments (Network Packet Size vs. Session Duration, WITHOUT attack_detected as feature)')
plt.legend()
plt.show()

---
# 3. **CONCLUSION**

## 3.1 Overall

> This project explored the application of K-Means clustering to a cybersecurity intrusion detection dataset, comparing two main approaches: one where the `attack_detected` attribute was included as a feature for clustering, and another where it was explicitly excluded. In the first scenario, K-Means demonstrated a remarkable ability to separate the data into highly pure clusters, with one cluster almost exclusively containing non-attack sessions and another almost exclusively containing attack sessions. This indicates a strong inherent separability within the data when the 'attack detected' characteristic is leveraged during clustering. Conversely, when `attack_detected` was excluded, K-Means struggled to form clusters that meaningfully aligned with the presence or absence of an attack, resulting in mixed clusters with similar proportions of attack and non-attack sessions. This highlights the inherent challenge of purely unsupervised anomaly detection where the patterns of interest are not explicitly provided.

## 3.2 Challenges and solutions

> **Challenges Encountered:**
> 1.  **Ambiguity in Unsupervised Clustering**: The primary challenge was the difficulty in getting K-Means to naturally 'discover' attack patterns when the `attack_detected` label was withheld. Without this direct signal, the algorithm found groupings based on other feature similarities, but these did not strongly correlate with actual attack events.
> 2.  **Data Preprocessing for K-Means**: K-Means is sensitive to feature scaling and categorical data. Handling `object` (categorical) columns through one-hot encoding and ensuring numerical features were standardized were crucial steps.
> 3.  **Optimal K Determination**: Identifying the optimal number of clusters (`k`) proved to be an interpretative task, relying on tools like the Elbow Method and Silhouette Score, which can sometimes be ambiguous.
>
> **Solutions Implemented:**
> 1.  **Comprehensive Preprocessing**: Applied one-hot encoding to all categorical features (`protocol_type`, `encryption_used`, `browser_type`) and `StandardScaler` to all numerical features to ensure all attributes contributed fairly to the clustering process.
> 2.  **Comparative Analysis**: Performed K-Means in two distinct modes (with and without `attack_detected` as a feature) to illustrate the impact of feature selection on clustering outcomes and to gauge inherent data separability.
> 3.  **Hyperparameter Tuning**: Utilized the Elbow Method (inertia plot) and Silhouette Score plot to systematically test and identify a suitable `k` value for both clustering scenarios.

## 3.3 Looking forward

> Future work could build upon this foundation in several ways:
> 1.  **Supervised Learning Transition**: Given the strong inherent separability observed, a natural next step would be to transition to supervised learning models (e.g., SVM, Random Forest, Neural Networks) for more accurate and direct attack detection.
> 2.  **Advanced Feature Engineering**: Explore creating new, more complex features from the existing data that might enhance the distinctiveness of attack patterns, especially for the purely unsupervised approach.
> 3.  **Alternative Unsupervised Algorithms**: Experiment with other unsupervised algorithms like DBSCAN (which can find clusters of varying shapes and handle noise) or hierarchical clustering to see if they yield better results in discovering attack patterns without the `attack_detected` label.
> 4.  **Dimensionality Reduction**: Apply techniques like Principal Component Analysis (PCA) before clustering to reduce noise and potentially reveal clearer structures in a lower-dimensional space, which might improve the performance of K-Means when `attack_detected` is excluded.
> 5.  **Investigation of Mixed Clusters**: Delve deeper into the characteristics of Cluster 0 (the mixed cluster from the 'with `attack_detected`' experiment) to understand why these sessions are ambiguous. This could uncover subtle attack vectors or normal traffic with unusual patterns requiring specialized handling.

## 3.4 In hindsight

> Reflecting on this project, the most significant insight is the critical role of feature selection in K-Means clustering, particularly in the context of unsupervised anomaly detection. While K-Means effectively grouped data when the target variable was indirectly provided (as a feature), its ability to 'discover' attacks without this explicit information was limited, suggesting that the raw features alone did not form naturally distinct attack/non-attack clusters. This underscores the difference between validating known patterns and genuinely discovering unknown ones. The importance of meticulous data preprocessing (one-hot encoding and scaling) for K-Means was also reaffirmed, as was the iterative nature of hyperparameter tuning. Understanding both the strengths and limitations of K-Means in different contexts is vital for its effective application in real-world cybersecurity scenarios.

Thank you, Professor Joaquim :)