# Basic Approach for Trace Clustering


Requirements:
- implement basic approach (one-hot-enconding + kmeans)
- hyperparameter experiments (grid-search)
- interpretation of clusters (one municipal? etc.) 
- -> Notebook nicely presenting the clustering results for different hyperparameters

In [6]:
from practical.ProcessMining.group1.shared import utils
from sklearn.metrics import silhouette_score
from sklearn.model_selection import ParameterGrid
from joblib import Parallel, delayed
import numpy as np

from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import silhouette_score
from sklearn.model_selection import ParameterGrid

BASE = utils.SAMPLES_PATH
real_path = BASE / "DomesticDeclarations_cleansed.csv"
event_log = utils.import_csv(BASE / "DomesticDeclarations_cleansed.csv")

In [7]:
# One-hot encode the event log
one_hot_encoder = OneHotEncoder(sparse_output=False)
event_log_encoded = one_hot_encoder.fit_transform(event_log[["activity"]])

# Sample a subset of the data for faster computation
sample_indices = np.random.choice(event_log_encoded.shape[0], size=1000, replace=False)
event_log_sampled = event_log_encoded[sample_indices]

# Apply KMeans clustering with hyperparameter tuning
param_grid = {
    'n_clusters': [3, 4, 5, 6, 7, 8, 9, 10],
    'init': ['k-means++', 'random'],
    'n_init': [10, 15, 20],
    'max_iter': [300, 400, 500]
}

def evaluate_model(params, log):
    k_means = KMeans(**params, random_state=42)
    cluster_labels = k_means.fit_predict(log)
    score = silhouette_score(log, cluster_labels)
    return score, params
        
results = Parallel(n_jobs=-1)(delayed(evaluate_model)(params, event_log_sampled) for params in ParameterGrid(param_grid))
best_score, best_params = max(results, key=lambda x: x[0])

print(f"Best Score: {best_score}")
print(f"Best Params: {best_params}")

Best Score: 0.9855
Best Params: {'init': 'k-means++', 'max_iter': 300, 'n_clusters': 10, 'n_init': 10}


Improvement Steps:
1. pure silhouette score calculation was too time consuming in combination with parameter tuning
2. To improve, used sample logs of actual eventlog and parallelized execution

Best score range: 
- 0.98 - 0.99

Best params: 
- 'init': 'k-means++', 
- 'max_iter': 300, 
- 'n_clusters': 10, 

Mostly best params
- 'n_init': 10}

Explanation:
- n_clusters: silhouette score calculates cluster affiliation, the more clusters are used, the better silhouette scores can be expected => overfitting problem, the more clusters, the better the score while cluster <= eventlog-unique-activities
- max_iter: probably for sample set size, already after 300 iterations, no more changes
- init: trivial, that k-means++ performance better than random

Next, using best parameters found for sample log to cluster the real log and add it as feature to each row of the eventlog

In [8]:
k_means = KMeans(**best_params, random_state=42)
event_log['cluster'] = k_means.fit_predict(event_log_encoded)

# Split the event log into cluster sublogs
cluster_sublogs = {}
for cluster_label in event_log['cluster'].unique():
    cluster_sublogs[cluster_label] = event_log[event_log['cluster'] == cluster_label]

Finally split the modified log by cluster label and save each sublog to a csv file, so that each one can be handled as standalone eventlog

In [9]:
import os

base_dir = "base_approach"
os.makedirs(base_dir, exist_ok=True)

for cluster_label, df in cluster_sublogs.items():
    file_path = os.path.join(base_dir, f"cluster_{cluster_label}.csv")
    df.to_csv(file_path, index=False)

print(f"Clusters exported to /base_approach")

Clusters exported to /base_approach
