# Basic Approach for Trace Clustering


Requirements:
- implement basic approach (one-hot-enconding + kmeans)
- hyperparameter experiments (grid-search)
- interpretation of clusters (one municipal? etc.) 
- -> Notebook nicely presenting the clustering results for different hyperparameters

In [17]:
from practical.ProcessMining.group1.shared import utils
from sklearn.metrics import silhouette_score
from sklearn.model_selection import ParameterGrid
from joblib import Parallel, delayed
import numpy as np

from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import silhouette_score
from sklearn.model_selection import ParameterGrid

BASE = utils.SAMPLES_PATH
real_path = BASE / "DomesticDeclarations_cleansed.csv"
event_log = utils.import_csv(BASE / "DomesticDeclarations_cleansed.csv")

In [38]:
# One-hot encode the event log
one_hot_encoder = OneHotEncoder(sparse_output=False)
event_log_encoded = one_hot_encoder.fit_transform(event_log[["activity"]])

# Sample a subset of the data for faster computation
sample_indices = np.random.choice(event_log_encoded.shape[0], size=1000, replace=False)
event_log_sampled = event_log_encoded[sample_indices]

# Apply KMeans clustering with hyperparameter tuning
param_grid = {
    'n_clusters': [3, 4, 5, 6, 7, 8, 9, 10],
    'init': ['k-means++', 'random'],
    'n_init': [10, 15, 20],
    'max_iter': [300, 400, 500]
}

def evaluate_model(params, log):
    k_means = KMeans(**params, random_state=42)
    cluster_labels = k_means.fit_predict(log)
    score = silhouette_score(log, cluster_labels)
    return score, params
        
results = Parallel(n_jobs=-1)(delayed(evaluate_model)(params, event_log_sampled) for params in ParameterGrid(param_grid))
best_score, best_params = max(results, key=lambda x: x[0])

print(f"Best Score: {best_score}")
print(f"Best Params: {best_params}")

Best Score: 0.9882608695652175
Best Params: {'init': 'k-means++', 'max_iter': 300, 'n_clusters': 10, 'n_init': 10}


Improvement Steps:
1. silhouette score calculation is too time consuming in combination with Parameter tuning
2. To improve, use sample log of actual eventlog and parallelize execution

Best score range: 
- 0.98 - 0.99

Best params: 
- 'init': 'k-means++', 
- 'max_iter': 300, 
- 'n_clusters': 10, 

Mostly best params
- 'n_init': 10}

Explanation:
- n_clusters: silhouette score calculates cluster affiliation, the more clusters are used, the better silhouette scores can be expected => overfitting problem, the more clusters, the better the score while cluster <= eventlog-unique-activities
- max_iter: probably for sample set size, already after 300 iterations, no more changes
- init: trivial, that k-means++ performance better than random

In [39]:
k_means = KMeans(**best_params, random_state=42)
event_log['cluster'] = k_means.fit_predict(event_log_encoded)

# Split the event log into cluster sublogs
cluster_sublogs = {}
for cluster_label in event_log['cluster'].unique():
    cluster_sublogs[cluster_label] = event_log[event_log['cluster'] == cluster_label]


{1:                   case_id                 timestamp  \
0      declaration 100000 2018-01-30 08:20:07+00:00   
5      declaration 100005 2018-01-30 08:38:54+00:00   
10     declaration 100010 2018-01-30 08:43:21+00:00   
15     declaration 100015 2018-01-30 12:06:43+00:00   
21     declaration 100021 2018-01-31 10:05:55+00:00   
...                   ...                       ...   
56410   declaration 99973 2018-01-29 10:44:00+00:00   
56415   declaration 99978 2018-01-29 11:31:59+00:00   
56420   declaration 99983 2018-01-29 11:43:52+00:00   
56426   declaration 99989 2018-01-29 18:35:35+00:00   
56432   declaration 99995 2018-01-29 19:58:58+00:00   

                                activity   case:concept:name  \
0      Declaration SUBMITTED by EMPLOYEE  declaration 100000   
5      Declaration SUBMITTED by EMPLOYEE  declaration 100005   
10     Declaration SUBMITTED by EMPLOYEE  declaration 100010   
15     Declaration SUBMITTED by EMPLOYEE  declaration 100015   
21     Declarat

In [41]:
import os

base_dir = "base_approach"
os.makedirs(base_dir, exist_ok=True)

for cluster_label, df in cluster_sublogs.items():
    file_path = os.path.join(base_dir, f"cluster_{cluster_label}.csv")
    
    df.to_csv(file_path, index=False)

    print(f"Cluster {cluster_label} exported to {file_path}")

Cluster 1 exported to base_approach/cluster_1.csv
Cluster 5 exported to base_approach/cluster_5.csv
Cluster 2 exported to base_approach/cluster_2.csv
Cluster 4 exported to base_approach/cluster_4.csv
Cluster 3 exported to base_approach/cluster_3.csv
Cluster 0 exported to base_approach/cluster_0.csv
Cluster 7 exported to base_approach/cluster_7.csv
Cluster 6 exported to base_approach/cluster_6.csv
Cluster 8 exported to base_approach/cluster_8.csv
Cluster 9 exported to base_approach/cluster_9.csv
