**Data mining Project - University of Pisa, acedemic year 2023/24**
 
**Authors**: Giacomo Aru, Giulia Ghisolfi, Luca Marini, Irene Testa

# Clustering comparison

In this notebook, we compare the clustering results of the different methods. Since DBSCAN and Heirarchical clustering were applied only to the incidents happened in Illinois, we restrict the comparison to the incidents in this state. We remind that the state of Illinois was chosen because it had few null values and the distribution of the variables was similar to the distribution of the variables of the whole dataset.

We import the libraries:

In [1]:
import pandas as pd
from clustering_utils import *

We define the paths to the saved clustering results:

In [2]:
PATH = '../data/clustering_labels/'
clustering_name = ['KMeans', 'KMeansPCA', 'DBSCAN', 'Hierarchical']
labels_files = [PATH+'/4-Means_clusters.csv', PATH+'4-Means_PCA_clusters.csv', PATH+'DBSCAN_clusters.csv', PATH+'hierarchical_clusters.csv']
external_scores_files = [PATH+'4-Means_external_scores.csv', PATH+'4-Means_PCA_external_scores.csv', PATH+'DBSCAN_external_scores.csv', PATH+'hierarchical_external_scores.csv']
internal_scores_files = [PATH+'4-Means_internal_scores.csv', PATH+'4-Means_PCA_internal_scores.csv', PATH+'DBSCAN_internal_scores.csv', PATH+'hierarchical_internal_scores.csv']

We concatenate the clustering results into a single dataframe:

In [3]:
clusters_df = pd.DataFrame(index=[i for i in range(239379)])
for name, labels_file, external_score_file, internal_score_file in zip(clustering_name, labels_files, external_scores_files, internal_scores_files):
    clusters_curr_df = pd.read_csv(labels_file, index_col=0)
    clusters_curr_df = clusters_curr_df.rename(columns={'cluster':'cluster'+name})
    clusters_df = clusters_df.join(clusters_curr_df)
clusters_df.dropna(inplace=True)

We visualize the clustering results using a sankey diagram:

In [4]:
sankey_plot(
    [clusters_df['clusterDBSCAN'], clusters_df['clusterKMeans'], clusters_df['clusterKMeansPCA'], clusters_df['clusterHierarchical']],
    labels_titles=['DBSCAN', 'KMeans', 'KMeansPCA', 'Heirarchical'],
    title='Clusterings comparison'
)

The clusters found by KMeans applied to the indicators and the clusters found by KMeansPCA applied to the first principal components of the indicators are very similar. Cluster 2 of DBSCAN groups almost all the points from cluster 0, 1 and 3 of KMeans; while cluster 1 of DBSCAN groups almost all the points from cluster 2 of KMeans. Cluster 2 of KMeansPCA groups points belonging mainly to cluster 1, 2 and 6 of the Heirachical clustering. There is an high overlap between cluster 0 and 3 of KMeansPCA and Heirarchical clustering. Cluster 1 of KMeansPCA graoups all the points in cluster 5 of Heirarchical clustering. 

From this analysis, we can conclude that despite the differences in the methods, the clusters found are not so different.

Now we compare the internal scores of KMeans and KMenasPCA:

In [5]:
internal_scores_df = pd.DataFrame()
for name, internal_scores_file in zip(clustering_name[:2], internal_scores_files[:2]):
    internal_scores_curr_df = pd.read_csv(internal_scores_file, index_col=0).T
    internal_scores_df = pd.concat([internal_scores_df, internal_scores_curr_df])
internal_scores_df.rename(columns={'0':'silhouette_score'}, inplace=True)
internal_scores_df.drop(columns=['model'], inplace=True)
internal_scores_df

Unnamed: 0,BSS,SSE,calinski_harabasz_score,davies_bouldin_score,n_iter,silhouette_score
4means,57863.91870811196,51878.79845330116,49004.18671712359,1.2129926119852776,6,0.3268032005092862
4means PCA,19658.06936663372,18305.34787899599,47182.6219155728,1.203178019586672,5,0.3235205972416301


BSS and SSE are not comparable because the feature space on which we run the algorithms is different. The other scores are comparable. As for the Calinski-Harabasz score and the Silhouette score, the best results are obtained by KMeans, while for the Davies-Bouldin score the best results are obtained by KMeansPCA.

Now we compare the silhouette score of all the methods:

In [6]:
silhouette_df = internal_scores_df['silhouette_score'].to_frame()
DBSCAN_silhouette = pd.read_csv(PATH+'DBSCAN_internal_scores.csv', index_col=0)['silhouette_score'].values[0]
hierarchical_silhouette = pd.read_csv(PATH+'hierarchical_internal_scores.csv', index_col=0).T['silhouette_score'].values[0]
pd.concat([silhouette_df, pd.DataFrame({'silhouette_score': [DBSCAN_silhouette, hierarchical_silhouette]}, index=['DBSCAN', 'Hierarchical'])])

Unnamed: 0,silhouette_score
4means,0.3268032005092862
4means PCA,0.3235205972416301
DBSCAN,0.251352
Hierarchical,0.368106


According to the silhouette score the best clustering results are obtained by Hierarchical clustering.

We finally visualize the external scores of all the methods:

In [7]:
external_scores_df = pd.DataFrame()
for name, external_score_file in zip(clustering_name, external_scores_files):
    scores_curr_df = pd.read_csv(external_score_file, index_col='feature')
    print(name)
    display(scores_curr_df)

KMeans


Unnamed: 0_level_0,adjusted rand score,normalized mutual information,homogeneity,completeness
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
shots,0.06696,0.178677,0.316218,0.124518
aggression,0.129628,0.171629,0.257839,0.128623
suicide,0.005059,0.041971,0.250608,0.022903
injuries,0.192134,0.249401,0.376695,0.186409
death,0.286169,0.389875,0.622494,0.283816
drugs,-0.003387,0.056488,0.166964,0.033995
illegal_holding,-0.000474,0.040465,0.102789,0.025191
unharmed,0.313347,0.492696,0.843059,0.348051
arrested,0.324981,0.382347,0.576628,0.28599


KMeansPCA


Unnamed: 0_level_0,adjusted rand score,normalized mutual information,homogeneity,completeness
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
shots,0.013098,0.135908,0.235928,0.095445
aggression,0.119408,0.174358,0.257281,0.131859
suicide,0.018048,0.055002,0.320398,0.030083
injuries,0.196766,0.252092,0.373973,0.190128
death,0.167238,0.260939,0.409001,0.191583
drugs,-0.02125,0.051653,0.149292,0.031229
illegal_holding,-0.012607,0.040955,0.101807,0.025633
unharmed,0.339517,0.501209,0.84145,0.356898
arrested,0.453665,0.514149,0.761589,0.388066


DBSCAN


Unnamed: 0_level_0,adjusted rand score,normalized mutual information,homogeneity,completeness
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
shots,-0.075938,0.047912,0.08467,0.033408
aggression,0.364733,0.212332,0.2637,0.177713
suicide,0.039015,0.035117,0.295749,0.018667
injuries,0.352241,0.234914,0.271139,0.207227
death,0.837984,0.772892,0.985145,0.635888
drugs,-0.007725,0.003725,0.016119,0.002106
illegal_holding,-0.038983,0.008984,0.020715,0.005736
unharmed,-0.029378,0.009844,0.011332,0.008701
arrested,0.000346,0.026035,0.035774,0.020464


Hierarchical


Unnamed: 0_level_0,adjusted rand score,normalized mutual information,homogeneity,completeness
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
shots,0.045257,0.103206,0.326136,0.061303
aggression,0.141147,0.194396,0.399959,0.128402
suicide,0.005569,0.017797,0.304827,0.009166
injuries,0.197868,0.294542,0.551776,0.200889
death,0.112758,0.218996,0.4656,0.143167
drugs,0.006032,0.020277,0.173055,0.010769
illegal_holding,0.001334,0.024685,0.105911,0.01397
unharmed,0.235298,0.342733,0.640033,0.234026
arrested,0.182932,0.351128,0.819837,0.223405


Regarding the external scores:
- the class 'death' is better clustered by DBSCAN
- KMeans has similar scores to KMeansPCA; KMeansPCA works better in identifying incidents from the class 'arrest'
- Heirarchical clustering has the highest scores for the class 'arrested' and works also better than the other algorithms in identifying incidents from the class 'aggression' and 'injuries'