# Homework: Cluster validation indices

| Student Name         | Student-ID |
|----------------------|------------|
| Marco Di Francesco   | 100632815  |
| Loreto García Tejada | 100643862  |
| György Bence Józsa   | 100633270  |
| József-Hunor Jánosi  | 100516724  |
| Sara-Jane Bittner    | 100498554  |

_Learning goal: To study different cluster validation indices on different datasets and different
clusterings._

In this task, you should study two internal clustering validation indices, **Silhouette index
(SI)** and **Davies-Bouldin index (DB)**, and one external index, **Normalized Mutual
Information (NMI)**, the version by Strehl and Ghosh, 2003 (see the slides of lecture 5).

Load two data sets, “balls.txt” and “spirals.txt”. Both are two-dimensional data, where
the third feature component (“class”) contains the ground-truth labels. Remember to discard
the label while running the clustering algorithms!

a) Cluster “balls.txt” with i) $K$-means and ii) hierarchical single linkage clustering, both
using the Euclidean distance measure.

Use values $K = 2, \dots , 5$ in $K$-means and similarly cut the dendrogram in $2, \dots , 5$
clusters. Plot the data points with different colors to visualize all your clustering results.

Determine the optimal number of clusters for both methods using all three indices SI,
DB and NMI. Report the results as a table.

Which clustering method and $K$ value seem to be the best for the data i) based on the
validation indices and ii) by visual observation?

In [None]:
import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering



In [None]:
balls = pd.read_csv('balls.txt')
balls.head()

In [None]:
def compute_metrics_gen(df, df_noClass, cluster, pred_labels, metrics_dict):
    
    metrics_dict['cluster_n'].append(cluster)
    metrics_dict['SI'].append(metrics.silhouette_score(df_noClass, pred_labels, metric='euclidean'))
    metrics_dict['DB'].append(metrics.davies_bouldin_score(df_noClass, pred_labels))
    #metrics_dict['NMI'].append(metrics.normalized_mutual_info_score(df, kmeans.labels_, average_method='euclidean'))
    
    #external evaluation NMI
    metrics_dict['NMI'].append(metrics.normalized_mutual_info_score(df['class'], pred_labels)) 

    return metrics_dict

In [None]:

#kmeans
balls_k_dict = {
        "cluster_n": [],
        "SI": [],
        "DB": [],
        "NMI": []
    }

balls_noClass = balls.drop(labels="class", axis=1)
for i in range(2,6):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(balls_noClass)

    balls_k_dict = compute_metrics_gen(balls,balls_noClass,i, kmeans.labels_, balls_k_dict)

    plt.scatter(balls_noClass['X'], balls_noClass['Y'], c=kmeans.labels_)
    plt.title('K-means, K value = %d' %i)
    plt.show()

In [None]:
balls_metrics = pd.DataFrame.from_dict(balls_k_dict)
balls_metrics.set_index("cluster_n")

## Balls - hierarchical

In [None]:
#hieracical
balls_h_dict = {
        "cluster_n": [],
        "SI": [],
        "DB": [],
        "NMI": []
    }

for i in range(2,6):
    agClus = AgglomerativeClustering(n_clusters=i, linkage="single",affinity='euclidean' )
    agClus.fit(balls_noClass)   

    balls_h_dict = compute_metrics_gen(balls,balls_noClass,i, agClus.labels_, balls_h_dict)

    plt.scatter(balls_noClass['X'], balls_noClass['Y'], c=agClus.labels_)
    plt.title('hierarchical single linkage clustering, n_clusters = %d' %i)
    plt.show()




In [None]:
Z = linkage(balls_noClass, 'single')
fig = plt.figure(figsize=(25, 10))
dn = dendrogram(Z)
plt.show()

In [None]:
#kmeans
balls_metrics = pd.DataFrame.from_dict(balls_k_dict)
print('kmeans:')
print(balls_metrics.set_index("cluster_n"))

# Hierachical

print('Single Linkage:')
balls_h_metrics = pd.DataFrame.from_dict(balls_h_dict)
print(balls_h_metrics.set_index("cluster_n"))

Answer
Which clustering method and $K$ value seem to be the best for the data...

i)  ...based on the validation indices
    The best value for each of the index is as follows: for **SI** the **SI** range is $[-1,1]$, the higher the value the better the clustering, on the other hand for **DB** the lower the value the better, and in **NMI** the best possible value is 1.

|                               | $\boldsymbol{K}$-means | Hierarchical clustering |
|-------------------------------|------------------------|-------------------------|
| Silhouette index              | 3 clusters             | 3 clusters              |
| Davies-Bouldin index          | 3 clusters             | 3 clusters              |
| Normalized mutual information | 3 clusters             | 3 clusters              |

Overall, both clusters propose 3 clusters as the best amount of clusters coherent for all three validation indices. However, if you take a closer look the hierarchical single linkage clustering shows a way lower **DB** index for more than 3 clusters. Compared to kmeans, where the **DB** index raises drastically with more than 3 clusters. This indicates that a wrong choice of the number of clusters, the hierarchical clustering method loses less correctness over $K$-means. This is due to the fact that while $K$-means creates a new centroid, turning a large portion of points in a cluster into another cluster, hierarchical clustering assigns only outlier points to the additional claster.

ii) ...by visual observation?
    Regarding the $K$-means the best number of clusters visually is 3. Here the clusters are most compact in themselves and have the widest distance inter-cluster. With 2 clusters, the clusters show drastically higher intra-cluster. Further, for more than 3 clusters, clusters become less distinct and previous compact clusters are divided into two for example.
    Regarding hierarchical clustering, visually 3 clusters are the best as well. Similar to the $K$-means, with 2 clusters the clusters show drastically higher intra clusters.  Further, for more than 3 clusters, small parts of existing compact clusters are split.
    However, the split is less drastic for more clusters regarding the hierarchical single linkage clustering. Therefore, hierarchical clustering is the better forming clustering methods based on the plots.


## Sprirals - k-means

b) Repeat all steps of a) for “spirals.txt”.

In [None]:
spirals = pd.read_csv('spirals.txt')
spirals.head()

In [None]:
spirals_noClass = spirals.drop(labels="class", axis=1)

In [None]:

spiral_k_dict = {
        "cluster_n": [],
        "SI": [],
        "DB": [],
        "NMI": []
    }
for i in range(2,6):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(spirals_noClass)

    spiral_k_dict = compute_metrics_gen(spirals,spirals_noClass,i, kmeans.labels_, spiral_k_dict)

    plt.scatter(spirals_noClass['X'], spirals_noClass['Y'], c=kmeans.labels_)
    plt.title('K-means, K value = %d' %i)
    plt.show()

In [None]:
#kmeans
spirals_k_metrics = pd.DataFrame.from_dict(spiral_k_dict)
print('kmeans:')
print(spirals_k_metrics.set_index("cluster_n"))

Answer
Which clustering method and $K$ value seem to be the best for the data...

i)  ...based on the validation indices

The $K$-means is not performing as well as the hierarchical clustering. Here it is also not that clear which amount of clusters is the best.


|                               | $\boldsymbol{K}$-means | Hierarchical clustering |
|-------------------------------|------------------------|-------------------------|
| Silhouette index              | 3 clusters             | 2 clusters              |
| Davies-Bouldin index          | 4 clusters             | 5 clusters              |
| Normalized mutual information | 5 clusters             | 3 clusters              |

For the spirals.txt the single linkage clustering the metrics doesn’t represent correctly the classification, only **NMI** is a good indicator in this case of what is the best number of clusters.

ii) ...by visual observation?

Regarding k means the clustering for every amount of cluster is not viable. There is no clear distinction between spirals and clusters. This was hinted to by the fact that the different indices disagreed on the optimal number of clusters.

Regarding hierarchical single linkage clustering, the clustering with 3 clusters shows the best performance. Here the three clusters form 3 distinct spirals. For less than 3 clusters, 2 spirals are forming one cluster. While for more than 3 clusters, the most far ends of a spirals clustered to one cluster are split to new ones.

Overall, regarding the plots the hierarchical single linkage clustering is performing best for the spirals data sets. Here the clustering results in more distinct spirals and clustering.



## Sprirals - hierarchical

In [None]:
Z = linkage(spirals_noClass, 'single')
fig = plt.figure(figsize=(25, 10))
dn = dendrogram(Z)
plt.show()

In [None]:
spiral_h_dict = {
        "cluster_n": [],
        "SI": [],
        "DB": [],
        "NMI": []
    }

for i in range(2,6):
    agClus = AgglomerativeClustering(n_clusters=i, linkage="single",affinity='euclidean' )
    agClus.fit(spirals_noClass)   

    spiral_h_dict = compute_metrics_gen(spirals,spirals_noClass,i, agClus.labels_, spiral_h_dict)
    plt.scatter(spirals_noClass['X'], spirals_noClass['Y'], c=agClus.labels_)
    plt.title('hierarchical single linkage clustering, n_clusters = %d' %i)
    plt.show()



In [None]:
#kmeans
spirals_k_metrics = pd.DataFrame.from_dict(spiral_k_dict)
print('kmeans:')
print(spirals_k_metrics.set_index("cluster_n"))

# Hierachical

print('Single Linkage:')
spirals_h_metrics = pd.DataFrame.from_dict(spiral_h_dict)
print(balls_h_metrics.set_index("cluster_n"))

c) Explain and analyze your observations. Which index captured the performance of the
clustering algorithm most accurately? Why some indices might have failed to reflect
good performance?

Answer

The performance of the indices depends on what we mean for a good clustering algorithm, there may be 2 cases: one in which we want to have an index the gives an high value for the best number of clusters with and a low value for the wrong number of clusters; the other we may want to have just the most number of datapoints in the same cluster, giving little importance if there are clusters that target only at an outlier.

- **Targeting the biggest number of datapoints clustered together**: in this case the index that best captured the clustering algorithms was **NMI**. This is because it considers the ground truth labels (giving also importance to the number of clusters we have), while the other 2 indices consider the distances from the clusters.
More in detail we can analyze the behavior in the 2 datasets:
    - ball dataset: it gave 1.00 to both algorithms with 3 clusters, as there were no missed points
    - spiral dataset: the **NMI** values of $K$-means were very low, correctly indicating that it is fundamentally incapable of clustering the dataset accurately. In hierarchical clustering values were much higher, closer to 1, which indicated that that method captured the topology of the dataset rather well. With $K$-means, the values were pretty uniform, while the hierarchical values indicated a clear best choice for parameters.
- **Targeting best number of clusters**: the best index changed depending on the dataset. Considering in detail the two datasets:
    - balls dataset: in the case of spirals dataset for $K$-means clustering **SI** gave was easier to reason with because the values in all the cases are bad and similar with each other, while for **DB** the values changed, hinting that 5 clusters was the best option
    - spiral dataset: in both cases the indices show very low values hinting in both cases that both clustering algorithms are wrong, so having bad results. This is given by how the indices work, because the look at the centroid, so in this case the **NMI** metric would have given better results.


In [None]:
#kmeans
balls_metrics = pd.DataFrame.from_dict(balls_k_dict)
print('balls kmeans:')
print(balls_metrics.set_index("cluster_n"))

# Hierachical

print('balls Single Linkage:')
balls_h_metrics = pd.DataFrame.from_dict(balls_h_dict)
print(balls_h_metrics.set_index("cluster_n"))

#kmeans
spirals_k_metrics = pd.DataFrame.from_dict(spiral_k_dict)
print('Spirals kmeans:')
print(spirals_k_metrics.set_index("cluster_n"))

# Hierachical

print('Spirals Single Linkage:')
spirals_h_metrics = pd.DataFrame.from_dict(spiral_h_dict)
print(spirals_h_metrics.set_index("cluster_n"))