# Hands-on: Clustering

Clustering is a particular class of Machine Learning tools called unsupervised machine learning, whose objective is to separate your data into homogeneous groups with common characteristics. 

It is a technique that allows us to find groups of similar objects, objects that are more related to each other than to objects in other groups. Examples of business-oriented applications of clustering include the grouping of documents, music, and movies by different topics, or finding customers that share similar interests based on common purchase behaviors as a basis for recommendation engines.

The basic question is how do we measure similarity between objects? We can define similarity as the opposite of distance, and a commonly used distance for clustering samples with continuous features is the squared Euclidean distance between two points x and y in m-dimensional space:
$$
d^2(x,y) -\sum_{j=1}^m(x_j-y_j)^2 = \Vert x-y\Vert^2_2  
$$

Based on this Euclidean distance metric, we can describe the k-means algorithm as a simple optimization problem using a an iterative process for minimizing the within-cluster Sum of Squared Errors (SSE), which is sometimes also called cluster inertia or distortion:

$$
\mathrm{SSE} = \sum_{i=1}^n \sum_{j=1}^k w^{(i,j)}=1 \Vert x^{(i)}-\mu^{(j)}\Vert^2_2
$$

where $\mu^{(j)}$ is the centroid of the $j$-th cluster, and $w^{(i,j)}=1$ if $x^{(i)}$ is in cluster $j$ and $0$ otherwise.

The following four types are the most widely used types of clustering models.

* **Centroid Models**: uses the distance between a data point and the centroid of the cluster to group data. K-means clustering is an example of a centroid model.
* **Distribution Models**: segments data based on their probability of belonging to the same distribution. Gaussian Mixture Model (GMM) is a popular distribution model.
* **Connectivity Models**: uses the closeness of the data points to decide the clusters. Hierarchical Clustering Model is a widely used connectivity model.
* **Density Models**: scans the data space and assigns clusters based on the density of data points. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular density model.

More Information here: https://scikit-learn.org/stable/modules/clustering.html

## K-means

The K-means algorithm is a very well known unsupervised algorithm in clustering. In this lab practice we will detail how it works and the useful ways to optimize it. 

This algorithm was designed in 1957 at Bell Laboratories by Stuart P. Lloyd as a pulse code modulation (PCM) technique. It was presented to the general public only in 1982. In 1965 Edward W. Forgy had already published an almost similar algorithm, which is why K-means is often called the Lloyd-Forgy algorithm. 

The fields of application are diverse: customer segmentation, data analysis, image segmentation, semi-supervised learning etc.

### The principle of the k-means algorithm

Given points and an integer $k$, the algorithm aims to divide the points into $k$ homogeneous and compact groups, called clusters. Let's look at the example below:

In [None]:
#Just the basics 
import numpy as np
import pandas as pd

# Plotting te things
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib import colors

# Mathematical Analysis
from scipy import linalg
from scipy.spatial import Voronoi, voronoi_plot_2d 

# Metrics
import sklearn.metrics as metrics
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.neighbors import NearestCentroid

# Dataset
from sklearn import datasets
from sklearn.datasets import make_blobs

# Dimensionality reduction
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Modeling
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.cluster import DBSCAN


1. Let's create a dataset of size $N=2000$, distributed in three clusters with a Gaussian distribution,

In [None]:
n_samples = 2000
random_state = 130 # fix the random state for reproducibility
n_components=3
std_dev=1.0 
# This function makes some clusters with 2D coordinates in X and a label y
# Usefull for testing unsupervised ML as well classification

X,y = make_blobs(n_samples=n_samples, centers=n_components, cluster_std=std_dev, random_state=random_state)

figure, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, figsize=(14,5))
ax1.scatter(X[:,0], X[:,1])
ax1.set_title ('Clusters',fontsize=20)
ax2.scatter(X[:,0], X[:,1], c = y)
ax2.set_title ('Classes',fontsize=20)
xlim, ylim = ax1.get_xlim(),ax1.get_ylim()

2. Create a model with $k=3$ as visually this should represent a good clustering of the data

In [None]:
# Fix the number of clusters that you want 
nb_c = 3

# create a K-means model with that number of clusters  
model = KMeans(n_clusters=nb_c
               , init='random'
               , n_init=10
               , max_iter=300
               , tol=1e-04
               , random_state=random_state
              )

# cluster the data with that model 
model.fit(X)

# here for the demonstration use the same data to see how it performs
# y_pred :labels assigned to each data points
# centers : centroid positions of each cluster

y_pred = model.predict(X)
centers = model.cluster_centers_
# Plot the data

vor = Voronoi(centers) 
voronoi_plot_2d(vor
                , show_vertices=False
                , line_colors='red'
                , line_width=2
                , line_alpha=0.6
                , point_size=10) 

plt.scatter(X[:, 0], X[:,1], c = y_pred)
plt.xlim(xlim)
plt.ylim(ylim)
plt.title ('Clusters',fontsize=20)

**Note**: that the cluster labels can be different but this is not important, so seems to perform perfectly as can be seen from the initial 3 classes that have been built.

**Interpretation**: k-means divides the space using hyperplanes to attibute a data point to a cluster. These planes can visualized in 2D by a Voronoï analysis. This will be more clear when increasing the requested number of clusters as bellow and when predict to which cluster a new data point is attributed.

3. What if we change the wanted number of clusters ?

In [None]:
# Fix the number of clusters that you want 
y_pred = np.zeros(n_samples) #create an empty array
random_state = 130
k_max = 10 
nr = k_max//2

models = []

figure, ax = plt.subplots(nrows=nr, ncols = 2, figsize=(14,5*nr))  

for nb_c in range(1,k_max+1):

# create a K-means model with that number of clusters  
    model = KMeans(n_clusters=nb_c
               , init='random'
               , n_init=10
               , max_iter=300
               , tol=1e-04
               , random_state=random_state
              )
# cluster the data with that model 
    model.fit(X)

# here for thedemonstration use the same data to see how it performs
    y_pred = np.append(y_pred, model.predict(X))
    models.append(model)

y_p = np.reshape(y_pred,(k_max+1,n_samples))

for nb_c in range(1,k_max+1):
    ax[(nb_c-1)//2,1-nb_c%2].scatter(X[:, 0], X[:,1], c = y_p[nb_c,:])
    ax[(nb_c-1)//2,1-nb_c%2].set_title(f'number of clusters: {nb_c}', fontsize = 20)
    ax[(nb_c-1)//2,1-nb_c%2].scatter(models[nb_c-1].cluster_centers_[:,0], models[nb_c-1].cluster_centers_[:,1], 
                marker='*', 
                color='red', 
                s=200);

#     if nb_c>2:
#         vor = Voronoi(models[nb_c-1].cluster_centers_) 
#         voronoi_plot_2d(vor, ax[(nb_c-1)//2,1-nb_c%2]
#                 , show_vertices=False
#                 , line_colors='red'
#                 , line_width=2
#                 , line_alpha=0.6
#                 , point_size=10
#                 , point_alpha=0
#                 ) 
#     ax[(nb_c-1)//2,1-nb_c%2].set_xlim(xlim)
#     ax[(nb_c-1)//2,1-nb_c%2].set_ylim(ylim)


Except for $k=3$ one can see that the partitionning is inaccurate because the number of initial clusters is either lower or higher than the ideal number, in this case $3$ by construction. 

3. Search for the optimal number of clusters

There are methods to determine the ideal number of clusters. 

####  Elbow method
The most common is the elbow method. 
It is based on the notion of inertia. It is defined as follows: the sum of the Euclidean distances between each point and its associated centroid. Obviously, the higher the initial number of clusters, the more the inertia is reduced: the points have more chance to be next to a centroid. 

Let's look at what this gives on our example

In [None]:
# Extract the inertia

elb = []
for i in range(k_max):
    elb.append(models[i].inertia_)
    

plt.plot(range(1,k_max+1),elb, marker = 'o')
plt.xticks(range(1,k_max+1))
plt.xlabel('Number of clusters in k-means', fontsize=16)
plt.ylabel('Inertia or Distortion', fontsize=16)
plt.show()

#### Silhouette coefficient

One can notice that the inertia stagnates after 3 clusters. This method is conclusive here. Nevertheless it can be coupled with a more precise approach but which requires more computing time: the silhouette coefficient, which is defined as follows

$$
s = \frac{b-a}{\max(a,b)}
$$

where $a$ is the average of the distances to the other observations of the same cluster (*i.e.* the intra-cluster average), and $b$ is the average distance to the nearest cluster. This coefficient can vary between $-1$ and $+1$. A $s$ coefficient close to $+1$ means that the observation is well located inside its own cluster, while a coefficient close to $0$ means that it is located near a border; finally, a coefficient close to $-1$ means that the observation is associated with the wrong cluster. 

The calculation of this coefficient is included in the sklearn.metrics library.  

As for the inertia, it is judicious to display the evolution of the coefficient as a function of the number of clusters as shown below:

In [None]:

# Extract the silhouette

sil = []
for i in range(2,k_max+1):
    sil.append(silhouette_score(X, y_p[i]))

    print(
        "For n_clusters =",
        i,
        "The average silhouette_score is :",
        sil[-1],
    )
plt.plot(range(2,k_max+1),sil, marker = 'o')
plt.xticks(range(2,k_max+1))
plt.xlabel('Number of clusters in k-means', fontsize=16)
plt.ylabel('Silhouette coefficient', fontsize=16)
plt.show()

4. Predict now to what cluster belong new points according to the best model 

It is usefull to use the voronoi to visually check the performance of the clustering and see how it works

In [None]:
# Select dim points randomly
n = 100
dim = 1
np.random.seed(42)

x = np.array([])
x = np.append(x,np.random.rand(n, dim)*14-10)
x = np.append(x,np.random.rand(n, dim)*22-11)
X_new = np.reshape(x,(2,n)).transpose()

# Best model 
nb_c = 3

y_p_new = models[nb_c-1].predict(X_new)
centers = models[nb_c-1].cluster_centers_

figure, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, figsize=(14,5))

vor = Voronoi(centers) 

ax1.scatter(X[:,0],X[:,1], c = y_p[nb_c])
voronoi_plot_2d(vor
                 , ax1
                 , show_vertices=False
                 , line_colors='red'
                 , line_width=2
                 , line_alpha=0.6
                 , point_size=10) 

ax2.scatter(X_new[:,0],X_new[:,1], c = y_p_new)
voronoi_plot_2d(vor
                 , ax2
                 , show_vertices=False
                 , line_colors='red'
                 , line_width=2
                 , line_alpha=0.6
                 , point_size=10) 
ax1.set_xlim(xlim)
ax1.set_ylim(ylim)
ax2.set_xlim(xlim)
ax2.set_ylim(ylim)
plt.show()

#### Practice this on more tricky example

In [None]:
n_samples = 2000
random_state = 0 # fix the random state for reproducibility
n_components = 4
std_dev=1
# This function makes some clusters with 2D coordinates in X and a label y
# Usefull for testing unsupervised ML as well classification

X,y = make_blobs(n_samples=n_samples, centers=n_components, cluster_std=std_dev, random_state=random_state)

figure, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, figsize=(14,5))

ax1.scatter(X[:,0], X[:,1])
ax1.set_title ('Clusters',fontsize=20)
ax2.scatter(X[:,0], X[:,1], c = y)
ax2.set_title ('Classes',fontsize=20)
xlim, ylim = ax1.get_xlim(),ax1.get_ylim()
plt.show()

In [None]:
k_max = 10 
nr = k_max//2


In [None]:
# Extract the inertia


### Conclusion:

Here K-means still perdorms quite well, even if the inertia and siloutte score are less clear. 

* Advantages: K-Means is fast and scalable

* Drawbacks: The model performance is highly impacted by the initial centroids. Some centroids initiation can produce sub-optimal results. K-Means model does not perform well when the cluster sizes vary a lot, have different densities, or have a non-spherical shape.

* Extentions: K-means++ which contains a more clever way to initialize the centroids, often used in the Gaussian Mixture Model.  

## Gaussian Mixture Model

Gaussian Mixture Model (GMM) is a probabilistic model that assumes each data point belongs to a Gaussian distribution. It uses the expectation-maximization (EM) algorithm.

In the expectation step, the algorithm estimates the probability of each data point belonging to each cluster.
In the maximization step, each cluster is updated using the estimated probability of belonging to the cluster of all the data points.
The updates of the cluster are mostly impacted by the data points with high probabilities of belonging to the cluster.

The Python code implementation of GMM is similar to the K-Means clustering model, we just need to change the method from KMeans to GaussianMixture.

One difference is that the value for n_init from the default value of $1$ is changed to $5$. n_init is the number of initialization to generate. When setting it to $5$, it means that $5$ initializations for the model will be performed, and the one with the best result is kept.

In [None]:
k_max = 10 
nr = k_max//2

models_gmm = []
y_pred = np.zeros(n_samples) #create an empty array

figure, ax = plt.subplots(nrows=nr, ncols = 2, figsize=(14,5*nr))  

for nb_c in range(1,k_max+1):

# create a GMM model with that number of clusters  
    model = GaussianMixture(n_components=nb_c
                            , n_init= 5
                            , random_state=random_state)
    
# cluster the data with that model 
    model.fit(X)

# here for the demonstration use the same data to see how it performs
    y_pred= np.append(y_pred, model.predict(X))
    models_gmm.append(model)

y_p_gmm  = np.reshape(y_pred,(k_max+1,n_samples))

for nb_c in range(1,k_max+1):
    ax[(nb_c-1)//2,1-nb_c%2].scatter(X[:, 0], X[:,1], c = y_p_gmm[nb_c,:])
    ax[(nb_c-1)//2,1-nb_c%2].set_title(f'number of clusters: {nb_c}', fontsize = 20)
    ax[(nb_c-1)//2,1-nb_c%2].scatter(models_gmm[nb_c-1].means_[:,0], models_gmm[nb_c-1].means_[:,1], 
                marker='*', 
                color='red', 
                s=200);


In [None]:
# Extract the log likelyhood

elb = []
aic = []
bic = []
for i in range(k_max):
    elb.append(models_gmm[i].score(X))
    aic.append(models_gmm[i].aic(X))
    bic.append(models_gmm[i].bic(X))
    
# Extract the silhouette
sil = []
for i in range(2,k_max+1):
    sil.append(silhouette_score(X, y_p_gmm[i]))

    print(
        "For n_clusters =",
        i,
        "The average silhouette_score is :",
        sil[-1],
    )
    
figure, axs = plt.subplots(nrows = 2, ncols=2, figsize=(14,10))

axs[0,0].plot(range(1,k_max+1),elb, marker = 'o')
axs[0,0].set_xticks(range(1,k_max+1))
axs[0,0].set_xlabel('Number of clusters in gmm', fontsize=16)
axs[0,0].set_ylabel('Log Likelyhood ', fontsize=16)

axs[0,1].plot(range(2,k_max+1),sil, marker = 'o')
axs[0,1].set_xticks(range(2,k_max+1))
axs[0,1].set_xlabel('Number of clusters in gmm', fontsize=16)
axs[0,1].set_ylabel('Silhouette coefficient', fontsize=16)

axs[1,0].plot(range(1,k_max+1),aic, marker = 'o')
axs[1,0].set_xticks(range(1,k_max+1))
axs[1,0].set_xlabel('Number of clusters in gmm', fontsize=16)
axs[1,0].set_ylabel('Akaike information Criterion', fontsize=16)

axs[1,1].plot(range(1,k_max+1),bic, marker = 'o')
axs[1,1].set_xticks(range(1,k_max+1))
axs[1,1].set_xlabel('Number of clusters in gmm', fontsize=16)
axs[1,1].set_ylabel('Bayesian Information Criterion', fontsize=16)

plt.show()

## Hierachical clustering

AgglomerativeClustering is a type of hierarchical clustering algorithm that will be used here as an example.

It uses a bottom-up approach and starts each data point as an individual cluster.
Then the clusters that are closest to each other are connected until all the clusters are connected into one.
The hierarchical clustering algorithms produce a binary tree, where the root of the tree includes all the data points, and the leaves of the tree are the individual data points.

The Python code implementation of the hierarchical clustering model is similar to the K-Means clustering model, we just need to change the method from KMeans to AgglomerativeClustering

In [None]:
k_max = 10 
nr = k_max//2

models_hc = []
y_pred = np.zeros(n_samples) #create an empty array

figure, ax = plt.subplots(nrows=nr, ncols = 2, figsize=(14,5*nr))  

for nb_c in range(1,k_max+1):

# create a GMM model with that number of clusters  
    model = AgglomerativeClustering(n_clusters=nb_c)
    
# cluster the data with that model 
    model.fit_predict(X)

# here for the demonstration use the same data to see how it performs
    y_pred= np.append(y_pred, model.fit_predict(X))
    models_hc.append(model)

y_p_hc  = np.reshape(y_pred,(k_max+1,n_samples))


# for this method calculate the centroids externally
clf = NearestCentroid()

for nb_c in range(1,k_max+1):
    ax[(nb_c-1)//2,1-nb_c%2].scatter(X[:, 0], X[:,1], c = y_p_hc[nb_c,:])
    ax[(nb_c-1)//2,1-nb_c%2].set_title(f'number of clusters: {nb_c}', fontsize = 20)
    if nb_c>1:
        clf.fit(X, y_p_hc[nb_c])
        ax[(nb_c-1)//2,1-nb_c%2].scatter(clf.centroids_[:,0], clf.centroids_[:,1], 
                marker='*', 
                color='red', 
                s=200);


In [None]:
# Extract the silhouette
sil = []
for i in range(2,k_max+1):
    sil.append(silhouette_score(X, y_p_hc[i]))

    print(
        "For n_clusters =",
        i,
        "The average silhouette_score is :",
        sil[-1],
    )
    
figure, (ax) = plt.subplots(nrows=1, ncols=1, figsize=(7,5))

ax.plot(range(2,k_max+1),sil, marker = 'o')
ax.set_xticks(range(2,k_max+1))
ax.set_xlabel('Number of clusters in k-means', fontsize=16)
ax.set_ylabel('Silhouette coefficient', fontsize=16)



plt.show()

## Density-based spatial clustering of applications with noise  (DBSCAN)

DBSCAN defines clusters using data density. It has two important hyperparameters to tune, eps and min_samples.

* eps is the epsilon distance to be considered as the neighborhood of a data point. It is the most important parameter for DBSCAN [6].
* min_samples is the number of minimum data points in the neighborhood in order for a data point to be considered as a core data point. This number includes the data point itself [6].

All data points in the neighborhood of the core data points belong to the same cluster.
The data points that are not core data points and do not have a core data point in the neighborhood are considered outliers. The label -1 in the prediction results represents outliers. To learn more about anomaly detection.

**An important point with respect to the other methods it that DBSCAN does not take a pre-defined number of clusters**, and it identifies the number of clusters based on the density distribution of the dataset. 

In [None]:
# First go back to the first sample

X0,y0 = make_blobs(n_samples=n_samples, centers=3, cluster_std=1.0, random_state=130)
model_db0 = DBSCAN(eps = 0.8, min_samples = 3)
y_p_db0= model_db0.fit_predict(X0)


# Then the more tricky sample 
model_db = DBSCAN(eps = 0.3, min_samples = 3)
    
# cluster the data with that model 
y_p_db= model_db.fit_predict(X)

print('Simple sample: ')
labels0 = model_db0.labels_
n_clusters0 = len(set(labels0)) - (1 if -1 in labels else 0)
n_noise0 = list(labels0).count(-1)
print("Estimated number of clusters: %d" % n_clusters0)
print("Estimated number of noise points: %d" % n_noise0)
print('\n')
print('Tricky sample: ')
labels = model_db.labels_
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)

print("Estimated number of clusters: %d" % n_clusters)
print("Estimated number of noise points: %d" % n_noise)

figure, (ax1,ax2) = plt.subplots(nrows=1, ncols = 2, figsize=(14,5))  

ax1.scatter(X0[:, 0], X0[:,1], c = y_p_db0)
ax1.set_title(f'Estimated number of clusters: {n_clusters0}', fontsize = 20)
ax2.scatter(X[:, 0], X[:,1], c = y_p_db)
ax2.set_title(f'Estimated number of clusters: {n_clusters}', fontsize = 20)

plt.show()

### Analysis

For the first simple sample, DBSCAN identifies the correct number of clusters, at the condition that two parameters are correctly set. In that case the tuning is somewhat easy. Here the number of cluster is 4 out of which 1 has a label -1 corresponding to the noise. The latter tha to be reduced but beware this is balance between this noise and the  chosen neighborhood.

For the more tricky sample is seem that DBSCAN has some difficulty at first sight to identify the clusters that overlap, and a more deeper tuning should be done (with all hyperparameters).  

DBSCAN Pros: works on datasets of any shape and identifies anomalies automatically.

DBSCAN Cons: It does not work well for identifying the clusters that are not well separated. Different clusters in the dataset need to have similar densities, otherwise, the DBSCAN does not perform well.


## Compare the clustering approaches

The comparison is made first by comparing the centers the histogram of labels for visual purposes. A more quantitative approach consists in comparing the found classes with the ground truth labels as the are known here. If the ground truth is not knowns which is basically the case in any clustering experiment, the silhouette values can be used.

Various scores can be used (see Scikit-learn guide):
* **Homogeneity**: metric of a cluster labeling given a ground truth. A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class. This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.
* **Completeness**: A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster. This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.This metric is not symmetric: switching label_true with label_pred will return the homogeneity_score which will be different in general.
* **V-measure**: V-measure cluster labeling given a ground truth. This score is identical to normalized_mutual_info_score with the 'arithmetic' option for averaging. The V-measure is the harmonic mean between homogeneity and completeness: This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way. This metric is furthermore symmetric: switching label_true with label_pred will return the same score value. This can be useful to measure the agreement of two independent label assignments strategies on the same dataset when the real ground truth is not known.
* **Rand index**: The Rand Index computes a similarity measure between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings
* **Adjusted Mutual Information (AMI)**: it is an adjustment of the Mutual Information (MI) score to account for chance. It accounts for the fact that the MI is generally higher for two clusterings with a larger number of clusters, regardless of whether there is actually more information shared.

In [None]:
n_bins = 10
k_max = 10
nr = k_max//2
figure, ax = plt.subplots(nrows=nr, ncols = 2, figsize=(14,5*nr))  

for nb_c in range(1,k_max+1):
    ax[(nb_c-1)//2,1-nb_c%2].hist([y_p_km[nb_c],y_p_gmm[nb_c],y_p_hc[nb_c]], bins=n_bins)
    ax[(nb_c-1)//2,1-nb_c%2].legend(['K-means','GMM','HC'])


In [None]:
# Compare the centroids 
nb_c = 4
vor = Voronoi(models_km[nb_c-1].cluster_centers_)
print('------------')
print('CENTROIDS:')
print('------------')
print('k-means') 
print(models_km[nb_c-1].cluster_centers_)
print('Gaussian Mixture Model') 
print(models_gmm[nb_c-1].means_)
print('Hierarchical clustering') 
print(clf.centroids_)

print('------------')
print('PERFORMANCE:')
print('------------')
print('k-means') 
print(f"Homogeneity: {metrics.homogeneity_score(y, y_p_km[4]):.3f}")
print(f"Completeness: {metrics.completeness_score(y, y_p_km[4]):.3f}")
print(f"V-measure: {metrics.v_measure_score(y, y_p_km[4]):.3f}")
print(f"Adjusted Rand Index: {metrics.adjusted_rand_score(y, y_p_km[4]):.3f}")
print("Adjusted Mutual Information:" f" {metrics.adjusted_mutual_info_score(y, y_p_km[4]):.3f}")
print(f"Silhouette Coefficient: {metrics.silhouette_score(X, y_p_km[4]):.3f}")
print('\n')

print('Gaussian Mixture Model') 
print(f"Homogeneity: {metrics.homogeneity_score(y, y_p_gmm[4]):.3f}")
print(f"Completeness: {metrics.completeness_score(y, y_p_gmm[4]):.3f}")
print(f"V-measure: {metrics.v_measure_score(y, y_p_gmm[4]):.3f}")
print(f"Adjusted Rand Index: {metrics.adjusted_rand_score(y, y_p_gmm[4]):.3f}")
print("Adjusted Mutual Information:" f" {metrics.adjusted_mutual_info_score(y, y_p_gmm[4]):.3f}")
print(f"Silhouette Coefficient: {metrics.silhouette_score(X, y_p_gmm[4]):.3f}")
print('\n')

print('Hierarchical clustering') 
print(f"Homogeneity: {metrics.homogeneity_score(y, y_p_hc[4]):.3f}")
print(f"Completeness: {metrics.completeness_score(y, y_p_hc[4]):.3f}")
print(f"V-measure: {metrics.v_measure_score(y, y_p_hc[4]):.3f}")
print(f"Adjusted Rand Index: {metrics.adjusted_rand_score(y, y_p_hc[4]):.3f}")
print("Adjusted Mutual Information:" f" {metrics.adjusted_mutual_info_score(y, y_p_hc[4]):.3f}")
print(f"Silhouette Coefficient: {metrics.silhouette_score(X, y_p_hc[4]):.3f}")
print('\n')

print('DBSCAN') 
print(f"Homogeneity: {metrics.homogeneity_score(y, y_p_db):.3f}")
print(f"Completeness: {metrics.completeness_score(y, y_p_db):.3f}")
print(f"V-measure: {metrics.v_measure_score(y, y_p_db):.3f}")
print(f"Adjusted Rand Index: {metrics.adjusted_rand_score(y, y_p_db):.3f}")
print("Adjusted Mutual Information:" f" {metrics.adjusted_mutual_info_score(y, y_p_db):.3f}")
print(f"Silhouette Coefficient: {metrics.silhouette_score(X, y_p_db):.3f}")


figure, (ax,ax1) = plt.subplots(nrows = 1, ncols = 2, figsize = (14,5))

ax.set_title(f'number of clusters: {nb_c}', fontsize = 20)
voronoi_plot_2d(vor
                 , ax
                 , show_vertices=False
                 , line_colors='red'
                 , line_width=2
                 , line_alpha=0.6
                 , point_size=0) 

ax.scatter(models_km[nb_c-1].cluster_centers_[:,0], models_km[nb_c-1].cluster_centers_[:,1], 
                marker='*', 
#                color='lue', 
                s=200,
          label = 'k-means');
ax.scatter(models_gmm[nb_c-1].means_[:,0], models_gmm[nb_c-1].means_[:,1], 
                marker='*', 
#                color='lue', 
                s=200,
          label = 'GMM');
ax.scatter(clf.centroids_[:,0], clf.centroids_[:,1], 
                marker='*', 
#                color='red', 
                s=200,
          label = 'HC');
ax.scatter(X[:, 0], X[:,1], c = y, alpha = 0.1)
ax1.hist([y_p_km[nb_c],y_p_gmm[nb_c],y_p_hc[nb_c]], bins=n_bins)
xlim1=ax1.get_xlim()
ax1.plot(xlim1,[n_samples/4,n_samples/4], linestyle = 'dashed')
ax.set_xlim(xlim)
ax.set_ylim(ylim)
ax.legend()

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=fe3254e6-9d62-4c8c-aa95-7472e9779ff6' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>