#Q1
In the context of clustering evaluation, homogeneity and completeness are two metrics used to assess the quality of a clustering solution.

Homogeneity:

Homogeneity measures the degree to which each cluster contains only members of a single class. In other words, it evaluates whether the clusters are made up of data points that belong to the same class or category.
The homogeneity score ranges from 0 to 1, where 1 indicates perfect homogeneity.

Completeness:
Completeness measures the degree to which all members of a given class are assigned to the same cluster. It evaluates whether all data points belonging to the same class are grouped together in a single cluster.
Like homogeneity, the completeness score also ranges from 0 to 1, with 1 indicating perfect completeness.

These metrics are often used together, and a balance between homogeneity and completeness is desired for a good clustering solution.

In [2]:
#1
from sklearn.metrics import homogeneity_score, completeness_score
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import warnings 
warnings.filterwarnings('ignore')

# Generate synthetic data with 3 clusters
X, y = make_blobs(n_samples=300, centers=3, random_state=42)

# Apply KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)

# Calculate homogeneity and completeness
homogeneity = homogeneity_score(y, labels)
completeness = completeness_score(y, labels)

print(f'Homogeneity Score: {homogeneity:.4f}')
print(f'Completeness Score: {completeness:.4f}')

Homogeneity Score: 1.0000
Completeness Score: 1.0000


#Q2
The V-measure is a metric used for clustering evaluation that combines homogeneity and completeness into a single score. It provides a balance between these two aspects of clustering quality. The V-measure is the harmonic mean of homogeneity and completeness, and it is defined as follows:

V= 2×homogeneity×completeness/homogeneity+completeness 

The V-measure ranges from 0 to 1, with 1 indicating a perfect clustering solution where clusters are both homogeneous and complete.

In [3]:
#2
from sklearn.metrics import v_measure_score
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Generate synthetic data with 3 clusters
X, y = make_blobs(n_samples=300, centers=3, random_state=42)

# Apply KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)

# Calculate V-measure
v_measure = v_measure_score(y, labels)

print(f'V-measure Score: {v_measure:.4f}')

V-measure Score: 1.0000


the V-measure is a useful metric that balances the trade-off between homogeneity and completeness in clustering evaluation. It is particularly valuable when you want a single measure to capture the overall quality of a clustering solution.

#Q3
The Silhouette Coefficient is a metric used to evaluate the quality of a clustering result. It measures how well-separated clusters are and takes into account both the cohesion within clusters and the separation between clusters. The Silhouette Coefficient for a single data point is defined as:

s(i)=b(i)−a(i)/max{a(i),b(i)}

Where:

a(i) is the average distance from the 

i-th data point to other data points in the same cluster.

b(i) is the smallest average distance from the 

i-th data point to data points in a different cluster.
The Silhouette Coefficient for the entire clustering solution is the average of the silhouette scores for all data points.

The range of Silhouette Coefficient values is between -1 and 1. A higher value indicates better-defined clusters, with a score close to 1 suggesting well-separated clusters, a score around 0 indicating overlapping clusters, and negative values indicating that data points might be assigned to the wrong clusters.

In [6]:
#3
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Generate synthetic data with 3 clusters
X, _ = make_blobs(n_samples=300, centers=3, random_state=42)

# Apply KMeans clustering with different number of clusters
for n_clusters in range(2, 6):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    labels = kmeans.fit_predict(X)
    
    # Calculate Silhouette Coefficient
    silhouette_avg = silhouette_score(X, labels)
    
    print(f'Number of clusters: {n_clusters}, Silhouette Coefficient: {silhouette_avg:.4f}')

Number of clusters: 2, Silhouette Coefficient: 0.7049
Number of clusters: 3, Silhouette Coefficient: 0.8480
Number of clusters: 4, Silhouette Coefficient: 0.6630
Number of clusters: 5, Silhouette Coefficient: 0.5014


#Q4
The Davies-Bouldin Index is a metric used for evaluating the quality of a clustering result. It provides a measure of the compactness and separation of clusters in a clustering solution. The index is calculated based on the pairwise similarity between clusters, considering both intra-cluster cohesion and inter-cluster separation. Lower Davies-Bouldin Index values indicate better clustering solutions.

The range of Davies-Bouldin Index values is not bounded, but lower values indicate better clustering solutions.

In [7]:
#4
from sklearn.metrics import davies_bouldin_score
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Generate synthetic data with 3 clusters
X, _ = make_blobs(n_samples=300, centers=3, random_state=42)

# Apply KMeans clustering with different number of clusters
for n_clusters in range(2, 6):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    labels = kmeans.fit_predict(X)
    
    # Calculate Davies-Bouldin Index
    db_index = davies_bouldin_score(X, labels)
    
    print(f'Number of clusters: {n_clusters}, Davies-Bouldin Index: {db_index:.4f}')

Number of clusters: 2, Davies-Bouldin Index: 0.4360
Number of clusters: 3, Davies-Bouldin Index: 0.2123
Number of clusters: 4, Davies-Bouldin Index: 0.7068
Number of clusters: 5, Davies-Bouldin Index: 0.9802


#Q5
Yes, it is possible for a clustering result to have high homogeneity but low completeness. Homogeneity measures whether all clusters contain only members of a single class, while completeness measures whether all members of a given class are assigned to the same cluster.

Consider a scenario with three classes (A, B, and C) and two clusters (Cluster 1 and Cluster 2):

Cluster 1: Contains all instances of Class A.

Cluster 2: Contains instances from Class B and Class C.

In this case, Cluster 1 is perfectly homogeneous as it contains only instances of Class A. However, it is not complete because instances from Class B and Class C are not assigned to Cluster 1. Therefore, completeness is low.

In [8]:
#5
from sklearn.metrics import homogeneity_score, completeness_score

# Ground truth labels
true_labels = [0, 1, 2, 0, 1, 2]

# Cluster assignments
cluster_assignments = [0, 1, 1, 0, 1, 1]

homogeneity = homogeneity_score(true_labels, cluster_assignments)
completeness = completeness_score(true_labels, cluster_assignments)

print(f'Homogeneity: {homogeneity:.4f}')
print(f'Completeness: {completeness:.4f}')

Homogeneity: 0.5794
Completeness: 1.0000


#Q6

The V-measure can be used to determine the optimal number of clusters by comparing the V-measure scores for different cluster numbers. The number of clusters that maximizes the V-measure is considered the optimal choice.

In [9]:
#6
from sklearn.metrics import v_measure_score
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Generate synthetic data with ground truth labels
X, true_labels = make_blobs(n_samples=300, centers=3, random_state=42)

# Evaluate V-measure for different cluster numbers
for n_clusters in range(2, 6):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    cluster_assignments = kmeans.fit_predict(X)
    
    v_measure = v_measure_score(true_labels, cluster_assignments)
    
    print(f'Number of clusters: {n_clusters}, V-measure: {v_measure:.4f}')

Number of clusters: 2, V-measure: 0.7337
Number of clusters: 3, V-measure: 1.0000
Number of clusters: 4, V-measure: 0.9091
Number of clusters: 5, V-measure: 0.8279


#Q7

Advantages:
Interpretability: Silhouette Coefficient is easy to understand and interpret.

No dependence on ground truth: Unlike some metrics, Silhouette Coefficient does not depend on ground truth labels, making it applicable in unsupervised scenarios.

Disadvantages:
Sensitive to shape and density: Silhouette Coefficient may not perform well when dealing with clusters of varying shapes and densities.

Does not handle well varying cluster sizes: Silhouette Coefficient may not be suitable for clusters with significantly different sizes.

In [10]:
#7
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Generate synthetic data with 3 clusters
X, _ = make_blobs(n_samples=300, centers=3, random_state=42)

# Apply KMeans clustering with different number of clusters
for n_clusters in range(2, 6):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    labels = kmeans.fit_predict(X)
    
    # Calculate Silhouette Coefficient
    silhouette_avg = silhouette_score(X, labels)
    
    print(f'Number of clusters: {n_clusters}, Silhouette Coefficient: {silhouette_avg:.4f}')

Number of clusters: 2, Silhouette Coefficient: 0.7049
Number of clusters: 3, Silhouette Coefficient: 0.8480
Number of clusters: 4, Silhouette Coefficient: 0.6630
Number of clusters: 5, Silhouette Coefficient: 0.5014


#Q8
Limitations:

Sensitive to dimensionality: The Davies-Bouldin Index tends to perform poorly in high-dimensional spaces.

Assumes spherical clusters: It assumes that clusters are spherical, which may not be true in real-world data.

Dependent on cluster shapes: Performance may vary based on the shapes and sizes of the clusters.

Overcoming Limitations:

Dimensionality reduction: Perform dimensionality reduction techniques before applying the Davies-Bouldin Index.

Consider other indices: Combine the Davies-Bouldin Index with other metrics for a more comprehensive evaluation.

In [11]:
#8
from sklearn.metrics import davies_bouldin_score
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# Generate synthetic data with 3 clusters
X, _ = make_blobs(n_samples=300, centers=3, random_state=42)

# Apply PCA for dimensionality reduction
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Apply KMeans clustering with different number of clusters
for n_clusters in range(2, 6):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    labels = kmeans.fit_predict(X_pca)
    
    # Calculate Davies-Bouldin Index
    db_index = davies_bouldin_score(X_pca, labels)
    
    print(f'Number of clusters: {n_clusters}, Davies-Bouldin Index: {db_index:.4f}')

Number of clusters: 2, Davies-Bouldin Index: 0.4360
Number of clusters: 3, Davies-Bouldin Index: 0.2123
Number of clusters: 4, Davies-Bouldin Index: 0.7068
Number of clusters: 5, Davies-Bouldin Index: 0.9802


#Q9

Homogeneity: Measures the extent to which each cluster contains only members of a single class.

Completeness: Measures the extent to which all members of a given class are assigned to the same cluster.

V-measure: Represents the harmonic mean of homogeneity and completeness.

They can have different values for the same clustering result if the clusters are not equally balanced in terms of class distribution. For example, if a clustering solution perfectly separates one class into its own cluster but mixes other classes into the remaining clusters, homogeneity may be high while completeness may be low.

In [12]:
#9
from sklearn.metrics import homogeneity_score, completeness_score, v_measure_score
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Generate synthetic data with 3 clusters
X, y = make_blobs(n_samples=300, centers=3, random_state=42)

# Apply KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)

# Calculate homogeneity, completeness, and V-measure
homogeneity = homogeneity_score(y, labels)
completeness = completeness_score(y, labels)
v_measure = v_measure_score(y, labels)

print(f'Homogeneity Score: {homogeneity:.4f}')
print(f'Completeness Score: {completeness:.4f}')
print(f'V-measure Score: {v_measure:.4f}')

Homogeneity Score: 1.0000
Completeness Score: 1.0000
V-measure Score: 1.0000


#Q10
The Silhouette Coefficient can be used to compare the quality of different clustering algorithms on the same dataset by calculating the Silhouette Coefficient for each algorithm and choosing the one with the highest average score.

In [13]:
#10
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans, AgglomerativeClustering

# Generate synthetic data with 3 clusters
X, _ = make_blobs(n_samples=300, centers=3, random_state=42)

# Apply KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans.fit_predict(X)
kmeans_silhouette = silhouette_score(X, kmeans_labels)

# Apply AgglomerativeClustering
agg_clustering = AgglomerativeClustering(n_clusters=3)
agg_labels = agg_clustering.fit_predict(X)
agg_silhouette = silhouette_score(X, agg_labels)

print(f'KMeans Silhouette Coefficient: {kmeans_silhouette:.4f}')
print(f'AgglomerativeClustering Silhouette Coefficient: {agg_silhouette:.4f}')

KMeans Silhouette Coefficient: 0.8480
AgglomerativeClustering Silhouette Coefficient: 0.8480


#Q11
Davies-Bouldin Index: Measures the average similarity-to-dissimilarity ratio between each cluster and its most similar cluster.

The index calculates the average ratio of the distance within a cluster to the distance between clusters. Lower values indicate better separation and compactness. The assumption is that clusters should be both well-separated and compact.

In [14]:
#11
from sklearn.metrics import davies_bouldin_score
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Generate synthetic data with 3 clusters
X, _ = make_blobs(n_samples=300, centers=3, random_state=42)

# Apply KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)

# Calculate Davies-Bouldin Index
db_index = davies_bouldin_score(X, labels)

print(f'Davies-Bouldin Index: {db_index:.4f}')


Davies-Bouldin Index: 0.2123


#Q12

Yes, the Silhouette Coefficient can be used to evaluate hierarchical clustering algorithms. When using hierarchical clustering, you need to consider the cluster assignments at different linkage levels.

In [16]:
#12

from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering

# Generate synthetic data with 3 clusters
X, _ = make_blobs(n_samples=300, centers=3, random_state=42)

# Apply AgglomerativeClustering
agg_clustering = AgglomerativeClustering(n_clusters=3)
agg_labels = agg_clustering.fit_predict(X)

# Calculate Silhouette Coefficient
agg_silhouette = silhouette_score(X, agg_labels)

print(f'AgglomerativeClustering Silhouette Coefficient: {agg_silhouette:.4f}')

AgglomerativeClustering Silhouette Coefficient: 0.8480
