# Day 33: Clustering Evaluation Metrics

Welcome to Day 33! Unlike supervised learning, where you have a clear target variable to measure your model against, evaluating unsupervised clustering models is more nuanced. Since there's no "correct" answer, we use metrics that assess the quality and compactness of the clusters themselves.

## Topics Covered:

- The Challenge of Evaluating Unsupervised Models

- Common Clustering Metrics
    - Intertia

    - Silhouette Score

    - Davies-Bouldin Index

    - Calinski-Harabasz Index

- How to Interpret These Metrics

## The Challenge of Unsupervised Evaluation

In supervised learning, we have an answer key (the labels). We can easily calculate how many predictions were correct using metrics like accuracy or an F1-Score. 

In unsupervised learning, however, we don't have labels. The model is creating its own structure, and we need to find a way to measure how "good" that structure is without an answer key

So, we use **internal evaluation metrics** to assess the compactness and separation of clusters:
- **Compactness**: Points within a cluster should be close to each other.
- **Separation**: Clusters should be well-separated from one another.

## Internal Metrics

### Inertia

Inertia is sum of squared distances of samples to their closest cluster center. It measures how compact the clusters are.

### Silhouette Score

Measures **how similar a point is to its own cluster vs. other clusters**.


- Ranges from **-1 to 1**:
  - +1 → well-clustered
  - 0 → on the boundary
  - -1 → wrongly clustered

### Formula:

### 📐 Silhouette Score Formula

For a point \( i \):

- a(i): average distance to points in the **same** cluster  
- b(i): lowest average distance to points in **another** cluster  

$$ Silhouette(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} $$


### Davies-Bouldin Index

The Davies-Bouldin Index measures the average similarity between each cluster and its most similar cluster. Similarity is a ratio of within-cluster distance to between-cluster distance. A lower Davies-Bouldin score indicates a better clustering. A perfect score is 0.

$$DB = \frac{1}{k} \sum_{i=1}^{k} \max_{j \neq i} \left(\frac{\sigma_i + \sigma_j}{d(c_i, c_j)}\right)$$

### Calinski-Harabasz Index

Also known as the Variance Ratio Criterion, this index is a ratio of the between-cluster variance to the within-cluster variance. A higher score means the clusters are better defined.

$$CH = \frac{SS_B / (k-1)}{SS_W / (N-k)}$$

## External Metrics

### Adjusted Rand Index (ARI)