# Clustering

Clustering is a technique used to group similar items together.

```{admonition} PCA vs Clustering

PCA: finds a low-dimensional representation <br>
Clustering: finds subgroups among observations

```

## K-Means Clustering

K-means clustering is a widely used unsupervised machine learning algorithm for partitioning a dataset into $ K $ clusters, where each data point belongs to the cluster with the nearest mean (centroid).


### Mathematical Explanation

#### Step-by-Step Process

1. Initialization:
   - Choose the number of clusters $ K $.
   - Initialize $ K $ centroids randomly from the dataset.

2. Assignment Step:
   - Assign each data point to the nearest centroid. This is done by computing the Euclidean distance between each data point and each centroid:
     $$
     \text{distance}(x_i, \mu_j) = \| x_i - \mu_j \|_2
     $$
     where $ x_i $ is a data point and $ \mu_j $ is a centroid.

3. Update Step:
   - Update the centroid of each cluster to be the mean of the data points assigned to that cluster:
     $$
     \mu_j = \frac{1}{|C_j|} \sum_{x_i \in C_j} x_i
     $$
     where $ C_j $ is the set of points assigned to the $ j $-th cluster.

4. Convergence:
   - Repeat the Assignment and Update steps until the centroids no longer change significantly or a maximum number of iterations is reached.

### Example

Suppose we have a 2D dataset with the following points: $(1, 2)$, $(3, 4)$, $(5, 6)$, $(8, 8)$, $(9, 10)$, and we want to cluster them into $ K = 2 $ clusters.

1. Initialization:
   - Randomly select initial centroids, e.g., $(1, 2)$ and $(9, 10)$.

2. First Assignment Step:
   - Compute the distances:
     - $(1, 2)$ to $(1, 2)$: $0$
     - $(1, 2)$ to $(9, 10)$: $11.31$
     - $(3, 4)$ to $(1, 2)$: $2.83$
     - $(3, 4)$ to $(9, 10)$: $8.49$
     - $(5, 6)$ to $(1, 2)$: $5.66$
     - $(5, 6)$ to $(9, 10)$: $5.66$
     - $(8, 8)$ to $(1, 2)$: $9.22$
     - $(8, 8)$ to $(9, 10)$: $2.24$

   - Assign points to the nearest centroids:
     - Cluster 1: $(1, 2)$, $(3, 4)$
     - Cluster 2: $(5, 6)$, $(8, 8)$

3. First Update Step:
   - Compute new centroids:
     - Cluster 1: $(2, 3)$
     - Cluster 2: $(6.5, 7)$

4. Repeat until convergence.


This implementation includes:

- Initialization: Randomly selecting initial centroids from the dataset.
- Distance Calculation: Using `torch.cdist` to compute Euclidean distances between points and centroids.
- Label Assignment: Assigning each point to the nearest centroid.
- Centroid Update: Updating centroids to be the mean of the assigned points.
- Convergence Check: Checking if the centroids have stabilized within a tolerance level.

This code can be adapted for different datasets and configurations by adjusting the input tensor `X`, the number of clusters `K`, and the maximum iterations `max_iters`.

In [1]:
import torch

def kmeans(X, K, max_iters=100, tol=1e-4):
    N, D = X.shape
    # Randomly initialize centroids
    centroids = X[torch.randint(0, N, (K,))]
    prev_centroids = centroids.clone()
    
    for i in range(max_iters):
        # Compute distances from points to centroids
        distances = torch.cdist(X, centroids)
        
        # Assign points to the nearest centroid
        labels = torch.argmin(distances, dim=1)
        
        # Update centroids
        new_centroids = torch.stack([X[labels == k].mean(dim=0) for k in range(K)])
        
        # Check for convergence
        if torch.norm(new_centroids - prev_centroids) < tol:
            break
        
        prev_centroids = new_centroids
    
    return labels, new_centroids

X = torch.tensor([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [8.0, 8.0], [9.0, 10.0]])
K = 2
labels, centroids = kmeans(X, K)
print("Labels:", labels)
print("Centroids:", centroids)

Labels: tensor([0, 0, 0, 0, 0])
Centroids: tensor([[5.2000, 6.0000],
        [   nan,    nan]])
