# K-Means Clustering:

[See for visualization](http://stanford.edu/class/ee103/visualizations/kmeans/kmeans.html)

## Algorithm:

1. Initialize cluster centers at $t=0$: $c_1, c_2, \ldots c_k$
2. Compute $\delta^t$: Assign each data point to the closest cluster center.
   - $\displaystyle\delta^t=\underset{\delta}{argmin}\frac{1}{N}\sum_j^N\sum_i^K \delta_{ij}^{t-1}(c_i^{t-1}-x_j)^2$
   - where:
     - $t$: iteration count, 
     - $j$: index of data point, $x_j$: $j$th data point, $N$: number of data points, 
     - $i$: index of cluster center, $c_i$: $i$th cluster center, $K$: number of clusters,
     - $\delta^t$: the set of assignment for each data point $x_j$ to cluster whose center $c_i$ at iteration $t$. 
3. Compute $c^t$: Update each cluster center as the mean of the points belonging to that cluster.
    - $\displaystyle c^t=\underset{c}{argmin}\frac{1}{N}\sum_j^N\sum_i^K \delta_{ij}^{t}(c_i^{t-1}-x_j)^2$
4. Update $t$: $t=t+1$ and repeat steps 2-3 until no change. 

## Facts:

- Unsupervised learning method.
- Converges to a local minimum solution.
- Sensitive to outliers.
- Needs to pick $K$ value:
  - Requires experiments to find optimal number of clusters: elbow/knee finding.
- All clusters need to have the same parameters.
- Might be slow: $O(KNd)$ for $N$-many $d$-dimensional data points with $K$ clusters.
- Provides a better fit for spherical data.
- Clustering based on (r,g,b,x,y) values enforces more spatial coherence.

## Python Implementation:
```python
def kmeans(feat_vecs, k, max_iter=10):
    n, m = feat_vecs.shape
    proceed = True
    centers = None
    while (proceed):
        proceed = False
        centers = np.random.uniform(size=(k, m))
        for _ in range(max_iter):
            clusters = [list() for i in range(k)]
            # Update cluster assignments
            for i in range(n):
                vec = feat_vecs[i]
                dist =  euclidean_distance(vec, centers, 1)
                min_index = np.argmin(dist, axis=0)
                clusters[min_index].append(i)
            # Restart algorithm in case of empty clusters
            if [] in clusters:
                proceed = True
                break
            # Update centers
            for i in range(k):
                indices = clusters[i]
                centers[i] = feat_vecs[indices].mean(axis=0)
    return centers
```
[Benchmark of different k-means implementations in different libraries.](https://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html)