# K-Means Clustering

Reference: An Introduction to Statistical Learning with Applications in R

## Algorithm

Let $K$ be the number of clusters. We want to find a partition ${C_1,\ldots,C_K}$ of a given dataset that minimizes $\sum_{k=1}^{K} W(C_k)$ (called __inertia__), where $W(C_k) = \frac {1}{|C_k|} \sum_{i,j\in C_k} ||x_i - x_j||^2$ is the within-cluster variation of $C_k$.

A simple algorithm finding a local minimum is as follows:

* Select $K$ centroids randomly (possibly from the dataset).

* At each iteration, do

    1. Assign each instance to the closest centroid.
    1. Set the centroids by the mean of the instances assigned to them.
    
Run the algorithm multiple times and select the one minimizing the inertia.


```
sklearn.cluster.KMeans(n_clusters, init='k-means++', n_init, ...)
```

## Optimal number of clusters

To find the optimal number of clusters, we can plot the __inertia__ or the __silhouette score__ over the number of clusters and analyze the graphs. 

The silhouette score of a dataset is the mean of the silhoutte coefficients of all instances in the dataset. The silhoutte coefficient of an instance $\mathbf{x}$ which belongs to the cluster $C$ with $|C|>1$ is defined by $(b - a)/\max\{a,b\}$, where $a = \frac{1}{|C|-1}\sum_{y\in C} ||\mathbf{x} - \mathbf{y}||$ and $b = \min_{C^\prime \neq C} \frac{1}{|C^\prime|} \sum_{\mathbf{y}\in C^\prime} ||\mathbf{x} - \mathbf{y}||$. If $|C|=1$, then the silhoutte score is simply 0.

The silhoutte coefficient is between -1 and 1.
    * close to 1: the instance is in a proper cluster
    * close to 0: the instance is close to a cluster boundary
    * close to -1: the instance should probably belong to another cluster


