# K-means Clustering

*Machine Learning | Andrew Ng*: https://www.youtube.com/watch?v=hDmNF9JG3lo

* Unsupervised
    * Unlabeled data
* Clustering data into different groups/clusters

## How it works

0. Randomly initialize cluster centroids.

Do the following 2 steps iteratively
1. Cluster assignment
    * For each data point, assign the closest centroid to it.
2. Move centroids
    * Average all of the data points assigned to a cluster. Make that average the new cluster centroid of that cluster.
    
**K-means algorithm**  

Randomly initialize $K$ cluster centroids $\mu_1, \mu_2, ..., \mu_k \in R^n$  

Repeat {  
$\quad$$\quad$    for $i=1$ to $m$  
$\quad$ $\quad$$\quad$    $c^i$ := index (from $1$ to $K$) of cluster centroid closest to $x^i$  
$\quad$$\quad$    for $k=1$ to $K$  
$\quad$$\quad$$\quad$    $\mu_k$ := average (mean) of points assigned to cluster $k$  
$\quad$$\quad$ }

## Optimization Objective


$c^{i}$ = index of cluster $(1, 2, ..., K)$ to which example $x^{i}$ is currently assigned  
$\mu_{k}$ = location of cluster centroid $k$ ($\mu_k \in R^n$)  
$\mu_{c^i}$ = location of cluster centroid to which example $x^i$ is assigned  
$K$ = total number of clusters  

**Optimization Objective**:  

\begin{equation}  
J(c^1, ..., c^m, \mu_1, ..., \mu_K) = \frac{1}{m}\sum_{i=1}^{m} ||x^i-\mu_{c^i}||^2 \\
\min_{c^1,...,c^m,\mu_1,...,\mu_K} J(c^1, ..., c^m, \mu_1, ..., \mu_K)
\end{equation}

Cluster assignment step in the k-means algorithm is exactly doing the minimization of the cost/distortion function $J$ with respect to $c^1, ..., c^m$ (holding the $\mu_1, ..., \mu_K$ fixed). And the move centroids step is minimizing the cost/distortion function $J$ with respect to $\mu_1,...,\mu_K$. 

## Random Initialization of Cluster Centroids

* Should have $K < m$
* Randomly pick $K$ training examples
* Set $\mu_1, ..., \mu_K$ equal to these $K$ examples

K-means can end up converging to different solutions depending on how exactly the clusters were initialized. It can also get stuck in a local optima. 

To avoid getting stuck in a local optima, do multiple random initializations.

**Random Initialization**  

For $i=1$ to $100$ {  
$\quad$ Randomly initialize K-means.  
$\quad$ Run K-means. Get $c^1,...,c^m,\mu_1,...,\mu_K$.  
$\quad$ Compute cost function (distortion)  
$\quad$$\quad$ $J(c^1,...,c^m,\mu_1,...,\mu_K)$  
$\quad$ }

Pick the initialization with the lowest cost function $J$. This method will work well if the **number of clusters is small**.

## Choosing the Number of Clusters

* **Elbow method**: Compute the cost/distortion function $J$ for the number of clusters equal to 1, then 2, then 3, etc. Plot tha value of $J$ on y-axis and the number of clusters on the x-axis. Pick the number of clusters up until which the distortion goes down rapidly and after which the distortion goes down slowly.
* **Better method**: Sometimes, you're running K-means to get clusters to use for some later/downstream purpose. Evaluate K-means based on a metric for how well it performs for that later purpose. For example, you might want to choose 5 clusters versus 3 clusters for the sizes of t-shirts based on the heights and weights of the customers.