# Kmeans

Kmeans is a unsupervised algorithm. 

## Algorithm
1. Initialize k cluster centers     
2. Assign observations to closer cluster centers        
$$ Inferred Label = arg min||\mu_j - x_i||_2^2 $$
For each observation, calculate the distance from j means. Choose the label as the least distance index or the cluster mean which had lowest distance.          
3. Revise the cluster centers as mean of assigned observations until convergence. Convergence we will come to it at later stage
$$ \mu_j = \frac{1}{n}\sum x_i $$


The boundary achieved through such approach is called "Voronoi Tesselation". The boundary is created such that any new point in that boundary will always be closer to the cluster mean in that boundary.

![title](Images\Kmeans_1.PNG)

## Kmeans as co-ordinate descent 

Rewriting the kmeans algo:

1. Assign observation to closet cluster center
$$ z_i = arg min ||\mu_j - x_i||_2^2 $$     
2. Revise the cluster centers as mean of observations
$$ \mu_j = argmin \sum ||\mu - x_i||_2^2 $$
The above equation is more like we are finding that mean which has lowest error, inturn minimizing.

In summary, minimize in two steps:  
1. z given $\mu$        
2. $\mu$ given z
Thats exacly co-ordinate descent. Keep x fixed and update y, then in next step keep y fixed and update x.

### Convergence Criteria
Local minima. Global minima not possible bcos of complicated structure and non-convex.

### Initialization Effect
Kmeans is very sensitive to Initialization of cluster means. With different values, we can get diff results and cluster. Points can keep changing clusters! Changing clusters is not about any two given points being together in same group but different cluster colors with diff runs. It is about two given points being in diff groups altogether in diff runs!



# Kmeans ++

Initialization of kmeans is critical to quality of local optima!

Smart Initialization:   
1. Choose first cluster center at random from data point   
     2. For each data, calculate the distance to that cluster.       
3. Generate new cluster with prob of data being chosen proportional to distance squared, ie, pick the new cluster which had highest distance squared. In turn the next cluster is more likely to be far away

Can be computationally costly, but improve quality in finding local optima in running the Kmeans

## Quality Metrics

### Cluster Heterogeneity
We want less Heterogeneity, or less dissimilar data points within the Cluster. Lesser sum of distances within all clusters
Measure of quality:
$$ \sum {^k} \sum {^j}||\mu_j - x_i||_2^2 $$

![title](Images\KMeans_Heterogenity.PNG)

If k = N, each data is the cluster itself with Heterogeneity = 0!! As k increases, Heterogeneity decreases. Choose best k using elbow of the curve.

![title](Images\KMeans_Elbow.PNG)

## Elbow Method not quite good enough!

* Consider a case, where we see visually there are 5 clusters but elbow is at 4! 

* Elbow method doesnt stand-out the best possible number of clusters

* Silhouette Score 
    * $\frac{b - a}{max(b, a)} $

    * a = mean intra-cluster distance (mean distance to other data in same cluster)
    * b = mean nearest-cluster distance (mean distance to data of next closest cluster)

    * score
        * +1 -> all data very well within cluster and other data are far off (b > a -> right cluster assignment and other data are far)
        * 0 -> data is close to boundary
        * -1 -> wrong cluster assignment (a > b)
    
    * It is very clear on which cluster to choose. 

![title](Images\KMeans_SScore.PNG)

* But still we would want 5 to stand out!

* Silhouette Diagram


    * An even more informative visualization is obtained when you plot every instance’s silhouette coefficient, sorted by the cluster they are assigned to and by the value of the coefficient. This is called a silhouette diagram
    * Each diagram contains
one knife shape per cluster. height -> number of data in cluster, width -> silhouette coefficients (wider is better)
    * The dashed line indicates the mean silhouette
coefficient.
    * In below, the more it comes towards +1 it is good only then it says the cluster assignment is good and the data is not close to boundary and it contained in the cluster
    * Cluster 5 is predominantly good than 4

![title](Images\KMeans_SDiagram.PNG)


## Disadv of Kmeans:

* kmeans doesnt behave well with varying size, different densities in data.

* ie, if data contains different dimensions, densities and orientations -> kmeans will fail to converge