# K means clustering

## Recap
---
- Supervised Learning
    - Function approximation
    - Use labeled training data to generalize labels to new instances
- Unsupervised Learning
    - Data Description
    - Make sense out of unlabeled data

## Basic Clustering Problem
---
- Given
    - Set of objects X
    - Inter-object distances $D(x,y) = D(y,x)$ and $x,y \in X$
- Output
    - Partition $P_D(x) = P_D(y)$
    - if x and y are in same cluster

### Single Linkage Clustering
---
<img src="../images/single_linkage_clustering.png" width=500 align="right"/>  
- Consider each object a cluster (n objects, each object is its own cluster)
- Define intercluster distance as the distance between the clostest two points in the two clusters
- Merge the two closest clusters
- Repeat n-k times to make n clusters
- Run time is $O(n^3)$
- Can make strange clusters as it walks around to find shortest distances

## K Means Clustering
---
- pick k cernters (at random) (center does not have to be a point in the collection of objects)
- each center "claims" its closest points
- recompute the centers by averaging the clustered points
- repeat until convergence
- K means is like hill climbing

## K Means as optimization
---
- configurations: center, P (partition, cluster)
- scores: $E(P, center) = \sum_x \left\| center_{P(x)} - x \right\|^2$
- neighborhood: $p, center = \{(p^{'}, center)\} \cup \{(p, center^{'})\}$

## K Means in Euclidian Space
---
<img src="../images/kmeans_euclidean.png" width=600 align="left"/>  

## Properties of K-means clustering
---
- each iteration polynomial: $O(kn)$
- finite (exponential) iterations: $O(k^n)$
- error decreases (if ties broken consistently) (with one exception - when things stay same)
- can get stuck
    - for example k = 3
    - if randomly pick two points close to each other in same cluster, could get stuck in local optima
    - could use random restarts to help fix this

## Soft Clustering
---
Lean in probability theory.  Probabilistically from one of many clusters.   

Assume the data was generated by:  
1. Select one of K Gaussians (fixed known variance) uniformily
2. Sample $x_i$ from that Gaussian
3. Repeat n times

Task: find a hypothesis $h = <\mu_1, ...., \mu_k>$ that maximizes the probability of the data ML   

**Maximum Likelihood Gaussian**   
The ML mean of the Gaussina $\mu$ is the mean of the data.   

What if K of them?  Add hidden variables.    
$<X, Z_1, Z_2, .... , Z_k>$  Z's are indicator variable for which cluster it came from

## Expectation Maximization (EM)
---
- move back and forth between two probabilistic calculations
- Expectation
- Maximization
- move back and forth between soft clustering (expectation) and computing means for the soft cluster (maximization)
- $E[Z_{ij}]$ likelihood data element i comes from cluster j
    - use Bayes rule
    - probability data element i was produced by cluster j (normalize)
- $\mu_j$ average $x_i$'s within cluster j what's the likelihood it came from cluster j (normalize)
- k-means if cluster assignments use argmax


**NOTE: left summation should be with respect to variable j**
<img src="../images/expectation_maximization.png" width=600 align="left"/>  

## Properties of EM
---
- monotonically non-decreasing likelihood (not getting worse on each step)
- does not converge (practically does)
- will not diverge
- can get stuck (random restart)
- works with any distribution (if E, M solvable)

## Clustering Properties
---
clustering algorithm takes as set of distance (D) and maps them to clusters (partitions)   

$P_D \leftarrow$ clustering scheme  
- Richness:  For any assignemnt of objects to clusters, there is some distance matrix D such that clustering $\forall C, \exists_D,  P_D = C $
- Scale-invariance:  Scaling distances by a positive value does not change the clustering 
- Consistency:  Shrinking intra cluster distances and expanding intercluster distances does not change the clustering  

**Impossibility Theorem**  
No clustering scheme can achieve all three (richness, scale invariance, consistency)  