# K Means From Scratch

## The Basic Idea

Given a dataset where each observed example has a set of features, but has
**no** labels. What can we do? 

One task we can perform on a data set with no labels is to find groups of data in our dataset
which are similar to one another -- called clusters.

K Means is a clustering algorithm. It stores k centroids that it uses to define clusters. A point
is considered to be in a particular cluster if it is closer to that cluster's centroid that any other
centroid.

## The Algorithm

We are given a training set $\{ x_1, x_2,..., x_m \}$, and want to group the data into a few
cohesive "clusters". 

Here, we are given feature vectors for each data point $x_i \in\mathbb{R}^n$
as usual. Our goal is to classify k centroids with a label $c_i$ for each datapoint.

1. Initialize **cluster centroids** $\mu_1, \mu_2,...,\mu_k\in\mathbb{R}^n$ randomly.
2. Repeat until convergence:

$$
\text{Set } c_i := arg \min_{j} \Vert x_i - \mu_j\Vert ^2 \text{ }\text{  for every i}
$$

$$
\text{Set } \mu_j := \frac{\sum_{i=1}^m \chi_{\{c_i = j\}}x_i}{\sum_{i=1}^m \chi_{\{c_i = j\}}} \text{ }\text{  for every j}
$$

We can write the algorithm using 5 key steps:
1. randomly select centroids.
2. calculate distance at each point and assign each point to cluster.
3. calculate average of the assigned point.
4. move centroid to the new position.
5. repeat steps 2-4 until cluster assignment is not changed.

## Implementation of steps from 1 to 5

In [None]:
from copy import deepcopy
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets._samples_generator import make_blobs

In [None]:
MAX_ITERATIONS = 300

In [None]:
# step 1
def get_random_centroids(dataset, k):
    temp = []
    n_features = len(dataset[0])
    for i in range(n_features):
        min_col_val = np.min(dataset[:, i])
        max_col_val = np.max(dataset[:, i])
        rand_pos = np.random.randint(0.8 * min_col_val, 0.8 * max_col_val , size=k)
        temp.append(rand_pos)

    c_positions = np.array(temp).T

    return c_positions
        

In [None]:
def get_dist(a, b):
    return np.linalg.norm(a - b)

def should_stop(old_centroids, centroids, iterations):
    if iterations > MAX_ITERATIONS:
        return True
    return np.array_equal(old_centroids, centroids)

def kmeans(dataset, centroids, k):
    n = len(dataset)
    clusters = np.zeros(n)
    iterations = 0
    old_centroids = None

    # step 5
    while not should_stop(old_centroids, centroids, iterations):
        for i in range(n):
            # step 2
            distances = [get_dist(dataset[i],c) for c in centroids]
            cluster = np.argmin(distances)
            clusters[i] = cluster
            
        old_centroids = deepcopy(centroids)

        # step 3-4
        for j in range(k):
            points = [dataset[i] for i in range(n) if clusters[i] == j]
            centroids[j] = np.mean(points, axis=0)
        iterations += 1

    return centroids, clusters


### Let's Make Some Data

In [None]:
num_of_centroids = 4
n_features = 2
dataset, _ = make_blobs(n_samples=5000, centers=num_of_centroids, n_features=n_features, random_state=195)

In [None]:
plt.scatter(dataset[:,0], dataset[:, 1])
plt.show()

In [None]:
c_positions = get_random_centroids(dataset, num_of_centroids) 

In [None]:
plt.scatter(dataset[:, 0], dataset[:, 1])
plt.scatter(c_positions[:, 0], c_positions[:, 1], marker='*', s=300, c='orange')
plt.show()

## Apply K-Means

In [None]:
new_clusters, clusters_points = kmeans(dataset, c_positions, num_of_centroids)

In [None]:
plt.scatter(dataset[:, 0], dataset[:, 1])
plt.scatter(new_clusters[:, 0], new_clusters[:, 1], marker='*', s=300, c='r')
plt.show()

In [None]:
def plot_in_col(x, number_of_clusters, clusters_points, new_clusters):
    for i in range(number_of_clusters):
        col_points = np.array([x[n] for n in range(len(x)) if clusters_points[n] == i])
        plt.scatter(col_points[:, 0], col_points[:, 1], s=10)
    plt.scatter(new_clusters[:, 0], new_clusters[:, 1], marker='*', s=300, c='w')
    plt.show()

In [None]:
plot_in_col(dataset, num_of_centroids, clusters_points, new_clusters)

## Expectation Maximization (EM)
### K-Means is a hard EM version of Gaussian Naive Bayes with unit variance

We consider a Naive Bayes model with:  
- Class variable $C \in \{1, 2, \dots, k\}$  
- Feature variables $f_i \in \mathbb{R}$  with $i\in\{1,2,..., d\}$

Instead of a discrete table, assume $f_i$ is Gaussian given $C$:  
$$
P(f_i = x \mid C=c) = \frac{1}{\sqrt{2\pi}} \exp\Big(-\frac{(x-\mu_{c,i})^2}{2}\Big) \sim \mathcal{N}(\mu_{c,i},1)
$$

The goal is to estimate $P(C \mid \mathbf{f})$. Using Bayes' rule and assuming independence between each feature variable:  
$$
P(C=c \mid \mathbf{f}) \propto P(C=c) \prod_i P(f_i \mid C=c)
$$

### Maximum Likelihood Estimation (with hidden variables)

Suppose we have $n$ observations $\mathbf{x}_1, \dots, \mathbf{x}_n \in \R^d$ but unknown classes $C_1, \dots, C_n \in \R$.  
The complete-data likelihood is:  
$$
L(\mu) = \prod_{j=1}^n \prod_{c=1}^k \prod_{i=1}^d P(f_i = x_{j,i} \mid C=c)^{\mathbf{1}\{C_j=c\}}
$$  
where $\mathbf{1}\{C_j=c\}$ is 1 if observation $j$ belongs to class $c$, 0 otherwise.

Taking log-likelihood:  
$$
\ell(\mu) = \sum_{j=1}^n \sum_{c=1}^k \mathbf{1}\{C_j=c\} \sum_{i=1}^d \log P(f_i = x_{j,i} \mid C=c)
$$

Plug in Gaussian form:  
$$
\ell(\mu) = \sum_{j=1}^n \sum_{c=1}^k \mathbf{1}\{C_j=c\} \sum_{i=1}^d \Big[-\frac{1}{2}(x_{j,i}-\mu_{c,i})^2 + \text{const}\Big]
$$

### EM Algorithm

Since $C_j$ is unknown, EM iteratively estimates it:

**E-step:** Compute expected value of $\mathbf{1}\{C_j=c\}$ given current $\mu^{(t)}$:  
$$
\gamma_{j,c} = \mathbb{E}[\mathbf{1}\{C_j=c\} \mid x_j, \mu^{(t)}] 
= \frac{P(C_j=c) \prod_i P(x_{j,i} \mid C_j = c, \mu^{(t)})}{\sum_{c'=1}^k P(C_j=c') \prod_i P(x_{j,i} \mid C_j = c', \mu^{(t)})}
$$

**M-step:** Maximize expected log-likelihood w.r.t. $\mu$:  
$$
\mu^{(t+1)}_{c,i} = \frac{\sum_{j=1}^n \gamma_{j,c} x_{j,i}}{\sum_{j=1}^n \gamma_{j,c}}
$$

This is the weighted mean of points assigned to class $c$.

Repeat E and M until convergence.  

### Special Case: K-Means

If $P(C=c) = 1/k$ and we make a hard assignment ($\gamma_{j,c} \in \{0,1\}$), then:  

- E-step: assign $x_j$ to nearest $\mu_c$ (Euclidean distance)  
- M-step: $\mu_c$ = mean of assigned points  

Hence K-Means is a hard EM version of Gaussian Naive Bayes with unit variance.
