# K Means From Scratch

## The Basic Idea

Given a dataset where each observed example has a set of features, but has
**no** labels. What can we do?

One task we can perform on a data set with no labels is to find groups of data in our dataset
which are similar to one another -- called clusters.

K Means is a clustering algorithm. It stores k centroids that it uses to define clusters. A point
is considered to be in a particular cluster if it is closer to that cluster's centroid that any other
centroid.

## The Algorithm

We are given training sets $\{ x^1, x^2,..., x^m \}$, and want to group the data into a few
cohesive "clusters". Here, we are given feature vectors for each data point $x^i \in\mathbb{R}^n$
as usual. Our goal is to predict k centroids **and** a label $c^i$ for each datapoint.

1. Initialize **cluster centroids** $\mu_1, \mu_2,...,\mu_k\in\mathbb{R}^n$ randomly.
2. Repeat until convergence:

$$
\text{Set } c^i := \min_{j} \Vert x^i - \mu_j\Vert ^2 \text{ }\text{  for every i}
$$

$$
\text{Set } \mu_j := \frac{\sum_{i=1}^m \chi_{\{c^i = j\}}x^i}{\sum_{i=1}^m \chi_{\{c^i = j\}}} \text{ }\text{  for every j}
$$


## Implementation

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
data = pd.read_csv('clustering.csv')

In [None]:
def get_centroids(dataset, label, k):
    pass

In [None]:
def get_labels(dataset, centroids):
    pass

In [None]:
def should_stop(old_centroids, centroids, iterations):
    if iterations > MAX_ITERATIONS:
        return True
    return old_centroids == centroids

In [None]:
def kmean(dataset, k):
    num_features = dataset.get_num_features()
    centroids = get_random_centroids(num_features, k)

    iterations = 0
    old_centroids = None
    

    while not should_stop(old_centroids, centroids, iterations):
        old_centroids = centroids
        iteration += 1

        labels = get_labels(dataset, labels, k)

    return centroids


## Expectation Maximization

Note that K Mans is really just the Expectation Maximization algorithm applied to a particular
naive bayes model.