# K-Means and K-Medoids clustering from "Scratch"

The k-means clustering algorithm is one of the most basic forms of unsupervised
machine learning, and is foundational to a number of other ML methods.
K-medoids is a related variant with a very similar algorithm.

In this notebook I'll go over the general algorithms, the mathematical
intuitions for the methods, and implement simple "from scratch" functions in
python (i.e., a clear translation of the algo & math, but inefficient compared
to packaged implementations).


## The $k$-means clustering algorithm

The algorithm is really very straightforward:  
1. Initialize either with $k$ randomly selected points in the $n$-dimensional
   feature space or $k$ randomly selected data points
2. Assign all data points to the nearest of those randomly selected points
3. Find the centroid of those $k$ assigned clusters
4. Reassign all data points to the closest of those new $k$ centroids
5. Repeat steps 3 and 4 until the assignments and/or centroids stop changing.

The goal is simple -- find an assignment for all points such that, for each
cluster, the within-cluster data points are closer to the group's centroid than
they are to any other group's center.

So, we basically have five tasks to code:
1. Initialize
2. Distance
3. Assignment
4. Find centers
5. Stopping criteria


### Setup

The only package we'll be using is `numpy`

In [1]:
import numpy as np

### Initialize $k$ centers

For this example, we'll select $k$ of our data points at random to facilitate
convergence. Alternatively, one could assign $k$ arbitrary points somewhere
within the overall range of the data.

To help randomize the selection, we'll make our random selection with
[`numpy.random.choice`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html)
from a "shuffled" data set using
[`numpy.random.shuffle`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.shuffle.html).


In [5]:
def diy_kmeans_initialize(X, k_clust):

    # Shuffle the data matrix

    X_shuffled = np.random.shuffle(X)

    # Choose k random points from the shuffled matrix

    k_centers = X_shuffled[np.random.choice(
        X_shuffled.shape[0], k_clust, replace=False), :]

    return k_centers


### Distance

### Assignment

### Find $k$ centers



### Stopping criteria