# K-Means

For this problem, you will be implementing the K-Means algorithm. This is an unsupervised learning algorithm for clustering problems. That is, its objective is to produce a partitioning over a dataset without (explicit) external supervision of which group each datapoint should belong to.

Your initial implementation should be a standard K-means algorithm with Euclidean distance metric. A concise description can be found in [Andrew NG's lecture notes on K-Means](http://cs229.stanford.edu/notes2020spring/cs229-notes7a.pdf). The first part of Chapter 9 of _Pattern Recognition and Machine Learning_ by Christopher M. Bishop also gives a good overview of the algorithm, as well as its connection to the Expection Maximization (EM) algorithm.

We have provided some skeleton code for the classifier, along with a couple of utility functions in the [k_means.py](./k_means.py) module. Please fill out the functions marked with `TODO` and feel free to add extra constructor arguments as you see fit (just make sure the default constructor solves the first dataset).

In [None]:
%load_ext autoreload

We begin by loading necessary packages. Below follows a short description of the imported modules:

- `numpy` is the defacto python package for numerical calculation. Most other numerical libraries (including pandas) is based on numpy.
- `pandas` is a widely used package for manipulating (mostly) tabular data
- `matplotlib` is the most used plotting library for python
- `seaborn` contains several convience functions for matplotlib and integrates very well with pandas
- `k_means` refers to the module in this folder that should be further implemented by you.

Note: The `%autoreload` statement is an [IPython magic command](https://ipython.readthedocs.io/en/stable/interactive/magics.html) that automatically reloads the newest version of all imported modules within the cell. This means that you can edit the `k_means.py` file and just rerun this cell to get the updated version.

In [None]:
%autoreload

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
import k_means as km # <-- Your implementation

sns.set_style('darkgrid')

## [1] First Dataset

The first dataset is a simple problem that is well suited for K-Means. It consists of 500 datapoints ($x_0, x_1 \in \mathbb{R}$) that should be partitioned into two clusters.

### [1.1] Load Data

We begin by loading data from a .csv file located in the same folder as this notebook.

In [None]:
data_1 = pd.read_csv('data_1.csv')
data_1.describe().T

### [1.2] Visualize

Since the data is 2-dimensional, it lends itself nicely to visualization with a scatter plot. From this, it should be evident what a sensible clustering should look like.

In [None]:
plt.figure(figsize=(5, 5))
sns.scatterplot(x='x0', y='x1', data=data_1)
plt.xlim(0, 1); plt.ylim(0, 1)

### [1.3] Fit and Evaluate

Next we fit and evaluate your K-Means clustering model over the dataset. It should work with the default constructor, but it is perfectly fine if you make the default constructor configure the algorithm for two centroids. 

We will quantitatively evaluate the solution according to _distortion_ and the _silhouette score_ (both assuming a euclidean distance metric).

- The distortion measure is equal to the sum of squared distances between each point and the centroid it is assigned to. It favors cohesive clusters, i.e. clusters where all points are close to their centroids, and is used as a minimization objective by K-Means.

- The [silhouette score](https://en.wikipedia.org/wiki/Silhouette_(clustering) measures both cluster cohesion and separation. I.e., it also accounts for to what degree each cluster is isolated from other clusters. It takes on values in the range (-1, 1) and is subject to maximization.

Note that `.fit`, `.predict`, and `.get_centroids` will crash until you implement these two methods in [k_means.py](./k_means.py). The `.get_centroids` method is used fetch the cluster centroids which are visualized as stars in the figure.

Assuming a standard implementation of K-means, you should expect to get a sihouette score of ~0.67 and a distortion measure of ~8.8. You can also verify that everything is working as it should by inspecting the generated figure. A working algorithm should generate centroids such that all the points with $x_0 < 0.5$ (approximately) are assigned to one cluster and the remaining are assigned to the other.

In [None]:
# Fit Model 
X = data_1[['x0', 'x1']]
X = np.array(X)

model_1 = km.KMeans(version='v1') 
model_1.fit(X)

# Compute Silhouette Score 
z = model_1.predict(X)

print(f'Silhouette Score: {km.euclidean_silhouette(X, z) :.3f}')
print(f'Distortion: {km.euclidean_distortion(X, z) :.3f}')

# Plot cluster assignments
C = model_1.get_centroids()
K = len(C)

X = data_1[['x0', 'x1']]

_, ax = plt.subplots(figsize=(5, 5), dpi=100)
sns.scatterplot(x='x0', y='x1', hue=z, hue_order=range(K), palette='tab10', data=X, ax=ax)
sns.scatterplot(x=C[:,0], y=C[:,1], hue=range(K), palette='tab10', marker='*', s=250, edgecolor='black', ax=ax)
ax.legend().remove()

We plot here the convergence of the model

In [None]:
plt.figure(figsize=(20, 5), dpi=100)

model_1 = km.KMeans(version='v1') 

for k in range(3):
    model_1.fit(X, nb_epochs=1)
    z = model_1.predict(X)
    C = model_1.get_centroids()
    K = len(C)

    ax = plt.subplot(1, 3, k + 1)
    X = data_1[['x0', 'x1']]
    sns.scatterplot(x='x0', y='x1', hue=z, hue_order=range(K), palette='tab10', data=X, ax=ax)
    sns.scatterplot(x=C[:,0], y=C[:,1], hue=range(K), palette='tab10', marker='*', s=250, edgecolor='black', ax=ax)

    ax.legend().remove()

## [2] Second Dataset

The second dataset is superficially similar to the first one. The goal is still to partition a two-dimensional dataset into mutually exlusive groups, but it is designed to be a bit more challenging.

### [2.1] Load Data

This dataset can also be found in a .csv file in the same folder as this notebook.

In [None]:
data_2 = pd.read_csv('data_2.csv')
data_2.describe().T

We normalize the dataset here.

$X_{normalized} = ( X - X_{min} ) / ( X_{max} - X_{min} )$


In [None]:
data_2['x0'] = (data_2['x0'] - data_2.describe()['x0']['min']) / (data_2.describe()['x0']['max'] - data_2.describe()['x0']['min'])
data_2['x1'] = (data_2['x1'] - data_2.describe()['x1']['min']) / (data_2.describe()['x1']['max'] - data_2.describe()['x1']['min'])

### [2.2] Visualize Data

As can be seen, there are substantially more clusters in this dataset. We generated a total of 8 clusters that your algorithm should be able to identify. It is ok if you pass information about the number of clusters to your model during instantiation, but it should be able to initialize itself and identify a good clustering without any external information.

In [None]:
plt.figure(figsize=(5, 5))
sns.scatterplot(x='x0', y='x1', data=data_2)

### [2.3] Fit and Evaluate

Again, we fit the model to the data, measure distortion and silhouette score, and visualize the resulting clusters. You may experience that the algorithm you implemented for the first dataset fails to find all the clusters, at least consistently. 

Feel free to add extra functionality to the algorithm and/or the data preprocessing pipeline that improve performance. It might be useful to run the algorithm for one iteration at the time and plot the resulting clustering to get a better idea of what is going on. 

As a debugging reference; it should be possible to create an implementation that finds all the 8 clusters (at least) 9/10 times with randomized initialization. 

In [None]:
# Fit Model 
X = data_2[['x0', 'x1']]
X = np.array(X)

K = 8
nb_epochs = 10

model_2 = km.KMeans(K=K, version='v2')
model_2.fit(X, loop_tries=20, nb_epochs=nb_epochs)

# Compute Silhouette Score 
z = model_2.predict(X)
print(f'Distortion: {km.euclidean_distortion(X, z) :.3f}')
print(f'Silhouette Score: {km.euclidean_silhouette(X, z) :.3f}')

# Plot cluster assignments
C = model_2.get_centroids()
K = len(C)

X = data_2[['x0', 'x1']]

_, ax = plt.subplots(figsize=(5, 5), dpi=100)
sns.scatterplot(x='x0', y='x1', hue=z, hue_order=range(K), palette='tab10', data=X, ax=ax)
sns.scatterplot(x=C[:,0], y=C[:,1], hue=range(K), palette='tab10', marker='*', s=250, edgecolor='black', ax=ax)
ax.legend().remove()

We plot the convergence of the model here.

In [None]:
plt.figure(figsize=(20, 5), dpi=100)

clusters = model_2.cluster_to_plot

for ep in range(4):
    C = clusters[ep]
    K = len(C)
    ax = plt.subplot(1, 4, ep + 1)
    X = data_2[['x0', 'x1']]
    sns.scatterplot(x='x0', y='x1', hue=z, hue_order=range(K), palette='tab10', data=X, ax=ax)
    sns.scatterplot(x=C[:,0], y=C[:,1], hue=range(K), palette='tab10', marker='*', s=250, edgecolor='black', ax=ax)

    ax.legend().remove()

We plot the different values of the distortion measure depending on the try time, so depending on the initial random $\mu^{(j)}$.

In [None]:
distortions = model_2.training_distortions

plt.figure(figsize=(10, 5), dpi=100)
plt.plot(distortions)

We plot here the evolution of the different tries at each epochs.

In [None]:

plt.figure(figsize=(10, 5), dpi=100)

distortions_to_plot = model_2.training_distortions_to_plot

x = range(nb_epochs + 1)

for d in distortions_to_plot:
    while len(d) < len(x):
        d.append(d[len(d)])
    plt.plot(x, d)
    ax.legend().remove()

## [3] Further Steps (optional)

If you're done with the assignment but want to some more challenges; consider the following:

- Modify your clustering algorithm so that the user no longer has to enter the number of clusters manually.
- K-means makes hard cluster assignments. Try implementing the [EM-algorithm](https://en.wikipedia.org/wiki/Expectation–maximization_algorithm) to fit a gaussian mixture model to the data above.
- Implement a clustering algorithm that solves the dataset below. 

In [None]:
data_bonus = pd.read_csv('data_bonus.csv')
plt.figure(figsize=(5, 5))
sns.scatterplot(x='x0', y='x1', data=data_bonus)

In [None]:
# Fit Model 
X = data_bonus[['x0', 'x1']]
X = np.array(X)

K = 5

model_bonus = km.KMeans(K=K, version='v2') 
model_bonus.fit(X, loop_tries=20)

# Compute Silhouette Score 
z = model_bonus.predict(X)
print(f'Distortion: {km.euclidean_distortion(X, z) :.3f}')
print(f'Silhouette Score: {km.euclidean_silhouette(X, z) :.3f}')

# Plot cluster assignments
C = model_bonus.get_centroids()
K = len(C)

X = data_bonus[['x0', 'x1']]

_, ax = plt.subplots(figsize=(5, 5), dpi=100)
sns.scatterplot(x='x0', y='x1', hue=z, hue_order=range(K), palette='tab10', data=X, ax=ax)
sns.scatterplot(x=C[:,0], y=C[:,1], hue=range(K), palette='tab10', marker='*', s=250, edgecolor='black', ax=ax)
ax.legend().remove()

 We modified our clustering algorithm (v3) so that the user no longer has to enter the number of clusters manually. We chosed the silhouette method to do so.

In [None]:
X = np.array(data_1[['x0', 'x1']])

model_bonus = km.KMeans(version='v3') 
model_bonus.fit(X, loop_tries=30)

# Compute Silhouette Score 
z = model_bonus.predict(X)
print(f'Distortion: {km.euclidean_distortion(X, z) :.3f}')
print(f'Silhouette Score: {km.euclidean_silhouette(X, z) :.3f}')

# Plot cluster assignments
C = model_bonus.get_centroids()
K = len(C)

X = data_1[['x0', 'x1']]

_, ax = plt.subplots(figsize=(5, 5), dpi=100)
sns.scatterplot(x='x0', y='x1', hue=z, hue_order=range(K), palette='tab10', data=X, ax=ax)
sns.scatterplot(x=C[:,0], y=C[:,1], hue=range(K), palette='tab10', marker='*', s=250, edgecolor='black', ax=ax)
ax.legend().remove()

Unfortunatelly it does not works well on dataset 2, it find just 6 clusters (rather than 8).

In [None]:
X = np.array(data_2[['x0', 'x1']])

model_bonus = km.KMeans(version='v3') 
model_bonus.fit(X, loop_tries=30)

# Compute Silhouette Score 
z = model_bonus.predict(X)
print(f'Distortion: {km.euclidean_distortion(X, z) :.3f}')
print(f'Silhouette Score: {km.euclidean_silhouette(X, z) :.3f}')

# Plot cluster assignments
C = model_bonus.get_centroids()
K = len(C)

X = data_2[['x0', 'x1']]

_, ax = plt.subplots(figsize=(5, 5), dpi=100)
sns.scatterplot(x='x0', y='x1', hue=z, hue_order=range(K), palette='tab10', data=X, ax=ax)
sns.scatterplot(x=C[:,0], y=C[:,1], hue=range(K), palette='tab10', marker='*', s=250, edgecolor='black', ax=ax)
ax.legend().remove()

We plot our distortions (left) and silhouettes (right) for every k-runs.

In [None]:
distortions = model_bonus.training_distortions
silhouettes = model_bonus.training_silhouettes
x = np.array(range(2, model_bonus.K + 1))
plt.figure(figsize=(10, 5), dpi=100)
plt.subplot(1, 2, 1)
plt.plot(x, distortions)
plt.subplot(1, 2, 2)
plt.plot(x, silhouettes)

We tried to implement a version of EM-algorithm to fit a gaussian mixture model to the data above but it did not worked well and I did not have enough time to fix it (for both dataset)

In [None]:
X = np.array(data_2[['x0', 'x1']])

model_bonus = km.KMeans(version='em', K=8) 
model_bonus.fit(X, loop_tries=30, nb_epochs=20000, tolerance=0.001)

# Compute Silhouette Score 
z = model_bonus.predict(X)
print(f'Distortion: {km.euclidean_distortion(X, z) :.3f}')
print(f'Silhouette Score: {km.euclidean_silhouette(X, z) :.3f}')

# Plot cluster assignments
C = model_bonus.get_centroids()
K = len(C)

X = data_2[['x0', 'x1']]

_, ax = plt.subplots(figsize=(5, 5), dpi=100)
sns.scatterplot(x='x0', y='x1', hue=z, hue_order=range(K), palette='tab10', data=X, ax=ax)
sns.scatterplot(x=C[:,0], y=C[:,1], hue=range(K), palette='tab10', marker='*', s=250, edgecolor='black', ax=ax)
ax.legend().remove()

In [None]:
x = np.array(range(1, len(model_bonus.likelihoods)))
plt.figure(figsize=(10, 5), dpi=100)
plt.plot(x, model_bonus.likelihoods[1:])

In [None]:
X = np.array(data_bonus[['x0', 'x1']])

model_bonus = km.KMeans(version='em', K=5) 
model_bonus.fit(X, nb_epochs=1000)

# Compute Silhouette Score 
z = model_bonus.predict(X)
print(f'Distortion: {km.euclidean_distortion(X, z) :.3f}')
print(f'Silhouette Score: {km.euclidean_silhouette(X, z) :.3f}')

# Plot cluster assignments
C = model_bonus.get_centroids()
K = len(C)

X = data_bonus[['x0', 'x1']]

_, ax = plt.subplots(figsize=(5, 5), dpi=100)
sns.scatterplot(x='x0', y='x1', hue=z, hue_order=range(K), palette='tab10', data=X, ax=ax)
sns.scatterplot(x=C[:,0], y=C[:,1], hue=range(K), palette='tab10', marker='*', s=250, edgecolor='black', ax=ax)
ax.legend().remove()

In [None]:
x = np.array(range(1, len(model_bonus.likelihoods)))
plt.figure(figsize=(10, 5), dpi=100)
plt.plot(x, model_bonus.likelihoods[1:])