## Experiment 1: Synthetic Data

This week we will start by creating some synthetic data that we will use to illustrate the different clustering algorithms that we have seen this week. Sklearn has a very neat way of creating synthetic data for clustering.

**You can click and drag the 3D charts**, enjoy.

In [None]:
!pip install plotly

In [None]:
import matplotlib.pyplot as plt
import random

import pandas as pd
import plotly.express as px

from sklearn.datasets import make_blobs

In [None]:
# Change these constants to modify the experiments below!
NUM_CLUSTERS = 7
NUM_SAMPLES = 300

X, y = make_blobs(
    n_samples=NUM_SAMPLES,
    n_features=3, # We will add just 3 features so we can plot the data in a 3D plot! :)
    centers=NUM_CLUSTERS,   # blobs or clusters (to see how well our clustering methods recognise them later)
    cluster_std=0.5,
    shuffle=True,
    random_state=0
)
# We won't use "y" since we are in unsupervised learning, we don't really need it

In [None]:
df = pd.DataFrame(X, columns=['feature1', 'feature2', 'feature3'])

In [None]:
print(f''' Below you should see a 3D graph, which you can rotate with the synthetic data
spread in {NUM_CLUSTERS} clear clusters ''')
fig = px.scatter_3d(df, x='feature1', y='feature2', z='feature3')
fig.show()

In [None]:
# We import the clustering algorithms we have seen this week:
from sklearn.cluster import KMeans, MiniBatchKMeans, AgglomerativeClustering
from sklearn.mixture import GaussianMixture

# MiniBatchKMeans is a faster implementation of KMeans which sacrifices a bit of 
# clustering performance to get faster results. Try it out if you want!

In [None]:
kmeans_random = KMeans(
    n_clusters=NUM_CLUSTERS,
    init='random',
    n_init=1, # We just want to run the algorithm once - not several times and get the best results!
    max_iter=100,
)
kmeans_random_clusters = kmeans_random.fit_predict(X)

In [None]:
df['kmeans_random_clusters'] = kmeans_random_clusters

In [None]:
fig = px.scatter_3d(df, x='feature1', y='feature2', z='feature3', 
                    color='kmeans_random_clusters', color_continuous_scale='Rainbow')
fig.show()

We can see how K-means with random initialisation got some of the clusters wrong (at least in my execution).
Let's now try with K-means++:

In [None]:
kmeans_plusplus = KMeans(
    n_clusters=NUM_CLUSTERS,
    init='k-means++',
    n_init=1, # We just want to run the algorithm once - not several times and get the best results!
    max_iter=100,
)
kmeans_plusplus_clusters = kmeans_plusplus.fit_predict(X)
df['kmeans_plusplus_clusters'] = kmeans_plusplus_clusters
fig = px.scatter_3d(df, x='feature1', y='feature2', z='feature3', 
                    color='kmeans_plusplus_clusters', color_continuous_scale='Rainbow')
fig.show()

Much better now! (**there are no blobs split in multi-colours**). And it seems like GMM is also giving good results:

In [None]:
gmm = GaussianMixture(
    n_components=NUM_CLUSTERS,
    n_init=1 # Same as before
)
gmm_clusters = gmm.fit_predict(X)
df['gmm_clusters'] = gmm_clusters
fig = px.scatter_3d(df, x='feature1', y='feature2', z='feature3', 
                    color='gmm_clusters', color_continuous_scale='Rainbow')
fig.show()

In [None]:
# Change this to observe how Agglomerative Clustering changes its results:
DISTANCE_THRESHOLD = 5 # With a smaller distance we will get better results


agg_clustering = AgglomerativeClustering(
    n_clusters=None, # We don't need this, just the distance threshold
    linkage='single', # Try other linkages as we saw in the session: linkage{‘ward’, ‘complete’, ‘average’, ‘single’}
    distance_threshold=DISTANCE_THRESHOLD,
)
agg_clusters = agg_clustering.fit_predict(X)
df['agg_clusters'] = agg_clusters
fig = px.scatter_3d(df, x='feature1', y='feature2', z='feature3', 
                    color='agg_clusters', color_continuous_scale='Rainbow')
fig.show()

## Experiment 2: Clustering Images Data

Just as we did in the Introduction to Machine Learning course, we will start by using the most standard dataset. In supervised classification it was the Iris Dataset, and here is the MNIST dataset.

This dataset consists of a training set of 60k images of hand-written digits. The digits are from 0 to 9. So our target feature has 10 classes. Additionally, it comes with a testing set of another 10k images.

Each image is a 28x28 pixels, greyscale one. Each pixel has a value between 0 and 255 that represents how dark or clear the value of that pixel is.

We will try to group the images that correspond to the same digit (without using the labels, just the pixels!!). This time we will also use KNN

In [None]:
from sklearn.datasets import load_digits

from sklearn.neighbors import NearestNeighbors

In [None]:
digits = load_digits()
images = digits['data'] / 255 # A quick normalisation so all pixel values are between 0 and 1

In [None]:
'''
Insert any number between 0 and 60k to visualise one training data
record. I put 234 for example. BUT TRY SOME OTHER:
'''
SAMPLE_RECORD_NUMBER = 234

print('This looks like a six:')
plt.gray() 
plt.matshow(digits.images[SAMPLE_RECORD_NUMBER]) 
plt.show() 

In [None]:
# Let's train our KNN model (without a target feature!)
knn = NearestNeighbors(
    n_neighbors=100, # There are some more parameters you could tweak, check documentation
)
knn.fit(images)

In [None]:
# Let's now check the nearest neighbors to our example above:
distances, neighbors = knn.kneighbors([images[SAMPLE_RECORD_NUMBER].reshape(-1)])

In [None]:
# Distances contain the Euclidean distances in an array, and neighbors contain the indices
# of the neighbor samples. Both are sorted by distance, in ascending order.

'''Let's plot the top 5 nearest neighbors of this instance, which are the last 5 indices'''

for index in range(1, 6):
    plt.gray() 
    plt.matshow(digits.images[neighbors[0][index]]) 
    plt.show() 

### Yeah! 
Seems like they are all images of the number 6. KNN is working here!!

## Let's now try the MiniBatchKMeans: (faster KMeans)

In [None]:
kmeans_minibatch = (
    n_clusters=10, # We know the images are for the digits 0 to 9, so we should have 10 clusters!
    init='k-means++',
    n_init=10,
    max_iter=300,
)
kmeans_minibatch_clusters = kmeans_minibatch.fit_predict(images)

In [None]:
kmeans_minibatch_clusters

In [None]:
clusters = {}
for cluster_id, image in zip(kmeans_minibatch_clusters, images):
    if cluster_id in clusters:
        clusters[cluster_id].append(image)
    else:
        clusters[cluster_id] = [image]

In [None]:
# Select one cluster to display some samples in it!
# The cluster_id has nothing to do with the digits' value, but in each cluster,
# very similar images should be together: (select any number from 0 to 9)
CLUSTER_ID_TO_DISPLAY = 4

grid_size = 5

fig, axes = plt.subplots(grid_size, grid_size, sharex=True, sharey=True,figsize=(15, 15))

for row in axes:
    for i in range(grid_size):
        random_index = random.randint(0, len(clusters[CLUSTER_ID_TO_DISPLAY])-1)
        row[i].matshow(clusters[CLUSTER_ID_TO_DISPLAY][random_index].reshape(8, 8))

plt.show()

## Learning exercises:

* Can you apply more clustering algorithms to the digit images to improve performance?

* Can you tune the agglomerative clustering algorithm to work better?

* There are ways to plot a dendrogram with the scipy library and also with the plotly library, can you make it happen to see how your agglomerative clustering

* And as always, play with all of the constant parameters I specified in capitals to see and understand how the outputs change.