# Manifold Learning

Today we are going to work with another example dataset based on images, called Olivetti Faces. This dataset contains 400 images of some people's faces. Each image is 64 x 64 pixels (4096 in total).

That means that we have 4096 dimensions in the raw data!!

### How can we visually analyse 400 images in one chart? 
Well that's precisely what Manifold Learning can help us with. But actually also Dimensionality Reduction techniques could help us right? Let's compare these two technique families and verify why Manifold Learning is more suitable for this task. In theory we know it should be better because images' similarities are nonlinear and because there are too many dimensions to convert into components.

### In the 400 pictures, how many unique persons are represented?
There are 400 pictures, but we know (because the data is labelled) that there are 39 different people. Therefore many pictures are of the same person's face (but slightly different). Could we also apply some clustering algorithm to group those faces and discover how many people are there? (without using the true labels). You could imagine that this is a security camera system and we want to automate this task of recognising unique persons.

Or another example, your phone camera probably puts a square around people's faces ([like this](https://i1.wp.com/revoseek.com/wp-content/uploads/2012/03/Hitachis-Camera-Recognize-a-Person.jpg?resize=600%2C375)) before taking the picture: we could take the images inside those squares over a period of time and analyse how many different persons you are taking pictures of.

This is what Facebook is doing since years ago when it suggests you to tag someone in a picture, and probably guessing correctly who the person in the photgraph is.

In [None]:
import matplotlib.pyplot as plt
import random

import numpy as np
import pandas as pd

from sklearn.cluster import MiniBatchKMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.manifold import Isomap, MDS, LocallyLinearEmbedding, TSNE # There are more. Check them out

from sklearn.datasets import fetch_olivetti_faces



In [None]:
'''
If the next line of code doesn't work, run this:

import ssl
ssl._create_default_https_context = ssl._create_unverified_context
'''

faces = fetch_olivetti_faces()

# faces is a dictionary with 3 keys

`faces` is a dictionary with 4 keys:
* images: The actual 400 black and white images of 64x64 pixels of people.
* data: Same data as images, but 400 samples of 4096 columns (the pixels converted to input features)
* target: An array of 400 integers, representing the class (each class represents a different person)
* description: A description of the dataset. Shown below.

In [None]:
print(faces.DESCR)

In [None]:
# 400 images, 64 by 64 pixels.
faces['images'].shape

In [None]:
'''
Insert any number between 0 and 399 to visualise one training data
record. I put 0 for example. BUT TRY SOME OTHER:
'''
SAMPLE_RECORD_NUMBER = 0

plt.gray() 
plt.matshow(faces.images[SAMPLE_RECORD_NUMBER]) 
plt.show() 

Let's check some consecutive faces. You can change the parameter `grid_size` to show more or less faces, and the parameter `j` to show some other parts of the dataset (by default I am showing from the 100th data sample onwards). 

In [None]:
grid_size = 5

fig, axes = plt.subplots(grid_size, grid_size, sharex=True, sharey=True,figsize=(15, 15))
j=100

for row in axes:
    for i in range(grid_size):
        row[i].matshow(faces.images[j])
        j+=1

plt.show()

We humans can easily identify which picture corresponds to the same person. 

And in supervised learning we could do some feature selection and dimensionality reduction to create a classifier that could potentially perform well. Also, we could train an Artificial Neural Network which probably would work very well.

But without having any class, in **purely unsupervised learning**, how can we separate the pictures of each person?

To simplify this we will select the first 70 images. If anything we do with just 70 pictures works, we could then expand it to all of the data (with the same parameters) and check if the results are still good.

In [None]:
N_IMAGES = 70

df = pd.DataFrame(faces['data'][:N_IMAGES])

In [None]:
# As we expected, 4096 input features...
df.shape

# Manifold Plots

Let's use Manifold Learning to plot the images...

In [None]:
cols = ['x', 'y']

# I left some other parameters as default. You can check sklearn's docs
isomap = Isomap(
    n_neighbors=5, #num of neighbors for KNN 
    n_components=2, # 2D data as a result
    n_jobs=-1
)
isomap_result = isomap.fit_transform(df)
isomap_result = pd.DataFrame(isomap_result)
isomap_result.columns = cols

plt.scatter(isomap_result['x'], isomap_result['y'], color='b')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Isomap: We can already see that some data points are quite similar and other quite different')
plt.show()

In [None]:
lle = LocallyLinearEmbedding(
    n_neighbors=5, 
    n_components=2, 
    reg=0.001, 
    method='standard', # Could be {‘standard’, ‘hessian’, ‘modified’, ‘ltsa’} for all of the versions of LLE we saw
    random_state=0, 
    n_jobs=-1
)
lle_result = lle.fit_transform(df)
lle_result = pd.DataFrame(lle_result)
lle_result.columns = cols
plt.scatter(lle_result['x'], lle_result['y'], color='b')
plt.xlabel('x')
plt.ylabel('y')
plt.title('LLE: Some small clusters of points are quite separated from the rest - what is happening?')
plt.show()

In [None]:
# Multi-dimensional scaling algorithms:
mds = MDS(
    n_components=2, 
    metric=True, # To select metric or non-metric MDS
    n_init=4, 
    max_iter=300, 
    verbose=0, 
    eps=0.001, 
    n_jobs=-1, 
    random_state=0, 
    dissimilarity='euclidean'
)
mds_result = mds.fit_transform(df)
mds_result = pd.DataFrame(mds_result)
mds_result.columns = cols
plt.scatter(mds_result['x'], mds_result['y'], color='b')
plt.xlabel('x')
plt.ylabel('y')
plt.title('MDS: And now all images look quite spread?')
plt.show()

In [None]:
# Finally, let's try the big boy... As always: CHANGE PARAMETERS TO SEE HOW IT WORKS!!
t_sne = TSNE(
    n_components=2,
    # I had to play with perplexity quite a bit: WHAT HAPPENS IF YOU INCREASE/DECREASE IT?
    perplexity=5.6, 
    metric='euclidean', 
    # Can be random or pca. Try both!
    init='pca', 
    verbose=0, 
    random_state=0, 
    # You can also put below 'exact', but you will have to approx double the perplexity factor
    method='barnes_hut',
    n_jobs=-1
)
t_sne_result = t_sne.fit_transform(df)
t_sne_result = pd.DataFrame(t_sne_result)
t_sne_result.columns = cols
plt.scatter(t_sne_result['x'], t_sne_result['y'], color='b')
plt.xlabel('x')
plt.ylabel('y')
plt.title('t-SNE: There are some clear clusters there now')
plt.show()
print('Are there around 7-10 clusters there or it is my imagination just?')

# Clustering

Let's cluster that data that came out of t-SNE and see if we can more or less guess the number of groups/clusters (classes, **different people**) we have in there...

In [None]:
# I can see that one of the t-SNE clusters has 5 samples. Let's use 5 as the min_samples then:

# I was playing with EPS and min_samples for a while UNTIL:
#    1. I got as many samples as possible belonging to a cluster
#    2. I got a reasonable number of clusters: no 1 single cluster, and no >10 clusters 
dbscan = DBSCAN(
    eps=50,
    min_samples=3,
    
)
dbscan_clusters = dbscan.fit_predict(t_sne_result)

In [None]:
print('Num. samples without cluster:', (dbscan_clusters == -1).sum())
# DBSCAN thinks there are 4 different people:
print('Clusters: ', set(dbscan_clusters))

# The -1 cluster are those samples that are considered noise (don't belong to any cluster)
# Now we will add these clusters to the different Manifold Learning results, and plot them:

In [None]:
results = [isomap_result, lle_result, mds_result, t_sne_result]
algos = ['Isomap', 'LLE', 'MDS', 't-SNE']

palette = [
        '#aaaaaa', '#ff6666', '#66ccff', '#bbff88', '#3346ff', '#6666ff', '#e566ff', '#66ff99', 
        '#7cb9e8', '#b0bf1a', '#5d8aa8', '#efdecd', '#3b7a57', '#967117', '#cce6ff',
        '#4da6ff', '#e60073', '#2200cc', '#0088cc', '#19ffff', '#1eb300', '#805500',
]

clusters = set(dbscan_clusters)

for r in results:
    r['dbscan'] = dbscan_clusters


fig, axes = plt.subplots(2, 2, figsize=(15, 15))
res_index = 0

for row in axes:
    for i in range(2):
        res = results[res_index]
        for c in clusters:
            cluster_data = res[res['dbscan']==c]
            row[i].scatter(cluster_data['x'],cluster_data['y'], color=palette[c+1])
        row[i].title.set_text(algos[res_index])
        res_index+=1

plt.show()

Only in t-SNE the images are separated more or less correctly. Although in all of the methods the images identified as "being the same person" by our `t-SNE + DBSCAN` approach are located nearby (even in LLE) 
The t-SNE clusters are all more or less coloured correctly. Let's assign those cluster values to the original data and **check if the images inside each cluster are of the same person**:

In [None]:
df['clusters'] = dbscan_clusters

In [None]:
# Choose between -1 and len(set(dbscan_clusters))
CLUSTER_TO_DISPLAY = 2

cluster_images = df[df['clusters'] == CLUSTER_TO_DISPLAY]

print('This cluster has',len(cluster_images), 'photos')

# IF THE CLUSTER HAS LESS THAN 9 IMAGES, THEY WILL BE REPEATED BELOW:
grid_size = 3
fig, axes = plt.subplots(grid_size, grid_size, sharex=True, sharey=True,figsize=(15, 15))

img_count=0

for row in axes:
    for i in range(grid_size):
        row[i].matshow(np.array(cluster_images.iloc[img_count % len(cluster_images)][:4096]).reshape(64, 64))
        img_count+=1

plt.show()

## Kind of works...

The best clusters for me where the young man of the very first picture we showed above in this notebook, and all of the gentlemen with glasses. 

t-SNE's low-dimensional visualisation kind of gave us the impression that there were between 5 and 10 clusters of data - which allowed us to tune DBSCAN to achieve an OK result to group the 70 pictures.

What if we had used PCA for this?

In [None]:
vanilla_pca = PCA(n_components=2, svd_solver='auto')

v_pca = vanilla_pca.fit_transform(df)
print('Shape of the PCA-transformed data: ',v_pca.shape)
print('Those are', v_pca.shape[1], 'components')
print('Variance explained: ', round(100*sum(vanilla_pca.explained_variance_ratio_), 2), '% of the total')

In [None]:
v_pca_result = pd.DataFrame(v_pca)
v_pca_result.columns = cols
plt.scatter(v_pca_result['x'], v_pca_result['y'], color='b')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Classic PCA, reduced to 2 components')
plt.show()

It seems to happen the same as with MDS... the data is spread everywhere.

##### Let's plot our OK-ish t-SNE-based clusters in that PCA 2D plot:

In [None]:
v_pca_result['dbscan'] = dbscan_clusters

for c in clusters:
    cluster_data = v_pca_result[v_pca_result['dbscan'] == c]
    plt.scatter(cluster_data['x'], cluster_data['y'], color=palette[c+1])
plt.xlabel('x')
plt.ylabel('y')
plt.title('Classic PCA, reduced to 2 components')
plt.show()

That is actually not too bad! Some of the clusters we discovered using t-SNE (non-linear Manifold Learning) are located nearby, and without overlaps, in PCA (linear Dimensionality Reduction technique).

This means that there are _some linear relationships_ in our data that can be exploited by traditional Dimensionality Reduction techniques, even though PCA's 2 components are only able to explain around 40% of the original variance.

Maybe we could have done our clustering directly on the original data? Or in a PCA (or other Dimensionality Reduction) with some higher number of components?

In [None]:
# I had to play with these two parameters a lot to obtain a result similar
# to what we had after t-SNE!! 
# Without the knowledge from t-SNE and/or the target variable provided in 
# the dataset (faces['target']), this would have been a big guess to know how many clusters is
# the correct result... or maybe not: what could we have done here?
dbscan = DBSCAN(
    eps=8.5,
    min_samples=2,
    
)
dbscan_clusters2 = dbscan.fit_predict(df)

In [None]:
print('Num. samples without cluster:', (dbscan_clusters2 == -1).sum())
print('Clusters: ', set(dbscan_clusters2))

In [None]:
df['clusters2'] = dbscan_clusters2

In [None]:
# Choose between -1 and len(dbscan_clusters2)
CLUSTER_TO_DISPLAY = 3

cluster_images = df[df['clusters2'] == CLUSTER_TO_DISPLAY]

print('Num. Images in Cluster:', len(cluster_images))

# IF THE CLUSTER HAS LESS THAN 9 IMAGES, THEY WILL BE REPEATED BELOW:
grid_size = 3
fig, axes = plt.subplots(grid_size, grid_size, sharex=True, sharey=True,figsize=(15, 15))

img_count=0

for row in axes:
    for i in range(grid_size):
        row[i].matshow(np.array(cluster_images.iloc[img_count % len(cluster_images)][:4096]).reshape(64, 64))
        img_count+=1

plt.show()
print('Seems like the clusters work - and contain the same person in all of them!')

If you checked the clusters, you'll notice that we are even able to differentiate the 2 gentlemen using the same pair of glasses.

**Note:** The first time we clustered images, we used only 2 dimensions (those provided by t-SNE), and got fair-enough results. The second time, we used the 4096 original dimensions (all the pixels) to do the clustering!! In a very large dataset that would have taken ages!! So, while the second results are maybe slightly better for what I observed in the clusters, t-SNE helped us analyse our data to seek an ok-ish number of clusters that there should be, and then quickly apply any clustering algorithm there.

Anyway... Congratulations! You have now the power of a face recognition software in your hands!

# Learning Exercises:

* What happens when you select more than 70 images? Does t-SNE still make sense?
* Considering that we found out that _some_ linear relationships are present in the data, what else could we have done to discover how many classes we have?
* How could we verify that our clusters of photos are correct? Can you apply those metrics and plots?
