<img src="https://drive.google.com/uc?id=1-hPP-XPm9_5M3orUgmompcVleQ5xvPST" style="Width:1000px">

In [None]:
from nbta.utils import download_data
download_data(id='1Jlq8kHOlsp563-x15b6Mx55st4XIKAjs')

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.read_csv('raw_data/penguins.csv')
data

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

X = data.drop(columns=['species'])
y = data.species
                   
X_train, X_test, y_train_raw, y_test_raw = train_test_split(X, y, train_size=0.8, random_state=12)

label_encoder = LabelEncoder().fit(y_train_raw)

y_train = pd.Series(data=label_encoder.transform(y_train_raw),name='species')
y_test = pd.Series(data=label_encoder.transform(y_test_raw), name='species')

#### Visualizing the relationship between features

Create a pairplot (`seaborn.pairplot`) to visualize the features. Use the `species` as a hue, to be able to distinguish different species.

In [None]:
plot_data = X_train.copy(); plot_data['species']=y_train_raw.values;
sns.pairplot(data=plot_data, hue='species');

# Creating a PCA projection

Create a `X_proj_pca` dataset that is the projected version of your dataset using a `PCA` with a `random_state=5`. I selected 5 because it leads to a nice visualization. Save your `X_proj_pca` as a dataframe.

In [None]:
from sklearn.decomposition import PCA

pca = PCA(random_state=5)
pca.fit(X_train)

In [None]:
X_proj_pca = pca.transform(X_train)
X_proj_pca = pd.DataFrame(X_proj_pca, columns=[f'PC{i}' for i in range(1, 5)])
X_proj_pca

# Dimensionality reduction and projection using graph algorithms (t-SNE and UMAP)

* Manifold learning based on graph algorithms 
* very complex, general principles exlained only
* More about graph algorithm during the Deep-Learning week

#### In a nutshell:

<img src="figures/umap-only.png" style="width:1000">

## Main Differences between t-SNE and UMAP

#### t-SNE (T-distributed Stochastic Neighbor Embedding)
* Older algorithm
* Relatively simple mathematically: empirical approach
* Based on Gaussian probability and student t-test distribution
* Uses exclusively Euclidian distance between points
* tSNE applies distance normalization

#### UMAP (Uniform Manifold Approximation & Projection)
* More recent algorithm
* Anchored in theoretical mathematical approach
* In order to construct the initial high-dimensional graph, builds a "fuzzy simplicial complex"
* UMAP is often better at preserving global structure in the final projection than T-SNE
* **UMAP is orders of magnitude faster than T-SNE for complex datasets**

**A good first read** if you want to understand the mathematical differences between these two algorithms is this blog by <a href="https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668">Olskolkov, 2019</a>

## t-SNE (T-distributed Stochastic Neighbor Embedding)

A great resource to read about <code>t-SNE</code> is the dedicated blog <a href="https://distill.pub/2016/misread-tsne/">from GoogleBrain</a>. Most of these slides are inspired from it. 

**Key hyperparameters:**
* <code>Perplexity</code> (distance at which points are considered to be linked)

* <code>Perplexity</code> needs to be between 5 and 50, and less than the number of datapoints

<img src="figures/t-sne-hyperparameters.png" style="width:1200">

**Key hyperparameters:**
* <code>Step</code> (number of iterative computation steps)

<img src="figures/t-sne-steps.png" style="width:1200">

* Need sufficient <code>Steps</code> for algorithm to converge!

### What do cluster size and distance 'mean'?

<img src="figures/t-sne-distance.png" style="width:1200">

* Cluster sizes in a t-SNE plot mean nothing (because t-SNE preserves local structure over global structure: large clusters of data tend to be expanded)
* Distances between clusters might not mean anything
* Random noise doesn’t always look random

<img src="figures/t-sne-topology.png" style="width:1200">

* For topology, you may need more than one plot

# Comparing t-SNE and PCA projections

Create a plot that will contain three subplots:
1. A plot of the original data woth `culmen Depth (mm)` vs `Culmen Length (mm)`
2. A plot that contains a `PCA` projection of the data, plotting `PC1` vs `PC3`
3. A plot that contains a `T-SNE` projection of the data, plotting `TSNE 1` vs `TSNE 2`. Use a `random_state=11` for your `T-SNE` to obtain a nice projection (though feel free to play with this parameter to see how unstable `T-SNE` projections can be.

In all three cases, use color to show the species that each datapoint belongs to. Which projection does a better job at separating the species?

In [None]:
from sklearn.manifold import TSNE

TSNE_embedded = TSNE(random_state=11).fit_transform(X_train)
TSNE_embedded.shape


In [None]:
from utils import compare_embedding
X_proj_PC = X_proj_pca.values
compare_embedding({'Before PCA (initial space)':{'x':X_train.iloc[:,0], 'y':X_train.iloc[:,1],'xlabel':'Culmen Lenght (mm)', 'ylabel':'Culmen Depth (mm'},
 'PCA Projection':{'x':X_proj_PC[:,0], 'y':X_proj_PC[:,1],'xlabel':'PC_1', 'ylabel':'PC_2'},
 'T-SNE Projection':{'x':TSNE_embedded[:,0], 'y':TSNE_embedded[:,1],'xlabel':'TSNE_1', 'ylabel':'TSNE_2'}},labels_data=y_train_raw, marker_size=35)

## UMAP: Uniform Manifold Approximation & Projection

UMAP has some advantages over t-SNE, notably that it preserves the global geometry of the dataset better. However, like t-SNE, hyperparameters selection really matter (and not easy to choose), and the issues are the same as T-SNE (cluster size, distance, and random noise structure might not be meaningful, etc...)

## Install UMAP

We first need to install UMPA:


In [None]:
!conda install -c conda-forge umap-learn

**Key hyperparameters:**
* <code>n_neighbors</code>: The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation

* <code>min_distance</code>: The effective minimum distance between embedded points. 

### Example of projection from 3D to 2D

<img src="figures/UMAP-projection.png" style="width:1500px">
<a href="https://pair-code.github.io/understanding-umap/">Google Brain</a>

### UMAP preserves global context better than t-SNE 

<img src="figures/t-sne-vs-UMAP2.png" style="width:1500px">
<a href="https://pair-code.github.io/understanding-umap/">Google Brain</a>

### On some datasets t-SNE works better than UMAP!

<img src="figures/UMAP-failings.png" style="width:1000">
<a href="https://pair-code.github.io/understanding-umap/">Google Brain</a>

# Comparing UMAP and PCA projections

Create a plot that will contain three subplots:
1. A plot of the original data woth `culmen Depth (mm)` vs `Culmen Length (mm)`
2. A plot that contains a `PCA` projection of the data, plotting `PC1` vs `PC3`
3. A plot that contains a `UMAP` projection of the data, plotting `UMPA 1` vs `UMAP 2`

In all three cases, use color to show the species that each datapoint belongs to. Which projection does a better job at separating the species?

In [None]:
from umap import UMAP 
umap_trans = UMAP()
X_umap = umap_trans.fit_transform(X_train)

In [None]:
compare_embedding({'Before PCA (initial space)':{'x':X_train.iloc[:,0], 'y':X_train.iloc[:,1],'xlabel':'Culmen Lenght (mm)', 'ylabel':'Culmen Depth (mm'},
 'PCA Projection':{'x':X_proj_PC[:,0], 'y':X_proj_PC[:,1],'xlabel':'PC_1', 'ylabel':'PC_2'},
 'Unsupervised UMAP Projection':{'x':X_umap[:,0], 'y':X_umap[:,1],'xlabel':'UMAP_1', 'ylabel':'UMAP_2'}
},labels_data=y_train_raw,loc=(-.15,.87),marker_size=35)

# 🏁 Finished!

Well done! <span style="color:teal">**Push your exercise to GitHub**</span>. And this was the last exercise of the module. I hope you enjoyed the two weeks spent together!