<img src="materials/images/introduction-to-statistics-II-cover.png"/>


# 👋 Welcome, before you start
<br>

### 📚 Module overview

We will go through eleven lessons with you:
    
- [**Lesson 1: Z-score**](Lesson_1_Z-score.ipynb)

- [**Lesson 2: P-value**](Lesson_2_P-value.ipynb)

- [**Lesson 3: Lesson 3: Welchs T-test**](Lesson_3_Welchs_T-test.ipynb)

- [**Lesson 4: Log2 Fold Change**](Lesson_4_Log2_Fold_Change.ipynb)

- [**Lesson 5: Pearson Correlation**](Lesson_5_Pearson_Correlation.ipynb)

- [**Lesson 6: Spearman Correlation**](Lesson_6_Spearman_Correlation.ipynb)

- [**Lesson 7: False Discovery Rate**](Lesson_7_False_Discovery_Rate.ipynb)

- [**Lesson 8: Benjamini Hochberg**](Lesson_8_Benjamini_Hochberg.ipynb)

- [**Lesson 9: Dimensionality Reduction Methods: Principal Component Analysis**](Lesson_9_Dimensionality_Reduction_Methods_Principal_Component_Analysis.ipynb)

- [**Lesson 10: Dimensionality Reduction Methods: t-SNE**](Lesson_10_Dimensionality_Reduction_Methods_t-SNE.ipynb)

- <font color=#E98300>**Lesson 11: UMAP**</font>    `📍You are here.`
</br>



<div class="alert alert-block alert-info">
<h3>⌨️ Keyboard shortcut</h3>

These common shortcut could save your time going through this notebook:
- Run the current cell: **`Enter + Shift`**.
- Add a cell above the current cell: Press **`A`**.
- Add a cell below the current cell: Press **`B`**.
- Change a code cell to markdown cell: Select the cell, and then press **`M`**.
- Delete a cell: Press **`D`** twice.

Need more help with keyboard shortcut? Press **`H`** to look it up.
</div>



---



# Lesson 11: UMAP

`🕒 This module should take about 15 minutes to complete.`

`✍️ This notebook is written using Python.`

<mark>**Uniform Manifold Approximation and Projection (UMAP)**</mark> is an increasingly popular dimensionality reduction technique used to visualize high-dimensional data, such as genomic or proteomic data sets, in a lower-dimensional space. The technique can identify critical features in high-dimensional space (e.g., hundreds or thousands of variables) and preserve them in a lower-dimensional embedding (e.g., 2D or 3D) from which they can be visualized. One of the most widely used techniques for visualizing high-dimensional data is t-SNE, but its performance suffers with large datasets. **UMAP** offers a number of advantages over t-SNE, most notably increased speed and better preservation of the data's global structure. 

UMAP is primarily used as a dimensionality-reduction technique to visualize high-dimensional data rather than for quantitative analysis.












<img src="materials/images/images_umap/umap_viz.png"/>

---

### ✅ `Run` each of the cells below:

# Import a sample high-dimensional dataset


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits

### Preview the dimensionality of the dataset
The following sample dataset has 64 dimensions (1797 rows x 64 columns).

In [None]:
# Import sample high-dimensional dataset
digits, labels = load_digits(return_X_y=True)

# Display the dimensionality of the dataset
digits.data.shape

### Preview the sample dataset

In [None]:
digits.data.obj

---

# Apply UMAP to the dataset

In [None]:
#pip install umap-learn
from umap import UMAP

embedding = UMAP(n_neighbors=5,
                      min_dist=0.3,
                      metric='correlation').fit_transform(digits.data)

### View the dimensionality of the dataset following the UMAP procedure
The sample datset has been reduced to a 2D representation. (1797 rows with two variables/columns each.)

In [None]:
embedding.shape

### Visualize the low-dimensional representation of the data as clusters

In [None]:
plt.scatter(embedding[:,0], embedding[:,1], c=labels)
plt.show();

## Interpretation
The points within the individual clusters are very similar to each other and are less similar to points in other clusters. A similar pattern is likely present in the original, high-dimensional dataset. 

It's important to note that the size of the clusters, relative to each other, have little meaning. UMAP uses considerations of local distance to construct its high-dimensional representation.

Similarly, the distances between clusters is likely to have little meaning. Although the global positions of clusters are better preserved in UMAP, the distances between them are not meaningful due to its consideration of local distances when constructing the representation.

## NOTES:

- UMAP has a parameter called **n_neighbors**. This determines the number of neighboring points used in local approximations of structure. Larger values will result in more global structure being preserved at the loss of detailed local structure. In general this parameter should often be in the range 5 to 50, with a choice of 10 to 15 being a sensible default.
 


- UMAP has a parameter called **min_dist**. This controls how tightly the embedding is allowed to compress points together. Larger values ensure embedded points are more evenly distributed while smaller values allow the algorithm to optimize more accurately with regard to local structure. Sensible values are in the range 0.001 to 0.5, with 0.1 being a reasonable default.


- UMAP has a parameter called **metric**. This determines the choice of metric used to measure distance in the input space. 

<div class="alert alert-block alert-warning">
<b>Tip: </b>UMAP is typically much faster than t-SNE and scales well in terms of both dataset size and dimensionality.</div>

---

# 🌟 You are done!
<br>
Review previous lessons often to consolidate the learnings.

    
- [**Lesson 1: Z-score**](Lesson_1_Z-score.ipynb)

- [**Lesson 2: P-value**](Lesson_2_P-value.ipynb)

- [**Lesson 3: Lesson 3: Welchs T-test**](Lesson_3_Welchs_T-test.ipynb)

- [**Lesson 4: Log2 Fold Change**](Lesson_4_Log2_Fold_Change.ipynb)

- [**Lesson 5: Pearson Correlation**](Lesson_5_Pearson_Correlation.ipynb)

- [**Lesson 6: Spearman Correlation**](Lesson_6_Spearman_Correlation.ipynb)

- [**Lesson 7: False Discovery Rate**](Lesson_7_False_Discovery_Rate.ipynb)

- [**Lesson 8: Benjamini Hochberg**](Lesson_8_Benjamini_Hochberg.ipynb)

- [**Lesson 9: Dimensionality Reduction Methods: Principal Component Analysis**](Lesson_9_Dimensionality_Reduction_Methods_Principal_Component_Analysis.ipynb)

- [**Lesson 10: Dimensionality Reduction Methods: t-SNE**](Lesson_10_Dimensionality_Reduction_Methods_t-SNE.ipynb)



---


# Contributions & acknowledgment

Thanks Antony Ross for contributing the content for this notebook.

---

Copyright (c) 2022 Stanford Data Ocean (SDO)

All rights reserved.