# Assignment

In this assignment, you continue working with the fashion MNIST dataset. Use the same sample of size 10000 with the previous checkpoint for the sake of comparability. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks below, and plan on discussing with your mentor. You can also take a look at these example solutions.

1. Load the dataset and make your preprocessing like normalizing the data.

2. Apply UMAP to the data.

3. Using the two-dimensional UMAP representation, draw a graph of the data by coloring and labeling the data points as we did in the checkpoint.

4. Do you think UMAP solution is satisfactory? Can you distinguish between different classes easily? Which one has done a better job: UMAP or the others (t-SNE or PCA) that you applied in the assignments of the previous checkpoints?

5. Now, play with the different hyperparameter values of the UMAP and apply UMAP for each of them. Which combination is the best in terms of the two-dimensional representation clarity?

In [None]:
%reload_ext nb_black

In [None]:
import numpy as np

import time

from sklearn.preprocessing import StandardScaler
from sklearn.datasets import fetch_openml
import umap

import matplotlib.pyplot as plt

### 1. Load the dataset and make your preprocessing like normalizing the data.

In [None]:
# We load the MNIST dataset below
mnist = fetch_openml("Fashion-MNIST", version=1, cache=True)

In [None]:
np.random.seed(123)

indices = np.random.choice(70000, 10000)
X = mnist.data[indices] / 255.0
y = mnist.target[indices]

print(X.shape, y.shape)

### 2. Apply UMAP to the data.

In [None]:
time_start = time.time()

umap_results = umap.UMAP(
    n_neighbors=5, min_dist=0.3, metric="correlation"
).fit_transform(X)

print("UMAP done! Time elapsed: {} seconds".format(time.time() - time_start))

### 3. Using the two-dimensional UMAP representation, draw a graph of the data by coloring and labeling the data points as we did in the checkpoint.

In [None]:
plt.figure(figsize=(10, 5))
colours = ["r", "b", "g", "c", "m", "y", "k", "r", "burlywood", "chartreuse"]
for i in range(umap_results.shape[0]):
    plt.text(
        umap_results[i, 0],
        umap_results[i, 1],
        y[i],
        color=colours[int(y[i])],
        fontdict={"weight": "bold", "size": 50},
    )

plt.xticks([])
plt.yticks([])
plt.axis("off")
plt.show()

### 4. Do you think UMAP solution is satisfactory? Can you distinguish between different classes easily? Which one has done a better job: UMAP or the others (t-SNE or PCA) that you applied in the assignments of the previous checkpoints?

Again, some of the classes are easily separable.  Those that are easily separable are even more separated than they were for the t-SNE plot.  And the runtime was dramatically reduced.  However, there are still several classes that are not separated.

### 5. Now, play with the different hyperparameter values of the UMAP and apply UMAP for each of them. Which combination is the best in terms of the two-dimensional representation clarity?

In [None]:
n_neighbors = [5, 10, 50, 100, 200]
min_dist = [0.1, 0.25, 0.5, 0.75, 0.99]

In [None]:
for neighbor in n_neighbors:
    for dist in min_dist:
        print(neighbor, dist)
        time_start = time.time()

        umap_results = umap.UMAP(
            n_neighbors=5, min_dist=0.3, metric="correlation"
        ).fit_transform(X)

        print("UMAP done! Time elapsed: {} seconds".format(time.time() - time_start))

        plt.figure(figsize=(10, 5))
        colours = ["r", "b", "g", "c", "m", "y", "k", "r", "burlywood", "chartreuse"]
        for i in range(umap_results.shape[0]):
            plt.text(
                umap_results[i, 0],
                umap_results[i, 1],
                y[i],
                color=colours[int(y[i])],
                fontdict={"weight": "bold", "size": 50},
            )

        plt.xticks([])
        plt.yticks([])
        plt.axis("off")
        plt.show()

The results don't appear dramatically different from plot to plot as the min_dist and n_neighbors parameters are tuned.  Seems pretty tough to evaluate objectively how well the classes are separated.