Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

# CSE204 - Introduction to Machine Learning - Lab Session 10: Dimensionality Reduction with Principal Components Analysis (PCA)

<img src="https://raw.githubusercontent.com/adimajo/CSE204-2021/master/data/logo.jpg" style="float: left; width: 15%" />

[CSE204-2021](https://moodle.polytechnique.fr/course/view.php?id=12838) Lab session #10

J.B. Scoggins - Adrien Ehrhardt

## Introduction

In this lab, you will get hands-on experience with dimension reduction using Principal Component Analysis. The goal of dimension reduction is to find a suitable transformation which converts a high-dimensional space into a smaller feature space, such that the important information is not lost, but the visualization and interpretability are easier.

In [None]:
import tensorflow.keras.datasets.mnist as mnist
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras import Model
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
from sklearn import decomposition
from sklearn import datasets

## Step 1: The original MNIST Dataset

We will use the MNIST digits dataset throughout this lab session. The original MNIST dataset provides 60000 28x28 pixel grayscale training images of hand-written digits 0-9. The images are labeled with integer values 0-9. The training set has become the _de facto_ image classification example due to its small size.

In this exercise, we are not interested in classifying images of digits. Instead, we will think of the images as defining a 28x28 = 784 element feature space. In this context, we are interested in transforming the 784 parameters into a smaller set of transformed coordinates.  

**Exercise 1.1:** Before continuing to the next section, use the keras datasets module to load the MNIST dataset, normalize it, and get to know how it is structured.

In [None]:
# x_train, ..., = mnist...  # <- TO UNCOMMENT AND COMPLETE
# x_train = x_train / ...   # <- TO UNCOMMENT AND COMPLETE
# YOUR CODE HERE
raise NotImplementedError()

- Inspect the dataset. What is the shape of x_train and y_train?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

- Plot a few images using `matplotlib.pyplot` to see what they look like.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Step 2: Principal Component Analysis (PCA)

The goal of PCA is to perform an orthogonal transformation which converts a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables, called _principal components_. This can be thought of as fitting a $p$-dimensional ellipsoid to the observations.  

Let's consider a dataset $X\in R^{n\times p}$, where $n$ is the number of observations and $p$ the number of variables.  PCA transforms $X$ into a new coordinate system (new variable set), such that the greatest variance in the data is captured in the first coordinate, and then the second, and so on.  More specifically, the transformed coordinates $T \in R^{n\times p}$ are written as a linear combination of the original dataset,

$$ T = X W, $$

where $W \in R^{p\times p}$ is the transformation matrix. The first column of $W$, denoted as $w_1$, is constructed to maximize the variance of the transformed coordinates.

$$ w_1 = \underset{\|w\|=1}{\operatorname{argmax}} \sum_{i=1}^{n} (t_1)_i^2 = \underset{\|w\|=1}{\operatorname{argmax}} \| X w \|_2^2 = \underset{\|w\|=1}{\operatorname{argmax}} \frac{w^T X^T X w}{w^T w} $$

The ratio in the last term is known as the _Rayleigh quotient_. It is well known that for the positive, semidefinite matrix $X^T X$, the largest value of the Rayleigh quotient is given as the largest eigenvalue of the matrix, where $w$ is eigenvector associated with that eigenvalue.

The remaining columns of $W$ can be found by finding the the next orthogonal linear combination which maximizes the variance of the data, minus the previously transformed coordinates.

$$ w_k = \underset{\|w\|=1}{\operatorname{argmax}} \| (X - \sum_{s=1}^{k-1} X w_s w_s^T) w \|^2_2 $$

Practically, the columns of $W$ are typically computed as the eigenvectors of $X^T X$ ordered by their corresponding eigenvalues in descending order.

### Singular Value Decomposition

The Singular Value Decomposition of a matrix $X \in R^{n\times p}$ is given as

$$ X = U \Sigma W^T, $$

where $\Sigma \in R^{n\times p}$ is a rectangular diagonal matrix of positive values known as the the singular values, of $X$, $\sigma(X)$, and $U \in R^{n\times n}$ and $W \in R^{p\times p}$ are orthonormal matrices, whose columns are the left and right (respectively) singular vectors of the matrix $X$. Using this decomposition, we can easily see that

$$ X^T X = W \hat{\Sigma} W^T, $$

where $\hat{\Sigma}$ is a square diagonal matrix of the squared singular values of $X$. Comparing this to the eigenvalue decomposition of $X^T X = Q \Lambda Q^T$, we see that the singular values of $X$ represent the square-root of the eigenvalues of $X^T X$, and the singular vectors of $X$ are simply the eigenvectors of $X^T X$.  Therefore, we can perform PCA on a data matrix $X$ by computing its right singular vector matrix, $W$.

### Dimensionality Reduction

We can reduce the dimensionality of our data by truncating the transformed variables to include only a subset of those variables with the highest variance. For example, if we keep the first $L <= p$ variables, the reduced transformation reads

$$ T_L = X W_L, $$

where $W_L \in R^{n\times L}$ is the eigenvector matrix as before, but taking only the first $L$ columns.  This technique has been widely used to reduce the dimension of large-dimensioned datasets by accounting for the directions of largest variance in the data, while neglecting the other directions. In addition, this can also be used to remove noise from a dataset, in which it is assumed that the noise accounts for a small degree of variance, compared to the true underlying parameterization. Finally, using PCA to find the 2 highest varying parameters can also allow us to visualize a high-dimensional dataset.

### An example with `sklearn` to get you started: the `iris` dataset

Loading the data

In [None]:
iris = datasets.load_iris()
X = iris.data
y = iris.target
print("Shape of X:", X.shape, "\n")
print("The 4 features are:", iris.feature_names, "\n")
smpl = np.random.randint(0, X.shape[0], size=10)
print("10 random rows of X:")
print(X[smpl, :], "\n")
print("Their associated label:")
print(y[smpl], "\n")
print("The label names associated to 0, 1 and 2 resp.")
print(iris.target_names)

Let's visualize the two first coordinates and the labels.

Would we be able to directly predict the labels, *i.e.* draw linear decision boundaries for each class?

In [None]:
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5

plt.scatter(X[:, 0], 
            X[:, 1], 
            c=y, 
            cmap=plt.cm.Set1,
            edgecolor='k')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.show()

... No, not really; let's try in 3D:

In [None]:
fig = plt.figure()
ax = Axes3D(fig, elev=-150, azim=110, auto_add_to_figure=False)
fig.add_axes(ax)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y,
           cmap=plt.cm.Set1, edgecolor='k', s=40)
ax.set_title("First three coordinates")
ax.set_xlabel(iris.feature_names[0])
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel(iris.feature_names[1])
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel(iris.feature_names[2])
ax.w_zaxis.set_ticklabels([])
for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]:
    ax.text3D(X[y == label, 0].mean(),
              X[y == label, 1].mean(),
              X[y == label, 2].mean(),
              name,
              horizontalalignment='center',
              bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))

plt.show()

The setosa are well apart, but not the other two classes. Unfortunately, it would be hard to represent a 4D graph... So that's where PCA comes in handy!

Let's perform PCA with `sklearn` and represent the data projected unto the 3 first axes.

In [None]:
fig = plt.figure()
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134, auto_add_to_figure=False)
fig.add_axes(ax)

plt.cla()
pca = decomposition.PCA(n_components=3)
pca.fit(X)
X_transformed = pca.transform(X)

ax.set_title("First three principal components")
ax.set_xlabel("First principal component")
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("Second principal component")
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("Third principal component")
ax.w_zaxis.set_ticklabels([])

for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]:
    ax.text3D(X_transformed[y == label, 0].mean(),
              X_transformed[y == label, 1].mean(),
              X_transformed[y == label, 2].mean(), name,
              horizontalalignment='center',
              bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))

ax.scatter(X_transformed[:, 0], X_transformed[:, 1], X_transformed[:, 2], 
           c=y, cmap=plt.cm.Set1,
           edgecolor='k')

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])

plt.show()

The 3 types of flowers seem well separated. Is it also the case in 2D now?

In [None]:
x_min, x_max = X_transformed[:, 0].min() - .5, X_transformed[:, 0].max() + .5
y_min, y_max = X_transformed[:, 1].min() - .5, X_transformed[:, 1].max() + .5

plt.scatter(X_transformed[:, 0], 
            X_transformed[:, 1], 
            c=y, 
            cmap=plt.cm.Set1,
            edgecolor='k')
plt.xlabel('First principal component')
plt.ylabel('Second principal component')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.show()

Perfect! It seems we were able to "compress" enough information from the 4 original features into 2 linear combinations of these features: the two first principal components.

### The real deal: the `MNIST` dataset

**Exercise 2.1:** Visualize the MNIST dataset in 2 dimensions.

- Reshape the array `X` to 2D with n = 60000 and p = 28 x 28 = 784.

In [None]:
# x_train_reshaped = ...  # <- TO UNCOMMENT AND COMPLETE
# YOUR CODE HERE
raise NotImplementedError()

- Use numpy to [compute the SVD](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html) of the MNIST images. Recall the notation of $U$, $\Sigma$ and $W^T$ [introduced above](#Step-2:-Principal-Component-Analysis-(PCA)). (Yes, this might take some time... If you get a `MemoryError`, read the documentation carefully).

In [None]:
# u, sigma, w_transpose = ...  # <- TO UNCOMMENT AND COMPLETE
# YOUR CODE HERE
raise NotImplementedError()

- Compute the first two principal components by truncating the eigenvector matrix (*i.e.* the two first columns of $W$ in the [notations introduced above](#Step-2:-Principal-Component-Analysis-(PCA))).
- Multiply the original data by the previous result (*i.e.* the the two first principal components): you obtain the projected data (onto the new PCA space), in other words $T_L$ for $L=2$ in the [notations introduced above](#Step-2:-Principal-Component-Analysis-(PCA)).

In [None]:
# T_2 = ...  # <- TO UNCOMMENT AND COMPLETE
# YOUR CODE HERE
raise NotImplementedError()

- Plot the original data projected onto the two first principal components using a scatter plot with [`matplotlib.pyplot.scatter`](https://matplotlib.org/3.5.0/api/_as_gen/matplotlib.pyplot.scatter.html), and the image labels to color the markers.

(*Hint*: the plot might be easier to interpret if you represent fewer points)

In [None]:
# plt.scatter(..., cmap='rainbow')  # <- TO UNCOMMENT AND COMPLETE
# YOUR CODE HERE
raise NotImplementedError()
plt.colorbar();

- What do you notice about how the data is presented in the plot?

YOUR ANSWER HERE

- Which images form a tight cluster in the reduced space?

YOUR ANSWER HERE

### Scree Plot

It is not always clear how many principal components are necessary to accurately represent the high-dimensional space.  There are two widely used methods to help us get a sense of the number variables required.  The first is called a Scree plot, which plots the eigenvalues of $X^T X$ in descending order.  Since the eigenvalues represent the degree of variance in the corresponding principal components, such a plot can tell use how many components are needed before we reach diminishing returns.

**Exercise 2.2:** Plot the Scree plot for the MNIST data. _Hint_: use the `sigma` matrix of singular values of $X$ computed above.
- How many principal components are needed to represent most of the variance.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

### Total Variance Explained

Another method is called _Total Variance Explained_.  In this method, we plot the cumulative sum of the eigenvalues and choose the number of components which give us a certain percentage fo the total variance.

**Exercise 2.3:** Plot the cumulative sum of the eigenvalues.
- Plot a horizontal line at 95% of the total sum.
- Based on this, how many components are needed to capture 95% of the variance?
- How does this compare to the Scree plot result?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

### Reconstruct Images

Now that we have an idea of how many principal components are necessary, let's use them to encode the images in a smaller set of features, which we can then decode to reconstruct the images from the lower-dimensional space.  Recall that based on the PCA transformation, we can compute the reconstructed images with

$$ \hat{X} = (X W_L) W_L^T $$

**Exercise 2.4:** Plot original and reconstruct images.
- Create a grid of images using `pyplot.subplots` and `imshow`.
  - In the first row, plot the first 5 images of the dataset.
  - In the next 4 rows, plot reconstructions of the images using the first 5, 15, 30, and 100 principal component vectors.
- How do the reconstructed images compare with the originals as you increase the size of the reduced space?

Note that once we have computed the transformation matrix $W$, we essentially have a compression scheme to convert our images into a compressed format.  From this perspective, using the first 5, 10, 30, and 100 principal components is equivalent to compressing the data at a rate of 156:1, 78:1, 26:1, and 8:1, respectively.  By contrast, JPEG image compression can obtain compression ratios of 23:1 with reasonable image quality, surpassing the quality of reconstructions with PCA.  For that reason, PCA is not really used for image compression, but it has been used in a number of other fields, particularly in physics and engineering.


In [None]:
# YOUR CODE HERE
raise NotImplementedError()