Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

# CSE204 - Introduction to Machine Learning - Lab Session 11: Dimensionality Reduction with AutoEncoders

<img src="https://raw.githubusercontent.com/adimajo/CSE204-2021/master/data/logo.jpg" style="float: left; width: 15%" />

[CSE204-2021](https://moodle.polytechnique.fr/course/view.php?id=12838) Lab session #11

J.B. Scoggins - Adrien Ehrhardt

In [None]:
import numpy as np
import tensorflow as tf
import tensorflow.keras.datasets.mnist as mnist
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras import Model
import matplotlib.pyplot as plt
import keras

## Introduction

In this lab, you will get hands-on experience with dimension reduction using Undercomplete Autoencoders. The goal of dimension reduction is to find a suitable transformation which converts a high-dimensional space into a smaller feature space, such that the important information is not lost, but the visualization and interpretability are easier.

## Recall the original MNIST Dataset and PCA decomposition

We will reuse the MNIST digits dataset throughout this excercise. Recall that the original MNIST dataset provides $60{,}000$ 28x28 pixels (in grayscale) training images of hand-written digits 0-9. The images are labeled with integer values 0-9. The training set has become the _de facto_ image classification example due to its small size.

In this exercise, we are not interested in classifying images of digits. Instead, we will think of the images as defining a 28x28 = 784 element feature space. In this context, we are interested in transforming the 784 parameters into a smaller set of transformed coordinates.  

Recall from [lab_session_10](https://adimajo.github.io/CSE204-2021/lab_session_10/lab_session_10.html) that we used the keras datasets module to load the MNIST dataset, normalize it, and get to know how it is structured.

In [None]:
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train / 255.0
x_test = x_test / 255.0
x_train_reshaped = x_train.reshape(60000, 784)

The shape of `x_train`, `x_train_reshaped` and `y_train` correspond to what we expect:

In [None]:
print(x_train.shape)  # 28 by 28 pixels
print(x_train_reshaped.shape)  # flattened 28 by 28 = 784 "pixels"
print(y_train.shape)  # one column of labels {0, ..., 9}

We can plot a few images using `matplotlib.pyplot`:

In [None]:
plt.imshow(x_train[0, :, :])
plt.axis('off');

## Principal Component Analysis (PCA) - taken from [lab_session_10](https://adimajo.github.io/CSE204-2021/lab_session_10/lab_session_10.html)

The goal of PCA is to perform an orthogonal transformation which converts a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables, called _principal components_. This can be thought of as fitting a $p$-dimensional ellipsoid to the observations.  

Let's consider a dataset $X\in R^{n\times p}$, where $n$ is the number of observations and $p$ the number of variables.  PCA transforms $X$ into a new coordinate system (new variable set), such that the greatest variance in the data is captured in the first coordinate, and then the second, and so on.  More specifically, the transformed coordinates $T \in R^{n\times p}$ are written as a linear combination of the original dataset,

$$ T = X W, $$

where $W \in R^{p\times p}$ is the transformation matrix.  The first column of $W$, denoted as $w_1$, is constructed to maximize the variance of the transformed coordinates.

$$ w_1 = \underset{\|w\|=1}{\operatorname{argmax}} \sum_{i=1}^{n} (t_1)_i^2 = \underset{\|w\|=1}{\operatorname{argmax}} \| X w \|_2^2 = \underset{\|w\|=1}{\operatorname{argmax}} \frac{w^T X^T X w}{w^T w} $$

The ratio in the last term is known as the _Rayleigh quotient_.  It is well known that for the positive, semidefinite matrix $X^T X$, the largest value of the Rayleigh quotient is given as the largest eigenvalue of the matrix, where $w$ is eigenvector associated with that eigenvalue.

The remaining columns of $W$ can be found by finding the the next orthogonal linear combination which maximizes the variance of the data, minus the previously transformed coordinates.

$$ w_k = \underset{\|w\|=1}{\operatorname{argmax}} \| (X - \sum_{s=1}^{k-1} X w_s w_s^T) w \|^2_2 $$

Practically, the columns of $W$ are typically computed as the eigenvectors of $X^T X$ ordered by their corresponding eigenvalues in descending order.

### Singular Value Decomposition

The Singular Value Decomposition of a matrix $X \in R^{n\times p}$ is given as

$$ X = U \Sigma W^T, $$

where $\Sigma \in R^{n\times p}$ is a rectangular diagonal matrix of positive values known as the the singular values, of $X$, $\sigma(X)$, and $U \in R^{n\times n}$ and $W \in R^{p\times p}$ are orthonormal matrices, whose columns are the left and right (respectively) singular vectors of the matrix $X$.  Using this decomposition, we can easily see that

$$ X^T X = W \hat{\Sigma} W^T, $$

where $\hat{\Sigma}$ is a square diagonal matrix of the squared singular values of $X$.  Comparing this to the eigenvalue decomposition of $X^T X = Q \Lambda Q^T$, we see that the singular values of $X$ represent the square-root of the eigenvalues of $X^T X$, and the singular vectors of $X$ are simply the eigenvectors of $X^T X$.  Therefore, we can perform PCA on a data matrix $X$ by computing its right singular vector matrix, $W$.

### Dimensionality Reduction

We can reduce the dimensionality of our data by truncating the transformed variables to include only a subset of those variables with the highest variance.  For example, if we keep the first $L <= p$ variables, the reduced transformation reads

$$ T_L = X W_L, $$

where $W_L \in R^{n\times L}$ is the eigenvector matrix as before, but taking only the first $L$ columns.  This technique has been widely used to reduce the dimension of large-dimensioned datasets by accounting for the directions of largest variance in the data, while neglecting the other directions.  In addition, this can also be used to remove noise from a dataset, in which it is assumed that the noise accounts for a small degree of variance, compared to the true underlying parameterization.  Finally, using PCA to find the 2 highest varying parameters can also allow us to visualize a high-dimensional dataset.

### Reconstruct Images

Now that we have an idea of how many principal components are necessary, let's use them to encode the images in a smaller set of features, which we can then decode to reconstruct the images from the lower-dimensional space.  Recall that based on the PCA transformation, we can compute the reconstructed images with

$$ \hat{X} = (X W_L) W_L^T $$

Note that once we have computed the transformation matrix $W$, we essentially have a compression scheme to convert our images into a compressed format. From this perspective, using the first 5, 10, 30, and 100 principal components is equivalent to compressing the data at a rate of 156:1, 78:1, 26:1, and 8:1, respectively.  By contrast, JPEG image compression can obtain compression ratios of 23:1 with reasonable image quality, surpassing the quality of reconstructions with PCA. For that reason, PCA is not really used for image compression, but it has been used in a number of other fields, particularly in physics and engineering.

## Autoencoders

In [lab_session_10](https://adimajo.github.io/CSE204-2021/lab_session_10/lab_session_10.html), we have seen that PCA was effective for producing a 2D or 3D representation of the points using "interpretable" principal components (linear combinations of the original features - see the `iris` dataset), but not very effective for image compression: we needed at least 100 coordinates (an 8:1 compression) to have credible digits.

In this lab, we will devise a compression strategy using another method, autoencoders. **Autoencoders are neural networks which are trained to output their input** in such a way that they learn a reduced dimensional space of the input distribution. They are generally composed of two distinct "layers" (or two parts possibly composed of several layers). The first encodes the input space (encoder) and the second decodes the encoded space back to the original feature space (decoder). There are 3 basic types of autoencoders:
1. __Undercomplete__ autoencoders work by constructing a network that has a hidden code layer that has fewer nodes than the input and output layers. After training, the smaller hidden layer will represent an encoding of the input onto a lower dimensional space.
2. __Regularized__ autoencoders use various regularization terms in the loss function during training to constrict the space of the output. For example, sparse autoencoders add a sparsity regularization term in the loss to force as many nodes as possible in the hidden layers to be zero.
3. __Variational__ autoencoders work slightly differently than the previous two. In this case, the autoencoder learns parameters that model the distribution of the input data in the encoder. The decoder is then used to reconstruct the output based on a random sample from this distribution. Their use is mainly directed towards data generation (*i.e.* generate credible faces).

In this exercise, we will construct two **undercomplete** autoencoders and train them on the MNIST data as before with PCA.

### First undercomplete autoencoder: dense *linear* decoder

Recall that a dense linear neural network in the context of regression is similar to linear regression. Similarly, it is well known that an autoencoder with a linear decoder layer and a mean-squared-error loss function will learn the same feature space as PCA. Let's check this by creating a simple linear autoencoder.

**Exercise 1:** Create a simple linear autoencoder.

- Write a function which takes the `input_size` (denoted by $p$ previously) and the `code_size` (denoted by $L$ previously) and returns an autoencoder model using Keras.
  - The autoencoder should be comprised of
    - A dense **encoder** layer taking `input_size` inputs with `code_size` nodes and no activation function (implies an identity / linear activation function).
    - An identity / linear **decoder** layer with `input_size` nodes.
  - Compile the autoencoder using the Adam optimizer and MSE loss
  - In addition to the **autoencoder** (the encoder + decoder), return the **encoder** as well.
  
*Hint*: use another `Model` which just takes the input and returns the output of the encoder layer (the encoder).  See the [functional API documentation](https://keras.io/getting-started/functional-api-guide/).

In [None]:
def linear_autoencoder(input_size, code_size: int):
    """
    Instanciate and compiles an autoencoder, returns both the autoencoder and just the encoder
    (i.e. autoencoder except last layer)

    :param int or tuple input_size: shape of the input samples
    :param int code_size: dimension on which to project the original data
    :return: autoencoder, encoder
    """
    # autoencoder = ...
    # encoder = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    return autoencoder, encoder

**Exercise 2:** Train the linear autoencoder.

- Using your function, create a linear autoencoder with the correct `input_size` and a `code_size` of 2 (similar to what we did with PCA).
- Train the model using the MNIST data as input and output for at least 7 epochs.
- Plot the history of the loss versus the epoch number to make sure training is basically complete.

*Hint*: the history of the loss is silently returned by the call to [`fit`](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit).

In [None]:
# linear_ae, linear_encoder = linear_autoencoder(input_size, code_size)
# history = ...
# plt.plot(history...)
# YOUR CODE HERE
raise NotImplementedError()

**Exercise 3:** Use the trained encoder to project the MNIST data to 2 features.
- Plot the two components in a scatter similar to the one we produced in [lab_session_10](https://adimajo.github.io/CSE204-2021/lab_session_10/lab_session_10.html) for PCA.
- How does the scatter plot compare to the one we made with PCA?

Recall that this autoencoder should learn the same vector space as PCA, though it will not learn the exact same transformation (could be rotated, scaled, etc.). Similarly to [lab_session_10](https://adimajo.github.io/CSE204-2021/lab_session_10/lab_session_10.html), you might want to draw fewer points to be able to see something.

In [None]:
# ... linear_encoder...(...)  # <- TO UNCOMMENT AND COMPLETE
# plt.scatter(...)  # <- TO UNCOMMENT AND COMPLETE
# YOUR CODE HERE
raise NotImplementedError()
plt.colorbar();

YOUR ANSWER HERE

### Second undercomplete autoencoder: *nonlinear* decoder

We saw in the previous section that linear decoders and MSE loss produce the same result as PCA.  Therefore, we can see nonlinear decoders as a nonlinear generalization of PCA. By allowing nonlinear transformations, we should be able to increase the "expressiveness", or the quality of the representation / separation of our reduced variables.  

**Exercise 4:** Create a nonlinear autoencoder.

- Copy your linear autoencoder function into `nonlinear_autoencoder`.
- Add a dense hidden layer in between the encoder output and decoder output layers. Give the hidden layer `input_size` / 2 nodes and use ReLU activations for both the encoder and the additional hidden layer.
- Keep the same optimizer and loss.

In [None]:
def nonlinear_autoencoder(input_size, code_size: int):
    """
    Instanciate and compiles an autoencoder, returns both the autoencoder and just the encoder

    :param int or tuple input_size: shape of the input samples
    :param int code_size: dimension on which to project the original data
    :return: autoencoder, encoder
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return autoencoder, encoder

**Exercise 5:** Train nonlinear reduced model.
- Create the nonlinear AE using 2 representation features as with the linear model.
- Train as before for at least 7 epochs and plot the loss history.

In [None]:
# nonlinear_ae, nonlinear_encoder = nonlinear_autoencoder(input_size, code_size)
# history = ...
# plt.plot(...)
# YOUR CODE HERE
raise NotImplementedError()
plt.show()

**Exercise 6:** Plot the scatter plot of the reduced variables (similar to Exercise 3).

- What can you say about grouping of points using the nonlinear model?  Does it seem to cluster the digits better than with the linear one?

In [None]:
# ... nonlinear_encoder...(...)
# plt.scatter(...)  # <- TO UNCOMMENT AND COMPLETE
# YOUR CODE HERE
raise NotImplementedError()
plt.colorbar();

YOUR ANSWER HERE

### Reconstruct Images

**Exercise 7:** Use the autoencoders to produce reconstructed images from the MNIST data as we did with PCA.
- Train linear and nonlinear autoencoders on the MNIST data using a `code_size` of 15.
- Compare the loss histories of the training for both models on the same plot. What does this tell you about the expressiveness, i.e. the quality of the reconstruction of the two models?

In [None]:
# Train autoencoder models, get history of losses

# Plot the losses

# YOUR CODE HERE
raise NotImplementedError()
plt.legend(['linear', 'nonlinear']);

It should be clear now that you were able to achieve a lower loss, all else equal, using the nonlinear autoencoder. Let's see what this will yield in terms of reconstruction quality / compression capability.

**Exercise 8:** Use the two AEs to produce reconstructed images (see [lab_session_10](https://adimajo.github.io/CSE204-2021/lab_session_10/lab_session_10.html)).
- Generate a grid of images
  - The first row should contain the first 5 images in the MNIST set as before.
  - The second row should contain their reconstruction using the linear model.
  - The third row should contain the reconstructions using the nonlinear model.
- How well do each of the models reproduce the images? 
- How do they compare to the PCA reconstructions?

Recall from Exercise 7 that we used a `code_size` of 15, achieving a 784:15 $\approx$ 52:1 compression! 

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

## Beyond this lab

* See the wikipedia entry for [JPEG](https://en.wikipedia.org/wiki/JPEG#Encoding) ([discrete cosine basis](https://en.wikipedia.org/wiki/Discrete_cosine_transform)) and [JPEG2000](https://en.wikipedia.org/wiki/JPEG_2000) ([wavelet transform basis](https://en.wikipedia.org/wiki/Wavelet_transform)).

* Copy / paste the `nonlinear_autoencoder` function above and complexify the model (*e.g.* more layers, different activation functions, `Dropout`) and rerun Exercices 7-8: were you able to achieve even better image reconstruction quality?