##### Symposium "Recent Advances in Deep Learning Systems", Reisensburg/UUlm, 05.11.2019 - 07.11.2019
##### Christian Jarvers, Heinke Hihn, Institute for Neural Information Processing, UUlm

# Unsupervised Learning with Autoencoders

**Autoencoders** \[1\] are a useful method for unsupervised learning (learning without ground-truth labels) and can also be used for semi-supervised learning by simultaneouly discovering structure from unlabeled data and training a classifier based on the discovered features. In general, autoencoders try to **reconstruct** their input data, essentially implementing an identity mapping. This may not be interesting in itself, but the goal is to learn hidden representations that summarize key aspects of the data (similar to [principal component analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) for example, but more powerful due to the non-linearities involved).

A recent extension, the **variational autoencoder** \[2,3\] extends this objective. The goal of a variational autoencoder is not only to reconstruct samples, but also to **generate** new examples of data that could have been taken from the dataset (formally, that come from the same distribution as the original data). 

\[1\] [Hinton & Salakhutdinov (2006). Reducing the dimensionality of data with neural networks.](https://doi.org/10.1126/science.1127647)

\[2\] [Kingma & Welling (2014). Auto-Encoding Variational Bayes.](https://arxiv.org/abs/1312.6114)

\[3\] [Kingma & Welling (2019). An Introduction to Variational Autoencoders.](https://arxiv.org/abs/1906.02691)

In [None]:
import tensorflow as tf
from tensorflow.keras import Sequential, Model
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dense, Flatten, Input

from matplotlib import pyplot as plt

We will start working on the [Fashion-MNIST](https://github.com/zalandoresearch/fashion-mnist), but feel free to try out other datasets as well.

In [None]:
(train_imgs, train_lbls), (val_imgs, val_lbls) = tf.keras.datasets.fashion_mnist.load_data()
train_imgs = train_imgs.reshape(-1, 28, 28, 1) / 255.0
val_imgs = val_imgs.reshape(-1, 28, 28, 1) / 255.0

## 1. Trivial Reconstruction

With autoencoders, we frame an unsupervised learning problem (discovering structure in the data) as a supervised task (reconstructing the input). It is important not to forget the actual goal of unsupervised learning and care only about the reconstruction accuracy - if we loosen the constraint that the network should learn a useful hidden representation, good reconstruction can become trivial.

In order to see this, try the following: train a network with a single hidden layer of 1000 units to reconstruct the input. Do not use any non-linearities. 

A useful design pattern for writing autoencoders it to instantiate one model that implements the encoder, a separate model that implements the decoder and a combined model. We then train the combined model, but can later use the encoder and decoder independently, if we wish.

In [None]:
trivial_encoder = Sequential([
    Input(shape=(28, 28, 1)),
    Flatten(),
    Dense(1000)
])

trivial_decoder = Sequential([
    Input(shape=(1000)),
    Dense(784)
])


trivial_ae = Sequential([trivial_encoder, trivial_decoder])

trivial_ae.compile(optimizer='Adam', loss='mse')

trivial_ae.fit(train_imgs, train_imgs.reshape(-1, 784), batch_size=64, epochs=10, shuffle=True)

We will use the following function to visualize the results of our autoencoder. The left column shows the inputs, the right column shows the outputs.

In [None]:
def show_ae(model, imgs, num_imgs=5):
    imgs = tf.random.shuffle(imgs)[:num_imgs, :, :]
    p = model.predict(imgs)
    plt.rcParams["figure.figsize"] = (10, 2)

    for i in range(num_imgs):
        plt.subplot(121)
        plt.imshow(imgs[i].numpy().squeeze(), cmap='gray')
        plt.subplot(122)
        plt.imshow(p[i].reshape(28,28), cmap='gray')
        plt.show()

In [None]:
show_ae(trivial_ae, train_imgs)

Chances are that this simple model will achieve a near perfect reconstruction. However, it did not actually learn anything useful about the data. In order to see this, it is useful to visualize the **latent space** of our model.

## Visualizing Latent Space

In order to see how our autoencoder represents the data, we plot the activations and color them according to their class. If samples from the same class are close together, but samples from different classes are far apart, the representation seems to mirror the structure of the dataset.

Since the activation vectors in our hiddden layer are high-dimensional, we cannot plot them directly. Instead, we project them into 2D space using **principal component analysis**.

In [None]:
def plot_pca(imgs, lbls, num_samples=100):
    indices = tf.random.shuffle(range(imgs.shape[0]))[:num_samples]
    imgs = imgs[indices, :]
    lbls = lbls[indices]
    imgs_centered = imgs - tf.reduce_mean(imgs, axis=0)
    s, U, V = tf.linalg.svd(tf.transpose(imgs_centered))
    projection = U[:, :2]
    projected = tf.matmul(imgs_centered, U)
    plt.rcParams["figure.figsize"] = (10, 10)
    plt.scatter(projected[:, 0], projected[:, 1], c=lbls)
    plt.show()

As a baseline, we can visualize the training images themselves.

In [None]:
plot_pca(train_imgs.reshape(-1, 784), train_lbls, num_samples=100)

As we can see, the different classes are mixed quite strongly.

Does our autoencoder do better?

In [None]:
hidden = trivial_encoder(train_imgs)

In [None]:
plot_pca(hidden.numpy(), train_lbls, num_samples=100)

Evidently, our trivial autoencoder did not learn anything useful about the data. Instead, the output layer simply learns to approximate the inverse of the hidden layer. In order to make sure that our network discovers useful structure in the data, we need to impose some **restrictions**.

## Compression, Sparsity, Denoising

The reason why the trivial autoencoder does not learn anything is that it is simultaneously too simple and too powerful: the network is linear and can therefore only discover a limited amount of structure in the data (at best, its principal components), but it is powerful enough to implement an **identity mapping**, which solves the reconstruction problem perfectly. In general, whenever a network can implement the identity mapping, it will converge to a trivial solution.

Thus, we need a more powerful network (using several non-linear layers), but at the same time need to ensure that it cannot simply learn an identity function. Three common ways to achieve this are:

- **Compression**: use a much smaller number of units in the output layer of the encoder than in the input. This forces the autoencoder to perform dimensionality reduction. Thus, in order to minimize the reconstruction loss, it needs to discard irrelevant information about the stimulus and keep relevant dimensions.
- **Sparsity**: add an additional loss function that punishes the output of the encoder for diverging from a certain target value (a common choice is 0.2). This is a similar principle to compression, but gives the network more freedom to choose the number of units it needs to represent a specific sample. If the sample is a very typical specimen, only a few neurons may be required, but the network can use more neurons to reconstruct an unusual input, even though this incurs a higher penalty.
  
  A common choice for the sparsity loss is the Kullback-Leibler divergence between the average hidden activation $\bar\mathbf{h}$ and the goal value $\kappa$.
  
$$ D_{KL}(\kappa \| \bar{\mathbf{h}}) = \kappa \cdot \log \frac{\kappa}{\bar{\mathbf{h}}} + (1-\kappa) \cdot \log \frac{1 - \kappa}{1 - \bar{\mathbf{h}}}$$

- **Denoising**: corrupt the input with random noise and have the autoencoder reconstruct the noise-free image. In order to tell signal from noise, the networks has to learn relevant structure from the data.

Try out some of these approaches. In each case, visualize the latent space. Try to find an architecture that yields good reconstructions, while simultaneously separating the classes in latent space. Note that you can also combine approaches.

In [None]:
encoder = Sequential([
    Input(shape=(28, 28, 1)),
    Flatten(),
    Dense(1000, activation='relu'),
    Dense(500, activation='relu'),
    Dense(30, activation='relu')
])

decoder = Sequential([
    Input(shape=(30,)),
    Dense(500, activation='relu'),
    Dense(1000, activation='relu'),
    Dense(784)
])

ae = Sequential([encoder, decoder])

In [None]:
ae.compile(optimizer='Adam', loss='mse')

In [None]:
ae.fit(train_imgs, train_imgs.reshape(-1, 784), batch_size=64, epochs=10, shuffle=True)

In [None]:
show_ae(ae, train_imgs)

In [None]:
hidden = encoder(train_imgs)
plot_pca(hidden.numpy(), train_lbls, num_samples=200)

## Semi-supervised learning

Once you have a working autoencoder, use it for semi-supervised learning. Use your encoder as the basis for a classifier by adding a single output layer on top of it. Select a small portion of your training set (e.g., 1%) to train this classifier (_Hint_: you may get better results by settign the `trainable` property of the encoder layers to `False` initially).

For comparison, train the same architecture (encoder layers + output layer) from scratch (i.e., with different, randomly initialized weights) using the same amount of data. Compare the generalization performance of both networks.

## Variational Autoencoders

Autoencoders using compression, sparsity or denoising are good to **represent** key features of input data and to **reconstruct** it. However, they tend to be quite bad at generating new data. Try this out by generating a random vector (for example, normally distributed around 0) of the right size for your decoder and decoding it. What does the reconstruction look like? 

The reason why this tends to work badly is that normal autoencoders can have a very irregular latent space. Some regions of it may not be covered by trained input data at all, while other regions may house many different data points, such that small changes in latent state lead to big changes in the reconstructed output.

A more regular latent space is desirable for several reasons, for example because it makes relationships between examples more interpretable: examples that are close to each other in latent space should really be similar in the real world, independent of where in latent space they are. In addition, being able to sample from untrained points in out latent space gives us the possibility to generate new, **synthetic data**.

Ensuring this property is the idea behind **variational autoencoders**. In this approach, the encoder does not simply output a point in latent space, but a probability distribution (typically a normal distribution). The decoder then reconstructs a sample from this distribution. Essentially, this amounts to **adding noise** to the latent state of each sample. Note however that this is different from a denoising autoencoder: we apply the noise to the output of the encoder, not to the input, and we let the encoder specify the shape of the noise distribution.

Begin by writing the encoder. It can look like any encoder you used before, except for the output: the variational encoder has two linear output layers, which you may call `mu` and `log_var`, both of the same size (the number of dimensions of your latent space). `mu` is the predicted mean value of the distribution and `log_var` is the log of the variance (we use the log so it can cover the whole number line, not just the positive half).

In [None]:
from tensorflow.keras.layers import Lambda
hidden_size = 30

encoder_in = Input(shape=(28, 28, 1))
x = Flatten()(encoder_in)
x = Dense(1000, activation='relu')(x)
x = Dense(500, activation='relu')(x)
mu = Dense(hidden_size, name="mu")(x)
log_var = Dense(hidden_size, name="log_var")(x)

Now implement the sampling: generate a vector with normally distributed values that have mean `mu` and variance `exp(log_var)`. A `Lambda` layer may be useful for this purpose.

In [None]:
from tensorflow.keras.layers import Concatenate

def sampling(distribution):
    mu = distribution[0, :, :]
    log_var = distribution[1, :, :]
    noise = tf.random.normal(shape=tf.shape(mu), mean=0., stddev=tf.exp(log_var / 2.))
    return mu + noise

distribution = Lambda(tf.stack, name="distribution")([mu, log_var])
encoder_out = Lambda(sampling, name="encoder_out")(distribution)

vae_encoder = Model(encoder_in, encoder_out)

Your decoder can have the same shape as before. It takes the sampled values as input.

In [None]:
from tensorflow.keras.layers import Reshape

decoder_in = Input(shape=(hidden_size,))
x = Dense(500, activation='relu')(decoder_in)
x = Dense(1000, activation='relu')(x)
x = Dense(784)(x)
decoder_out = Reshape(target_shape=(28, 28, 1))(x)

vae_decoder = Model(decoder_in, decoder_out)

We train the variational autoencoder with a reconstruction loss (the mean squared error between input and reconstruction) as before. However, we add an additional loss function that constrains the hidden representation. This **distribution loss** is the Kullback-Leibler divergence between the distribution predicted by the encoder and a standard normal distribution (mean 0, variance 1), which simplifies to:

$$ D_{KL}(\mathcal{N}(\mu, \sigma^2) || \mathcal{N}(0, 1)) = \frac{1}{2} \sum (1 + \log(\sigma ^2) - \mu^2 - \sigma^2) $$

Implementing this loss function requires two tricks. First, Keras expects each loss function to combine a ground truth with a prediction. However, our distribution loss only depends on the distribution, so no ground truth is required. Thus, your loss function should ignore its first argument. The second trick is that we need both `mu` and `log_var`, but we can only provide one prediction input. The solution to this is to use `tf.stack` to combine the two tensors before passing them to the function.

In [None]:
whole_model = vae_decoder(encoder_out)
vae = Model(encoder_in, [whole_model, distribution])

def kld_loss(ground_truth, distribution):
    mu = distribution[0, :, :]
    log_var = distribution[1, :, :]
    loss = -0.5 * tf.reduce_sum(1 + log_var - tf.square(mu) - tf.exp(log_var))
    return(loss)

Train the variational autoencoder with both losses. Again, visualize some reconstructions as well as the latent space.

In [None]:
vae.compile(optimizer='Adam', loss=['mse', kld_loss])

In [None]:
vae.fit(train_imgs, [train_imgs, tf.zeros(60000)], epochs=10, batch_size=64)

In [None]:
encoded = vae_encoder(train_imgs)

In [None]:
plot_pca(encoded.numpy(), train_lbls, num_samples=100)

Since the variational autoencoder is a generative model, you can use it to generate new samples. To do this, generate a vector of normally distributed values with the dimension of your latent space and pass it to your decoder. Once this works, you may want to try it with other datasets as well.

In [None]:
hidden = tf.random.normal(shape=(1, hidden_size))

In [None]:
plt.imshow(vae_decoder(hidden).numpy().squeeze(), cmap='gray')
plt.show()