<div><img style="float: right; width: 120px; vertical-align:middle" src="https://www.upm.es/sfs/Rectorado/Gabinete%20del%20Rector/Logos/EU_Informatica/ETSI%20SIST_INFORM_COLOR.png" alt="ETSISI logo" />


# Generating images with Variational Autoencoders<a id="top"></a>

<i><small>Autor: Alberto Díaz Álvarez<br>Última actualización: 2023-05-01</small></i></div>
                                                  

***

## Introduction

Variational Autoencoders (VAEs) are a class of generative models that have become very popular in recent years, thanks to their ability to generate high-quality images and other types of data.

| <img src="https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2021/07/01/ML1533-image003.jpg" alt="Variational Autoencoder" width="50%"> | 
|:--:| 
| *Architecture diagram of a variational autoencoder (VAE). Source: [Deploy variational autoencoders for anomaly detection with TensorFlow Serving on Amazon SageMaker](https://aws.amazon.com/es/blogs/machine-learning/deploying-variational-autoencoders-for-anomaly-detection-with-tensorflow-serving-on-amazon-sagemaker/) (last visited May 01, 2023).* |

Unlike traditional autoencoders, which are mainly used for dimensionality reduction and data compression, VAEs allow for the generation of new instances of data from the encoding of input data.

## Goals

We will be implementing a VAE in Keras for the specific example of generating handwritten digit images using the mnist dataset. Our goal is to train a VAE on this dataset so that we can use the model to generate new digit images that resemble the original images.

## Libraries and configurations

Next we will import the libraries that will be used throughout the notebook.

In [None]:
import itertools
import math
import random

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy
import tensorflow as tf

We will also configure some parameters to adapt the graphical presentation.

In [None]:
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams["axes.grid"] = False
plt.rcParams.update({'figure.figsize': (20, 6),'figure.dpi': 64})

***

## Conjunto de datos

Of course, we will continue to work with the `fashion_mnist` dataset. By now we know it quite well.

In [None]:
(x_train, _), (x_test, _) = tf.keras.datasets.fashion_mnist.load_data()
x_train, x_test = x_train / 255, x_test / 255

print(f'Training shape: {x_train.shape} input')
print(f'Test shape:     {x_test.shape} input')

##  Implementation of the variational autoencoder

We are going to perform a similar implementation to the vanilla autoencoder. However, while in a basic autoencoder the latent space is simply a compressed representation of the input data, in a VAE, this space is also used to generate new data samples, i.e., data that are not found in the original data set.

This is achieved through the introduction of additional layers to calculate the mean and variance of the latent distribution. For this we will also need a KL divergence term (a measure of similarity) to ensure that the latent distribution approximates a standard normal distribution, using this similarity as part of the loss to try to get it to 0 (maximum similarity between distributions).

In [None]:
class VariationalAutoencoder(tf.keras.models.Model):

    def __init__(self, input_dim, latent_dim):
        super().__init__()
        
        flatten_dim = None
        if isinstance(input_dim, (list, tuple)):
            flatten_dim = math.prod(input_dim)
        elif isinstance(input_dim, int):
            flatten_dim = input_dim
            input_dim = (input_dim,)
        else:
            raise ValueError('Argument input_dim must be a tuple or an int')

        # Encoder
        encoder_input = tf.keras.layers.Input(shape=input_dim)
        encoder_flatten = tf.keras.layers.Flatten()(encoder_input)
        
        z_mean = tf.keras.layers.Dense(latent_dim)(encoder_flatten)
        z_log_sigma = tf.keras.layers.Dense(latent_dim)(encoder_flatten)
        
        def sampling(args):
            z_mean, z_log_sigma = args
            epsilon = tf.random.normal(shape=(tf.shape(z_mean)[0], latent_dim), mean=0., stddev=0.1)
            return z_mean + tf.math.exp(z_log_sigma) * epsilon

        z = tf.keras.layers.Lambda(sampling)([z_mean, z_log_sigma])
        
        self.encoder = tf.keras.models.Model(encoder_input, [z_mean, z_log_sigma, z], name='encoder')
        
        # Decoder
        self.decoder = tf.keras.Sequential([
            tf.keras.layers.Dense(flatten_dim, activation='sigmoid'),
            tf.keras.layers.Reshape(input_dim)
        ])

    def call(self, inputs):
        z_mean, z_log_sigma, z = self.encoder(inputs)
        reconstructed_input = self.decoder(z)
        
        # We add the KL divergence as loss
        kl_loss = -tf.reduce_mean(z_log_sigma - tf.square(z_mean) - tf.exp(z_log_sigma) + 1) / 2
        self.add_loss(kl_loss)

        return reconstructed_input

Now we will create our VAE.

In [None]:
LATENT_DIM = 256
IMG_SIZE = (28, 28)

vae = VariationalAutoencoder(IMG_SIZE, LATENT_DIM)
vae.compile(loss='binary_crossentropy', optimizer='adam')

And we train it

In [None]:
history = vae.fit(x_train, x_train, epochs=10)

Let's see how the loss progresses during training:

In [None]:
pd.DataFrame(history.history).plot()
plt.xlabel('Epoch num.')
plt.show()

This architecture is a bit more complex, which results in longer training. Let's see how it reconstructs our training images:

In [None]:
n = 4
images = np.array(random.sample(list(x_train), n))

encoded, _, _ = vae.encoder(images)
decoded = vae.decoder(encoded).numpy()
for i in range(n):
    ax = plt.subplot(2, n, i + 1)
    plt.imshow(images[i])
    plt.title('Original')

    ax = plt.subplot(2, n, i + 1 + n)
    plt.imshow(decoded[i])
    plt.title('Reconstructed')

Now let's move on to the test dataset:

In [None]:
images = np.array(random.sample(list(x_test), n))

encoded, _, _ = vae.encoder(images)
decoded = vae.decoder(encoded).numpy()
for i in range(n):
    ax = plt.subplot(2, n, i + 1)
    plt.imshow(images[i])
    plt.title('Original')

    ax = plt.subplot(2, n, i + 1 + n)
    plt.imshow(decoded[i])
    plt.title('Reconstructed')

This kind of network has the advantage that the coding is performed in a latent space in which the interpolations between elements are continuous, in the sense that they share characteristics between the elements they represent.

In [None]:
# 100 images (10 x 10 matrix) of 28 x 28
num_elements = 10
size=28
figure = np.zeros((IMG_SIZE[0] * num_elements, IMG_SIZE[1] * num_elements))

# We walk through the latent space between the boundaries of these values
images = np.array(random.sample(list(x_test), 2))
_, _, encoded_test = vae.encoder(images)
min_z, max_z = np.min(encoded_test), np.max(encoded_test)
grid_x = np.linspace(min_z, max_z, num_elements)
grid_y = np.linspace(min_z, max_z, num_elements)
# We plot the images that correspond to that space
for i, yi in enumerate(grid_x):
    for j, xi in enumerate(grid_y):
        z_sample = [(xi if i % 2 == 1 else yi) for i in range(LATENT_DIM)]
        x_decoded = vae.decoder.predict([z_sample], verbose=0)
        digit = x_decoded[0].reshape(size, size)
        figure[i * size: (i + 1) * size, j * size: (j + 1) * size] = digit

plt.figure(figsize=(12, 12))
plt.imshow(figure)
plt.show()

However, these types of networks are not suitable for noise. For example, let's see what happens when we add a minimum of noise to our images:

In [None]:
images = np.array(random.sample(list(x_test), n))
noise_factor = 0.1
noisy_images = images + noise_factor * tf.random.normal(shape=images.shape)
noisy_images = tf.clip_by_value(noisy_images, clip_value_min=0, clip_value_max=1)

_, _, encoded = vae.encoder(noisy_images)
decoded = vae.decoder(encoded).numpy()

plt.figure(figsize=(12,12)) 
for i in range(n):
    ax = plt.subplot(3, n, i + 1)
    plt.imshow(images[i])
    plt.title('Original')

    ax = plt.subplot(3, n, i + 1 + n)
    plt.imshow(noisy_images[i])
    plt.title('Noisy')

    ax = plt.subplot(3, n, i + 1 + 2 * n)
    plt.imshow(decoded[i])
    plt.title('Reconstructed')

## Conclusion

We have implemented a VAE in Keras that has been able to encode and decode handwritten digits. As was the case with vanilla autoencoders, the encoding and subsequent decoding has been almost lossless, with the added advantage that encodings close to several "sample sets" interpolate them, acquiring characteristics of these sets.

However, we have seen that they can be deficient in some cases, for example in denoising.

This implementation can serve as a basis for the exploration of other more complex architectures in other problems, so do not hesitate to use them and explore.

***

<div><img style="float: right; width: 120px; vertical-align:top" src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-nc-sa.png" alt="Creative Commons by-nc-sa logo" />

[Volver al inicio](#top)

</div>