# 8. Generative Deep Learning

Sampling from a latent space of images to create entirely new images or edit existing ones is currently the most popular and successful application of creative AI. Here, we review some concepts pertaining to image generation - using <b>variational autoencoders (VAE)</b> and <b>generative adversarial networks (GAN)</b>. These are not limited to images but also sound, music, or text.

### 8.4 - Generating Images with Variational Autoencoders (VAEs)

#### Sampling from Latent Spaces of Images
The key idea of image generation is to develop a low-dimensional <i>latent space</i> of representations where any point can be mapped to a realistic-looking image. The module capable of taking an input in latent space and outputting a generated sample is a generator (in GANs) or a decoder (in VAEs). Once a latent space is developed, we can sample points from it, either deliberately or at random, and by mapping them to an image space, generate samples that have never been seen before.

<img src="img84a.png" width="800">

GANs and VAEs are two different strategies for learning such latent spaces of image representations, each with its own characteristics. 
- VAEs are great for learning latent spaces that are well structured, where specific directions encode a meaningful axis of variation in the data. 
- GANs generate images that can be potentially highly realistic, but the latent space they come from may not have much structure and continuity.

#### Concept Vectors for Image Editing

<b>Concept vectors</b> have been introduced when discussing word embeddings. Given a latent space of representations, or an embedding space, certain directions in the space may encode interesting axes of variation in the original data. In language, with a latent space of words with semantic meaning, we might have a concept vector $g$ which maps a male entity to its female counterpart e.g. `king --> queen`. In computer vision, given a latent space of images of faces, there may be smile vectors. This can be applied to a latent point $z$, $s(z) = z + s$ such that $s(z)$ is the embedded representation of the same face, smiling. Once we have identified such a vector, we can generate other images by projecting them into the latent space, moving their representation in a meaningful way, then decoding back to image space. There are concept vectors for essentially any independent dimension of variation in image space.

#### Variational Autoencoders or VAEs
VAEs were proposed in 2013. They are a kind of generative model that is appropriate for image editing via concept vectors. They are a modern take of autoencoders - a type of network that aims to encode an input to a low-dimensional latent space and then decode it back - that mixes ideas from deep learning with Bayesian inference.

A classical image autoencoder takes an image, maps it to a latent vector space via an encoder module, and then decodes it back to an output with the same dimensions as the original image using a decoder module.

<img src="img84b.png" width="500">

A VAE turns the image into the <u>parameters of a statistical distribution</u>: a mean and variance. Essentially, this means we assume the input image is generated by a statistical process, and that the randomness of the process should be taken to account during encoding and decoding. The VAE then uses the distribution parameters to <u>randomly sample one element of the distribution</u> and decodes that element back to the original imput. The stochasticity of this process improves robustness and forces the latent space to encode meaningful representations everywhere: every point sampled in the latent space is decoded to a valid output.

<img src="img8bi.png" width="500">

The steps of a VAE are as follows:
1. An encoder turns the input samples into two parameters in a latent space of representations, $\mu_z$ and $\log \sigma_z$
2. From the latent probability distribution, randomly sample a point that is assumed to generate the input image via $z' = \mu_z + \exp (\log \sigma_z) + \epsilon = \mu_z + \sigma_z + \epsilon$ where $\epsilon$ is a random tensor of small values.
3. A decoder maps this point $z'$ in the latent space back to the original input image.

Because $\epsilon$ is random, the process ensures that every point that is close to the latent location where you encoded can be decoded to something similar from the input. Any two close points in the latent space will decode to highly similar images. You can see that with this, we can ensure the latent space is structured and highly suited to be maniuplated using concept vectors.

The parameters of a VAE are trained via two loss functions: a <b>reconstruction loss</b> that forces the decoded samples to match the inputs and a <b>regularisation loss</b> that helps learn well-formed latent spaces and reduce overfitting to the training data.

We train the model using the reconstruction loss and regularisation loss.

<hr>

From https://www.tensorflow.org/tutorials/generative/cvae:

In the image generation case, let $x$ and $z$ denote the image we observe and the latent variable respectively.

<b>Encoder</b> - The encoder network learns the posterior distribution $q(z|x)$ where given an input image $x$, we output a set of parameters of $q$ that define the conditional distribution of the latent representation $z$. In this example, we use the multivariate Gaussian for $q$ and the network outputs the mean $\mu_q$ and log variance $\log \sigma_q$ of a factorized gaussian. The log-variance was used for numerical stability.

<b>Decoder</b> - The decoder network defines the conditional distribution of the observation $p(x|z)$ which takes a sample in latent space $z$ and outputs the parameters for a conditional distribution of the observation.

During training, we iterate through the dataset. In each iteration, we pass the image to the encoder to obtain the mean $\mu_q$ and log variance $\log \sigma_q$ for the posterior distribution $q$. Then, we reparameterise and sample from $q$ and finally pass the reparameterized samples to the decoder to obtain the logis of the generative distribution $p$.

During sampling, we sample a set of latent vectors from the prior distribution $p$. The generator will then convert the latent sample $z$ to logits of the observation, giving a disstribution $p(x|z)$.

<hr>

From HandsML:

Here we see a figure representation of a VAE. There is an encoder layer, followed by a decoder layer. 
<img src="img84c.png" width="600">

But instead of directly producing a coding for a given input, the eoncoder produces a <b>mean coding</b> $\mu$ and <b>standard deviation</b> $\sigma$. From here, the actual coding is sampled from a Gaussian distribution with mean $\mu$ and standard deviation $\sigma$. Then the decoder decodes the sampled coding normaly.

Th right part of the diagram shows a training instance going through the autoencoder. First, the encoder produces $\mu$ and $\sigma$, then a coding is sampled randomly, and finally this coding is decoded to obtain the output sample resembling the training instance.

The cost function has two parts. The first is the <b>reconstruction loss</b> that pushes the autoencoder to reproduce its inputs. The second is latent loss that pushes the autoencoder to have codings that look as though they were sampled from a simple Gaussian distribution. It is the KL divergence between the target distribution and the actual distribution of the codings. The latent loss can be computed as follows:

$$\mathcal L = -\frac 12 \sum_{i=1}^N 1+\log(\sigma_i^2) - \sigma_i^2 - \mu_i^2$$

where $\mathcal L$ is the latent loss, $N$ is the dimensionality of the latent space, $\mu_i$ and $\sigma_i$ are the mean and standard deviation of the $i$th component of the latent space The vectors $\vec \mu$ and $\vec \sigma$ are output by the encoder, as shown on the left side of Fig. 17-12.

Commonly, the encoder outputs $\gamma = \log (\sigma_i^2)$ and the latent loss now looks like:

$$\mathcal L = -\frac 12 \sum_{i=1}^N 1+\gamma_i - \exp (\gamma_i) - \mu_i^2$$

<hr>