## Latent Diffusion

### Quick Recap on Diffusion Models: 

Diffusion Models have a Forward Process, where small amount of noise is added to an image $\mathbf{x}_0$ over multiple steps. The expression demonstrating this is as follows, 

$$q(\mathbf{x_{1:T}}|\mathbf{x_0}) = \prod_{t=1}^T q(\mathbf{x_{t}}|\mathbf{x_{t-1}})$$

To explain this process simply, say you start off with a picture of an aircraft, $\mathbf{x_0}$ at $\mathbf{t=0}$. In the forward process, noise sampled from a Gaussian Distribution is added to $\mathbf{x_0}$ at $\mathbf{t=1}$ to form $\mathbf{x_1}$. Over large number of iterations of this process, noise accumulates leading to the image turning into an [Isotropic Gaussian Distribution](https://math.stackexchange.com/questions/1991961/gaussian-distribution-is-isotropic). In order to do this, there is a scheduler, $\beta$ which decides how much noise to add at each step. 

Then in a Backward Process, given as follows, 

$$p_\theta(\mathbf{x}_{0:T}) = p(\mathbf{x}_T)\prod_{t=1}^T p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) \quad p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t,t), \mathbf{\Sigma}_\theta(\mathbf{x}_t,t))$$

Where a model is trained to reverse the backward process, i.e, given a noisy image at $\mathbf{t}$, predict the image that occured in step $\mathbf{t-1}$. If the model converges, then we can generate an image starting from random noise, by denoising it step-by-step. 

### Training

To train the model, we take an image, $\mathbf{x}_0$, a random noise sample $\epsilon$, and a time-step $\mathbf{t}$. Using the properties of the forward process, we can compute the result of the $\mathbf{t}^{th}$ iteration straight away using the following equation: 

$$\mathbf{x}_t = \sqrt{\bar\alpha_t} \mathbf{x}_0 + \sqrt{1-\bar\alpha_t} \epsilon \;\;\;\;\;\; \bar\alpha_t = \prod_{i=1}^{i=t} \alpha_i $$

The model, an U-net architecture with t-embedded in certain steps, is then trained such that given a noisy image, $\mathbf{x}_t$ it predicts the noise added to the original high-resolution image, $\epsilon$. This predicted noise is denoted as $\epsilon_0$ and the model is trained using the MSE $||\epsilon - \epsilon_0||_{2}^{2}$ as loss function.

## Why do we need Latent Diffusion Models?

DDPMs are good for small images, but as their sizes are increased the compute needed to train increases exponentially. For example, it took **8 V100 GPUs** over 11 hours to train the DDPM on the CIFAR dataset. This demands something more efficient to train larger resolution images on a single GPU. [Rombach et al](https://github.com/CompVis/latent-diffusion) proposed Latent Diffusion Models where DM's are trained in the latent space of pre-trained auto-encoders instead of the pixel space as used in DDPMs. This helped them increase quality of the images, while reducing training costs. 

Essentially, LDMs first train an autoencoder. The encoder takes an image and makes a latent space representation of it, which the decoder then converts this latent space image back to the image of the original resolution. The entire autoencoer is trained using either L1/L2 loss. 

Then a diffusion model is trained to take noise, and produce latent space representations, which the trained decoder then converts to an image of the original resolution. 

In a VAE, the encoder, given an input image, generates a mean and variance of the encoder distribution of theinput image in the latent space. The decoder then recieves a sample from this distribution and is reponsible for generating the original image. The entire VAE is trained using a combination of reconstruction and KL Divergence loss to ensure that the encoder distributon is as close to the prior normal distribution as possible, for smooth transition in the latent space. 

When a VAE is trained usng L1/L2 loss, the reconstruction image matches the contents of the original image but the quality of the image is lower and blurry, as high frequency information is lost during reconstruction. Even within images, the MSE has a very low value as the only difference is the blur as the contents of the images are otherwise the same. To solve the issue of L2 or MSE not being able to assess the perceptual difference b/w images, a **Perceptual Loss** term had to be added. 