# A gentle introduction to Stable Diffusion: Part 1 - Introduction to Latent Diffusion Models

Hello and welcome to this explainer for Stable Diffusion - specifically targeting a non-technical audience. This will be a quick rundown of how *Latent Diffusion Models* (LDMs) work, specifically focusing on Stable Diffusion v1.4, as it is probably the most intuitive to explain.

This explainer will be split into multiple parts - first, we will give a high level overview of the model cascade itself, then we'll go into each component of the cascade, and finally, we'll put the model cascade together at the end!

This is the first section out of five: a high level explanation of the model cascade. The other sections are accessible [here]( https://research.qut.edu.au/genailab/projects/unboxing-genai/)

1. Introduction to Latent Diffusion Models
2. The CLiP text embedding model
3. Variational Auto Encoders for image compression
4. Convolutional UNet de-noiser
5. Conclusion - putting it all together



## Some (brief) history

<font color='red'>AJS: Describe the big 3-5 moments in the history of LDMs, leading up to and after SD 1.4. Mention the key papers and which labs produced the results. E.g. Stable diffusion was a model first released in XXX... by YYY... Link to the latest version of Stable Diffusion, or some of the pulbic and popular products and services that are now bulit on this tech.bar</font>

## Some terminology

So what is a Latent Diffusion Model, anyway?

<font color='red'>Latent Diffusion models are a type of...</font>

<font color='red'>Diffusion models refer to systems that... The term 'diffusion' comes from the field of thermodynamics in Physics, where this type of statistical model were first developed.</font>

<font color='red'>The 'latent' in 'Latend Diffusion Model' refers to the fact that the diffusion process works with a compressed or 'latent' version of the image, instead of the raw image pixels. This is largely done for XXX reasons.</font>

Putting this together, when it comes to an LDM system like Stable Diffusion, at a high level, the way it works is by starting with a completely random 'noise' image, and then incrementally generating slightly more visually coherent images from that noise. This is done by 'denoising' and then 're-noising' a bunch of times sequentially. Each time the image is denoised, the leftover image will have a little more detailed information in it than last time, and the image will be renoised a little bit less for the next loop around.

*INSERT A SEQUENCE OF DENOISED AND RENOISED IMAGES*

Text-to-image models build on this approach by having augmenting the denoising model so that the denoising process can be "guided" by the text instructions, leaving behind specific details in the structure once the noise is removed.

In a sense, you can think of LDMs as sculptors more so than illustrators - they take a noisy block of raw stone, and carve out their image, rather than taking a blank canvas and painting an image from scratch.


## Part 1: intro to the architecture

### The model cascade

Latent Diffusion Models are actually a "cascade" of models.

<font color='red'>'Cascade' here is a technical term that refers to the fact that...</font>

What that means is that they are actually made up of several smaller models that work together to create one big model. For latent diffusion models, the high level logic of the model cascade is as follows:

1. **Text embedder:** this model takes the text prompt, and turns it into a series of numbers that the computer can understand, when are passed to the other models down the line.
   
2. **Image compressor:** this is a deep learning model which specialises in compressing data into a smaller size, while preserving as much information as possible. You can imagine this to be like an image compression algorithm like JPEG, but far more efficient at storing the certain kinds of data (specifically, the kinds of images that are used in the training process). This step is important as it allows the whole system to use much less computational power, because the compressed image representation is smaller than the raw image, so all of the other model cascade steps can also be smaller.

3. **Denoising model:** this is the core of the model - essentially, it takes the compressed image from step 2, and denoises it while leaving behind structure that resembels the text input.

This high-level logic broadly applies to all text-to-image models, including Dall-E and Imagen. What often differs is the models themselves, and how they might denoise, compress, or embed their data. The other thing worth noting is that this logic also generalises to other applications of generative AI: text-to-audio, text-to-video, even some text-to-text generation models! Hopefully then, this set of explainers will help you understand the high level logic behind a lot of generative AI models as well.

In the case of Stable Diffusion v1.4, the model cascade is as follows (don't worry if it's a little hard to follow at the moment - we'll go into each model in their own section):

<font color='red'>Link the models here - e.g. to the paper that first introduced that specific model</font>
1. **Text Embedder: CLiP-ViT-L/14 -** This is an open-source text-embedder and image-embedder model, which was fine-tuned specifically for searching for images with text inputs. Essentially, if you type the word "dog" into a CLiP-powered search engine, this model will return a picture of a dog.

2. **Image compressor: Variational Auto-Encoder (VAE) -** This model is specifically designed to take 512x512 resolution images, and compress them into 64x64 "images" without losing quality. Because of the size difference, you can think of this model as making images **64x** smaller, but without losing much visual quality.

3. **Denoising model: UNet with Conditional Generation -** This model is a type of model called a 'UNet' (which comes from the shape of the model architecture - it looks like the letter 'U'!), with a few layers added in for "guidance" from the text embedding. Newer Stable Diffusion models use far more sophisticated architectures to denoise their compressed images, but this is probably the easiest to intuitively understand and explain!

**INSERT PICTURE OF SD 1.4 MODEL CASCADE WITH EACH COMPONENT POINTED OUT**

Now that we've done a high level overview, let's start with our first deep dive into one of the models - the text embedder. *make this a link*