# Lab: Diffusion Models with 🧨 Diffusers

**Author:**
- baptiste.engel@cea.fr

If you have questions or suggestions, contact us and we will gladly answer and take into account your remarks.

**Acknowledgement**  
This lab is partially inspired by 🧨 Diffusers's [stable_diffusion.ipynb](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/stable_diffusion.ipynb#scrollTo=-xMJ6LaET6dT) notebook. Feel free to explore this notebook also!

## 0 - Introduction


Stable Diffusion is an open-source diffusion model trained on [LAION-5B](https://laion.ai/blog/laion-5b/), a 5 billion image-text pairs dataset.

This lab guides you through the implementation of your own local Stable Diffusion using the 🧨 Diffusers library from 🤗 Hugging Face. You will first learn how to setup a complete pipeline to perform inference on a pre-trained Stable Diffusion model, then fine-tune it on your own data with Low Rank Adaptation (LoRA).

![baguette_cat](https://raw.githubusercontent.com/engelba/generative/main/tp/assets/baguette_cat.jpg)  
*The baguette cat from LAION*


You're encouraged to try various things, play with the parameters of the provided function. Feel free to explore the possibility of diffusion models to better understand how they work.  

Notice that training a Stable Diffusion model *from scratch* needs around 25 days of compute with 256 A100 GPUs (~20k € per unit). The market cost of such a training is around 600k €: you won't do it yourself on this lab.

### Free Choice

Stable Diffusion is only one of the many model that come with 🧨 Diffusers. Eg., if you rather want to implement a text-to-music model, you can e.g. explore [MusicLDM](https://huggingface.co/docs/diffusers/api/pipelines/musicldm).

### Setup

This lab will be easier if you use a colab environment to run it. Ensure that you have a GPU environment running (indicated by 'T4' near RAM and Disk at the top right of the notebook).

CUDA is an interface that facilitate computing on GPU. Ensure that it is available by running next cell.

In [None]:
import torch

torch.cuda.is_available()

Let's define a helper function to display a grid of images, that will be useful later

In [None]:
from PIL import Image

def image_grid(imgs, rows, cols):
    """
    From https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/stable_diffusion.ipynb
    """
    assert len(imgs) == rows*cols

    w, h = imgs[0].size
    grid = Image.new('RGB', size=(cols*w, rows*h))
    grid_w, grid_h = grid.size

    for i, img in enumerate(imgs):
        grid.paste(img, box=(i%cols*w, i//cols*h))
    return grid

Now, install the useful packages by running the next cell


In [None]:
!pip install diffusers==0.23.0
!pip install diffusers[training]==0.23.0
!pip install transformers scipy ftfy accelerate

### Stable Diffusion

Stable Diffusion (SD) is a text-to-image diffusion model, meaning that from a textual prompt, it outputs an image that must match.

The architecture of Stable Diffusion is described in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752). Specifically, SD is a latent diffusion model: the denoising diffusion process is done in the latent space of an autoencoder. For noise estimation, model uses an an improved [UNet](https://arxiv.org/abs/1505.04597).


![Stable Diffusion](https://miro.medium.com/v2/resize:fit:4800/format:webp/0*rW_y1kjruoT9BSO0.png)

SD is thus composed of 3 essential components:

**1) The autoencoder (VAE)**

The autoencoder is used to encode the image in a latent space, so that the denoising diffusion process requires less compute, even for large image generation. It is trained beforehand, and frozen during training of the diffusion model.

**2) The UNet**

The UNet in SD is composed of ResNet blocks. Shortcut connections are also made between the encoder and the decoder blocks, to minimize the information loss.

Furthermore, for conditionning of the denoising process, cross-attention blocks are added in both the encoder and the decoder, usually after the ResNet blocks.

**3) The text encoder**

The role of the text encoder is to transform prompt into embeddings, so that it can be processed by the UNet.

Text embedding is a complex task, and SD delegates it to a frozen [CLIP](https://github.com/openai/CLIP) ViT-L/14.



### 🧨 Diffusers


🧨 [Diffusers](https://huggingface.co/docs/diffusers/index) is a modular
library for performing inference or even training of state-of-the-art diffusion models. It comes with out-of-the-box diffusion models but offers the possibility to split the model into their smaller components.

It implements many diffusion models, and can easily be used to perform inference on open-source diffusion models. It also allows you to implement and test your own research.


The documentation can be found [here](https://huggingface.co/docs/diffusers/index).

## 1 - Stable Diffusion Implementation with diffusers


We will first begin to play with the complete Stable Diffusion 1.5 pipeline. You can start by running the following cell:

In [None]:
from diffusers import StableDiffusionPipeline

repo_id = "runwayml/stable-diffusion-v1-5"
pipeline = StableDiffusionPipeline.from_pretrained(repo_id, use_safetensors=True)

The "pipeline" object bundles all the components of the diffusion model you just load, and handle most of the work for you.

Use the [reference documentation](https://huggingface.co/docs/diffusers/v0.23.0/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline) to display the different components of the pipeline

In [None]:
# CODE: Print the VAE configuration:

# CODE: Print the UNet configuration

# CODE: Print the text encoder configuration:


Now that your model is loaded, you can use it to generate images. Remember to move your pipeline to the CUDA device to benefit from hardware acceleration!

When running the freshly downloaded pipeline, you will notice that **only 50 steps** of denoising are performed. This is because StableDiffusionPipeline uses the [PNDMScheduler](https://huggingface.co/docs/diffusers/v0.23.1/en/api/schedulers/pndm#diffusers.PNDMScheduler), a more efficient sampling method than the standard DDIM.

If you want inspiration for prompts, you can check https://stablediffusion.fr/prompts

In [None]:
# CODE: Move the pipe to the CUDA device

# CODE: Choose a prompt and run the pipeline

# CODE: Display the generated image


You can also experiment with various parameters of the pipeline:
- `height` and `width` allow you to control the size of the generated image.
- `num_inference_steps` is the number of denoising steps to generate the image. Try varying it between 1 and 1000. What trade-offs do you observe?
- `guidance_scale` controls how strongly the prompt should be followed. Experiment with a classifier guidance of 0, and then with a high number. What happens?
- `negative_prompt` is a prompt that instructs the model on what not to do. Try generating a scene with people in it. Then, using the same prompt and seed, alter the scene with negative prompting.


You can find all the parameters in the StableDiffusionPipeline [documentation](https://huggingface.co/docs/diffusers/v0.23.1/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline).

Tips: to compare your results, pass a torch Generator to your pipeline (`generator` argument). Initialize it with `torch.Generator("cuda").manual_seed([your_seed])`. Save the initials states with `generator.get_state()`, and update the generator with this initial state before each generation with the `set_state` method.

In [None]:
# Create a generator and save the initial state
generator = torch.Generator("cuda").manual_seed(1923936260)
state = generator.get_state()

# After each generation, reinitialize the generator to its initial state
generator.set_state(state)

Experiment with height and width:

In [None]:
# CODE: Create a prompt


# CODE: Experiment with the height and width parameters.



Experiment with num_inference_steps:

In [None]:
# CODE: Create a prompt


# CODE: Experiment with the num_inference_steps parameter.



Experiment with guidance_scale. Generate several image for each value of the guidance scale, with different seed. What happen?

In [None]:
# CODE: Create a prompt


# Experiment with the guidance_scale parameter:
# CODE: Create two images with a low guidance scale without restarting the generator

# CODE: Display both images using image_grid


# CODE: Create two images with a strong guidance scale without restarting the generator


# CODE: Display both images using image_grid



Experiment with negative_prompt:


In [None]:
# CODE: Create a prompt


# Experiment with the negative_prompt parameter:
# CODE: Create a first image without any negative prompt


# CODE: Create a negative prompt, and generate an image with your original prompt and the negative prompt


# CODE: Display both images using image_grid



## 2 - Implement you own diffusion model





So far, you only experimented with the pretrained weights of Stable Diffusion. But for real-case application, you may rather want to use a model trained on your own data.

Several possibilities exists:
- You can fine-tune Stable Diffusion 1.5 weights using [Low-Rank Adaptation](https://arxiv.org/abs/2106.09685) (Have a look [here](https://www.youtube.com/watch?v=70H03cv57-o) if you want to try that)
- You can use the [ControlNet](https://arxiv.org/abs/2302.05543) architecture to control the generation with other modalities than just text or images. ControlNet prevent the [catastrophic forgetting](https://en.wikipedia.org/wiki/Catastrophic_interference) of that may happen when fine-tuning a deep learning model.
- Or you can train a diffusion model from scratch! This is not really recommended, as diffusion models needs a lot of data and requires expensive compute. However, it is interesting to see how it can be done in practice, so we'll go for it.

You have seen previously how to perform inference on a latent diffusion model. Here, you will implement a more standard diffusion model, with no control. The purpose of this model is to generate new sample from your target dataset.


### Dataset

Diffusion requires data. You can choose a dataset of your choice, but ensure that there are at least 1000 samples.

You can check https://huggingface.co/huggan to take a dataset of your choice, e.g. [Smithsonian Butterflies Dataset](https://huggingface.co/datasets/huggan/smithsonian_butterflies_subset) or [Pokémon Image Dataset](https://huggingface.co/datasets/huggan/pokemon).

First, start by loading your dataset using [pandas](https://pandas.pydata.org/):

In [None]:
import pandas as pd
from datasets import load_dataset

# Choose the dataset you want to use
# dataset_name = "huggan/pokemon"
# dataset_name="huggan/smithsonian_butterflies_subset"

# Load the dataset
dataset = load_dataset(dataset_name, split="train")

When working with data, it is crucial to take a look at your data. Use the provided snippet to display a grid.

In [None]:
import matplotlib.pyplot as plt

random_indices = torch.randint(0, len(dataset), (4,))

def show_random_image(dataset):
    fig, axs = plt.subplots(1, 4, figsize=(16, 4))
    for i, image in enumerate(dataset[random_indices]["image"]):
        axs[i].imshow(image)
        axs[i].set_axis_off()
    fig.show()

show_random_image(dataset)

You may notice that all images are of different size. This is a common problem, but it can easily be solved by applying transformations to your data. You will also add some conventional data augmentation methods to artificially augment your dataset.

Using the `Compose` class from `torchvision.transforms` package, create a composition with the transformations:
- Resize images to size 128x128
- Apply a RandomHorizontalFlip
- Convert the images to tensors
- Normalize the images to be in the range [-1, 1]

In [None]:
# CODE: Import the transform module from torchvision


# CODE: Create an object named `preprocess` of class torch.Compose with the proper transformations.



In [None]:
# Here, we defined a function to be used by the dataset object to transform the images
def transform(examples):
    images = [preprocess(image.convert("RGB")) for image in examples["image"]]
    return {"images": images}

In [None]:
# CODE: Use the `dataset`'s method `set_transform` to apply the preprocessing function to the images when needed (`on the fly`)



In [None]:
# Visualize how the images have been modified
fig, axs = plt.subplots(1, 4, figsize=(16, 4))
for i, image in enumerate(dataset[random_indices]["images"]):
    axs[i].imshow(image.cpu().detach().permute(1,2,0)*0.5+0.5)
    axs[i].set_axis_off()
fig.show()

Finally, create your torch DataLoader for efficient training. You can use a batch size of 8 and activate shuffling.

In [None]:
# CODE: Initialize the batch_size to 8


# CODE: Create a train_dataloader with batch size 8 and shuffling activated.



### Create the model

For this (small) diffusion model, you are going to use the same UNet as Stable Diffusion, but at a lower scale. The Diffusers library offers a simple integration of UNet with the class [UNet2DModel](https://huggingface.co/docs/diffusers/api/models/unet2d).

The goal of the UNet is to evaluate the noise in an image. Therefore, as you are working in the explicit image space, it will take an image as input and output a tensor of the same dimension containing the estimated noise value for each channel of each pixel.

The structure of a UNet is as follows:

![UNet](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/unet-model.png)

Initialize a UNet2DModel with the following configuration:
- Set the sample size to match the size of your dataset's images.
- Configure 3 input channels and 3 output channels, as we are working with RGB images.
- Use 2 layers per block, resulting in 2 ResNet layers for each UNet block.
- Set the block out channels as (128, 128, 256, 256, 512, 512).
- Use 5 "DownBlock2D", the standard ResNet downsampling block. Insert a "AttnDownBlock2D" before the last "DownBlock2D".
- Implement the symmetrical structure for the up part, using "UpBlock2D" and "AttnUpBlock2D."

In [None]:
# CODE: Import UNet2DModel from the diffusers library


# CODE: Initalize your model according to the guidelines.



Before executing the complete training loop, ensure that the input and output sizes of your model are the same:

In [None]:
sample_image = dataset[0]["images"].unsqueeze(0)
print("Input shape:", sample_image.shape)

print("Output shape:", model(sample_image, timestep=0).sample.shape)

### Initialize your scheduler

The [noise scheduler](https://huggingface.co/docs/diffusers/v0.23.1/en/api/schedulers/ddpm#diffusers.DDPMScheduler) handles the addition of noise during training, and the generation process during inference.

Create a DDPMScheduler with 1000 timesteps.

In [None]:
# CODE: Import de DDPMScheduler


# CODE: Initialize your noise scheduler



You can run the next cell to visualize how the noise degrades the data:

In [None]:
def show_noise_impact(dataset, noise_scheduler):

    random_idx = torch.randint(0, len(dataset), (1,))

    # Sample an image from the dataset
    sample_image = dataset[random_idx]["images"][0]

    # Sample random noise
    noise = torch.randn(sample_image.shape)

    # Select timesteps
    timesteps = [50, 100, 500, 999]

    fig = plt.figure(figsize=(16, 4))
    for i, t in enumerate(timesteps):
        # Add noise to the image using the scheduler
        noisy_image = noise_scheduler.add_noise(sample_image, noise, torch.tensor(t))
        plt.subplot(1, len(timesteps), i+1)
        plt.title(f"t={t}")
        plt.imshow(noisy_image.permute(1,2,0)*0.5+0.5)
        plt.axis("off")
    plt.show()
show_noise_impact(dataset, noise_scheduler)

### Train your model!

You are now only a few steps away from training your own diffusion model.

Remember, the following step to prepare the training of a model with PyTorch are almost always the same:

- Choose your training parameters. You can use
- Define you loss function. Diffusion models are trained to estimate the noise added to an image. Thus, a simple mean square error loss can be used!
- Define your optimizer. You can use AdamW, with a learning rate of 1e-4.
- You can sometimes defines a learning rate scheduler. We recommand using the [Cosine Scheduler with Warmup](https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CosineAnnealingWarmRestarts.html) with 500 learning rate warmup steps, and (len(dataloader) * n_epochs) as `num_training_steps`

In [None]:
# CODE: Import everything you need: AdamW and MSELoss from PyTorch, and get_cosine_schedule_with_warmup from diffusers.


# CODE: Defines the useful variable


# CODE: Instantiate the loss function


# CODE: Defines your optimizer


# CODE: Create the learning rate scheduler



You can then write your training loop. Iterates over your dataset n_epochs times. At each step:
- Start by clearing all gradients using the zero_grad method of your optimizer
- Sample a random noise of the same shape as your batch (use `torch.randn`)
- Sample a random timestep for each image (use `torch.randint`)
- Add the sampled noise to your image using the noise scheduler
- Feed the noisy batch to your model, and get the estimated noise (Forward pass)
- Compute the MSE between the noise you sampled and estimation
- Run the backward pass.
- Update your optimizer
- Update your learning rate scheduler

After each epoch, use the provided `evaluate` function to generate images

In [None]:
from diffusers.utils import make_image_grid
import os

def evaluate(epoch, pipeline, batch_size=4):
    # Sample some images from random noise (this is the backward diffusion process).
    # The default pipeline output type is `List[PIL.Image]`
    images = pipeline(
        batch_size=batch_size,
    ).images

    # Make a grid out of the images
    image_grid = make_image_grid(images, rows=2, cols=2)

    display(image_grid)

In [None]:
%matplotlib inline

from tqdm import tqdm
from diffusers import DDPMPipeline

device = "cuda"

model.to(device)
model.train()

evaluate_every_n_epochs = 1

losses = []
for epoch in range(n_epochs):
    epoch_loss = 0

    for i, batch in enumerate(tqdm(train_dataloader)):
        clean_images = batch["images"].to(device)

        # CODE: Clear all gradients

        # CODE: Sample random noise

        # CODE: Sample a random timestep for each image


        # CODE: Use the noise_scheduler to add noise to your image

        # CODE: Run the forward pass. Note: pass return_dict=False to your model.

        # CODE: Evaluate the loss

        # CODE: Run the backward pass

        # CODE: Perform the optimizer and scheduler steps

        # Aggregate the losses
        epoch_loss += loss.item()

    print(f"Epoch {epoch} / Epoch Loss {epoch_loss / len(train_dataloader)}")
    losses.append(epoch_loss / len(train_dataloader))

    # We now create a pipeline and evaluate the model!
    pipeline = DDPMPipeline(unet=model, scheduler=noise_scheduler)
    if (epoch + 1) % evaluate_every_n_epochs == 0:
        evaluate(epoch, pipeline)

# Plot the final results
plt.plot(range(len(losses)), losses)