This notebook is a fork of the FastAI class's stable diffusion deep dive.

# Setup and Import the Stable Diffusion models

In [3]:
import sys
print(sys.executable)

/Library/Developer/CommandLineTools/usr/bin/python3


In [1]:
try:
    import accelerate
    print(f"accelerate is installed: version {accelerate.__version__}")
except ImportError:
    print("accelerate is NOT installed in this environment")

  import pynvml  # type: ignore[import]


accelerate is installed: version 1.10.1




In [2]:
import torch
from diffusers import AutoencoderKL, LMSDiscreteScheduler, UNet2DConditionModel
from huggingface_hub import notebook_login

# For video display:
from IPython.display import HTML
from matplotlib import pyplot as plt
from pathlib import Path
from PIL import Image
from torch import autocast
from torchvision import transforms as tfms
from tqdm.auto import tqdm
from transformers import CLIPTextModel, CLIPTokenizer, logging
import os

torch.manual_seed(100)
if not (Path.home()/'.cache/huggingface'/'token').exists(): notebook_login()

# Supress some unnecessary warnings when loading the CLIPTextModel
logging.set_verbosity_error()


In [4]:
# get the device
if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"

## Load the models from HuggingFace
See what's in the huggingface downloaded folder

In [5]:
!ls ~/.cache/huggingface

[34mhub[m[m           stored_tokens token         [34mxet[m[m


In [6]:
!ls ~/.cache/huggingface/xet

[34mhttps___cas_serv-tGqkUaZf_CBPHQ6h[m[m


In [7]:
# this is the model I have downloaded from huggingface in other notebook
!ls ~/.cache/huggingface/hub

[34mmodels--CompVis--stable-diffusion-v1-4[m[m [34mmodels--openai--clip-vit-large-patch14[m[m


**VAE**: variational auto-encoder. this module is responsible for encode images into latents and then decode the latents back to the image. Stable diffusion is only gonna use the latents generated by the encoding step. Why we need the latents? Memory saving! Latents are compressed version of the original image with much smaller size in terms of bytes/pixels. The latents will be services as the input image of the UNet.

Noted that the encoder and decoder of VAE are both neural nets, with the encoder as the convolutional layers and decoder as transpose convolutional layers. So don't forget to move VAE to GPU.

KL: KL divergence is used to compute the similarities between the (normal) distribusions of the noise between the encode and decode process. It helps us force the encode and decode process are indeed the same.

In [8]:
# vae
vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae").to(device)

In [9]:
type(vae)

diffusers.models.autoencoders.autoencoder_kl.AutoencoderKL

**CLIP**: contrastive loss image pretrain, this is the process that trains the embedding that maps the image(eg. a cat picture) and its caption(eg. "cat") into adjacent embedding vector space. ~~Note that CLIP train 2 tokenizers: image-to-embedding and text-to-embedding image encoder and text encoder. For stable diffusion, we are only gonna use the text-to-embedding part,~~
Note that CLIP trains 2 encoders (image-encoder and text-encoder), through contrastive learning, CLIP align the outputs of the 2 encoders to produce one unified embedding space, enabling cross-modal similarity.

We are only gonna use the text-encoder and it will serve as frozen-language-feature-extractor(aka, the machine that convert the caption/prompt into vector) as input of the UNet, both for building the guidance vector during training and building prompt vector during inference.

Are the embeddings the same as CLIP’s original ones?
Not exactly — and this is an important nuance.

- If you use the exact same CLIP checkpoint (say, OpenAI’s ViT-L/14), then encoding "dog" with that model yields the same vector that CLIP would produce for "dog".

- But Stable Diffusion may use a different text encoder checkpoint (e.g., OpenCLIP ViT-H/14 trained on LAION). The architecture is CLIP-like, but weights differ. Therefore, the vector for "dog" will not be numerically identical, though semantically similar.

Also, SD sometimes fine-tunes or freezes only part of the encoder — depending on the model version.

🧩 Conceptually the same type of embedding, but not literally the same numbers as the original CLIP model trained jointly with an image encoder.

Note that UNet is not gonna use the image-encoder and the unified embedding produced by CLIP.

Tips:
1. Rule of thumb for the `to(device)`: only the Pytorch neural net models need to be move to GPU(mps) for acceleration, things like tokenizer, schedulers are not neural nets that contains millions of parameters so they are considered "lightweight" and can stay on CPU.
2. I forgot the tokenizer part in the very begining: captions like "a brown cat on the street" should first be tokenized and then feed into text-encoders such as CLIP.

In [10]:
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to(device)

In [11]:
type(tokenizer), type(text_encoder)

(transformers.models.clip.tokenization_clip.CLIPTokenizer,
 transformers.models.clip.modeling_clip.CLIPTextModel)

**U-Net**: the diffusion model that intakes image(latents) and caption(text embeddin), gradually add noise to the image in a series of steps until it turns into a complete noise (forward process), and then by reversing the forward process to reduce noise from the noise in the same series of steps until the noise turns back to the image (not the original image but an averaged image). The word "U" means this forward and reverse procces looks like a U-shaped funnel.

In [12]:
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet").to(device)

@TODO: what is a scheduler? is this the one that manage how much noise to add to the image in the Unet training?

In [14]:
scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012,
beta_schedule="scaled_linear", num_train_timesteps=1000)

# A diffusion loop

@TODO:
1. remind me of what is a classier-free guidance? is it just a guidance/caption of what the image is about? (instead of a tyical classifier label?)
2. why are we starting with one prompt for inference? I thought we are going to inspect the training of a Unet

In [27]:
torch.cuda.empty_cache()

In [29]:
# some settings
prompt = ["a white and grey rag doll cat"]
height = 512
width = 512
num_inference_steps = 30 # denoising steps
guidance_scale = 7.5 # scale for classifier-free guidance
generator = torch.manual_seed(32) # seed generator to create the initial latent noise
batch_size = 1 #does batch_size = 1 means we only want 1 image?

## tokenize the prompt and then use the CLIP text encoder to turn it into embeddings

### First let's generate the embedding of the actual prompt (conditional)

In [33]:
text_input = tokenizer(prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt")
#tensors in plural

In [24]:
text_input.input_ids

[49406,
 320,
 1579,
 537,
 5046,
 20687,
 9286,
 2368,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407,
 49407]

what is input_id and attention_mask?

why is input_id contains so many 49407? I thought tokenization is like dictionary checking agains the vocabulary in the tokenizer model

In [39]:
max_length = text_input.input_ids.shape[-1]

❓ Why torch.no_grad here? have we initialized some gradients in the text_input and we are gonna updates the coefficients?

No, it's an optimization for speed and memory, not about preventing unintended training.
Rule of thumb: for inference, always do `torch.no_grad()`, with gradient, text encode eat up 1.5GB VRAM, without gradients ~500MB VRAM.

In [35]:
with torch.no_grad():
    text_embeddings = text_encoder(text_input.input_ids.to(device))[0]

In [37]:
type(text_embeddings)

torch.Tensor

In [38]:
text_embeddings.ndim

3

### Then generate an empty prompt embedding (the unconditional). A key to classifier-free guidance

In [41]:
uncond_input = tokenizer(
    [""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt"
)

In [42]:
uncond_input

{'input_ids': tensor([[49406, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407]]), 'attention_mask': tensor([[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0]])}

In [44]:
with torch.no_grad():
    uncond_embeddings = text_encoder(uncond_input.input_ids.to(device))[0]

### Concat the conditional and unconditional embeddings into 1 tensor

In [45]:
text_embeddings = torch.cat([uncond_embeddings, text_embeddings])

## Prepare Scheduler and Latents