<a href="https://colab.research.google.com/github/free-syllabus/aml/blob/main/on-session/10_Stable_Diffusion/stable_diffusion_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stable Diffusion

Credits go to **Pedro Cuenca, Patrick von Platen, Suraj Patil, Jeremy Howard** for creating this notebook, check their whole course if you want to: https://course.fast.ai/Lessons/lesson9.html

In [None]:
!pip install -Uq diffusers transformers fastcore

## Using Stable Diffusion

You need to create a Hugging Face token and log in to continue.

(or you can use mine: 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 )

In [None]:
import logging
from pathlib import Path

import matplotlib.pyplot as plt
import torch
from diffusers import StableDiffusionPipeline
from fastcore.all import concat
from huggingface_hub import notebook_login
from PIL import Image

logging.disable(logging.WARNING)

torch.manual_seed(1)
if not (Path.home()/'.cache/huggingface'/'token').exists(): notebook_login()

### Stable Diffusion Pipeline

[`StableDiffusionPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion#diffusers.StableDiffusionPipeline) is an end-to-end [diffusion inference pipeline](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion) that allows you to start generating images with just a few lines of code.

In [None]:
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", revision="fp16", torch_dtype=torch.float16).to("cuda")

Let's use the pipeline to generate our first image:

In [None]:
prompt = "Bear in a bed"

In [None]:
pipe(prompt).images[0]

In [None]:
torch.manual_seed(1024)
pipe(prompt).images[0]

Stable Diffusion is based on a progressive denoising algorithm that is able to create a convincing image starting from pure random noise. We can now look at the intermediary denoising results:

In [None]:
torch.manual_seed(1024)
pipe(prompt, num_inference_steps=5).images[0]

In [None]:
torch.manual_seed(1024)
pipe(prompt, num_inference_steps=10).images[0]

### Classifier-Free Guidance

In [None]:
def image_grid(imgs, rows, cols):
    w,h = imgs[0].size
    grid = Image.new('RGB', size=(cols*w, rows*h))
    for i, img in enumerate(imgs): grid.paste(img, box=(i%cols*w, i//cols*h))
    return grid

_Classifier-Free Guidance_ is a method to increase the adherence of the output to the conditioning signal we used (the text).

Roughly speaking, the larger the guidance the more the model tries to represent the text prompt. However, large values tend to produce less diversity. The default is `7.5`, which represents a good compromise between variety and fidelity.

We can generate multiple images for the same prompt by simply passing a list of prompts instead of a string.

In [None]:
num_rows,num_cols = 4,4
prompts = [prompt] * num_cols

In [None]:
images = concat(pipe(prompts, guidance_scale=g).images for g in [1.1,3,7,14])

In [None]:
image_grid(images, rows=num_rows, cols=num_cols)

### Negative prompts

_Negative prompting_ refers to the use of another prompt (instead of a completely unconditioned generation), and scaling the difference between generations of that prompt and the conditioned generation.

In [None]:
torch.manual_seed(1000)
prompt = "Bear in the style of Vermeer"
pipe(prompt).images[0]

In [None]:
torch.manual_seed(1000)
pipe(prompt, negative_prompt="blue").images[0]

By using the negative prompt we move more towards the direction of the positive prompt, effectively reducing the importance of the negative prompt in our composition.

### Image to Image

Even though Stable Diffusion was trained to generate images, and optionally drive the generation using text conditioning, we can use the raw image diffusion process for other tasks.

For example, instead of starting from pure noise, we can start from an image an add a certain amount of noise to it. We are replacing the initial steps of the denoising and pretending our image is what the algorithm came up with. Then we continue the diffusion process from that state as usual.

This usually preserves the composition although details may change a lot.

These operations (provide an initial image, add some noise to it and run diffusion from there) can be automatically performed by a special image to image pipeline: `StableDiffusionImg2ImgPipeline`.

In [None]:
from diffusers import StableDiffusionImg2ImgPipeline
from fastdownload import FastDownload

In [None]:
pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    revision="fp16",
    torch_dtype=torch.float16,
).to("cuda")

We'll use as an example an image from Wikipedia.

In [None]:
p = FastDownload().download('https://upload.wikimedia.org/wikipedia/commons/f/f5/Howlsnow.jpg')
init_image = Image.open(p).convert("RGB").resize((400, 400))
init_image

In [None]:
torch.manual_seed(42)
prompt = "Wolf howling at the moon, fantasy style"
images = pipe(prompt=prompt, num_images_per_prompt=3, image=init_image, strength=0.8, num_inference_steps=50).images
image_grid(images, rows=1, cols=3)

When we get a composition we like we can use it as the next seed for another prompt and further change the results. For example, let's take the second image above and try to use it to generate something in the style of Van Gogh.

In [None]:
init_image = images[1]

In [None]:
torch.manual_seed(100)
prompt = "Oil painting of wolf howling at the moon by van Gogh"
images = pipe(prompt=prompt, num_images_per_prompt=3, image=init_image, strength=1, num_inference_steps=70).images
image_grid(images, rows=1, cols=3)

## Looking inside the pipeline

The inference pipeline is just a small piece of code that plugs the components together and performs the inference loop.

We'll go through the process of loading and plugging the pieces to see how we could have written it ourselves. We'll start by loading all the modules that we need from their pretrained weights.

First, we need the text encoder and the tokenizer. These come from the text portion of a standard CLIP model, so we'll use the weights released by Open AI.

In [None]:
from transformers import CLIPTextModel, CLIPTokenizer

In [None]:
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.float16)
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.float16).to("cuda")

Next we'll load the `autoencoder` and the `unet`.

In [None]:
from diffusers import AutoencoderKL, UNet2DConditionModel

In [None]:
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-ema", torch_dtype=torch.float16).to("cuda")
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet", torch_dtype=torch.float16).to("cuda")

In real diffusion process, people don't add the noise to the training samples just randomly, but it's based on something called a `scheduler`.

The higher value of timestep means that we add more noise, and the Beta says what amount of noise we are really adding.

In [None]:
beta_start,beta_end = 0.00085,0.012
plt.plot(torch.linspace(beta_start**0.5, beta_end**0.5, 1000) ** 2)
plt.xlabel('Timestep')
plt.ylabel('β');

In [None]:
from diffusers import LMSDiscreteScheduler

In [None]:
scheduler = LMSDiscreteScheduler(beta_start=beta_start, beta_end=beta_end, beta_schedule="scaled_linear", num_train_timesteps=1000)

We now define the parameters we'll use for generation.

In contrast with the previous examples, we set `num_inference_steps` to 70 to get an even more defined image.

In [None]:
prompt = ["Bear in a bed"]

height = 512
width = 512
num_inference_steps = 70
guidance_scale = 7.5
batch_size = 1

We tokenize the prompt. The model requires the same number of tokens for every prompt, so padding is used to ensure we meet the required length.

In [None]:
text_input = tokenizer(prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt")
text_input['input_ids']

In [None]:
tokenizer.decode(49407)

The attention mask uses zero to represent tokens we are not interested in. These are all of the padding tokens.

In [None]:
text_input['attention_mask']

The text encoder gives us the embeddings for the text prompt we used.

In [None]:
text_embeddings = text_encoder(text_input.input_ids.to("cuda"))[0].half()
text_embeddings.shape

We also get the embeddings required to perform unconditional generation, which is achieved with an empty string: the model is free to go in whichever direction it wants as long as it results in a reasonably-looking image. These embeddings will be applied to apply classifier-free guidance.

In [None]:
max_length = text_input.input_ids.shape[-1]
uncond_input = tokenizer(
    [""], padding="max_length", max_length=max_length, return_tensors="pt"
)
uncond_embeddings = text_encoder(uncond_input.input_ids.to("cuda"))[0].half()
uncond_embeddings.shape

For classifier-free guidance, we need to do two forward passes. One with the conditioned input (`text_embeddings`), and another with the unconditional embeddings (`uncond_embeddings`). In practice, we can concatenate both into a single batch to avoid doing two forward passes.

In [None]:
text_embeddings = torch.cat([uncond_embeddings, text_embeddings])

To start the denoising process, we start from pure Gaussian (normal) noise. These are our initial latents.

In [None]:
torch.manual_seed(100)
latents = torch.randn((batch_size, unet.in_channels, height // 8, width // 8))
latents = latents.to("cuda").half() # using half to get lower precision to make it faster
latents.shape

`4×64×64` is the input shape. The decoder will later transform this latent representation into a `3×512×512` image after the denoising process is complete.

Next, we initialize the scheduler with our chosen `num_inference_steps`. This will prepare the internal state to be used during denoising.

In [None]:
scheduler.set_timesteps(num_inference_steps)

We scale the initial noise by the standard deviation required by the scheduler. This value will depend on the particular scheduler we use.

In [None]:
latents = latents * scheduler.init_noise_sigma # because of activations and gradients to not get out of control

We are ready to write the denoising loop. The timesteps go from `999` to `0` (1000 steps that were used during training) following a particular schedule.

In [None]:
scheduler.timesteps

In [None]:
scheduler.sigmas # actual number of noise

In [None]:
plt.plot(scheduler.timesteps, scheduler.sigmas[:-1]);

In [None]:
from tqdm.auto import tqdm

In [None]:
for i, t in enumerate(tqdm(scheduler.timesteps)):
    input = torch.cat([latents] * 2)
    input = scheduler.scale_model_input(input, t)

    # predict the noise residual
    with torch.no_grad(): pred = unet(input, t, encoder_hidden_states=text_embeddings).sample

    # perform guidance
    pred_uncond, pred_text = pred.chunk(2)
    pred = pred_uncond + guidance_scale * (pred_text - pred_uncond)

    # compute the "previous" noisy sample
    latents = scheduler.step(pred, t, latents).prev_sample

After this process complets our `latents` contain the denoised representation of the image. We use the `vae` decoder to convert it back to pixel space.

In [None]:
with torch.no_grad(): image = vae.decode(1 / 0.18215 * latents).sample # scaling based on the original paper

And finally, let's convert the image to PIL so we can display it.

In [None]:
image = (image / 2 + 0.5).clamp(0, 1)
image = image[0].detach().cpu().permute(1, 2, 0).numpy()
image = (image * 255).round().astype("uint8")
Image.fromarray(image)



---



## TODO

Adjust the following code to generate images not only based on the guidance prompt, but similarly as in section 'Negative prompt' to consider the *negative prompt*. The model will now not scale between *guided image* and *random image*, but between *guided image* and *negatively guided image*.

The following code is the same as the one we've just went through, but it's in functions to make it nice and readable.

In [None]:
from IPython.display import display

In [None]:
prompts = [
    'Bear in a bed',
    'Oil painting of a bear in a bed'
]

neg_prompts = [
    'brown',
    'brown'
]

In [None]:
def text_enc(prompts, maxlen=None):
    if maxlen is None: maxlen = tokenizer.model_max_length
    inp = tokenizer(prompts, padding="max_length", max_length=maxlen, truncation=True, return_tensors="pt")
    return text_encoder(inp.input_ids.to("cuda"))[0].half()

def mk_img(t):
    image = (t/2+0.5).clamp(0,1).detach().cpu().permute(1, 2, 0).numpy()
    return Image.fromarray((image*255).round().astype("uint8"))

In [None]:
def mk_samples(prompts, g=7.5, seed=100, steps=70):
    bs = len(prompts)
    text = text_enc(prompts)
    uncond = text_enc([""] * bs, text.shape[1])
    emb = torch.cat([uncond, text])
    if seed: torch.manual_seed(seed)

    latents = torch.randn((bs, unet.in_channels, height//8, width//8))
    scheduler.set_timesteps(steps)
    latents = latents.to("cuda").half() * scheduler.init_noise_sigma

    for i,ts in enumerate(tqdm(scheduler.timesteps)):
        inp = scheduler.scale_model_input(torch.cat([latents] * 2), ts)
        with torch.no_grad(): u,t = unet(inp, ts, encoder_hidden_states=emb).sample.chunk(2)
        pred = u + g*(t-u)
        latents = scheduler.step(pred, ts, latents).prev_sample

    with torch.no_grad(): return vae.decode(1 / 0.18215 * latents).sample

In [None]:
images = mk_samples(prompts)

In [None]:
for img in images: display(mk_img(img))