# Generative AI with 🧨 diffusers

**What is 🧨 diffusers**?

> 🤗 Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Whether you’re looking for a simple inference solution or want to train your own diffusion model, 🤗 Diffusers is a modular toolbox that supports both. Our library is designed with a focus on usability over performance, simple over easy, and customizability over abstractions.

\- from the [docs](https://huggingface.co/docs/diffusers).

**Disclaimer**: The materials presented in this notebook are purely intended for educational purposes. Please refer to [our ethical charter](https://huggingface.co/docs/diffusers/main/en/conceptual/ethical_guidelines) for responsibly using the tools shown in this notebook. Also, this notebook is meant to be a hands-on tutorial on Diffusion models with 🧨 diffusers. For an overview, please refer to [this resource](https://jalammar.github.io/illustrated-stable-diffusion/). 


## Table of contents 🍽

* Installation and setup
* Text-to-image generation in five LoC
* Doing better with better prompts 
* Taking control of the generations
  * Editing images
  * Adding multiple conditions for generation
  * Controlling semantic properties
* Going beyond images -- text-to-video generation
* Conclusion

We believe these use cases will help you explore different creative applications and extend them further. 

Let's start diffusing 🧨

## Installation and setup

Since Colab comes with many pre-installed libraries already, we just need a few additional libraries to start our exploration:

* [🧨 diffusers](https://huggingface.co/docs/diffusers)
* [🤗 accelerate](https://huggingface.co/docs/accelerate/)
* [🤗 transformers](https://huggingface.co/docs/transformers/) 

In [None]:
!pip install -q diffusers accelerate transformers 

See if we have got access to a GPU (very important!).

In [None]:
!nvidia-smi

Log in to your Hugging Face account as we will need to be authorized to access one of the models we'll use today:

(If you don't have one, please [create one](https://huggingface.co/join); it's free 🤗)

In [None]:
!huggingface-cli login

And that's it for the setup part! We're now ready to start generating images 🎇

## Text-to-image generation in five LoC 🖼

We start by importing the [`DiffusionPipeline`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline), which is a class that encapsulates all the logic involved behind a text-conditioned diffusion system for image generation. We initialize the pipeline from a pre-trained checkpoint: [`runwayml/stable-diffusion-v1-5`](https://hf.co/runwayml/stable-diffusion-v1-5).

After importing the pipeline and initializing it, we call it on an input "prompt" which is basically an informal description in natural language of how we want the generated image to look like. We can also use an [unconditional diffusion model](https://huggingface.co/docs/diffusers/training/unconditional_training) that doesn't need a prompt but takes away user control. 

Generating images from a prompt is as easy as:

In [None]:
from diffusers import DiffusionPipeline

pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5").to("cuda")
prompt = "sailing ship in storm by Leonardo da Vinci"
images = pipeline(prompt=prompt).images

images[0]

And there we go! Within just five lines of code (LoC), we have generated an image from natural language input! 

> 💡 **Note**: By default, `DiffusionPipeline` classes have a `safety_checker` enabled to prevent the use of NSFW content.

As a good first next step, we can try changing the "seed" to see how that impacts the generated image. 

In [None]:
import torch 

seed = 1554 #@param {type:"integer"}
images = pipeline(prompt=prompt, generator=torch.manual_seed(seed)).images

images[0]

Feel free to experiment with the seeds and notice how that impacts the quality of the generated image. 

We can also generate multiple images from the same prompt by specifying the `num_images_per_prompt` parameter:

In [None]:
import PIL

def make_grid(images, rows, cols, resize_to=128):
    images = [image.resize((resize_to, resize_to)) for image in images]
    w, h = images[0].size
    grid = PIL.Image.new("RGB", size=(cols*w, rows*h))
    for i, image in enumerate(images):
        grid.paste(image, box=(i%cols*w, i//cols*h))
    return grid

num_images_per_prompt = 4 #@param {type:"integer"}
images = pipeline(prompt=prompt, num_images_per_prompt=num_images_per_prompt).images
make_grid(images, 1, 4)

So far, we have been performing all our computations in FP32 (floating-point 32 bits). Diffusers [supports](https://huggingface.co/docs/diffusers/main/en/optimization/fp16) FP16 computations which can speed up the generation speed without any degradation in the quality. It can also save you memory preventing OOM problems. FP16 computations are specifically advantageous on [GPUs having tensor cores](https://www.nvidia.com/en-in/data-center/tensor-cores/).  

Let's how this is beneficial. 

We start by writing a utility function to benchmark the inference speed of our pipeline. 

In [None]:
import time 

def benchmark(pipeline, prompt="a dog"):
    # warmup.
    _ = pipeline(prompt=prompt) 

    # run for five iterations.
    tic = time.time_ns()
    for _ in range(5):
        _ = pipeline(prompt=prompt)
    tok = time.time_ns() 
    print(f"Execution time -- {(tok - tic) / 1e6:.1f} ms\n")

We delete the previous `pipeline` and initialize two new pipelines:

1. FP32 mode
2. FP16 mode 

In [None]:
del pipeline 
fp32_pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5").to("cuda")
fp16_pipeline = DiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16 # switching to FP16 mode is just this line of code :)
).to("cuda")

Now, let's take them on benchmarking ⏰

In [None]:
print("*** FP32 Pipeline ***")
benchmark(fp32_pipeline)

print("\n*** FP16 Pipeline ***")
benchmark(fp16_pipeline)

The speedup is clear ⚡️

The speedup is more prominent when there are more images to generate. The numbers will also vary depending on the GPU we're using. 

Going forward in this notebook, we'll default to FP16 computations to save us some memory and speed. 

> 💡 **Note**: Diffusers supports many forms of optimizations that can help you run the pipelines as optimally as possible with little to no code changes. These optimizations include attention slicing, CPU offloading, use of flash attention, etc. We encourage you to explore these options [here](https://huggingface.co/docs/diffusers/main/en/optimization/opt_overview).  

So far, we have been able to generate images but their quality (which is a subjective measurement) hasn't been super exciting. Even though there are various things we can do to improve that one particularly easy option is -- ***better prompting***. 

Let's explore this in the next section. 

## Doing better with better prompts ✍️

Let's start with the following prompt:

> "portrait photo of a old warrior chief"

In [None]:
prompt = "portrait photo of a old warrior chief"
images = fp16_pipeline(prompt=prompt, num_images_per_prompt=4).images
make_grid(images, rows=2, cols=2, resize_to=256)

Not bad but let's explore how we can do better. 

The text prompt you use to generate an image is super important, so much so that it is called ***prompt engineering***. Some considerations to keep during prompt engineering are:

* How is the image or similar images of the one I want to generate stored on the internet?
* What additional detail can I give that steers the model towards the style I want?

With this in mind, let’s improve the prompt to include color and higher-quality details:

In [None]:
prompt += ", tribal panther make up, blue on red, side profile, looking away, serious eyes"
prompt += " 50mm portrait photography, hard rim lighting photography--beta --ar 2:3  --beta --upbeta"

In [None]:
images = fp16_pipeline(prompt=prompt, num_images_per_prompt=4).images
make_grid(images, rows=2, cols=2, resize_to=256)

Pretty impressive! 

We encourage you to explore how much you push the needle further with better prompts (and other simple techniques) by referring to [this tutorial](https://huggingface.co/docs/diffusers/main/en/stable_diffusion). 

Up until this point, we've been experimenting with the "vanilla" text-to-image generation where we "conditioned" the diffusion model on text prompts. Let's now see how we can better control the images these models generate. 

## Taking control of the generations 🎛

In this section, we'll explore:

* how to edit images from natural language inputs
* how to use multiple conditionings for generating images
* how to control the semantic properties of the generated images

Let's get controlling!



### Editing images

We'll use [InstructPix2Pix](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/pix2pix) to edit images using text prompts. Along with an "edit instruction" (consider it to be prompt), we'd need to also supply an image that we wish to edit. 

We'll start by initializing the pipeline and loading the image we want to edit. 

In [None]:
from diffusers import StableDiffusionInstructPix2PixPipeline
from diffusers.utils import load_image

model_id = "timbrooks/instruct-pix2pix"
pipeline = StableDiffusionInstructPix2PixPipeline.from_pretrained(
    model_id, torch_dtype=torch.float16
).to("cuda")
original_image = load_image("https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Tower_Bridge_from_Shad_Thames.jpg/1200px-Tower_Bridge_from_Shad_Thames.jpg")
original_image

In [None]:
edit_instruction = "Make it a picasso painting"
edited_image = pipeline(edit_instruction, image=original_image).images[0]
edited_image

Not bad. 

This pipeline exposes two important arguments:

1. `image_guidance_scale` that lets us preserve the input image structure.
2. `guidance_scale` that lets us control how much the edit instruction reflects in the edited image.

Let's run the same pipeline across a few different `image_guidance_scale` and `guidance_scale` values. 

In [None]:
img_cfg_scales = [1., 1.5, 2.]
text_cfg_scales = [7., 10., 12.]
edited_images = []

for img_cfg in img_cfg_scales:
    for text_cfg in text_cfg_scales:
        image = pipeline(
            edit_instruction,
            image=original_image,
            image_guidance_scale=img_cfg,
            guidance_scale=text_cfg,
            num_inference_steps=20,
        ).images[0]
        edited_images.append(image)

Let's now plot these edited images to see the impact of different `image_guidance_scale` (denoted as $s_i$) and `guidance_scale` (denoted as $s_t$) values. 

In [None]:
import matplotlib.pyplot as plt 
import numpy as np


plt.figure(figsize=(15, 8))
for i, image in enumerate(edited_images):
    ax = plt.subplot(3, 3, i + 1)
    plt.imshow(np.array(image))
    plt.title(
        f"$s_i$: {img_cfg_scales[i // 3]}, $s_t$: {text_cfg_scales[i % 3]}",
        fontsize=14,
        fontweight="bold",
    )
    plt.axis("off")
plt.tight_layout()

Seems like the `image_guidance_scale` of 1.5 is essential for us to be able to properly reflect the edit instruction. 

Editing images seems practical. But what if we wanted to generate a new image from a geometric layout (a [canny edge map](https://en.wikipedia.org/wiki/Canny_edge_detector), for example) and a text prompt? Let's explore that now. 

### Adding multiple conditions for generation

For this part, we'll leverage [ControlNets](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/controlnet). We'll start by loading an image from which we'll extract its canny edge map. 

In [None]:
from diffusers.utils import load_image

# Let's load the popular vermeer image
image = load_image(
    "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
)
image

Extract the canny edge map from the image: 

In [None]:
import cv2
import numpy as np

image = np.array(image)

low_threshold = 100
high_threshold = 200

image = cv2.Canny(image, low_threshold, high_threshold)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = PIL.Image.fromarray(image)
canny_image

Now, we can use this edge map as an additional conditioning with a text prompt to generate an image. But this time, let's be a little more creative. 

Instead of using the pre-trained [`runwayml/stable-diffusion-v1-5`](https://hf.co/runwayml/stable-diffusion-v1-5) checkpoint, let's use [one](https://huggingface.co/sd-dreambooth-library/mr-potato-head) that was trained to generate photos of [Mr. Potato Head](https://en.wikipedia.org/wiki/Mr._Potato_Head). 

In the next code cell, we initialize the [ControlNet model](https://huggingface.co/lllyasviel/sd-controlnet-canny) for handling canny edge maps as well as our [`StableDiffusionControlNetPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/controlnet#diffusers.StableDiffusionControlNetPipeline) (with the [`sd-dreambooth-library/mr-potato-head`](https://huggingface.co/sd-dreambooth-library/mr-potato-head) checkpoint).

In [None]:
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel

controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16
).to("cuda")
pipeline = StableDiffusionControlNetPipeline.from_pretrained(
    "sd-dreambooth-library/mr-potato-head", controlnet=controlnet, torch_dtype=torch.float16
).to("cuda")

And then generate: 

In [None]:
prompt = "a photo of sks mr potato head, best quality, extremely detailed"
output = pipeline(prompt, canny_image)
output.images[0]

And voila! We have generated a Mr. Potato Head posing like the [famous Vermeer painting](https://en.wikipedia.org/wiki/Girl_with_a_Pearl_Earring) 😂

It's possible to combine multiple conditionings (e.g., combining canny edge map and pose) for generation. Refer to the [docs](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/controlnet#combining-multiple-conditionings) to know more! 

Text-to-image generation is fun! It can help unlock different creative avenues that may have seemed impossible to explore before. But what if we wanted to have more control over the semantic properties of the generated images, such as -- smile factor of an image involving a human face?

### Controlling semantic properties

We will control the semantic properties of the generated images through something called "semantic guidance" via the [`SemanticStableDiffusionPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/semantic_stable_diffusion). To understand this better, let's first generate an image with the following prompt -- "a photo of the face of a woman".



In [None]:
prompt = "a photo of the face of a woman"
pipeline = DiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
).to("cuda")

# We're fixing the seed to have a more deterministic way of
# comparing the outputs here. Any diffusion pipeline in Diffusers
# comes with a `generator` argument in its __call__ method to
# help fix the seed. 
pipeline(prompt=prompt, generator=torch.manual_seed(0)).images[0]

Now leverage the `SemanticStableDiffusionPipeline` to semantically control the output image by specifying what attributes we want to control. 

In [None]:
from diffusers import SemanticStableDiffusionPipeline

pipeline = SemanticStableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
)
pipeline = pipeline.to("cuda")

out = pipeline(
    prompt="a photo of the face of a woman",
    num_images_per_prompt=1,
    guidance_scale=7,
    editing_prompt=[
        # Concepts to apply
        "smiling, smile",  
        "glasses, wearing glasses",
    ],
    reverse_editing_direction=[True, False],  # Direction of guidance i.e. increase all concepts
    edit_warmup_steps=[10, 10],  # Warmup period for each concept
    edit_guidance_scale=[4, 5],  # Guidance scale for each concept
    edit_threshold=[
        0.99,
        0.975,
    ],  # Threshold for each concept. Threshold equals the percentile of the latent space that will be discarded. I.e. threshold=0.99 uses 1% of the latent dimensions
    edit_momentum_scale=0.3,  # Momentum scale that will be added to the latent guidance
    edit_mom_beta=0.6,  # Momentum beta
    edit_weights=[1, 1],  # Weights of the individual concepts against each other
    generator=torch.manual_seed(0)
)
out.images[0]

Pretty amazing! Now, we can control the semantic properties of the generated images as we did here: 

* Decrease the smile quotient a bit
* Make the subject wear glasses

It's okay if we don't understand all the arguments used in the above pipeline. The most important ones are:

* `editing_prompt`
* `edit_guidance_scale`
* `edit_weights`

If this type of controllable generation fascinates you, we welcome you to check the [official documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/semantic_stable_diffusion) and the [SEGA paper](https://arxiv.org/abs/2301.12247). 

With this, we conclude our exploration of controllable image generation. To learn more about other similar techniques, check out [our documentation](https://huggingface.co/docs/diffusers/main/en/using-diffusers/controlling_generation). 

In the final leg, we'll take Diffusers to go beyond images and generate videos from text prompts 🤯

## Going beyond images -- text-to-video generation 🎥

It turns out that it's possible to extend a text-to-image diffusion system such that it generates temporally coherent video content. To the best of our knowledge, it was first explored in [Text-to-Video Zero](https://huggingface.co/docs/diffusers/main/en/api/pipelines/text_to_video_zero). 

With the [`TextToVideoZeroPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/text_to_video_zero#diffusers.TextToVideoZeroPipeline), the interface for generating a video from a prompt remains almost the same as our text-to-image generation pipeline 🤗

Let's first clear some memory as video generation is a memory-heavy task.



In [None]:
# Delete the unused heavy objects.
del pipeline, controlnet

# Let's free some memory. 
import gc
gc.collect()
torch.cuda.empty_cache()

And now, let's generate a video!

In [None]:
import imageio
from diffusers import TextToVideoZeroPipeline

model_id = "runwayml/stable-diffusion-v1-5"
pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")

prompt = "A panda playing guitar in the mountains"
result = pipe(prompt=prompt).images
result = [(r * 255).astype("uint8") for r in result]
imageio.mimsave("video.mp4", result, fps=2)

In [None]:
#@title ##### Show video
from IPython.display import HTML
from base64 import b64encode

mp4 = open("video.mp4", "rb").read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
HTML("""
<video width=400 controls>
      <source src="%s" type="video/mp4">
</video>
""" % data_url)

That's cute! 

Just like we controlled different aspects of the generated images, we can do the same for videos too! For example, we can condition the video generation process on certain poses along with the prompt. 

And the good news is that we can leverage the components we have already seen for doing this:

* `ControlNet`
* `StableDiffusionControlNetPipeline`

To make the generated video frames temporally coherent, we'll leverage the special [`CrossFrameAttnProcessor`](https://github.com/huggingface/diffusers/blob/7a32b6beeb0cfdefed645253dce23d9b0a78597f/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_zero.py#L39) class. Diffusers supports many ["attention processor" classes](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py) for different things. A discussion on that is out of the scope of this session. 

We first download a video whose frames are comprised of different poses but temporally coherent:

In [None]:
from huggingface_hub import hf_hub_download

filename = "__assets__/poses_skeleton_gifs/dance1_corr.mp4"
repo_id = "PAIR/Text2Video-Zero"
video_path = hf_hub_download(repo_type="space", repo_id=repo_id, filename=filename)

In [None]:
#@title ##### Show video
from IPython.display import HTML
from base64 import b64encode

mp4 = open(video_path, "rb").read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
HTML("""
<video width=400 controls>
      <source src="%s" type="video/mp4">
</video>
""" % data_url)

Then we extract the individual frames from the video sequentially: 

In [None]:
from PIL import Image
import imageio

reader = imageio.get_reader(video_path, "ffmpeg")
frame_count = 8
pose_images = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)]

In the next cell, we initialize the `ControlNet` model for poses and the `StableDiffusionControlNetPipeline`. We then assign `CrossFrameAttnProcessor` to be the attention processor of `StableDiffusionControlNetPipeline`. 

In [None]:
from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor

controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-openpose", torch_dtype=torch.float16
).to("cuda")
pipeline = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16
).to("cuda")

# Set the attention processor.
pipeline.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))
pipeline.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))

And now we can make Darth Vader dance in the poses we want:

In [None]:
# Fix latents for all frames. This is to avoid unexpected randomness in the process.
latents = torch.randn((1, 4, 64, 64), device="cuda", dtype=torch.float16).repeat(len(pose_images), 1, 1, 1)

prompt = "Darth Vader dancing in a desert"
result = pipeline(prompt=[prompt] * len(pose_images), image=pose_images, latents=latents).images
imageio.mimsave("video.mp4", result, fps=4)

In [None]:
#@title ##### Show video
from IPython.display import HTML
from base64 import b64encode

mp4 = open("video.mp4", "rb").read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
HTML("""
<video width=400 controls>
      <source src="%s" type="video/mp4">
</video>
""" % data_url)

And that's it for all the code cavalry for today's session. Let's quickly recap what we learned and how we can explore Diffusers further to unlock the massive creative potential lying ahead.

## Conclusion

We covered quite of a lot things today starting from text-to-image generation all the way up to text-to-video generation. In between, we learned how to improve the quality of the generated images with better prompts. We also learned to control different aspects of the generated images by exploring different techniques. 

With that said, we barely scratched the surface of Diffusers today. There's a lot more we can do with it. Below are some interesting pointers for you to explore further if this area of work seems interesting to you:

* [Running](https://huggingface.co/blog/if) heavy pipelines on consumer GPUs
* [Training](https://huggingface.co/docs/diffusers/main/en/training/overview) your own diffusion models easily 
* [Models](https://huggingface.co/docs/diffusers/main/en/api/pipelines/if) that are particularly good at generating text content in images
* [Audio generation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audio_diffusion) with Diffusers

Ciao!

<div align="center">
<figure>
<img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/datacamp_bunny.png" width=500/>
<figcaption>Image generated with <a href="https://huggingface.co/spaces/DeepFloyd/IF">IF</a>. Prompt: high quality dslr photo of a happy bunny holding a sign saying "DataCamp"</figcaption>
</figure>
</div>