# Stable Diffusion with Diffusers

This is a notebook that demonstrates how to use the [Diffusers package](https://huggingface.co/docs/diffusers/index) from Huggingface to run stable diffusion. [Huggingface](https://huggingface.co) is a community-driven  platform that provides a comprehensive suite of open-source libraries and tools for machine learning. Think of it as a kind of "github for machine learning". The diffusers package provides a simple API and a number of pretrained models that allow to easily run and experiment with diffusion models. 

For extra references, you can look into the [official HyggingFace inference notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/stable_diffusion.ipynb#scrollTo=yW14FA-tDQ5n) and this [fastAI inference example](https://github.com/fastai/diffusion-nbs/blob/master/stable_diffusion.ipynb)

To use this notebook you will need to install diffusers (with your conda environment active) with:
```
pip install --upgrade diffusers\[torch\]
pip install transformers
```

> **_NOTE:_**  ONLY IF USING COLAB, run this:

In [None]:
!pip install --upgrade diffusers\[torch\]
!pip install transformers

Then mount your Google Drive so you have access to your data

In [None]:
import sys
if 'google.colab' in sys.modules:
    from google.colab import drive
    drive.mount('/content/drive/')

import os
os.chdir('drive/My Drive/DMLAP-2025/python') # change to your directory accordingly

And make sure that you have the "images" directory from this week's "09-diffusion" copied inside the folder of choice. Alternatively, you may manually upload the dictory to Collab, but it will be gone once you close the notebook!

> **_NOTE:_**  END COLAB NOTE

## Running Stable Diffusion
To run stable diffusion you will need to distinguish wether you are running this on a Mac M1/M2/... or on Linux/Windows with a Nvidia GPU. 

On a Mac, diffusion will need a "warmup" phase to work properly (see [this link](https://huggingface.co/docs/diffusers/optimization/mps))

In [3]:
import torch
import numpy as np
import matplotlib.pyplot as plt

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
device

'mps'

Now this will load the pretrained model and download it the first time the cell is run (which might take a while). 

You can set the model version by modifying the `sd_version` variable. Read [this document](https://huggingface.co/docs/diffusers/en/stable_diffusion) for performance recommendations and tricks.

In [None]:
from diffusers import StableDiffusionPipeline
from diffusers import DPMSolverMultistepScheduler

sd_version = '2.1'

if sd_version == '2.1':
    model_key = "stabilityai/stable-diffusion-2-1-base"
elif sd_version == '2.0':
    model_key = "stabilityai/stable-diffusion-2-base"
elif sd_version == '1.5':
    model_key = "runwayml/stable-diffusion-v1-5"

pipe = StableDiffusionPipeline.from_pretrained(
    model_key, 
    torch_dtype=torch.float16, #loading the model with half-precision floating point numbers, which reduces memory usage and speeds up inference
    use_safetensors=True) #avoiding running any malicious code

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) # faster scheduler for the denoising process

pipe = pipe.to(device)

if device=='mps':
    # Recommended if your computer has < 64 GB of RAM
    pipe.enable_attention_slicing()
if device=='cuda':
    pipe.enable_sequential_cpu_offload()


#### Generate the image 

Observe the parameters that we define.

Defining a `seed` ensures that the model starts the denoising process from the same initial noise each time, which means you'll get the same output whenever you run the same cell with the same parameters. This is useful for reproducibility. If you skip defining the `generator`, the model will use a random seed each time, which results in different outputs for each run, even with the same prompt.

The `guidance_scale` defines the level of influence from the text prompt (positive or negative). Higher values (7-10) mean the model will closely follow the prompt and try to match your description as precisely as possible. Lower values (2-5) give the model more creative freedom and allow for more variation in the generated image, with less strict adherence to the prompt.

Increasing `num_inference_steps` will improve quality but will make the denoising process slower. If you don't specify the number of steps, they will be 50 by default.

In [None]:
generator = torch.manual_seed(99)

prompt = "A cubist painting of the Star Treck character Spock, high quality, trending on artstation"
image = pipe(
    prompt, 
    guidance_scale=7.5, 
    generator=generator, 
    num_inference_steps=20).images[0]

plt.imshow(np.array(image))
plt.title(prompt)
plt.axis('off')
plt.show()

## Text guided Image-to-Image Generation

We can also use Stable Diffusion with a *seed image*. 

Load an image of your choice and experiment below.   

In [None]:
from PIL import Image 

seed_image = Image.open("images/spock.jpg")

seed_image

Now let's load the [`StableDiffusionImg2ImgPipeline`](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/img2img) model, similarly to what we did earlier with the normal stable diffusion pipeline:

In [None]:
from diffusers import StableDiffusionImg2ImgPipeline

pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
    model_key, 
    torch_dtype=torch.float16, 
    use_safetensors=True)

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) # Faster scheduler

pipe = pipe.to(device)

if device=='mps':
    # Recommended if your computer has < 64 GB of RAM
    pipe.enable_attention_slicing()
if device=='cuda':
    pipe.enable_sequential_cpu_offload()

The `StableDiffusionImg2ImgPipeline` takes an additional `image` parameter that conditions the diffusion process so that the result is similar to a given image.

In [None]:
prompt = "A punk with a mohawk"

generator = torch.manual_seed(32) 

image = pipe(prompt, 
             image=seed_image, # this is the only difference from the previous diffusion example
             guidance_scale=7.5, 
             strength=0.7, 
             generator=generator, 
             num_inference_steps=20).images[0]

plt.imshow(np.array(image))
plt.title(prompt)
plt.axis('off')
plt.show()

The diffusers API provides a number of [other pipelines](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/overview) that can be useful for other tasks, such as image inpainting or depth-to-image.

## Conditioning Stable Diffusion with ControlNet

[ControlNet](https://openaccess.thecvf.com/content/ICCV2023/papers/Zhang_Adding_Conditional_Control_to_Text-to-Image_Diffusion_Models_ICCV_2023_paper.pdf) is a very recent and quite amazing advancement in image generation using stable diffusion. It allows conditioning the stable diffusion generation pipeline on an image input (similarly to pix2pix).  This system operates by integrating a smaller neural network with a pre-trained stable diffusion model. The weights of the stable diffusion model are frozen, and the combined model is trained to steer stable diffusion towards producing images consistent with the provided input conditions.

The Huggingface diffusers API comes with ControlNet and a number of pre-trained models that can be used for an number of tasks such as guiding stable diffusion with edges, poses or depth maps (see in the [ControlNet HuggingFace page](https://huggingface.co/lllyasviel/sd-controlnet-canny) for some available models).

Here we will use the ["sd-controlnet-canny"](https://huggingface.co/lllyasviel/sd-controlnet-canny) model, which guides stable diffusion with edge images. Let's start by using skimage to create edges from an input image, as we did with pix2pix:

In [None]:
import numpy as np
from skimage import io, feature, transform
import PIL.Image as Image
import cv2

def apply_canny_skimage(img, sigma=1.5, size=512):
    import cv2
    from skimage import feature
    invert = False
    grayimg = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
    edges = (feature.canny(grayimg, sigma=sigma)*255).astype(np.uint8)
    if invert:
        edges = cv2.bitwise_not(edges)
    return cv2.cvtColor(edges, cv2.COLOR_GRAY2RGB)

img = io.imread("images/spock.jpg")
edges =  apply_canny_skimage(img)

# ControlNet expects a PIL Image as input
edges_image = Image.fromarray(edges)

plt.figure(figsize=(8,4))
plt.subplot(121)
plt.imshow(img)
plt.subplot(122)
plt.imshow(edges)
plt.show()

Now we setup the ControlNet model. Note that the current HuggingFace version of ControlNet requires Stable Diffusion 1.5 to run: 

In [None]:
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import PIL.Image as Image

controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny", 
    torch_dtype=torch.float16)
    
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", # Controlnet is currently working only with SD 1.5
    torch_dtype=torch.float16,
    controlnet=controlnet, 
    safety_checker=None)

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) # Faster scheduler

pipe = pipe.to(device)

if device=='mps':
    pipe.enable_attention_slicing()
if device=='cuda':
    pipe.enable_sequential_cpu_offload()

And generate an image, by passing the edge image as an input to the `image` argument. We will use 15 inference steps here to save a little bit of time:

In [None]:
prompt = "A sculpture made of clay on a white background"

generator = torch.manual_seed(0) # Removing manual_seed will always generate different images 

image = pipe(
    prompt,
    image=edges_image,
    generator=generator,
    num_inference_steps=15,
).images[0]

plt.imshow(np.array(image))
plt.title(prompt)
plt.axis('off')
plt.show()

## Procedural Input for Image Generation

In the previous example, we used edge detection to produce the ControlNet condition image. However, you can create interesting visual results also by procedurally generating an *edge-like* image. We will use the [py5canvas](https://github.com/colormotor/py5canvas) API by Daniel Berio, to make things easier (as long as you are able to install it correctly). Here are some updated instructions that should work, in case you have not installed it yet. **Skip these steps if you have already installed and used py5canvas**. 

With you environment active:
```
pip install face_recognition
pip install pyglet
```
Then:
```
conda install conda-forge::cairo
```
Followed by:
```
conda install conda-forge::pycairo
```
Finally install py5canvas with: 
```
pip install git+https://github.com/colormotor/py5canvas.git
```

Now we can crete some graphics:


> **_NOTE:_**  ONLY IF USING COLAB, run this first:

In [None]:
!sudo apt install libcairo2-dev pkg-config python3-dev
!pip install git+https://github.com/colormotor/py5canvas.git

> **_NOTE:_**  END COLAB NOTE

In [None]:
from py5canvas import canvas
c = canvas.Canvas(512, 512)

np.random.seed(123)
c.background(0)
c.stroke_weight(2)
c.stroke(255)
c.no_fill()
#c.set_rect_mode('center')
for i in range(40):
    c.rectangle(np.random.uniform(0, c.width, 2), np.random.uniform(10, 90, 2))

# Retrieve the canvas image and convert it from a numpy array to a PIL Image. 
img = c.get_image()
edges_image = Image.fromarray(img)
# show it
edges_image

For this case, we may get better results by using the ["sd-controlnet-scribble"](https://huggingface.co/lllyasviel/sd-controlnet-scribble) model, which is similar to the Canny edge detection model, but is trained on scribble-like images instead of edge maps. Note that both the *scribble* and *canny* control net are trained on images with that have a specific line width and the stroke width of the generated condition image will affect the quality of the results. Try to adjust the line width using the `stroke_weight` function and tweak it to adjust the generated results to your preference. 

In [None]:
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-scribble", 
    torch_dtype=torch.float16)
    
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", # Controlnet is currently working only with SD 1.5
    torch_dtype=torch.float16,
    controlnet=controlnet, 
    safety_checker=None)

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) # Faster scheduler

pipe = pipe.to(device)

if device=='mps':
    pipe.enable_attention_slicing()
if device=='cuda':
    pipe.enable_sequential_cpu_offload()

This time, we will also add a `negative_prompt`, which is a caption describing outputs we wish to avoid.

In [None]:
prompt = "A white wall with many portraits of Star Trek character Spock"

generator = torch.manual_seed(420) # Removing the generator will always result in different images 

image = pipe(
    prompt,
    negative_prompt="blurry, user interface, captions",
    image=edges_image,
    generator=generator,
    controlnet_conditioning_scale=0.9,
    guidance_scale=7.5, 
    guess_mode=False, # Guess mode will try to "guess" the contents of the input, so it can be used also without a prompt
    num_inference_steps=10,
).images[0]

plt.imshow(np.array(image))
plt.title(prompt)
plt.axis('off')
plt.show()

## Procedural Input for Animations

We can also create a procedural animation using a similar technique together with ffmpeg. 

Let's create an animation of waves, saving each frame of the animation in a *input_animation_frames* directory (for later visualisation) and storing the frames in a `frame_images` list. For time/performance consideration, we will create only 10 frames for the moment. This will result in a bit of a choppy animation.

In [None]:
import os
from py5canvas import canvas

# Create a folder for input animation 
output_folder = 'input_animation_frames'
if not os.path.isdir(output_folder):
    os.mkdir(output_folder)

# Modify this function for different animations
def animation_frame(c, frame_index, n_frames):
    c.background(0)
    c.stroke_weight(1) 
    c.stroke(255)
    c.no_fill()

    phase = (np.pi*2)/n_frames * frame_index
    for y in np.linspace(0, 1, 13)[1:-1]: # Loop 0-1 skipping first and last row (
        c.begin_shape()
        for x in np.linspace(0, 1, 100):
            c.vertex(x*c.width, y*c.height +
                     np.sin(x*np.pi*4 + phase + y*6)*np.cos(y*np.pi*6 + x*5.2)*20) # Bit arbitrary moving waves creating a loop
        c.end_shape()
    c.save_image(os.path.join(output_folder, 'frame_%02d.png'%(frame_index+1)))
    return Image.fromarray(c.get_image())

# Create the canvas once
c = canvas.Canvas(512, 512)

# comment this block and uncomment the last line to preview one frame
n_frames = 10
frame_images = []
for i in range(n_frames):
    frame_images.append( animation_frame(c, i, n_frames) )
# preview one frame
frame_images[0]

The code above previews one frame of the animation. Now let's use ffmpeg to convert the saved frames to an animated gif that we can visualise in the notebook. We will use a very low frame rate (10) so the animation with 10 frames will last one second.

In [None]:
!ffmpeg -y -f image2 -framerate 10 -i 'input_animation_frames/frame_%02d.png' -loop 0 input_animation.gif

The command above (as other command-line commands can be) is quite cryptic to read. Look into ffmpeg documentation if in need for clarifications.

Now to visualise the animation we can use the `IPython` module.

In [None]:
import IPython
IPython.display.Image('input_animation.gif')

Finally, let's take the image for each frame and use ControlNet to to transform it. 

Note that we force the seed to be the same for each iteration of the loop. This will give some form of temporal coherence to the animation. Nevertheless, the animation will probably flicker.

In [None]:
# Create a folder for input animation 
output_folder = 'output_animation_frames'
if not os.path.isdir(output_folder):
    os.mkdir(output_folder)

prompt = "Colorful 3d ribbons, Cinema 4d, Maxon renderer"
preview = -1 # Set to 0 or greater to just preview a frame, -1 for saving video
generated_images = []
for i, frame in enumerate(frame_images):
    print('Frame %d of %d'%(i+1, len(frame_images)))
    generator = torch.manual_seed(677) # Removing the generator will always result in different images 
    if preview > -1:
        frame = frame_images[preview]

    image = pipe(
        prompt,
        negative_prompt="blurry, user interface, captions",
        image=frame,
        generator=generator,
        controlnet_conditioning_scale=0.7,
        guidance_scale=6.5, 
        guess_mode=False, # Guess mode will try to "guess" the contents of the input, so it can be used also without a prompt
        num_inference_steps=10, # 10 inference steps is faster and seems to be enough
    ).images[0]
    if preview > -1:
        display(image)
        break
    generated_images.append(image)
    image.save(os.path.join(output_folder, 'frame_%02d.png'%(i+1)))


Again let's create a gif, this time from the *output_animation_frames* directory.

In [None]:
!ffmpeg -y -f image2 -framerate 10 -i 'output_animation_frames/frame_%02d.png' -loop 0 output_animation.gif

And visualise it:

In [None]:
IPython.display.Image('output_animation.gif')

## Possible Things to Experiment with:

- Run the full notebook and understand how the different approaches work

- Test out the notebook with your own prompts, search for prompting tips and tricks, play around with negative prompting, etc

- Test out the notebook with different parameters to direct the generated outcome to your likings

- Introduce computational thinking to your making process, eg build your prompt generator and use it with SD, export generated images in each inference step and use these as seed_images for the next generation, etc

- Search for different models from HuggingFace and use them in the same way. Eg. for upscaling images, look into [this notebook](https://github.com/huggingface/notebooks/blob/main/diffusers/latent_diffusion_upscaler.ipynb). The model was originally released in [Latent Diffusion repo](https://github.com/CompVis/latent-diffusion). It's a simple, 4x super-resolution model diffusion model. This model is not conditioned on text.

    `from diffusers import LDMSuperResolutionPipeline`

    `pipe = LDMSuperResolutionPipeline.from_pretrained("CompVis/ldm-super-resolution-4x-openimages").to(device)`

    `...`