# ControlNet in Practice

**Module 7.1, Lesson 2** | CourseAI

You know the architectureâ€”trainable encoder copy, zero convolutions, additive features at skip connections. This notebook is where the architecture becomes a tool. Real preprocessors, real images, real control.

**What you will do:**
- Extract Canny edges from a photograph with different threshold settings and observe how preprocessing quality affects generation
- Use three different preprocessors (Canny, depth, OpenPose) and verify the pipeline API is identical for all of them
- Sweep the conditioning scale from 0.3 to 2.0 and discover the control-creativity tradeoff firsthand
- Stack two ControlNets (Canny + depth) and see how complementary spatial constraints compose
- Choose your own source image, preprocessor(s), and settings to create a controlled composition

**For each exercise, PREDICT the output before running the cell.**

Every concept in this notebook comes from the lesson. Preprocessor types, conditioning scale as a volume knob, multi-ControlNet stacking by additive composition. No new theoryâ€”just hands-on practice with real models.

**Estimated time:** 40â€“60 minutes (model downloads may take several minutes on first run).

**VRAM requirements:** This notebook is designed for a T4 GPU (16 GB). It carefully manages GPU memory by never loading two full pipelines simultaneously. Follow the cleanup cells between exercises.

---

## Setup

Run this cell to install dependencies, import everything, and configure the environment.

**Important:** Set the runtime to GPU before running. In Colab: Runtime â†’ Change runtime type â†’ T4 GPU.

The first run will download model weights (~5 GB for SD v1.5 + ~1.5 GB per ControlNet checkpoint). Subsequent runs use cached weights.

In [None]:
!pip install -q diffusers transformers accelerate safetensors controlnet_aux opencv-python-headless

import torch
import torch.nn as nn
import numpy as np
import cv2
import matplotlib.pyplot as plt
import gc
from PIL import Image
from diffusers import (
    StableDiffusionControlNetPipeline,
    StableDiffusionPipeline,
    ControlNetModel,
    UniPCMultistepScheduler,
)

# Reproducible results
torch.manual_seed(42)

# Nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [10, 4]

# Device setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')
if device.type == 'cuda':
    print(f'GPU: {torch.cuda.get_device_name(0)}')
    print(f'VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')
else:
    print('WARNING: No GPU detected. This notebook requires a GPU for image generation.')
    print('In Colab: Runtime â†’ Change runtime type â†’ T4 GPU')

print('\nSetup complete.')

## Shared Helpers

Utility functions used across multiple exercises.

In [None]:
def download_sample_image(url, filename="sample.jpg"):
    """Download an image from a URL and return it as a PIL Image."""
    import urllib.request
    urllib.request.urlretrieve(url, filename)
    return Image.open(filename).convert("RGB")


def show_images(images, titles, figsize=None, suptitle=None):
    """Display a row of images with titles."""
    n = len(images)
    if figsize is None:
        figsize = (5 * n, 5)
    fig, axes = plt.subplots(1, n, figsize=figsize)
    if n == 1:
        axes = [axes]
    for ax, img, title in zip(axes, images, titles):
        ax.imshow(img)
        ax.set_title(title, fontsize=10)
        ax.axis('off')
    if suptitle:
        plt.suptitle(suptitle, fontsize=13, y=1.02)
    plt.tight_layout()
    plt.show()


def cleanup_pipeline(pipe):
    """Delete a pipeline and free GPU memory."""
    del pipe
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        allocated = torch.cuda.memory_allocated() / 1e9
        print(f"GPU memory after cleanup: {allocated:.2f} GB allocated")


def make_generator(seed):
    """Create a torch Generator with the given seed for reproducible results."""
    return torch.Generator(device=device).manual_seed(seed)


# Download a sample image that works well for all exercises.
# This is a Creative Commons photo of a person standing in front of architecture,
# giving us edges (building), depth (foreground/background), and pose (person).
SOURCE_URL = "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
source_image = download_sample_image(SOURCE_URL, "source.jpg")
source_image = source_image.resize((512, 512))

show_images([source_image], ["Source Image (512x512)"])
print("This source image will be used throughout the notebook.")
print("It has clear edges, depth layering, and a visible personâ€”ideal for all three preprocessors.")

---

## Exercise 1: Canny Edge Preprocessing and Generation `[Guided]`

From the lesson: preprocessing quality is the **most impactful practical decision** you will make with ControlNet. The spatial map is what ControlNet followsâ€”garbage in, garbage out.

Canny edge detection has two thresholds (low, high) that control edge sensitivity:
- Edges below the low threshold are discarded
- Edges above the high threshold are kept
- Edges in between are kept only if connected to strong edges

We will extract Canny edges at three different threshold pairs, then generate an image with ControlNet for each. The goal: see how preprocessing quality directly affects output quality.

**Before running, predict:**
- Which threshold settingâ€”(50, 100), (100, 200), or (200, 300)â€”will produce the best-controlled generation?
- What goes wrong with too-low thresholds (too many edges)?
- What goes wrong with too-high thresholds (too few edges)?

In [None]:
# Step 1: Extract Canny edges at three threshold settings

source_np = np.array(source_image)

# Three threshold pairs: too many edges, good edges, too few edges
threshold_pairs = [
    (50, 100),   # Low thresholdsâ€”picks up lots of texture/noise edges
    (100, 200),  # Moderate thresholdsâ€”clean structural edges
    (200, 300),  # High thresholdsâ€”only the strongest edges survive
]

canny_maps = {}
for low, high in threshold_pairs:
    edges = cv2.Canny(source_np, low, high)
    # Convert to 3-channel RGB PIL Image (ControlNet expects RGB)
    edges_rgb = np.stack([edges] * 3, axis=-1)
    canny_maps[(low, high)] = Image.fromarray(edges_rgb)

# Display the three edge maps side by side
show_images(
    [source_image] + list(canny_maps.values()),
    ["Source"] + [f"Canny ({lo}, {hi})" for lo, hi in threshold_pairs],
    suptitle="Same photo, three Canny threshold settings",
)

print("Low thresholds (50, 100): many edges, including texture and noise.")
print("Moderate thresholds (100, 200): clean structural edges, good detail.")
print("High thresholds (200, 300): only strongest edges, missing structure.")

In [None]:
# Step 2: Generate with ControlNet using each edge map
#
# VRAM note: We load one pipeline, generate all three images, then clean up.
# All models use float16 to fit within T4's 16 GB VRAM.

controlnet_canny = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16,
)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet_canny,
    torch_dtype=torch.float16,
    safety_checker=None,
).to(device)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

prompt = "a beautiful watercolor painting of a woman, masterpiece, high quality"
seed = 42
num_steps = 20

generated_images = {}
for (low, high), edge_map in canny_maps.items():
    generator = make_generator(seed)
    result = pipe(
        prompt,
        image=edge_map,
        num_inference_steps=num_steps,
        generator=generator,
    ).images[0]
    generated_images[(low, high)] = result

print("Generation complete for all three threshold settings.")

In [None]:
# Step 3: Compare the results

fig, axes = plt.subplots(2, 4, figsize=(20, 10))

# Top row: edge maps
axes[0][0].imshow(source_image)
axes[0][0].set_title("Source Image", fontsize=11)
axes[0][0].axis('off')

for i, ((low, high), edge_map) in enumerate(canny_maps.items()):
    axes[0][i + 1].imshow(edge_map)
    axes[0][i + 1].set_title(f"Canny ({low}, {high})", fontsize=11)
    axes[0][i + 1].axis('off')

# Bottom row: generated images
axes[1][0].axis('off')  # empty cell
axes[1][0].set_title("(source, no generation)", fontsize=9, color='gray')

labels = ["Too many edges", "Good edges âœ“", "Too few edges"]
for i, ((low, high), gen_img) in enumerate(generated_images.items()):
    axes[1][i + 1].imshow(gen_img)
    axes[1][i + 1].set_title(f"Generated ({low}, {high})\n{labels[i]}", fontsize=11)
    axes[1][i + 1].axis('off')

plt.suptitle("Preprocessing quality directly affects ControlNet output quality", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

print("Observations:")
print("- (50, 100): Too many edges from texture. Model over-constrains to noise.")
print("- (100, 200): Clean structural edges. Model follows composition precisely, natural output.")
print("- (200, 300): Too few edges. Model has insufficient guidance, loose structure.")
print("")
print("Key insight: preprocessing is the most impactful practical decision.")
print("ControlNet faithfully follows whatever spatial map you give it.")
print("Garbage in, garbage out.")

In [None]:
# Save the best edge map for later exercises.
# Exercise 3 (conditioning scale sweep) will reuse this.
best_canny_map = canny_maps[(100, 200)]

# VRAM cleanup: keep the pipeline loaded. Exercise 2 will use the same Canny ControlNet
# before switching to other preprocessors.
print("Best Canny edge map saved for Exercise 3.")
print(f"Pipeline still loaded. GPU memory: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

### What Just Happened

You extracted Canny edges from the same photograph at three different threshold settings and generated an image with each. The results demonstrate the lesson's central practical point:

- **Too many edges (50, 100):** Texture details, noise, and grain all become edges. ControlNet tries to follow every spurious contour, producing over-constrained, artifacted output.
- **Good edges (100, 200):** Clean structural edges capture object boundaries and composition. ControlNet follows the composition precisely while the text prompt fills in natural-looking details.
- **Too few edges (200, 300):** Only the strongest gradients survive. The model lacks sufficient spatial guidanceâ€”the result is loosely controlled, with the composition only roughly matching.

**The threshold tuning was the most impactful decision.** The same ControlNet checkpoint, the same prompt, the same seedâ€”only the preprocessing changed, and the output quality varied dramatically.

---

## Exercise 2: Three Preprocessors, One Pipeline `[Guided]`

From the lesson: the ControlNet pipeline is **identical for all spatial map types**. Only two things change: (1) which preprocessor extracts the map, and (2) which ControlNet checkpoint you load. The architecture is genuinely map-agnostic.

We will extract three types of spatial maps from the same source imageâ€”Canny edges, MiDaS depth, and OpenPose skeletonâ€”and generate with each. Pay attention to what **stays the same** in the API code across all three.

**Before running, predict:**
- Will the generated images look similar to each other, or qualitatively different?
- What kind of structural control does each map type provide? (edges = ?, depth = ?, pose = ?)
- How much of the pipeline code will change when switching from one map type to another?

In [None]:
# Step 1: We already have the Canny edge map from Exercise 1.
# Now extract depth and pose maps.

# --- Depth map via DPT (MiDaS-based) ---
from transformers import pipeline as hf_pipeline

depth_estimator = hf_pipeline("depth-estimation", model="Intel/dpt-hybrid-midas")
depth_result = depth_estimator(source_image)
depth_map = depth_result["depth"]  # PIL Image, grayscale
depth_map = depth_map.resize((512, 512))

# Clean up the depth model (we only need the map)
del depth_estimator
gc.collect()
torch.cuda.empty_cache()

print("Depth map extracted.")

# --- Pose map via OpenPose ---
from controlnet_aux import OpenposeDetector

openpose = OpenposeDetector.from_pretrained("lllyasviel/ControlNet")
pose_map = openpose(source_image)
pose_map = pose_map.resize((512, 512))

# Clean up the pose model
del openpose
gc.collect()
torch.cuda.empty_cache()

print("Pose map extracted.")

# Display all three spatial maps
show_images(
    [source_image, best_canny_map, depth_map, pose_map],
    ["Source", "Canny Edges", "Depth Map (MiDaS)", "OpenPose Skeleton"],
    suptitle="Same source image, three preprocessors, three types of spatial information",
)

print("Each preprocessor captures different spatial information:")
print("  Canny: 2D contours and silhouettes")
print("  Depth: 3D structure, perspective, foreground/background layering")
print("  Pose:  Body joint positions and limb angles")

In [None]:
# Step 2: Generate with each preprocessor and its corresponding ControlNet checkpoint.
#
# VRAM management: We load ONE ControlNet pipeline at a time.
# After generating with Canny, we delete the pipeline, then load depth, etc.
# This keeps us well within T4's 16 GB VRAM budget.

prompt = "a beautiful watercolor painting of a woman, masterpiece, high quality"
seed = 42
num_steps = 20

# --- Generate with Canny (pipeline already loaded from Exercise 1) ---
generator = make_generator(seed)
img_canny = pipe(
    prompt,
    image=best_canny_map,
    num_inference_steps=num_steps,
    generator=generator,
).images[0]
print("Canny generation complete.")

# Clean up Canny pipeline before loading depth
cleanup_pipeline(pipe)
del controlnet_canny

In [None]:
# --- Generate with Depth ---
controlnet_depth = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-depth",
    torch_dtype=torch.float16,
)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet_depth,
    torch_dtype=torch.float16,
    safety_checker=None,
).to(device)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

generator = make_generator(seed)
img_depth = pipe(
    prompt,
    image=depth_map,
    num_inference_steps=num_steps,
    generator=generator,
).images[0]
print("Depth generation complete.")

# Save depth map and controlnet for Exercise 4 (stacking)
# But clean up the pipeline to free VRAM for OpenPose
cleanup_pipeline(pipe)
del controlnet_depth

In [None]:
# --- Generate with OpenPose ---
controlnet_pose = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-openpose",
    torch_dtype=torch.float16,
)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet_pose,
    torch_dtype=torch.float16,
    safety_checker=None,
).to(device)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

generator = make_generator(seed)
img_pose = pipe(
    prompt,
    image=pose_map,
    num_inference_steps=num_steps,
    generator=generator,
).images[0]
print("OpenPose generation complete.")

# Clean up the pose pipeline
cleanup_pipeline(pipe)
del controlnet_pose

In [None]:
# Step 3: Compare all three side by side

fig, axes = plt.subplots(2, 4, figsize=(20, 10))

# Top row: spatial maps
axes[0][0].imshow(source_image)
axes[0][0].set_title("Source Image", fontsize=11)
axes[0][0].axis('off')

maps_and_titles = [
    (best_canny_map, "Canny Edges"),
    (depth_map, "Depth Map"),
    (pose_map, "OpenPose Skeleton"),
]
for i, (m, t) in enumerate(maps_and_titles):
    axes[0][i + 1].imshow(m)
    axes[0][i + 1].set_title(t, fontsize=11)
    axes[0][i + 1].axis('off')

# Bottom row: generated images
axes[1][0].axis('off')
axes[1][0].set_title("(source)", fontsize=9, color='gray')

gen_and_titles = [
    (img_canny, "Canny â†’ Contours"),
    (img_depth, "Depth â†’ 3D Structure"),
    (img_pose, "Pose â†’ Body Position"),
]
for i, (img, t) in enumerate(gen_and_titles):
    axes[1][i + 1].imshow(img)
    axes[1][i + 1].set_title(t, fontsize=11)
    axes[1][i + 1].axis('off')

plt.suptitle("Same prompt, same seedâ€”different preprocessor, qualitatively different control", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

print("Notice what stayed the SAME in the code across all three:")
print("  - StableDiffusionControlNetPipeline.from_pretrained(...)")
print("  - pipe(prompt, image=spatial_map, num_inference_steps=..., generator=...)")
print("")
print("Only TWO things changed:")
print("  1. Which preprocessor extracted the map (cv2.Canny vs depth_estimator vs openpose)")
print("  2. Which ControlNet checkpoint was loaded (sd-controlnet-canny vs -depth vs -openpose)")
print("")
print("The pipeline does not know or care what kind of spatial map it receives.")
print("The architecture is genuinely map-agnostic.")

### What Just Happened

You used three different preprocessors on the same source image and generated with each using the corresponding ControlNet checkpoint:

- **Canny edges** controlled 2D contours and silhouettesâ€”the model preserved object boundaries and composition.
- **MiDaS depth** controlled 3D structure and layeringâ€”the model preserved foreground/background arrangement and perspective.
- **OpenPose skeleton** controlled body poseâ€”the model generated a figure matching the skeleton's joint positions.

The three outputs look qualitatively different because each spatial map captures a **different kind** of structural information. Yet the pipeline code was identical. The modularity from lesson 1â€”"four translators, one pipeline"â€”is real. Swap the preprocessor and the checkpoint; the socket does not change.

---

## Exercise 3: Conditioning Scale Sweep `[Supported]`

From the lesson: conditioning scale is a **volume knob for spatial control**. Low scale means the spatial map is a faint suggestion and the model generates freely from text. High scale means the model rigidly follows the spatial map, losing natural variation. The sweet spot is typically 0.7â€“1.0.

This exercise uses the best Canny edge map from Exercise 1. Your task: write the generation loop that sweeps across conditioning scale values and displays the results as a comparison grid.

After the sweep, you will also verify that **text conditioning remains active** even at scale=1.0 by generating with two different prompts on the same edge map.

**Before running, predict:**
- At scale=0.3, will the output follow the edge map at all?
- At scale=2.0, what will the output look like? Sharper? More detailed? Or something else?
- At scale=1.0 with two different prompts, will the images look identical?

In [None]:
# Load the Canny ControlNet pipeline for this exercise

controlnet_canny = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16,
)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet_canny,
    torch_dtype=torch.float16,
    safety_checker=None,
).to(device)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

print("Canny ControlNet pipeline loaded.")
print(f"GPU memory: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

In [None]:
# YOUR TASK: Sweep the conditioning scale from 0.3 to 2.0.
#
# The scale values and display code are provided.
# You write the generation loop that fills in `sweep_results`.

scales = [0.3, 0.5, 0.7, 1.0, 1.5, 2.0]
prompt = "a beautiful watercolor painting of a woman, masterpiece, high quality"
seed = 42
num_steps = 20

sweep_results = {}  # Maps scale -> generated PIL Image

for scale in scales:
    # TODO: Generate an image using the pipeline with this conditioning scale.
    #
    # Use: pipe(prompt, image=best_canny_map, num_inference_steps=num_steps,
    #          generator=make_generator(seed),
    #          controlnet_conditioning_scale=???)
    #
    # Store the result in sweep_results[scale]
    # (the pipeline returns .images[0] for the first image)
    pass

print(f"Generated {len(sweep_results)} images across conditioning scales.")

In [None]:
# Display the conditioning scale sweep as a comparison grid

if not sweep_results:
    print("sweep_results is emptyâ€”go back to the previous cell and fill in the TODO.")
    print("You need to generate an image for each scale value and store it in sweep_results[scale].")
    print("Check the solution below the TODO cell if you get stuck.")
else:
    fig, axes = plt.subplots(1, len(scales) + 1, figsize=(4 * (len(scales) + 1), 4))

    # First column: the edge map
    axes[0].imshow(best_canny_map)
    axes[0].set_title("Edge Map\n(input)", fontsize=10)
    axes[0].axis('off')

    # One column per scale value
    scale_labels = {
        0.3: "0.3\n(faint suggestion)",
        0.5: "0.5\n(visible influence)",
        0.7: "0.7\n(good balance)",
        1.0: "1.0\n(strong control)",
        1.5: "1.5\n(over-constraining)",
        2.0: "2.0\n(heavily rigid)",
    }

    for i, scale in enumerate(scales):
        axes[i + 1].imshow(sweep_results[scale])
        axes[i + 1].set_title(f"Scale {scale_labels[scale]}", fontsize=10)
        axes[i + 1].axis('off')

    plt.suptitle("Conditioning Scale Sweep: same edges, same prompt, varying spatial strength", fontsize=13, y=1.02)
    plt.tight_layout()
    plt.show()

    print("The volume knob for spatial control:")
    print("  Low (0.3-0.5): spatial map is a suggestion, model generates freely")
    print("  Sweet spot (0.7-1.0): clear structural adherence, natural textures")
    print("  High (1.5-2.0): over-constrained, rigid textures, artifacts")

In [None]:
# Step 2: Verify that text conditioning remains active at scale=1.0.
#
# Same edge map, same scale, same seedâ€”TWO different prompts.
# If spatial conditioning disabled text conditioning, the images would be identical.

prompt_a = "a beautiful watercolor painting of a woman, masterpiece"
prompt_b = "a cyberpunk android with neon tattoos, digital art"

generator = make_generator(seed)
img_prompt_a = pipe(
    prompt_a,
    image=best_canny_map,
    num_inference_steps=num_steps,
    generator=generator,
    controlnet_conditioning_scale=1.0,
).images[0]

generator = make_generator(seed)
img_prompt_b = pipe(
    prompt_b,
    image=best_canny_map,
    num_inference_steps=num_steps,
    generator=generator,
    controlnet_conditioning_scale=1.0,
).images[0]

show_images(
    [best_canny_map, img_prompt_a, img_prompt_b],
    ["Edge Map", f'Prompt A:\n"{prompt_a[:40]}..."', f'Prompt B:\n"{prompt_b[:40]}..."'],
    suptitle="Same edges, scale=1.0, different prompts â†’ same structure, different content",
)

print("The structure is identical (both follow the same edges).")
print("The content is completely different (watercolor vs cyberpunk).")
print("")
print("Conditioning scale 1.0 does NOT disable text conditioning.")
print("Remember WHEN/WHAT/WHERE: the scale controls how loud the WHERE signal is,")
print("not whether the WHAT signal is on.")

In [None]:
# Clean up the pipeline before Exercise 4
cleanup_pipeline(pipe)
del controlnet_canny
gc.collect()
torch.cuda.empty_cache()
print("Pipeline cleaned up. Ready for Exercise 4.")

<details>
<summary>ðŸ’¡ Solution</summary>

The key insight is that `controlnet_conditioning_scale` is a single parameter in the pipeline callâ€”your manual volume knob for spatial control. The sweep loop is just calling the pipeline with a different scale value each time.

```python
for scale in scales:
    generator = make_generator(seed)
    result = pipe(
        prompt,
        image=best_canny_map,
        num_inference_steps=num_steps,
        generator=generator,
        controlnet_conditioning_scale=scale,
    ).images[0]
    sweep_results[scale] = result
```

**Why this works:** The conditioning scale multiplies the ControlNet's output features before they are added to the frozen encoder's skip connections. At scale=0, the ControlNet's contribution is zeroed out (muted). At scale=1.0, it is at full trained strength. Above 1.0, you are amplifying the signal beyond what training optimized for, which is why over-constraining occurs.

**Common mistake:** Forgetting to recreate the generator for each scale value. Without resetting the seed, each generation starts from different random noise, making the comparison unfairâ€”you would not be isolating the effect of the scale parameter.

</details>

### What Just Happened

You swept the conditioning scale from 0.3 to 2.0 and discovered the control-creativity tradeoff firsthand:

- **Low scales (0.3â€“0.5):** The spatial map is a faint suggestion. The model generates mostly from the text prompt, with rough composition loosely following the edges.
- **Sweet spot (0.7â€“1.0):** Clear structural adherence with natural textures. The model follows the composition precisely while maintaining the visual quality you expect from SD.
- **High scales (1.5â€“2.0):** Over-constrained. Textures flatten, details stiffen, the image looks mechanical. The model is trying too hard to match every edge pixel.

You also verified that **text conditioning remains active** at scale=1.0. Two different prompts with the same edge map produced the same structure but completely different content. The WHEN/WHAT/WHERE channels coexistâ€”conditioning scale turns up the WHERE volume, it does not mute the WHAT channel.

---

## Exercise 4: Multi-ControlNet Stacking `[Supported]`

From the lesson: each ControlNet independently contributes additive features to the skip connections. They compose by summation: `e_i + z_i_canny + z_i_depth`. Stacking is not doubling control strengthâ€”it is providing two **complementary** types of structural constraint.

Your task: load both the Canny and depth ControlNet checkpoints, extract both maps from the same source image (you already have these from earlier exercises), and generate three comparisons:
1. Canny edges only
2. Depth only
3. Both stacked together

The model loading is done for you. You write the pipeline construction and generation calls.

**Before running, predict:**
- Will the stacked result look like "Canny result + depth result" blended together?
- Will the stacked result be more precisely controlled than either alone, or more artifacted?
- What conditioning scales should you use for the stacked version?

In [None]:
# Step 1: Generate with Canny only
#
# Load a single-ControlNet pipeline for Canny, generate, clean up.

prompt = "a beautiful watercolor painting of a woman, masterpiece, high quality"
seed = 42
num_steps = 20

controlnet_canny = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16,
)

pipe_canny = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet_canny,
    torch_dtype=torch.float16,
    safety_checker=None,
).to(device)
pipe_canny.scheduler = UniPCMultistepScheduler.from_config(pipe_canny.scheduler.config)

generator = make_generator(seed)
img_canny_only = pipe_canny(
    prompt,
    image=best_canny_map,
    num_inference_steps=num_steps,
    generator=generator,
    controlnet_conditioning_scale=0.8,
).images[0]
print("Canny-only generation complete.")

# Clean up before loading depth pipeline
cleanup_pipeline(pipe_canny)
del controlnet_canny

In [None]:
# Step 2: Generate with Depth only

controlnet_depth = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-depth",
    torch_dtype=torch.float16,
)

pipe_depth = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet_depth,
    torch_dtype=torch.float16,
    safety_checker=None,
).to(device)
pipe_depth.scheduler = UniPCMultistepScheduler.from_config(pipe_depth.scheduler.config)

generator = make_generator(seed)
img_depth_only = pipe_depth(
    prompt,
    image=depth_map,
    num_inference_steps=num_steps,
    generator=generator,
    controlnet_conditioning_scale=0.8,
).images[0]
print("Depth-only generation complete.")

# Clean up before loading the stacked pipeline
cleanup_pipeline(pipe_depth)
del controlnet_depth

In [None]:
# Step 3: Generate with BOTH Canny + Depth stacked.
#
# The ControlNet models are loaded for you.
# YOUR TASK: Build the stacked pipeline and generate.
#
# VRAM note: Two ControlNet models + SD base model will use ~8-9 GB.
# This fits on a T4 since we cleaned up the previous pipelines.

controlnet_canny = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16,
)
controlnet_depth = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-depth",
    torch_dtype=torch.float16,
)

# TODO: Build a StableDiffusionControlNetPipeline with BOTH ControlNets.
#
# The key difference from single ControlNet: pass a LIST of ControlNets.
#   controlnet=[controlnet_canny, controlnet_depth]
#
# Use the same base model, torch_dtype, and safety_checker settings as before.
# Don't forget to set the scheduler and move to device.
#
# pipe_stacked = StableDiffusionControlNetPipeline.from_pretrained(
#     ...,
#     controlnet=???,
#     ...,
# ).to(device)
# pipe_stacked.scheduler = ...



# TODO: Generate with the stacked pipeline.
#
# The pipeline takes LISTS for image and controlnet_conditioning_scale:
#   image=[best_canny_map, depth_map]
#   controlnet_conditioning_scale=[0.7, 0.5]
#
# Use moderate scales (0.5-0.8) for stacking. The combined effect is stronger
# than either alone, so you need lower per-ControlNet scales.
#
# generator = make_generator(seed)
# img_stacked = pipe_stacked(
#     ...,
#     image=???,
#     controlnet_conditioning_scale=???,
#     ...,
# ).images[0]



print("Stacked generation complete.")

In [None]:
# Step 4: Display the comparison (edges only vs depth only vs both stacked)

fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Top row: spatial maps
axes[0][0].imshow(best_canny_map)
axes[0][0].set_title("Canny Edges", fontsize=11)
axes[0][0].axis('off')

axes[0][1].imshow(depth_map)
axes[0][1].set_title("Depth Map", fontsize=11)
axes[0][1].axis('off')

axes[0][2].imshow(source_image)
axes[0][2].set_title("Source Image", fontsize=11)
axes[0][2].axis('off')

# Bottom row: generated images
axes[1][0].imshow(img_canny_only)
axes[1][0].set_title("Canny Only\n(contour control)", fontsize=11)
axes[1][0].axis('off')

axes[1][1].imshow(img_depth_only)
axes[1][1].set_title("Depth Only\n(layering control)", fontsize=11)
axes[1][1].axis('off')

axes[1][2].imshow(img_stacked)
axes[1][2].set_title("Canny + Depth Stacked\n(contours AND layering)", fontsize=11)
axes[1][2].axis('off')

plt.suptitle("Multi-ControlNet: complementary constraints compose by additive features", fontsize=13, y=1.02)
plt.tight_layout()
plt.show()

print("Observations:")
print("  Canny only: precise contour control, but depth/perspective is model's choice.")
print("  Depth only: correct spatial layering, but specific contours are loose.")
print("  Stacked: BOTH contours AND layering are controlledâ€”more precise than either alone.")
print("")
print("The stacked result is not '2x more controlled.'")
print("It is controlled in TWO DIFFERENT WAYS simultaneously.")
print("Edges enforce contours. Depth enforces layering. Complementary, not redundant.")

In [None]:
# Clean up the stacked pipeline
cleanup_pipeline(pipe_stacked)
del controlnet_canny, controlnet_depth
gc.collect()
torch.cuda.empty_cache()
print("Pipeline cleaned up. Ready for Exercise 5.")

<details>
<summary>ðŸ’¡ Solution</summary>

The core insight is that stacking uses Python **lists**â€”a list of ControlNet models, a list of conditioning images, and a list of per-ControlNet conditioning scales. The pipeline handles the rest (running each ControlNet independently and summing their additive features at the skip connections).

**Building the stacked pipeline:**
```python
pipe_stacked = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=[controlnet_canny, controlnet_depth],
    torch_dtype=torch.float16,
    safety_checker=None,
).to(device)
pipe_stacked.scheduler = UniPCMultistepScheduler.from_config(pipe_stacked.scheduler.config)
```

**Generating with the stacked pipeline:**
```python
generator = make_generator(seed)
img_stacked = pipe_stacked(
    prompt,
    image=[best_canny_map, depth_map],
    num_inference_steps=num_steps,
    generator=generator,
    controlnet_conditioning_scale=[0.7, 0.5],
).images[0]
```

**Why moderate scales (0.7 and 0.5)?** Each ControlNet independently contributes additive features. Their combined effect at the skip connections is `e_i + z_i_canny + z_i_depth`. If both are at scale 1.0, the combined spatial constraint can be too strong, producing artifacts. Starting with moderate scales (0.5â€“0.8) gives each map influence while leaving room for the model to generate natural details.

**Why Canny at 0.7 and depth at 0.5?** Canny edges provide fine-grained contour control that is visually preciseâ€”you want this signal to be reasonably strong. Depth provides broad spatial layeringâ€”a softer influence is usually sufficient. You could tune these differently depending on your creative intent.

**Common mistake:** Passing `image=best_canny_map` (a single image) instead of `image=[best_canny_map, depth_map]` (a list). When you have multiple ControlNets, the pipeline expects a list of conditioning imagesâ€”one per ControlNet.

</details>

### What Just Happened

You stacked two ControlNetsâ€”Canny edges and depthâ€”and compared the result to each individual ControlNet:

- **Canny only:** Precise contour control. Object boundaries and silhouettes match the edges. But depth and perspective are the model's choice.
- **Depth only:** Correct spatial layering and perspective. But specific contours are loosely controlled.
- **Stacked:** Both contours AND layering are controlled. The combination is more precisely controlled than either alone because the two maps provide **complementary** structural information.

The stacked result is not "2x more controlled." It is controlled in two different ways simultaneouslyâ€”edges enforce contours while depth enforces layering. This is the additive composition from the architecture: `e_i + z_i_canny + z_i_depth`.

---

## Exercise 5: Your Composition `[Independent]`

You now have all the practical skills:
- **Preprocessing:** Extract Canny edges, depth maps, or pose skeletons from any image
- **Conditioning scale:** Tune the volume knob for spatial control (sweet spot: 0.7â€“1.0)
- **Stacking:** Combine multiple spatial constraints with per-ControlNet scales

**Your task:** Create a controlled composition from scratch.

1. Choose a source image (options provided below, or use your own)
2. Decide which preprocessor(s) to use based on your creative intent
3. Tune the preprocessing (Canny thresholds if using edges)
4. Choose your conditioning scale(s)
5. Write a text prompt that complements the spatial control
6. Generate and iterate

No scaffolding. You decide the full workflow.

**Available source images:**

In [None]:
# Source image options (pick one or upload your own)

image_options = {
    "vermeer": "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png",
    "architecture": "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_input.png",
    "person": "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/person.png",
}

# Show what is available
option_images = []
option_titles = []
for name, url in image_options.items():
    try:
        img = download_sample_image(url, f"{name}.jpg").resize((512, 512))
        option_images.append(img)
        option_titles.append(name)
    except Exception as e:
        print(f"Could not download {name}: {e}")

show_images(option_images, option_titles, suptitle="Choose a source image (or upload your own)")

print("Pick one of these, or upload your own image to Colab.")
print("Think about what kind of spatial control you want:")
print("  - Canny edges for precise contours")
print("  - Depth for spatial layering")
print("  - Pose for body positioning")
print("  - Stacking for combined control")

In [None]:
# Your composition code goes here.
#
# Workflow:
# 1. Choose your source image
# 2. Preprocess it (Canny, depth, pose, or multiple)
# 3. Load the appropriate ControlNet checkpoint(s)
# 4. Build the pipeline
# 5. Generate with your chosen prompt and conditioning scale(s)
# 6. Display the result
#
# Remember:
# - Use torch.float16 for all models
# - Clean up preprocessor models after extracting maps
# - Start with conditioning_scale around 0.7-1.0
# - For stacking, use moderate scales (0.5-0.8 per ControlNet)
#
# Have fun!



<details>
<summary>ðŸ’¡ Solution (example)</summary>

There is no single correct answerâ€”this exercise is about making practical decisions. Here is one example using the architecture image with stacked Canny + depth:

**Why this approach:** The architecture image has clear geometric edges (good for Canny) and visible depth layering (good for depth ControlNet). Stacking both gives precise contour control AND correct perspective.

```python
# Choose source image
my_image = download_sample_image(image_options["architecture"], "my_source.jpg")
my_image = my_image.resize((512, 512))

# Preprocess: Canny edges
my_edges = cv2.Canny(np.array(my_image), 100, 200)
my_canny = Image.fromarray(np.stack([my_edges] * 3, axis=-1))

# Preprocess: depth map
from transformers import pipeline as hf_pipeline
depth_est = hf_pipeline("depth-estimation", model="Intel/dpt-hybrid-midas")
my_depth = depth_est(my_image)["depth"].resize((512, 512))
del depth_est; gc.collect(); torch.cuda.empty_cache()

# Load stacked pipeline
cn_canny = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16
)
cn_depth = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-depth", torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=[cn_canny, cn_depth],
    torch_dtype=torch.float16,
    safety_checker=None,
).to(device)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

# Generate
generator = make_generator(42)
result = pipe(
    "a fantasy castle in a magical forest, digital art, detailed",
    image=[my_canny, my_depth],
    num_inference_steps=20,
    generator=generator,
    controlnet_conditioning_scale=[0.7, 0.5],
).images[0]

# Display
show_images(
    [my_image, my_canny, my_depth, result],
    ["Source", "Canny Edges", "Depth Map", "Generated"],
    suptitle="Stacked Canny + Depth: contours AND perspective controlled",
)

# Clean up
cleanup_pipeline(pipe)
del cn_canny, cn_depth
```

**Key decisions:**
- Canny thresholds (100, 200) because the building has clear geometric edges at moderate contrast
- Stacking because architecture benefits from both contour precision (Canny) and perspective accuracy (depth)
- Canny at 0.7 (strong contours matter for architecture), depth at 0.5 (softer spatial layering)
- The prompt describes the creative transformation while the spatial maps lock down the composition

</details>

---

## Key Takeaways

1. **Preprocessing quality is the most impactful practical decision.** ControlNet faithfully follows whatever spatial map you give it. Bad Canny thresholds produce noisy edges; noisy edges produce artifacted output. Garbage in, garbage out. Tune the preprocessing before touching anything else.

2. **The pipeline API is identical for all spatial map types.** Only the preprocessor and the ControlNet checkpoint change. `StableDiffusionControlNetPipeline` does not know or care what kind of map it receives. The architecture is genuinely map-agnostic.

3. **Conditioning scale is your volume knob for spatial control.** Low scale = spatial map is a suggestion. High scale = model rigidly follows the map. Sweet spot is typically 0.7â€“1.0. Same tradeoff as CFG guidance scaleâ€”precision vs creativity. Two knobs on the same mixing board.

4. **Multiple ControlNets stack by summing their additive features.** Use complementary maps from the same source image, moderate scales (0.5â€“0.8 each), and add complexity gradually. Stacking is not "2x control"â€”it is two different types of control simultaneously.

5. **The practical workflow is: choose image â†’ choose preprocessor(s) â†’ tune preprocessing â†’ tune conditioning scale â†’ iterate.** This is a creative process. The tools are the same every time; the decisions are what change.