# ControlNet

**Module 7.1, Lesson 1** | CourseAI

You built Stable Diffusion from scratch. You know the U-Net encoder-decoder, skip connections, cross-attention, and the frozen-model training pattern. This notebook puts that knowledge to work on a real ControlNet model.

**What you will do:**
- Inspect a pre-trained ControlNet's architecture and compare its parameter counts to the frozen SD model
- Build a zero convolution from scratch and verify that it produces all-zero output at initialization
- Trace a forward pass through ControlNet, inspecting feature map shapes at every resolution level
- Generate images with and without ControlNet conditioning, and verify that text and spatial conditioning coexist

**For each exercise, PREDICT the output before running the cell.**

Every concept in this notebook comes from the lesson. The trainable encoder copy, the zero convolution mechanism, the additive connections at each resolution level, and the coexistence of text and spatial conditioning. No new theory hereâ€”just verification with real models and real tensors.

**Estimated time:** 30â€“45 minutes (model downloads may take a few minutes on first run).

---

## Setup

Run this cell to install dependencies, import everything, and configure the environment.

**Important:** Set the runtime to GPU before running. In Colab: Runtime â†’ Change runtime type â†’ T4 GPU.

The first run will download model weights (~1.5 GB for the ControlNet checkpoint + ~5 GB for the SD v1.5 model). Subsequent runs use cached weights.

In [None]:
!pip install -q diffusers transformers accelerate safetensors

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np
from diffusers import (
    StableDiffusionControlNetPipeline,
    ControlNetModel,
    UniPCMultistepScheduler,
)
from diffusers.models.unets.unet_2d_condition import UNet2DConditionModel
from PIL import Image

# Reproducible results
torch.manual_seed(42)

# Nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [10, 4]

# Device setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')
if device.type == 'cuda':
    print(f'GPU: {torch.cuda.get_device_name(0)}')
else:
    print('WARNING: No GPU detected. Exercises 3 and 4 will be very slow.')
    print('In Colab: Runtime \u2192 Change runtime type \u2192 T4 GPU')

print('\nSetup complete.')

---

## Exercise 1: Inspect the ControlNet Architecture `[Guided]`

From the lesson, ControlNet adds spatial conditioning by cloning the U-Net encoder and training only the clone. The frozen SD model stays untouched. The clone's outputs are connected to the frozen decoder via zero convolutions.

Let's load a real pre-trained ControlNet (Canny edge variant) and the frozen SD v1.5 U-Net, then inspect what's inside.

**Before running, predict:**
- The frozen SD U-Net has ~860M parameters. How many parameters does the ControlNet add? (Hint: the lesson said the trainable copy is the encoder half onlyâ€”about 35% of the U-Net.)
- Will any of the U-Net's parameters be marked as trainable?
- Will the ControlNet's encoder blocks have the same structure as the U-Net's encoder blocks?

In [None]:
# Load a pre-trained ControlNet (Canny edge variant) and the SD v1.5 U-Net
# This downloads model weights on first run (~1.5 GB for ControlNet, ~3.4 GB for U-Net)

controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16,
)

unet = UNet2DConditionModel.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    subfolder="unet",
    torch_dtype=torch.float16,
)

print("Models loaded.")

In [None]:
# Count parameters in each model

def count_params(model):
    """Count total and trainable parameters in a model."""
    total = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return total, trainable

unet_total, unet_trainable = count_params(unet)
cn_total, cn_trainable = count_params(controlnet)

print("=== Parameter Counts ===")
print(f"\nFrozen SD v1.5 U-Net:")
print(f"  Total parameters:     {unet_total:>12,}")
print(f"  Trainable parameters: {unet_trainable:>12,}")

print(f"\nControlNet (Canny):")
print(f"  Total parameters:     {cn_total:>12,}")
print(f"  Trainable parameters: {cn_trainable:>12,}")

print(f"\nControlNet as % of U-Net: {cn_total / unet_total * 100:.1f}%")
print(f"\nDuring ControlNet training:")
print(f"  U-Net params frozen:       {unet_total:>12,} (100% of U-Net)")
print(f"  ControlNet params trained:  {cn_trainable:>12,}")

In [None]:
# Inspect the top-level structure of both models
# Look for the correspondence: ControlNet's encoder blocks should mirror the U-Net's encoder blocks

print("=== U-Net Top-Level Modules ===")
for name, module in unet.named_children():
    param_count = sum(p.numel() for p in module.parameters())
    print(f"  {name:30s} {param_count:>12,} params")

print("\n=== ControlNet Top-Level Modules ===")
for name, module in controlnet.named_children():
    param_count = sum(p.numel() for p in module.parameters())
    print(f"  {name:30s} {param_count:>12,} params")

In [None]:
# The ControlNet has "controlnet_down_blocks" that mirror the U-Net's "down_blocks" (encoder).
# It also has "controlnet_cond_embedding" for processing the spatial map input,
# and "zero_convs," the zero convolution connections.
#
# Let's verify the encoder block correspondence.

print("=== Encoder Block Correspondence ===")
print(f"{'U-Net down_blocks':40s} {'ControlNet down_blocks':40s}")
print("-" * 82)

unet_down = list(unet.down_blocks.named_children())
cn_down = list(controlnet.down_blocks.named_children())

for i in range(max(len(unet_down), len(cn_down))):
    unet_info = ""
    cn_info = ""
    if i < len(unet_down):
        name, block = unet_down[i]
        params = sum(p.numel() for p in block.parameters())
        unet_info = f"Block {name}: {type(block).__name__} ({params:,})"
    if i < len(cn_down):
        name, block = cn_down[i]
        params = sum(p.numel() for p in block.parameters())
        cn_info = f"Block {name}: {type(block).__name__} ({params:,})"
    print(f"{unet_info:40s} {cn_info:40s}")

print("\n=== Zero Convolution Layers ===")
for name, module in controlnet.controlnet_down_blocks.named_children():
    params = sum(p.numel() for p in module.parameters())
    # Each zero conv is a small 1x1 convolution
    print(f"  controlnet_down_blocks.{name}: {type(module).__name__} ({params:,} params)")

# Also the mid block zero conv
mid_zc_params = sum(p.numel() for p in controlnet.controlnet_mid_block.parameters())
print(f"  controlnet_mid_block: ({mid_zc_params:,} params)")

### What Just Happened

You loaded a real pre-trained ControlNet and inspected its architecture alongside the frozen SD v1.5 U-Net.

**Key observations:**
- The ControlNet adds roughly 35â€“40% of the U-Net's parameter countâ€”not a full duplicate. It contains the encoder half and a mid-block, but no decoder.
- The ControlNet's encoder blocks (`down_blocks`) mirror the U-Net's encoder blocks in structureâ€”same block types, same channel dimensions, same number of layers.
- The zero convolution layers (`controlnet_down_blocks`) are small 1x1 convolutions that connect the ControlNet's encoder outputs to the frozen decoder. Their parameter count is negligible compared to the encoder copy.
- The `controlnet_cond_embedding` module processes the spatial map (Canny edges) into the same format the encoder expects.

This matches the lesson's architecture diagram: a trainable copy of the encoder, connected to the frozen decoder via zero convolutions at each resolution level.

---

## Exercise 2: Verify the Zero-Initialization Property `[Guided]`

From the lesson: a zero convolution is a **1Ã—1 convolution with weights initialized to 0.0 and bias initialized to 0.0**. That is the complete definition. The cleverness is in the *initialization*, not the operation.

At initialization:
- The zero conv output is exactly zero for any input
- Adding zero to the frozen encoder features leaves them unchanged: $e_i + 0 = e_i$
- The frozen model's output is identical to vanilla SD

This is the **safety guarantee**: connecting a fresh ControlNet changes nothing about the original model.

**Before running, predict:**
- If you create a `nn.Conv2d(256, 256, 1)` with all weights and bias set to zero, what will the output be for a random input tensor?
- If you add that output to a tensor of frozen features, will the frozen features change at all?
- After a few gradient updates, will the zero conv still output zeros?

In [None]:
# Build a zero convolution from scratch and verify its properties

# Step 1: Create a 1x1 convolution and initialize to zero
channels = 256
zero_conv = nn.Conv2d(channels, channels, kernel_size=1, bias=True)

# Initialize ALL weights and bias to exactly zero
nn.init.zeros_(zero_conv.weight)
nn.init.zeros_(zero_conv.bias)

print("=== Zero Convolution at Initialization ===")
print(f"Weight shape: {zero_conv.weight.shape}")
print(f"Bias shape:   {zero_conv.bias.shape}")
print(f"Weight sum:   {zero_conv.weight.sum().item():.6f}  (should be 0.0)")
print(f"Bias sum:     {zero_conv.bias.sum().item():.6f}  (should be 0.0)")
print(f"Weight max:   {zero_conv.weight.abs().max().item():.6f}  (should be 0.0)")

In [None]:
# Step 2: Pass a random feature map through the zero conv
# This simulates what happens when a freshly-initialized ControlNet encoder
# produces features and passes them through a zero convolution.

# Random feature map (simulating ControlNet encoder output at some resolution)
torch.manual_seed(42)
controlnet_features = torch.randn(1, channels, 16, 16)  # batch=1, 256 channels, 16x16 spatial

print(f"ControlNet encoder output (random, simulated):")
print(f"  Shape: {controlnet_features.shape}")
print(f"  Mean:  {controlnet_features.mean().item():.4f}")
print(f"  Std:   {controlnet_features.std().item():.4f}")
print(f"  Range: [{controlnet_features.min().item():.4f}, {controlnet_features.max().item():.4f}]")

# Pass through the zero convolution
with torch.no_grad():
    zero_conv_output = zero_conv(controlnet_features)

print(f"\nAfter zero convolution:")
print(f"  Shape: {zero_conv_output.shape}")
print(f"  Mean:  {zero_conv_output.mean().item():.6f}  (should be 0.0)")
print(f"  Std:   {zero_conv_output.std().item():.6f}  (should be 0.0)")
print(f"  Max:   {zero_conv_output.abs().max().item():.6f}  (should be 0.0)")
print(f"  All zeros? {torch.all(zero_conv_output == 0).item()}")

In [None]:
# Step 3: Add the zero conv output to "frozen" encoder features.
# This is what happens at each skip connection: e_i + zero_conv(c_i)

# Simulate frozen encoder features at the same resolution
torch.manual_seed(99)
frozen_features = torch.randn(1, channels, 16, 16)

# The additive connection
enriched_features = frozen_features + zero_conv_output

# Verify: enriched features should be IDENTICAL to frozen features
difference = (enriched_features - frozen_features).abs().max().item()

print("=== Safety Guarantee: Frozen Features Unchanged ===")
print(f"Frozen encoder feature (e_i):")
print(f"  Mean: {frozen_features.mean().item():.4f}")
print(f"  Std:  {frozen_features.std().item():.4f}")
print(f"\nEnriched feature (e_i + zero_conv(c_i)):")
print(f"  Mean: {enriched_features.mean().item():.4f}")
print(f"  Std:  {enriched_features.std().item():.4f}")
print(f"\nMax absolute difference: {difference:.10f}")
print(f"Features identical? {difference == 0.0}")
print(f"\ne_i + 0 = e_i. The frozen model is unchanged. This is the safety guarantee.")

In [None]:
# Step 4: Simulate a few gradient updates to show the zero conv learns to produce
# a small, non-zero signal. The control "fades in" gradually.

# Create a simple training scenario: try to make zero conv output a target signal
target_signal = torch.randn(1, channels, 16, 16) * 0.1  # small target
optimizer = torch.optim.SGD(zero_conv.parameters(), lr=0.01)

print("=== Zero Conv Learns to Produce a Signal ===")
print(f"{'Step':>6s}  {'Output Mean':>12s}  {'Output Std':>12s}  {'Output Max':>12s}")
print("-" * 50)

for step in range(6):
    output = zero_conv(controlnet_features)
    print(f"{step:>6d}  {output.mean().item():>12.6f}  {output.std().item():>12.6f}  {output.abs().max().item():>12.6f}")

    # Simple MSE loss against the target
    loss = nn.functional.mse_loss(output, target_signal)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

print(f"\nWeight max after training: {zero_conv.weight.abs().max().item():.6f}")
print(f"Bias max after training:   {zero_conv.bias.abs().max().item():.6f}")
print(f"\nThe zero conv started silent (all zeros) and gradually learned to produce")
print(f"a small signal. Nothing at first, then a whisper, then a clear voice.")
print(f"Training gradually turns up the volume.")

### What Just Happened

You built a zero convolution from scratch and verified all three properties from the lesson:

1. **Zero output at initialization** â€” A 1Ã—1 conv with all-zero weights and bias produces all-zero output, regardless of input. The ControlNet starts silent.

2. **Frozen features unchanged** â€” Adding the zero conv output to the frozen encoder features: $e_i + 0 = e_i$. The frozen model's behavior is identical with or without ControlNet connected.

3. **Gradual fade-in** â€” After a few gradient updates, the weights drift from zero, producing a small non-zero signal. The control fades in gradually as training progresses.

This is the same principle as LoRA's B=0 initialization, applied at the feature level instead of the weight level. Same safety guarantee: ensure the frozen model starts unchanged.

---

## Exercise 3: Trace the Forward Pass `[Supported]`

From the lesson, the ControlNet forward pass produces features at each resolution level that are added to the frozen encoder's skip connections:

```
d3 = decoder_block_3(cat(d4, e3 + z3))   # original + control
d2 = decoder_block_2(cat(d3, e2 + z2))
d1 = decoder_block_1(cat(d2, e1 + z1))
```

The ControlNet's outputs must have **matching shapes** at every resolution level. Let's verify this with a real model.

We will feed a Canny edge map through the ControlNet and inspect the output shapes at each level. The edge map is provided as a pre-computed tensorâ€”no preprocessing needed.

**Task:** Fill in the `# TODO` markers. Each is 1â€“2 lines.

In [None]:
# Create a synthetic Canny edge map as our spatial conditioning input.
# In practice, you would extract edges from a real image (that is next lesson).
# Here we use a simple geometric pattern so we can focus on the architecture.

def create_synthetic_edge_map(height=512, width=512):
    """Create a simple synthetic edge map with geometric shapes.
    Returns a PIL Image with white edges on black background."""
    edge_map = np.zeros((height, width), dtype=np.uint8)

    # Draw a rectangle
    edge_map[100:400, 100] = 255
    edge_map[100:400, 400] = 255
    edge_map[100, 100:400] = 255
    edge_map[400, 100:400] = 255

    # Draw a diagonal line
    for i in range(200):
        x = 150 + i
        y = 150 + i
        if x < height and y < width:
            edge_map[x, y] = 255
            # Make the line a bit thicker
            if x + 1 < height:
                edge_map[x + 1, y] = 255

    # Draw a circle (approximate)
    cx, cy, r = 300, 300, 60
    for angle in np.linspace(0, 2 * np.pi, 360):
        x = int(cx + r * np.sin(angle))
        y = int(cy + r * np.cos(angle))
        if 0 <= x < height and 0 <= y < width:
            edge_map[x, y] = 255

    # Convert to 3-channel RGB PIL Image (ControlNet expects RGB)
    edge_rgb = np.stack([edge_map] * 3, axis=-1)
    return Image.fromarray(edge_rgb)

edge_map_pil = create_synthetic_edge_map()

plt.figure(figsize=(5, 5))
plt.imshow(edge_map_pil)
plt.title("Synthetic edge map (spatial conditioning input)", fontsize=12)
plt.axis('off')
plt.tight_layout()
plt.show()

print("This is a simple geometric edge map.")
print("In practice, you would extract these from real images using Canny edge detection.")
print("For now, we only care about the SHAPES of the feature maps, not the content.")

In [None]:
# Prepare inputs for the ControlNet forward pass.
# We need: a noisy latent (z_t), a timestep (t), a text embedding, and the edge map.

from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL

# Load the text encoder and tokenizer for creating text embeddings
tokenizer = CLIPTokenizer.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5", subfolder="tokenizer"
)
text_encoder = CLIPTextModel.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5", subfolder="text_encoder",
    torch_dtype=torch.float16,
).to(device)

# Move models to device
controlnet_device = controlnet.to(device)
unet_device = unet.to(device)

# Create a text embedding for a simple prompt
prompt = "a house with a garden"
text_input = tokenizer(
    prompt, padding="max_length", max_length=77, truncation=True, return_tensors="pt"
)
with torch.no_grad():
    text_embeddings = text_encoder(text_input.input_ids.to(device))[0]  # [1, 77, 768]

print(f"Text embeddings shape: {text_embeddings.shape}")
print(f"  (batch=1, sequence_length=77 tokens, embedding_dim=768)")

# Create a random noisy latent (simulating z_t at some timestep)
torch.manual_seed(42)
noisy_latent = torch.randn(1, 4, 64, 64, device=device, dtype=torch.float16)
print(f"\nNoisy latent shape: {noisy_latent.shape}")
print(f"  (batch=1, 4 latent channels, 64x64 spatial)")

# Prepare the edge map as a tensor
# ControlNet expects the conditioning image as a [B, 3, H, W] tensor in [0, 1]
edge_tensor = torch.from_numpy(np.array(edge_map_pil)).permute(2, 0, 1).unsqueeze(0)
edge_tensor = edge_tensor.float() / 255.0
edge_tensor = edge_tensor.to(device=device, dtype=torch.float16)
print(f"\nEdge map tensor shape: {edge_tensor.shape}")
print(f"  (batch=1, 3 RGB channels, 512x512 spatial)")

# Set a timestep
timestep = torch.tensor([500], device=device)  # mid-point of denoising
print(f"\nTimestep: {timestep.item()} (mid-point of denoising schedule)")

In [None]:
# Run the ControlNet forward pass and inspect the output shapes.
#
# The ControlNet returns:
#   - down_block_res_samples: features at each encoder resolution level
#   - mid_block_res_sample: the bottleneck-level feature
#
# These are the z_i values from the pseudocode:
#   z1 = zero_conv_1(c1), z2 = zero_conv_2(c2), etc.
#
# They will be ADDED to the frozen encoder's features at matching resolutions.

with torch.no_grad():
    # TODO: Call controlnet's forward pass.
    # The ControlNet takes the same inputs as the U-Net (noisy_latent, timestep,
    # text_embeddings) PLUS the conditioning image (edge_tensor).
    #
    # Call: controlnet(noisy_latent, timestep, encoder_hidden_states=text_embeddings,
    #                  controlnet_cond=edge_tensor, return_dict=False)
    # This returns a tuple: (down_block_res_samples, mid_block_res_sample)
    #
    # Hint: unpack into two variables.
    


# Print the shapes of all ControlNet outputs
print("=== ControlNet Output Shapes (z_i values) ===")
print(f"Number of down_block outputs: {len(down_block_res_samples)}")
print()
for i, sample in enumerate(down_block_res_samples):
    h, w = sample.shape[2], sample.shape[3]
    print(f"  down_block[{i}]: {str(sample.shape):30s}  (spatial: {h}x{w})")

print(f"\n  mid_block:     {str(mid_block_res_sample.shape):30s}  (spatial: {mid_block_res_sample.shape[2]}x{mid_block_res_sample.shape[3]})")
print(f"\nThese are the zero convolution outputs that get ADDED to the frozen")
print(f"encoder's skip connections at each resolution level.")

In [None]:
# Now run the U-Net forward pass WITH the ControlNet outputs.
# The U-Net accepts `down_block_additional_residuals` and `mid_block_additional_residual`
# which are added to the encoder features at each skip connection.
#
# This is the e_i + z_i operation from the pseudocode.

with torch.no_grad():
    # TODO: Call the U-Net with the ControlNet residuals.
    # Call: unet(noisy_latent, timestep, encoder_hidden_states=text_embeddings,
    #           down_block_additional_residuals=down_block_res_samples,
    #           mid_block_additional_residual=mid_block_res_sample)
    # Access the .sample attribute of the result to get the noise prediction.
    


print(f"U-Net noise prediction shape: {noise_pred_with_cn.shape}")
print(f"  (batch=1, 4 latent channels, 64x64 spatial)")
print(f"\nThis is the predicted noise epsilon_hat, same shape as the noisy latent input.")
print(f"The decoder received enriched skip connections: e_i + z_i at each resolution.")

In [None]:
# Compare: run the U-Net WITHOUT ControlNet residuals.
# This should produce a different noise prediction because the spatial conditioning is absent.

with torch.no_grad():
    noise_pred_without_cn = unet(
        noisy_latent, timestep, encoder_hidden_states=text_embeddings
    ).sample

# Compare the two predictions
diff = (noise_pred_with_cn - noise_pred_without_cn).abs()

print("=== ControlNet's Influence ===")
print(f"Noise prediction WITH ControlNet:    mean={noise_pred_with_cn.mean().item():.4f}, std={noise_pred_with_cn.std().item():.4f}")
print(f"Noise prediction WITHOUT ControlNet: mean={noise_pred_without_cn.mean().item():.4f}, std={noise_pred_without_cn.std().item():.4f}")
print(f"\nDifference (|with - without|):")
print(f"  Mean: {diff.mean().item():.4f}")
print(f"  Max:  {diff.max().item():.4f}")
print(f"  Std:  {diff.std().item():.4f}")
print(f"\nThe predictions are different because the ControlNet adds spatial control")
print(f"signals at each skip connection. The frozen model's weights are identical")
print(f"in both cases: the difference comes entirely from the additive z_i terms.")

<details>
<summary>ðŸ’¡ Solution</summary>

The key insight is that the ControlNet and U-Net communicate through tensor addition at matching resolution levels. The ControlNet produces residuals; the U-Net adds them to its own encoder features.

**ControlNet forward pass:**
```python
down_block_res_samples, mid_block_res_sample = controlnet(
    noisy_latent, timestep, encoder_hidden_states=text_embeddings,
    controlnet_cond=edge_tensor, return_dict=False,
)
```

The ControlNet takes the same inputs as the U-Net (noisy latent, timestep, text embeddings) **plus** the spatial conditioning image. It returns features at each encoder resolution levelâ€”these are the zero convolution outputs (z_i).

**U-Net forward pass with ControlNet residuals:**
```python
noise_pred_with_cn = unet(
    noisy_latent, timestep, encoder_hidden_states=text_embeddings,
    down_block_additional_residuals=down_block_res_samples,
    mid_block_additional_residual=mid_block_res_sample,
).sample
```

The U-Net's `down_block_additional_residuals` parameter injects the ControlNet's outputs at each skip connection. Internally, this implements `e_i + z_i`â€”the frozen encoder features plus the ControlNet's spatial control signals.

**Common mistake:** Forgetting `return_dict=False` on the ControlNet call, which returns a dataclass instead of a tuple.

</details>

### What Just Happened

You traced a real forward pass through ControlNet and the frozen U-Net:

1. **ControlNet produces multi-resolution features** â€” One output per encoder resolution level, matching the U-Net's skip connection dimensions exactly. These are the $z_i$ values from the lesson's pseudocode.

2. **The U-Net adds them at each skip connection** â€” `down_block_additional_residuals` implements the $e_i + z_i$ operation. The frozen encoder features are enriched with spatial control signals.

3. **The ControlNet makes a measurable difference** â€” The noise prediction with ControlNet differs from the prediction without it. The difference comes entirely from the additive residuals; the U-Net's weights are identical in both cases.

This is the architecture from the lesson, running with real weights and real tensors.

---

## Exercise 4: ControlNet vs Vanilla SD Comparison `[Independent]`

The lesson's central claim: ControlNet adds a **WHERE** dimension (spatial structure) that coexists with the **WHAT** dimension (text conditioning). Same edge map, different text prompts should produce different content following the same structure.

**Your task:**
1. Build a `StableDiffusionControlNetPipeline` using the models already loaded
2. Generate an image **without** ControlNet (vanilla SD) using a prompt and a fixed seed
3. Generate an image **with** ControlNet using the same prompt, same seed, and the edge map
4. Generate a second image **with** ControlNet using a **different** prompt but the **same** edge map and seed
5. Display all three side by side

**What to verify:**
- The vanilla SD image has no spatial structure matching the edge map
- The ControlNet image follows the edge map's structure
- Changing the text prompt changes the content/style but preserves the spatial structure

**Hints:**
- Use `StableDiffusionControlNetPipeline` from diffusers (already imported)
- Use `UniPCMultistepScheduler` for fast sampling (~20 steps instead of 1000)
- Use `torch.Generator(device=device).manual_seed(seed)` for reproducible results
- For vanilla SD (no ControlNet), you can use a `StableDiffusionPipeline` or simply pass a blank conditioning image
- Use `num_inference_steps=20` to keep generation fast

In [None]:
# VRAM cleanup: free the standalone models from Exercises 1-3.
# The pipelines in Exercise 4 will load their own copies of these components.
# Without this cleanup, two full pipelines + the standalone models would exceed
# a T4's 16 GB VRAM budget.

import gc

del controlnet, unet, controlnet_device, unet_device, text_encoder
gc.collect()
torch.cuda.empty_cache()

if torch.cuda.is_available():
    allocated = torch.cuda.memory_allocated() / 1e9
    print(f"GPU memory after cleanup: {allocated:.2f} GB allocated")
else:
    print("No GPU detected, skipping VRAM report.")

print("Standalone models deleted. Exercise 4 pipelines will load fresh copies.")

In [None]:
# Your code here.
#
# Steps:
# 1. Build a StableDiffusionControlNetPipeline
# 2. Generate without ControlNet (use a blank/zero edge map or a vanilla SD pipeline)
# 3. Generate with ControlNet + prompt A
# 4. Generate with ControlNet + prompt B (same edge map)
# 5. Display all three side by side



<details>
<summary>ðŸ’¡ Solution</summary>

The core insight: ControlNet controls spatial structure (WHERE), text controls content (WHAT). Keeping the edge map fixed while changing the prompt demonstrates their independence.

The solution loads one pipeline at a time to stay within T4 VRAM limits. Generate with the vanilla pipeline first, delete it, then load the ControlNet pipeline.

```python
import gc
from diffusers import StableDiffusionPipeline

# Settings
seed = 42
num_steps = 20
prompt_a = "a house with a garden, watercolor painting"
prompt_b = "a futuristic city, cyberpunk neon style"

# --- Step 1: Vanilla SD (no spatial conditioning) ---
pipe_vanilla = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    safety_checker=None,
).to(device)
pipe_vanilla.scheduler = UniPCMultistepScheduler.from_config(pipe_vanilla.scheduler.config)

generator = torch.Generator(device=device).manual_seed(seed)
img_vanilla = pipe_vanilla(
    prompt_a, num_inference_steps=num_steps, generator=generator,
).images[0]

# Free the vanilla pipeline before loading the ControlNet pipeline
del pipe_vanilla
gc.collect()
torch.cuda.empty_cache()

# --- Step 2: ControlNet pipeline (load ControlNet + SD together) ---
cn_model = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16,
)

pipe_cn = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=cn_model,
    torch_dtype=torch.float16,
    safety_checker=None,
).to(device)
pipe_cn.scheduler = UniPCMultistepScheduler.from_config(pipe_cn.scheduler.config)

# ControlNet + prompt A
generator = torch.Generator(device=device).manual_seed(seed)
img_cn_a = pipe_cn(
    prompt_a, image=edge_map_pil, num_inference_steps=num_steps, generator=generator,
).images[0]

# ControlNet + prompt B (same edge map, same seed)
generator = torch.Generator(device=device).manual_seed(seed)
img_cn_b = pipe_cn(
    prompt_b, image=edge_map_pil, num_inference_steps=num_steps, generator=generator,
).images[0]

# Free the ControlNet pipeline
del pipe_cn, cn_model
gc.collect()
torch.cuda.empty_cache()

# --- Step 3: Display all three side by side ---
fig, axes = plt.subplots(1, 4, figsize=(20, 5))

axes[0].imshow(edge_map_pil)
axes[0].set_title("Edge Map\n(spatial input)", fontsize=11)
axes[0].axis('off')

axes[1].imshow(img_vanilla)
axes[1].set_title(f"Vanilla SD\n\"{prompt_a}\"", fontsize=11)
axes[1].axis('off')

axes[2].imshow(img_cn_a)
axes[2].set_title(f"ControlNet + Prompt A\n\"{prompt_a}\"", fontsize=11)
axes[2].axis('off')

axes[3].imshow(img_cn_b)
axes[3].set_title(f"ControlNet + Prompt B\n\"{prompt_b}\"", fontsize=11)
axes[3].axis('off')

plt.suptitle("Spatial conditioning (WHERE) coexists with text conditioning (WHAT)", fontsize=13, y=1.02)
plt.tight_layout()
plt.show()

print("Observations:")
print("- Vanilla SD: no spatial structure matching the edge map")
print("- ControlNet + Prompt A: spatial structure follows the edges, content matches prompt A")
print("- ControlNet + Prompt B: SAME spatial structure, DIFFERENT content matches prompt B")
print("")
print("Timestep says WHEN. Text says WHAT. ControlNet says WHERE.")
print("Three conditioning signals, three mechanisms, all coexisting.")
```

**Notes:**
- We use `safety_checker=None` to avoid downloading the safety checker model (saves memory on Colab).
- `UniPCMultistepScheduler` allows fast sampling in ~20 steps instead of the 1000 steps of DDPM. This is an improved sampler; the architecture lesson does not change.
- The same seed ensures the initial noise is identical across all three runs, isolating the effect of ControlNet.
- The solution loads and deletes pipelines sequentially to stay within a T4's 16 GB VRAM budget. The vanilla pipeline is deleted before the ControlNet pipeline is loaded.

</details>

---

## Key Takeaways

1. **ControlNet adds ~35â€“40% of the U-Net's parameters, not 100%.** The trainable copy is the encoder half only. The frozen decoder runs once, receiving enriched skip connections from both the frozen encoder and the ControlNet copy.

2. **Zero convolutions are standard 1Ã—1 convs with zero initialization.** The output is exactly zero before training. Adding zero to the frozen features leaves them unchanged: $e_i + 0 = e_i$. The control signal fades in gradually as training progresses.

3. **ControlNet produces multi-resolution features that match the frozen encoder's shapes exactly.** Each output is added at the corresponding skip connection. The only change to the forward pass is $e_i + z_i$ instead of $e_i$.

4. **Text and spatial conditioning coexist.** Same edge map + different prompts = same structure, different content. Timestep says WHEN. Text says WHAT. ControlNet says WHERE. Three conditioning signals, three mechanisms, all active simultaneously.

5. **The frozen model is never modified.** Disconnect the ControlNet and the output is vanilla SD, bit-for-bit identical. This is the same safety principle as LoRA's B=0 initialization, applied at the feature level.