# The Stable Diffusion Pipeline

**Module 6.4, Lesson 1** | CourseAI

In the lesson, you traced the complete Stable Diffusion pipeline on paper: text prompt -> CLIP tokenizer -> CLIP text encoder -> denoising loop (U-Net x2 for CFG, timestep conditioning, cross-attention) -> VAE decoder -> pixel image. Now you will do it hands-on with real pre-trained components.

**What you will do:**
- Load the three SD components separately (CLIP, VAE, U-Net) and inspect their parameter counts
- Trace the CLIP stage: tokenize a prompt, inspect token IDs and padding, verify the output embedding shape
- Trace one denoising step manually: two U-Net forward passes, CFG combination, scheduler step
- Execute the complete pipeline manually from text prompt to generated image, verifying every tensor shape

**For each exercise, PREDICT the output before running the cell.**

This is a CONSOLIDATE notebook. No new algorithms, no new math. Every concept here is something you already know from Modules 6.1–6.3. The exercises verify that you can trace the pipeline and identify components, not implement anything from scratch.

**Estimated time:** 30–45 minutes.

---

## Setup

Run this cell to install dependencies and import everything.

In [None]:
!pip install -q diffusers transformers accelerate

import torch
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
import time

# Reproducible results
torch.manual_seed(42)
np.random.seed(42)

# Nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [10, 4]

# Device setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
dtype = torch.float16 if device.type == 'cuda' else torch.float32
print(f'Using device: {device}')
if device.type == 'cuda':
    print(f'GPU: {torch.cuda.get_device_name(0)}')
    print(f'VRAM: {torch.cuda.get_device_properties(0).total_mem / 1024**3:.1f} GB')

print('\nSetup complete.')

---

## Exercise 1: Load and Inspect Components [Guided]

Stable Diffusion is not one big model. It is **three independently trained models** connected by tensor handoffs:

1. **CLIP text encoder** — translates text to embedding vectors
2. **VAE** — translates between pixel space and latent space
3. **U-Net** — denoises latent tensors, conditioned on text and timestep

Each was trained separately, with a different loss, on different data. They were never trained together.

In this exercise, you will load each component separately from `diffusers` and inspect their parameter counts. The lesson predicted: CLIP ~123M, VAE ~84M, U-Net ~860M, total ~1.07B.

**Before running, predict:**
- How many parameters does the U-Net have? (The lesson said ~860M. It is by far the largest component because denoising at multiple scales is a hard task.)
- What is the total parameter count across all three components? (~1.07B)
- Which component has the fewest parameters? (The VAE at ~84M. It only needs to encode/decode images, not generate them.)

In [None]:
from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel

model_id = 'stable-diffusion-v1-5/stable-diffusion-v1-5'

# Load each component separately -- they live in different subfolders
# of the same model repository. This mirrors the modular architecture:
# each component is independent and swappable.

print('Loading CLIP tokenizer and text encoder...')
tokenizer = CLIPTokenizer.from_pretrained(model_id, subfolder='tokenizer')
text_encoder = CLIPTextModel.from_pretrained(
    model_id, subfolder='text_encoder', torch_dtype=dtype
).to(device)

print('Loading VAE...')
vae = AutoencoderKL.from_pretrained(
    model_id, subfolder='vae', torch_dtype=dtype
).to(device)

print('Loading U-Net...')
unet = UNet2DConditionModel.from_pretrained(
    model_id, subfolder='unet', torch_dtype=dtype
).to(device)

# Set all to eval mode -- no training, no gradients
text_encoder.eval()
vae.eval()
unet.eval()

# Count parameters for each component
def count_params(model):
    return sum(p.numel() for p in model.parameters())

clip_params = count_params(text_encoder)
vae_params = count_params(vae)
unet_params = count_params(unet)
total_params = clip_params + vae_params + unet_params

print('\n=== Component Parameter Counts ===')
print(f'CLIP text encoder: {clip_params / 1e6:>8.1f}M parameters')
print(f'VAE:               {vae_params / 1e6:>8.1f}M parameters')
print(f'U-Net:             {unet_params / 1e6:>8.1f}M parameters')
print(f'{"-" * 42}')
print(f'Total:             {total_params / 1e6:>8.1f}M parameters')
print()
print(f'The U-Net is {unet_params / clip_params:.1f}x larger than CLIP')
print(f'The U-Net is {unet_params / vae_params:.1f}x larger than the VAE')
print(f'The U-Net accounts for {unet_params / total_params * 100:.0f}% of total parameters')
print()
print('Each component was trained independently:')
print('  CLIP:  contrastive loss on 400M text-image pairs (by OpenAI)')
print('  VAE:   perceptual + adversarial loss on image reconstruction')
print('  U-Net: MSE loss on noise prediction in latent space')
print()
print('They communicate through tensor shapes, not shared weights.')
print('That is why you can swap any component independently.')

### What Just Happened

You loaded the three components of Stable Diffusion separately and verified their parameter counts:

- **CLIP text encoder:** ~123M parameters. Translates text prompts into 768-dimensional embedding vectors.
- **VAE:** ~84M parameters. Translates between pixel space (512x512x3) and latent space (64x64x4).
- **U-Net:** ~860M parameters. The denoising workhorse. Accounts for ~80% of the total parameters.

The parameter counts confirm the lesson's predictions. The U-Net is by far the largest component because denoising at multiple scales — with cross-attention at each resolution, adaptive group normalization, and skip connections — is a complex task. But parameter count does not equal importance: without CLIP, you lose text control; without the VAE, you lose the 48x speed advantage.

The key insight: **three independently trained models, connected by tensor handoffs.** Each was trained with a different loss on different data. They were never trained together.

---

## Exercise 2: Trace the CLIP Stage [Guided]

The pipeline starts with text. But the U-Net cannot read text — it operates on tensors. CLIP is the translator.

The CLIP stage has two parts:
1. **Tokenizer:** splits text into subword tokens, adds special tokens (SOT/EOT), pads to a fixed length of **77 tokens**
2. **Text encoder:** a transformer that processes the 77 token IDs and outputs 77 contextual embedding vectors, each 768 dimensions

The lesson predicted: any prompt, regardless of length, produces a **[batch, 77, 768]** output tensor. The 77 comes from padding; the 768 comes from CLIP's embedding dimension. This [77, 768] tensor is the **only thing** the rest of the pipeline sees.

**Before running, predict:**
- How many token positions will the tokenizer output? (77 — always, regardless of prompt length. Shorter prompts are padded.)
- What shape will the CLIP text encoder output? ([1, 77, 768] — batch size 1, 77 token positions, 768 embedding dimensions.)
- If you tokenize a completely different prompt (shorter or longer), will the output shape change? (No — both will be [1, 77, 768]. The padding ensures a fixed-size output.)

In [None]:
# ---- Part 1: Tokenize the prompt ----
prompt = "a cat sitting on a beach at sunset"

# Tokenize with padding to 77 (CLIP's fixed context length)
tokens = tokenizer(
    prompt,
    return_tensors='pt',
    padding='max_length',
    max_length=77,
    truncation=True,
)

token_ids = tokens.input_ids  # (1, 77)

print(f'Prompt: "{prompt}"')
print(f'Token IDs shape: {token_ids.shape}')
print(f'Token IDs: {token_ids[0].tolist()}')
print()

# Decode individual tokens to see what the tokenizer did
decoded_tokens = [tokenizer.decode([tid]) for tid in token_ids[0].tolist()]
non_pad_count = sum(1 for tid in token_ids[0].tolist() if tid != tokenizer.pad_token_id)

print(f'First {non_pad_count} tokens (non-padding):')
for i in range(non_pad_count):
    tid = token_ids[0, i].item()
    print(f'  Position {i:2d}: ID {tid:>5d} -> "{decoded_tokens[i]}"')

print(f'\nRemaining {77 - non_pad_count} positions are padding (ID {tokenizer.pad_token_id})')
print(f'Total: {token_ids.shape[1]} token positions (always 77, regardless of prompt length)')
print()
print('Notice the special tokens:')
print(f'  Position 0: SOT (start-of-text) token ID {token_ids[0, 0].item()}')
print(f'  Position {non_pad_count - 1}: EOT (end-of-text) token ID {token_ids[0, non_pad_count - 1].item()}')

In [None]:
# ---- Part 2: Run the CLIP text encoder ----
with torch.no_grad():
    text_embeddings = text_encoder(
        token_ids.to(device)
    ).last_hidden_state  # (1, 77, 768)

print(f'Text embeddings shape: {text_embeddings.shape}')
print(f'Text embeddings dtype: {text_embeddings.dtype}')
print(f'Value range: [{text_embeddings.min():.3f}, {text_embeddings.max():.3f}]')
print()
print('This [1, 77, 768] tensor is what the U-Net receives via cross-attention.')
print('Each of the 77 positions is a 768-dimensional contextual embedding.')
print('These are NOT simple lookup embeddings -- each token\'s representation')
print('includes context from all other tokens via self-attention inside CLIP.')

In [None]:
# ---- Part 3: Verify shape is prompt-independent ----
# The lesson claimed: 77x768 regardless of prompt length. Let's verify.

short_prompt = "dog"
long_prompt = "a highly detailed oil painting of a golden retriever puppy playing in a field of wildflowers under a dramatic sky with rays of light"

for name, p in [("short", short_prompt), ("long", long_prompt)]:
    t = tokenizer(p, return_tensors='pt', padding='max_length', max_length=77, truncation=True)
    non_pad = sum(1 for tid in t.input_ids[0].tolist() if tid != tokenizer.pad_token_id)
    with torch.no_grad():
        emb = text_encoder(t.input_ids.to(device)).last_hidden_state
    print(f'{name:5s} prompt: "{p}"')
    print(f'       Non-padding tokens: {non_pad}/77, Embedding shape: {emb.shape}')
    print()

print('Both produce [1, 77, 768] -- the shape is always the same.')
print()
print('Why? The tokenizer PADS short prompts to 77 and TRUNCATES long ones to 77.')
print('CLIP was trained with a fixed context length of 77 tokens.')
print('This fixed shape is what makes the interface between CLIP and the U-Net')
print('standardized: the U-Net always receives [batch, 77, 768] as K/V for cross-attention,')
print('regardless of what the user typed.')

### What Just Happened

You traced the complete CLIP stage:

1. **Tokenizer:** The prompt "a cat sitting on a beach at sunset" was split into subword tokens, with SOT/EOT special tokens added and padding to exactly 77 positions. This is the same BPE-style tokenization you learned in **Tokenization** (Series 4), just with CLIP's vocabulary.

2. **Text encoder:** CLIP's transformer processed the 77 token IDs and output a [1, 77, 768] tensor of contextual embeddings. These are the K and V inputs for cross-attention inside the U-Net.

3. **Shape invariance:** Different prompts (short, long) all produce the same [1, 77, 768] output. The tokenizer handles padding/truncation. This fixed interface is what makes the pipeline modular: the U-Net always expects [batch, 77, 768], no matter what.

The U-Net will never see the text "a cat sitting on a beach at sunset." It will only see a 77x768 tensor of floating-point numbers. **CLIP is the translator from human language to the geometric representation space the U-Net operates in.**

---

## Exercise 3: Trace One Denoising Step [Supported]

The denoising loop is the heart of the pipeline. At each step, the U-Net runs **twice** (once unconditional, once with text), and the results are combined with the CFG formula.

Your task: trace a single denoising step manually. You will:
1. Sample random noise z_T in latent space
2. Run one U-Net forward pass with the text embeddings (conditional)
3. Run a second U-Net forward pass with empty-string embeddings (unconditional)
4. Apply the CFG formula: eps_cfg = eps_uncond + w * (eps_cond - eps_uncond)
5. Apply one scheduler step to get z_{T-1}
6. Compare z_T and z_{T-1}

**Hints:**
- z_T has shape [1, 4, 64, 64] (batch=1, 4 latent channels, 64x64 spatial)
- The U-Net expects: `unet(sample, timestep, encoder_hidden_states=...).sample`
- Empty-string embeddings: tokenize `""` and run through the text encoder, same as any other prompt
- The scheduler needs `scheduler.set_timesteps(num_steps)` before use, and `scheduler.step(noise_pred, t, z_t).prev_sample` for a step

In [None]:
from diffusers import DDPMScheduler

# Load the DDPM scheduler (the same algorithm from Module 6.2)
scheduler = DDPMScheduler.from_pretrained(model_id, subfolder='scheduler')
num_steps = 50
scheduler.set_timesteps(num_steps)

print(f'Scheduler: {scheduler.__class__.__name__}')
print(f'Number of denoising steps: {num_steps}')
print(f'Timesteps: {scheduler.timesteps[:5].tolist()} ... {scheduler.timesteps[-3:].tolist()}')
print(f'  (From high noise to low noise, just like the sampling loop in Module 6.2)')

In [None]:
# ---- Step 1: Sample z_T (pure noise in latent space) ----

# TODO: Create a random tensor of shape (1, 4, 64, 64) on the correct device and dtype.
# Hint: torch.randn(shape, device=device, dtype=dtype)
z_t = None  # TODO

if z_t is not None:
    print(f'z_T shape: {z_t.shape}')   # Should be (1, 4, 64, 64)
    print(f'z_T dtype: {z_t.dtype}')
    print(f'z_T value range: [{z_t.min():.3f}, {z_t.max():.3f}]')
    print(f'z_T mean: {z_t.mean():.4f} (should be near 0)')
    print(f'z_T std: {z_t.std():.4f} (should be near 1)')
    print()
    print('This is the starting point for generation: pure random noise in latent space.')
    print('No input image, no VAE encoder. Just N(0, I).')
else:
    print('Fill in the TODO above.')

In [None]:
# ---- Step 2: Prepare text embeddings (conditional and unconditional) ----

# The conditional embeddings are the ones we computed in Exercise 2.
# We already have `text_embeddings` from that exercise: shape (1, 77, 768).

# TODO: Create the unconditional embeddings by encoding an empty string "".
# Hint: Use the same tokenizer + text_encoder pattern from Exercise 2.
# 1. Tokenize "" with the same padding settings
# 2. Run through text_encoder to get the embeddings
uncond_tokens = None  # TODO: tokenizer("", return_tensors='pt', padding='max_length', max_length=77, truncation=True)
uncond_embeddings = None  # TODO: text_encoder(uncond_tokens.input_ids.to(device)).last_hidden_state

if uncond_embeddings is not None:
    print(f'Conditional embeddings shape:   {text_embeddings.shape}')
    print(f'Unconditional embeddings shape: {uncond_embeddings.shape}')
    print()
    print('Both are [1, 77, 768] -- same shape, different content.')
    print('The conditional embeddings encode "a cat sitting on a beach at sunset".')
    print('The unconditional embeddings encode "" (empty string).')
    print('The U-Net will run once with each, and CFG will combine them.')
else:
    print('Fill in the TODOs above.')

In [None]:
# ---- Step 3: Run two U-Net forward passes ----

# We will use the FIRST timestep from the scheduler (highest noise level).
t = scheduler.timesteps[0]
print(f'Timestep: {t}')
print()

with torch.no_grad():
    # Unconditional pass: U-Net with empty-string embeddings
    eps_uncond = unet(z_t, t, encoder_hidden_states=uncond_embeddings).sample

    # Conditional pass: U-Net with real text embeddings
    eps_cond = unet(z_t, t, encoder_hidden_states=text_embeddings).sample

print(f'eps_uncond shape: {eps_uncond.shape}')  # Should match z_t: (1, 4, 64, 64)
print(f'eps_cond shape:   {eps_cond.shape}')    # Should match z_t: (1, 4, 64, 64)
print(f'z_t shape:        {z_t.shape}')
print()
print('Both noise predictions have the SAME shape as z_t.')
print('The U-Net predicts: "this is how much noise I think is in z_t."')
print('Same architecture, same weights, same z_t, same timestep.')
print('The ONLY difference: which text embeddings cross-attention used.')

In [None]:
# ---- Step 4: Apply the CFG formula ----

guidance_scale = 7.5  # Typical value from the lesson

# TODO: Apply the CFG formula.
# eps_cfg = eps_uncond + guidance_scale * (eps_cond - eps_uncond)
eps_cfg = None  # TODO

if eps_cfg is not None:
    print(f'eps_cfg shape: {eps_cfg.shape}')  # Same as z_t: (1, 4, 64, 64)
    print(f'guidance_scale: {guidance_scale}')
    print()
    
    # How different are the conditional and unconditional predictions?
    diff = (eps_cond - eps_uncond).float()
    print(f'Mean |eps_cond - eps_uncond|: {diff.abs().mean():.4f}')
    print(f'This is the "text direction" -- the signal that CFG amplifies by {guidance_scale}x.')
    print()
    print('The CFG formula steers TOWARD the text prompt.')
    print('At w=1.0: no amplification (just eps_cond).')
    print('At w=7.5: strong amplification of the text direction.')
    print('At w=20+: oversaturated, artifacts.')
else:
    print('Fill in the TODO above.')

In [None]:
# ---- Step 5: Apply one scheduler step ----

# TODO: Use the scheduler to compute z_{t-1} from z_t and the CFG noise prediction.
# Hint: scheduler.step(eps_cfg, t, z_t).prev_sample
z_t_minus_1 = None  # TODO

if z_t_minus_1 is not None:
    print(f'z_t shape:     {z_t.shape}')
    print(f'z_{{t-1}} shape: {z_t_minus_1.shape}')
    print()
    
    # Compare z_t and z_{t-1}
    z_diff = (z_t.float() - z_t_minus_1.float()).abs()
    print(f'Mean |z_t - z_{{t-1}}|: {z_diff.mean():.4f}')
    print(f'z_t std:     {z_t.float().std():.4f}')
    print(f'z_{{t-1}} std: {z_t_minus_1.float().std():.4f}')
    print()
    print('z_{t-1} is slightly different from z_t. One small step of denoising.')
    print('This is the reverse step formula from Sampling and Generation (Module 6.2).')
    print('Repeat this 49 more times and you get z_0 -- a clean latent.')
else:
    print('Fill in the TODO above.')

<details>
<summary>Solution</summary>

The key insight: one denoising step requires **two full U-Net forward passes** (unconditional + conditional), a CFG combination, and a scheduler step. With 50 steps, that is 100 U-Net forward passes total. CFG is not post-processing -- it is woven into every step.

**Step 1: Sample z_T**
```python
z_t = torch.randn(1, 4, 64, 64, device=device, dtype=dtype)
```

**Step 2: Unconditional embeddings**
```python
uncond_tokens = tokenizer("", return_tensors='pt', padding='max_length', max_length=77, truncation=True)
with torch.no_grad():
    uncond_embeddings = text_encoder(uncond_tokens.input_ids.to(device)).last_hidden_state
```

**Step 4: CFG formula**
```python
eps_cfg = eps_uncond + guidance_scale * (eps_cond - eps_uncond)
```
This is the exact formula from **Text Conditioning & Guidance**: amplify the direction that the text embeddings push the prediction. At w=7.5, the text direction is amplified 7.5x.

**Step 5: Scheduler step**
```python
z_t_minus_1 = scheduler.step(eps_cfg, t, z_t).prev_sample
```
This applies the reverse step formula from Module 6.2. The scheduler uses eps_cfg and the noise schedule to compute z_{t-1}.

Common mistake: forgetting the `.prev_sample` accessor -- `scheduler.step()` returns a named tuple, not a tensor.

</details>

---

## Exercise 4: Full Pipeline Trace [Independent]

You now have all the pieces. In this exercise, you will execute the **complete Stable Diffusion pipeline** manually:

1. **Tokenize** the prompt
2. **CLIP encode** the tokens (conditional + unconditional embeddings)
3. **Sample z_T** from N(0, I) in latent space
4. **Run the full denoising loop** (50 steps, each with two U-Net passes + CFG + scheduler step)
5. **VAE decode** z_0 to get a pixel image
6. **Display** the generated image

At every stage, print the tensor shape and verify it matches the lesson's predictions:
- After tokenizer: [1, 77] int IDs
- After CLIP: [1, 77, 768] float embeddings
- z_T: [1, 4, 64, 64]
- U-Net output at each step: [1, 4, 64, 64]
- After VAE decode: [1, 3, 512, 512]

**Time each stage** and answer the reflection question: which component took the longest to run? Why?

**Key tips:**
- Use all the components already loaded: `tokenizer`, `text_encoder`, `unet`, `vae`, `scheduler`
- Re-initialize the scheduler timesteps with `scheduler.set_timesteps(50)`
- The denoising loop iterates over `scheduler.timesteps`
- VAE decode: `vae.decode(z_0 / vae.config.scaling_factor).sample`
- To display: convert tensor to PIL with values clipped to [0, 1]
- Wrap everything in `torch.no_grad()` for inference

In [None]:
# YOUR CODE HERE
#
# Execute the complete Stable Diffusion pipeline manually:
#   1. Tokenize the prompt "a cat sitting on a beach at sunset"
#   2. Encode with CLIP (conditional + unconditional embeddings)
#   3. Sample z_T from N(0, I)
#   4. Denoising loop: for each timestep, two U-Net passes + CFG + scheduler step
#   5. VAE decode z_0 to pixel image
#   6. Display the result
#
# Print tensor shapes at every stage.
# Time each stage (CLIP, denoising loop, VAE decode).
#
# Reflection: Which component took the longest? Why?


<details>
<summary>Solution</summary>

The key insight: you are manually executing the exact pipeline that the lesson traced on paper. Every tensor shape matches the predictions. The denoising loop dominates compute time because it runs 100 U-Net forward passes (50 steps x 2 for CFG), while CLIP and the VAE decoder each run only once.

```python
prompt = "a cat sitting on a beach at sunset"
num_steps = 50
guidance_scale = 7.5

# Re-seed for reproducibility
torch.manual_seed(42)

# Re-initialize scheduler
scheduler.set_timesteps(num_steps)

with torch.no_grad():
    # ---- Stage 1: CLIP encoding ----
    t0 = time.time()
    
    # Tokenize
    tokens = tokenizer(
        prompt, return_tensors='pt',
        padding='max_length', max_length=77, truncation=True
    )
    print(f'Token IDs shape: {tokens.input_ids.shape}')  # [1, 77]
    
    # Conditional embeddings
    cond_emb = text_encoder(tokens.input_ids.to(device)).last_hidden_state
    print(f'Conditional embeddings shape: {cond_emb.shape}')  # [1, 77, 768]
    
    # Unconditional embeddings
    uncond_tok = tokenizer(
        "", return_tensors='pt',
        padding='max_length', max_length=77, truncation=True
    )
    uncond_emb = text_encoder(uncond_tok.input_ids.to(device)).last_hidden_state
    print(f'Unconditional embeddings shape: {uncond_emb.shape}')  # [1, 77, 768]
    
    clip_time = time.time() - t0
    print(f'\nCLIP stage time: {clip_time:.2f}s')
    
    # ---- Stage 2: Sample z_T ----
    z = torch.randn(1, 4, 64, 64, device=device, dtype=dtype)
    print(f'\nz_T shape: {z.shape}')  # [1, 4, 64, 64]
    
    # ---- Stage 3: Denoising loop ----
    t0 = time.time()
    
    for i, t in enumerate(scheduler.timesteps):
        # Two U-Net passes for CFG
        eps_uncond = unet(z, t, encoder_hidden_states=uncond_emb).sample
        eps_cond = unet(z, t, encoder_hidden_states=cond_emb).sample
        
        # CFG combine
        eps = eps_uncond + guidance_scale * (eps_cond - eps_uncond)
        
        # Scheduler step
        z = scheduler.step(eps, t, z).prev_sample
        
        if i == 0:
            print(f'\nStep {i}: U-Net output shape: {eps_cond.shape}, z shape: {z.shape}')
    
    denoise_time = time.time() - t0
    print(f'\nDenoising loop time ({num_steps} steps, {num_steps * 2} U-Net passes): {denoise_time:.2f}s')
    print(f'z_0 shape: {z.shape}')  # [1, 4, 64, 64]
    
    # ---- Stage 4: VAE decode ----
    t0 = time.time()
    
    image = vae.decode(z / vae.config.scaling_factor).sample
    
    vae_time = time.time() - t0
    print(f'\nVAE decode time: {vae_time:.2f}s')
    print(f'Image shape: {image.shape}')  # [1, 3, 512, 512]

# ---- Display ----
img = image.detach().cpu().float().squeeze(0).permute(1, 2, 0).numpy()
img = np.clip((img + 1.0) / 2.0, 0.0, 1.0)

plt.figure(figsize=(6, 6))
plt.imshow(img)
plt.title(f'"{prompt}"\n{num_steps} steps, guidance_scale={guidance_scale}', fontsize=11)
plt.axis('off')
plt.tight_layout()
plt.show()

# ---- Timing summary ----
total_time = clip_time + denoise_time + vae_time
print(f'\n=== Timing Summary ===')
print(f'CLIP encoding:   {clip_time:>6.2f}s  ({clip_time/total_time*100:>4.1f}%)')
print(f'Denoising loop:  {denoise_time:>6.2f}s  ({denoise_time/total_time*100:>4.1f}%)')
print(f'VAE decode:      {vae_time:>6.2f}s  ({vae_time/total_time*100:>4.1f}%)')
print(f'Total:           {total_time:>6.2f}s')
print()
print(f'The denoising loop took {denoise_time/total_time*100:.0f}% of total time.')
print(f'Why? It runs {num_steps * 2} U-Net forward passes ({num_steps} steps x 2 for CFG).')
print(f'The U-Net has ~860M parameters. Each forward pass processes a [1, 4, 64, 64]')
print(f'tensor through the full encoder-decoder architecture with cross-attention.')
print(f'CLIP (once) and VAE decode (once) are negligible in comparison.')
```

**Shape verification summary:**
| Stage | Shape | Matches lesson? |
|-------|-------|-----------------|
| Token IDs | [1, 77] | Yes |
| CLIP embeddings | [1, 77, 768] | Yes |
| z_T | [1, 4, 64, 64] | Yes |
| U-Net output | [1, 4, 64, 64] | Yes |
| z_0 | [1, 4, 64, 64] | Yes |
| Decoded image | [1, 3, 512, 512] | Yes |

**Reflection:** The denoising loop dominates because it runs 100 U-Net forward passes through a ~860M parameter network. CLIP runs once (~123M params). The VAE decoder runs once (~84M params, decoder only). The U-Net is both the largest component AND runs the most times. This is why faster samplers (next lesson) are so valuable -- reducing 50 steps to 20 steps cuts the U-Net passes from 100 to 40.

</details>

---

## Key Takeaways

1. **Stable Diffusion is three independently trained models connected by tensor handoffs.** CLIP (~123M params) translates text to [77, 768] embeddings. The U-Net (~860M params) denoises [4, 64, 64] latents using cross-attention for text, adaptive group norm for timestep, and CFG for amplification. The VAE (~84M params) decoder translates the final [4, 64, 64] latent to a [3, 512, 512] pixel image.

2. **Every tensor shape matches the lesson's predictions.** Token IDs: [77]. CLIP embeddings: [77, 768]. Latent tensors: [4, 64, 64]. U-Net output: [4, 64, 64]. Decoded image: [3, 512, 512]. The shapes are the interface contract between components.

3. **CFG requires two U-Net forward passes per step.** 50 steps means 100 U-Net forward passes. This is why the denoising loop dominates compute time. CFG is not post-processing -- it is woven into every step.

4. **CLIP always produces [77, 768] regardless of prompt length.** Short prompts are padded, long prompts are truncated. This fixed interface is what makes the pipeline modular.

5. **Nothing in this pipeline is new to you.** The tokenizer is from Series 4. CLIP is from Module 6.3. The U-Net architecture, timestep conditioning, cross-attention, and CFG are from Modules 6.2-6.3. The VAE is from Module 6.1. The latent diffusion concept is from Module 6.3. Every piece is something you built or deeply studied.