# SDXL

**Module 7.4, Lesson 1** | CourseAI

SDXL is the U-Net pushed to its practical ceiling. Every improvement is about what goes IN to the U-Net (dual text encoders), what goes AROUND it (refiner model), or what goes ALONGSIDE it (micro-conditioning). The architecture itself is the same species as SD v1.5.

**What you will do:**
- Load the SDXL base pipeline and inspect its dual text encoders—verify the tensor shapes [77, 768] and [77, 1280] that concatenate to [77, 2048]
- Generate with SDXL at 1024×1024 and compare to SD v1.5 at 512×512—see the quality jump for yourself
- Explore micro-conditioning by varying original_size and crop_top_left—observe how training metadata steers generation quality and composition
- Build the base + refiner two-stage pipeline and compare base-only vs base+refiner output at different handoff points

**For each exercise, PREDICT the output before running the cell.**

Every concept in this notebook comes from the lesson. Dual text encoders, micro-conditioning, the refiner as img2img with a specialist. No new theory—just hands-on verification of what you just read.

**Estimated time:** 40–60 minutes. All exercises use pre-trained models (no training). Requires a GPU runtime with sufficient VRAM (~7 GB for SDXL base in float16).

## Setup

Run this cell to install dependencies and configure the environment.

**Important:** Switch to a GPU runtime in Colab (Runtime > Change runtime type > T4 GPU). SDXL requires a GPU with at least 7 GB VRAM in float16.

In [None]:
!pip install -q diffusers transformers accelerate safetensors

In [None]:
import torch
import time
import gc
import matplotlib.pyplot as plt
from diffusers import (
    StableDiffusionXLPipeline,
    StableDiffusionXLImg2ImgPipeline,
    DPMSolverMultistepScheduler,
)
from IPython.display import display

# Reproducible results
SEED = 42

# Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [14, 5]
plt.rcParams['figure.dpi'] = 100

print(f'Device: {device}')
print(f'Dtype: {dtype}')
if device.type == 'cpu':
    print('WARNING: No GPU detected. SDXL will be extremely slow on CPU.')
    print('Switch to a GPU runtime: Runtime > Change runtime type > T4 GPU')
print()
print('Setup complete.')

## Shared Helpers

Utility functions for timing generation and displaying image comparisons. Run this cell now—these are used across all four exercises.

In [None]:
def generate_sdxl(
    pipe,
    prompt,
    num_inference_steps=30,
    guidance_scale=7.5,
    seed=SEED,
    original_size=None,
    crops_coords_top_left=None,
    target_size=None,
    output_type="pil",
    denoising_end=None,
):
    """Generate an image with SDXL and return (image, elapsed_seconds).

    Supports micro-conditioning kwargs and denoising_end for refiner handoff.
    """
    generator = torch.Generator(device=device).manual_seed(seed)
    kwargs = dict(
        prompt=prompt,
        num_inference_steps=num_inference_steps,
        guidance_scale=guidance_scale,
        generator=generator,
        output_type=output_type,
    )
    if original_size is not None:
        kwargs["original_size"] = original_size
    if crops_coords_top_left is not None:
        kwargs["crops_coords_top_left"] = crops_coords_top_left
    if target_size is not None:
        kwargs["target_size"] = target_size
    if denoising_end is not None:
        kwargs["denoising_end"] = denoising_end

    start = time.time()
    result = pipe(**kwargs)
    elapsed = time.time() - start
    return result.images[0], elapsed


def show_image_row(images, titles, suptitle=None, figsize=None):
    """Display a row of PIL images with titles."""
    n = len(images)
    fig_w = figsize[0] if figsize else max(5 * n, 12)
    fig_h = figsize[1] if figsize else 5
    fig, axes = plt.subplots(1, n, figsize=(fig_w, fig_h))
    if n == 1:
        axes = [axes]
    for ax, img, title in zip(axes, images, titles):
        ax.imshow(img)
        ax.set_title(title, fontsize=10)
        ax.axis('off')
    if suptitle:
        plt.suptitle(suptitle, fontsize=13, y=1.02)
    plt.tight_layout()
    plt.show()


def free_memory(*pipelines):
    """Delete pipelines and free GPU memory."""
    for p in pipelines:
        del p
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    print("Memory freed.")


print('Helpers defined: generate_sdxl, show_image_row, free_memory')

---

## Exercise 1: SDXL Pipeline Inspection `[Guided]`

The lesson taught that SDXL uses two text encoders instead of one:
- **CLIP ViT-L/14** (SD v1.5's original encoder): produces [77, 768] embeddings
- **OpenCLIP ViT-bigG/14** (SDXL's addition): produces [77, 1280] embeddings

The two outputs are concatenated along the embedding dimension: [77, 768] + [77, 1280] = [77, 2048]. This combined tensor becomes the K/V source for cross-attention. One cross-attention path, wider input.

Let's verify this by loading the SDXL pipeline and inspecting the encoders directly.

**Before running, predict:**
- SDXL has two text encoders. What class will each one be? (Hint: one is the standard CLIP text model, the other is an OpenCLIP variant.)
- How many parameters will the larger encoder have compared to the smaller one?
- What will the combined embedding shape be after concatenation?

In [None]:
# ============================================================
# Exercise 1: Load SDXL and inspect the dual text encoders
# ============================================================

# --- Step 1: Load the SDXL base pipeline ---
print("Loading SDXL base pipeline...")
print("(This downloads ~6.5 GB on first run. Subsequent runs use the cache.)")
print()

pipe_sdxl = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=dtype,
    variant="fp16",
    use_safetensors=True,
).to(device)

print("SDXL base loaded.")

In [None]:
# --- Step 2: Inspect the two text encoders ---

text_encoder_1 = pipe_sdxl.text_encoder
text_encoder_2 = pipe_sdxl.text_encoder_2

# Class names
print("=== Text Encoder 1 (CLIP ViT-L) ===")
print(f"  Class: {text_encoder_1.__class__.__name__}")
params_1 = sum(p.numel() for p in text_encoder_1.parameters())
print(f"  Parameters: {params_1:,} ({params_1 / 1e6:.1f}M)")
print()

print("=== Text Encoder 2 (OpenCLIP ViT-bigG) ===")
print(f"  Class: {text_encoder_2.__class__.__name__}")
params_2 = sum(p.numel() for p in text_encoder_2.parameters())
print(f"  Parameters: {params_2:,} ({params_2 / 1e6:.1f}M)")
print()

print(f"Encoder 2 is {params_2 / params_1:.1f}x larger than Encoder 1.")

In [None]:
# --- Step 3: Run a prompt through both encoders and check output shapes ---

prompt = "a cat sitting on a beach at sunset"

# Tokenize and encode with both encoders
tokenizer_1 = pipe_sdxl.tokenizer
tokenizer_2 = pipe_sdxl.tokenizer_2

# Encoder 1: CLIP ViT-L
tokens_1 = tokenizer_1(
    prompt,
    padding="max_length",
    max_length=tokenizer_1.model_max_length,
    truncation=True,
    return_tensors="pt",
).input_ids.to(device)

with torch.no_grad():
    output_1 = text_encoder_1(tokens_1)
    hidden_1 = output_1.last_hidden_state  # per-token embeddings

# Encoder 2: OpenCLIP ViT-bigG
tokens_2 = tokenizer_2(
    prompt,
    padding="max_length",
    max_length=tokenizer_2.model_max_length,
    truncation=True,
    return_tensors="pt",
).input_ids.to(device)

with torch.no_grad():
    # output_hidden_states=True is needed to access the penultimate layer
    output_2 = text_encoder_2(tokens_2, output_hidden_states=True)
    hidden_2 = output_2.hidden_states[-2]  # penultimate layer (SDXL uses this)
    pooled = output_2.text_embeds  # pooled output for global conditioning

print(f'Prompt: "{prompt}"')
print()
print(f"Encoder 1 (CLIP ViT-L) output shape:       {list(hidden_1.shape)}")
print(f"Encoder 2 (OpenCLIP ViT-bigG) output shape: {list(hidden_2.shape)}")
print()

# Concatenate along the embedding dimension (dim=2)
combined = torch.cat([hidden_1, hidden_2], dim=-1)
print(f"Concatenated embedding shape:                {list(combined.shape)}")
print()
print(f"Pooled embedding shape (global conditioning): {list(pooled.shape)}")

### What Just Happened

You loaded the SDXL base pipeline and inspected its dual text encoders. Here is what you should have observed:

- **Text Encoder 1** is `CLIPTextModel`—the same CLIP ViT-L/14 from SD v1.5. It produces **[1, 77, 768]** embeddings. ~123M parameters. This is the encoder you already know.

- **Text Encoder 2** is `CLIPTextModelWithProjection`—an OpenCLIP ViT-bigG/14 variant. It produces **[1, 77, 1280]** embeddings. ~695M parameters (about 5.6x larger). This is SDXL's addition.

- **The concatenated embedding is [1, 77, 2048].** 768 + 1280 = 2048 dimensions per token. This is the K/V source for cross-attention. The cross-attention mechanism is unchanged from SD v1.5—it just reads from a wider embedding.

- **The pooled embedding is [1, 1280].** A single vector summarizing the entire prompt, from Encoder 2. This is concatenated with the timestep embedding and injected through adaptive group normalization—the same global conditioning pathway as the timestep.

This confirms the lesson's tensor shape trace:
```
CLIP ViT-L/14:     [77, 768]   ← SD v1.5's original encoder
OpenCLIP ViT-bigG:  [77, 1280]  ← SDXL's addition
Concatenated:       [77, 2048]  ← One embedding per token, richer
```

Two encoders, one cross-attention path. Not decoupled attention (like IP-Adapter)—concatenation before the bottleneck.

---

## Exercise 2: SDXL Base Generation and Comparison `[Guided]`

The lesson showed two descriptions of the same prompt at SD v1.5 vs SDXL quality. Now you will see it for real.

We will generate "a cat sitting on a beach at sunset"—the same prompt from the SD v1.5 pipeline lesson—with SDXL at 1024×1024. Then we will vary `guidance_scale` to explore the richer conditioning.

From the lesson: SDXL's dual text encoders provide better text understanding. The hypothesis is that with richer conditioning, you need *less* amplification from classifier-free guidance. SD v1.5 typically uses guidance_scale ~7.5. SDXL should work well at lower values.

**Before running, predict:**
- Will the optimal guidance_scale for SDXL be higher or lower than SD v1.5's typical 7.5? Why? (Think about what guidance_scale does: it amplifies the difference between conditioned and unconditioned predictions. If the conditioned prediction is already better...)
- At guidance_scale=10, will SDXL show oversaturation artifacts like SD v1.5 often does?

In [None]:
# ============================================================
# Exercise 2: Generate with SDXL and compare guidance scales
# ============================================================

PROMPT = "a cat sitting on a beach at sunset"

# Use DPM-Solver++ for faster generation (Level 1 acceleration—free speedup)
pipe_sdxl.scheduler = DPMSolverMultistepScheduler.from_config(
    pipe_sdxl.scheduler.config
)

# --- Generate at three guidance scales ---
guidance_scales = [5.0, 7.5, 10.0]
images = []
times = []

for gs in guidance_scales:
    print(f"Generating: guidance_scale={gs}...")
    img, elapsed = generate_sdxl(
        pipe_sdxl,
        PROMPT,
        num_inference_steps=30,
        guidance_scale=gs,
    )
    images.append(img)
    times.append(elapsed)
    print(f"  Done in {elapsed:.2f}s")

In [None]:
# --- Compare all three side by side ---
show_image_row(
    images,
    [
        f"guidance_scale=5.0\n{times[0]:.1f}s",
        f"guidance_scale=7.5\n{times[1]:.1f}s",
        f"guidance_scale=10.0\n{times[2]:.1f}s",
    ],
    suptitle=f'SDXL at 1024\u00d71024: "{PROMPT}"',
    figsize=(18, 6),
)

print("Compare across guidance scales:")
print(f"  guidance_scale=5.0:  Clean, well-balanced. SDXL's typical sweet spot.")
print(f"  guidance_scale=7.5:  Slightly more vivid. SD v1.5's default.")
print(f"  guidance_scale=10.0: May show oversaturation or high-contrast artifacts.")
print()
print("The dual text encoders provide richer conditioning,")
print("so less CFG amplification is needed. SDXL typically works")
print("best at guidance_scale ~5-7, lower than SD v1.5's ~7.5.")

### What Just Happened

You generated the same prompt that you traced through SD v1.5 in the pipeline lesson, now with SDXL at 1024×1024. Here is what to observe:

- **The quality jump is dramatic.** Even at the same guidance_scale, SDXL produces sharper details, better text following, and more coherent composition. This comes from the combination of dual text encoders, higher resolution, and the larger U-Net.

- **guidance_scale=5.0 often looks the best for SDXL.** The dual text encoders provide richer conditioning—the conditioned prediction is already more accurate. CFG amplifies the *difference* between conditioned and unconditioned. When the conditioned prediction is better, less amplification is needed to steer generation. SD v1.5 needed guidance_scale ~7.5 to compensate for its single encoder's limitations.

- **guidance_scale=10.0 may show oversaturation.** Higher guidance pushes the prediction further from the unconditioned baseline. With SDXL's already-strong conditioning, this can overshoot—producing overly vivid colors, extreme contrast, or unnatural lighting. The same effect happens with SD v1.5 at high guidance, but SDXL reaches the oversaturation threshold sooner.

- **This confirms the lesson's insight.** Better text encoders → better conditioned predictions → less need for CFG amplification. The connection between conditioning quality and optimal guidance scale is not just theoretical—you can see it.

---

## Exercise 3: Micro-Conditioning Exploration `[Supported]`

From the lesson: micro-conditioning tells the model about each training image's context. Three additional numbers are fed alongside the timestep:
- **original_size**: the resolution of the training image before any resizing
- **crops_coords_top_left**: where the crop was taken from
- **target_size**: the resolution the model should generate

At inference, setting `original_size=(1024, 1024)` and `crops_coords_top_left=(0, 0)` tells the model: "generate as if the original image was high-resolution and well-centered."

But what happens if you change these values? The model learned what low-resolution originals and off-center crops look like during training. We can exploit this to see micro-conditioning in action.

Your task: generate the same prompt with three different micro-conditioning configurations and compare the outputs.

Fill in the TODO markers to complete the experiment.

In [None]:
# ============================================================
# Exercise 3: Micro-conditioning exploration
# ============================================================

PROMPT = "a cat sitting on a beach at sunset"

# --- Configuration 1: Default (best quality) ---
# original_size=(1024, 1024), crops_coords_top_left=(0, 0)
# This is the default: "generate as if the original was high-res and well-framed."
print("Generating: default micro-conditioning (high-quality)...")
img_default, time_default = generate_sdxl(
    pipe_sdxl,
    PROMPT,
    num_inference_steps=30,
    guidance_scale=5.0,
    original_size=(1024, 1024),
    crops_coords_top_left=(0, 0),
    target_size=(1024, 1024),
)
print(f"  Done in {time_default:.2f}s")

In [None]:
# --- Configuration 2: Simulate a low-resolution original ---
# Setting original_size=(256, 256) tells the model:
# "generate as if the training image was a 256x256 thumbnail."
# The model learned that 256x256 originals tend to be softer and less detailed.

# TODO: Generate with original_size=(256, 256), crops_coords_top_left=(0, 0),
#       target_size=(1024, 1024). Use the same prompt, steps, and guidance_scale
#       as Configuration 1.
# Hint: Call generate_sdxl with the appropriate micro-conditioning kwargs.
raise NotImplementedError(
    "TODO: Generate with original_size=(256, 256). See the hint above."
)

In [None]:
# --- Configuration 3: Simulate an off-center crop ---
# Setting crops_coords_top_left=(512, 512) tells the model:
# "generate as if this was cropped from the center of a much larger image."
# The model learned that off-center crops often have unusual compositions.

# TODO: Generate with original_size=(1024, 1024), crops_coords_top_left=(512, 512),
#       target_size=(1024, 1024). Same prompt, steps, and guidance_scale.
raise NotImplementedError(
    "TODO: Generate with crops_coords_top_left=(512, 512). See the hint above."
)

In [None]:
# --- Compare all three side by side ---
show_image_row(
    [img_default, img_lowres, img_crop],
    [
        "Default\norig=(1024,1024)\ncrop=(0,0)",
        "Low-res original\norig=(256,256)\ncrop=(0,0)",
        "Off-center crop\norig=(1024,1024)\ncrop=(512,512)",
    ],
    suptitle=f'Micro-Conditioning Comparison: "{PROMPT}"',
    figsize=(18, 6),
)

print("Compare the three outputs:")
print("  Default (left):    Best quality. High-res original, well-centered.")
print("  Low-res (middle):  Should appear softer or less detailed.")
print("                     The model generates 'as if' the original was 256x256.")
print("  Off-center (right): May show shifted composition or cropping artifacts.")
print("                     The model generates 'as if' this was a center crop.")
print()
print("Micro-conditioning is not just metadata—it actively steers generation.")
print("The model learned to separate content from quality during training.")

<details>
<summary>Solution</summary>

The key insight is that micro-conditioning values are not suggestions—they are conditioning inputs that the model learned to respond to during training. Setting `original_size=(256, 256)` does not resize anything. It tells the model "generate with the characteristics you learned from 256x256 training images" (typically softer, less detailed). Setting `crops_coords_top_left=(512, 512)` tells the model "generate with the characteristics you learned from center-cropped training images" (potentially shifted composition).

```python
# Configuration 2: Low-resolution original
print("Generating: low-resolution original micro-conditioning...")
img_lowres, time_lowres = generate_sdxl(
    pipe_sdxl,
    PROMPT,
    num_inference_steps=30,
    guidance_scale=5.0,
    original_size=(256, 256),
    crops_coords_top_left=(0, 0),
    target_size=(1024, 1024),
)
print(f"  Done in {time_lowres:.2f}s")

# Configuration 3: Off-center crop
print("Generating: off-center crop micro-conditioning...")
img_crop, time_crop = generate_sdxl(
    pipe_sdxl,
    PROMPT,
    num_inference_steps=30,
    guidance_scale=5.0,
    original_size=(1024, 1024),
    crops_coords_top_left=(512, 512),
    target_size=(1024, 1024),
)
print(f"  Done in {time_crop:.2f}s")
```

**Common mistakes:**
- Confusing `original_size` with actually resizing the output. The output is always 1024x1024. `original_size` is a conditioning signal, not a resize parameter.
- Forgetting `target_size=(1024, 1024)`. Without it, the pipeline uses the default, which is usually correct—but being explicit ensures the experiment is controlled.
- Expecting dramatic visual differences. The effect can be subtle, especially for `crops_coords_top_left`. The model learned statistical tendencies, not absolute rules.

</details>

### What Just Happened

You explored micro-conditioning by changing the metadata values the model uses as conditioning inputs:

- **Default (original_size=1024, crop=(0,0))** produces the highest quality. This tells the model: "the original training image was high-resolution and well-framed." The model generates accordingly.

- **Low-resolution original (original_size=256)** may produce softer or less detailed output. The model learned during training that 256x256 images tend to have less fine detail. At inference, you are asking it to generate with those characteristics—not literally at 256x256, but with the quality profile it associates with small originals.

- **Off-center crop (crop=(512,512))** may shift the composition. The model learned that training crops taken from off-center positions have different statistical properties (subjects at edges, asymmetric compositions). The effect varies by prompt—some prompts are more sensitive than others.

- **The effects may be subtle.** Micro-conditioning represents statistical tendencies learned from millions of training images, not deterministic rules. The difference between `original_size=(1024, 1024)` and `original_size=(256, 256)` is a shift in the quality distribution, not a binary switch. This is exactly how adaptive norm conditioning works—continuous influence, not discrete control.

- **This confirms the lesson's insight.** Without micro-conditioning, SDXL would have to either throw away low-resolution training data or accept that the model learns to produce artifacts. Micro-conditioning separates content from quality, letting the model train on diverse data while generating at the quality level you request.

---

## Exercise 4: Base + Refiner Pipeline `[Independent]`

From the lesson: the SDXL refiner is a second U-Net, fine-tuned for the low-noise timesteps where fine detail matters. The two-model pipeline works like this:

1. **Base model** denoises from t=T (pure noise) to t=t_switch—handles composition, structure, color
2. **Refiner model** denoises from t=t_switch to t=0—handles fine detail, texture, sharpness

This is the img2img mechanism you already know, applied with a specialized second model. The base model produces a partially denoised latent at t_switch. The refiner takes that latent and completes the denoising.

### Your Task

1. **Free the base-only pipeline** from the previous exercises to make VRAM for the refiner
2. **Load both the base pipeline and the refiner pipeline** (`StableDiffusionXLImg2ImgPipeline` for the refiner)
3. **Generate base-only** at 40 steps as the comparison baseline
4. **Generate base+refiner** with the base using `denoising_end` to stop early and the refiner using `denoising_start` to continue from there
5. **Vary the handoff point (t_switch):** try giving the refiner 10%, 20%, and 40% of the denoising steps
6. **Compare** base-only vs base+refiner outputs. Look for differences in fine detail, texture, and sharpness.
7. **Bonus:** time the base-only and base+refiner pipelines to quantify the compute tradeoff

### Hints

- The refiner model is at `"stabilityai/stable-diffusion-xl-refiner-1.0"`
- Use `StableDiffusionXLImg2ImgPipeline.from_pretrained(...)` for the refiner
- The base pipeline's `denoising_end` parameter controls where it stops (e.g., `denoising_end=0.8` means stop at 80% completion)
- The refiner's `denoising_start` parameter controls where it picks up (e.g., `denoising_start=0.8` means start at 80% completion)
- The base must output latents, not images: use `output_type="latent"` for the base when handing off to the refiner
- Pass the base's latent output as the `image` argument to the refiner
- Both pipelines should use the same prompt and seed for a fair comparison
- On T4 GPUs, you may need to load the refiner with `torch_dtype=torch.float16` and `variant="fp16"` to fit in memory. Consider using `.enable_model_cpu_offload()` instead of `.to(device)` for the refiner to save VRAM.

In [None]:
# ============================================================
# Exercise 4: Base + Refiner two-stage pipeline
# ============================================================
#
# Free the pipeline from previous exercises first.
# Then load both base and refiner, generate, and compare.
#
# Your code here:



In [None]:
# --- Your comparison visualization ---
#
# Use show_image_row() to display base-only vs base+refiner results.
# Try multiple handoff points (denoising_end/denoising_start of 0.9, 0.8, 0.6)
# and compare the fine detail differences.
#
# Your code here:



<details>
<summary>Solution</summary>

The refiner is img2img with a specialized model. The base model generates the composition and structure (high-noise timesteps). The refiner polishes fine detail (low-noise timesteps). The `denoising_end` and `denoising_start` parameters control where the handoff happens.

```python
# --- Step 1: Free memory from previous exercises ---
free_memory(pipe_sdxl)

# --- Step 2: Load both base and refiner ---
print("Loading SDXL base...")
base = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True,
)
base.enable_model_cpu_offload()

print("Loading SDXL refiner...")
refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True,
)
refiner.enable_model_cpu_offload()
print("Both models loaded.")

PROMPT = "a cat sitting on a beach at sunset"
N_STEPS = 40

# --- Step 3: Base-only generation (all 40 steps) ---
print("\nGenerating: base-only (40 steps)...")
generator = torch.Generator(device="cpu").manual_seed(SEED)
start = time.time()
img_base_only = base(
    prompt=PROMPT,
    num_inference_steps=N_STEPS,
    guidance_scale=5.0,
    generator=generator,
).images[0]
time_base_only = time.time() - start
print(f"  Done in {time_base_only:.2f}s")

# --- Step 4: Base + Refiner at different handoff points ---
handoff_points = [0.9, 0.8, 0.6]  # refiner gets 10%, 20%, 40% of steps
refiner_images = []
refiner_times = []

for handoff in handoff_points:
    pct = int((1 - handoff) * 100)
    print(f"\nGenerating: base+refiner (refiner gets {pct}% of steps)...")

    # Base: denoise from noise to the handoff point
    generator = torch.Generator(device="cpu").manual_seed(SEED)
    start = time.time()
    latent = base(
        prompt=PROMPT,
        num_inference_steps=N_STEPS,
        guidance_scale=5.0,
        denoising_end=handoff,
        output_type="latent",
        generator=generator,
    ).images

    # Refiner: pick up from the handoff point and denoise to completion
    generator = torch.Generator(device="cpu").manual_seed(SEED)
    img_refined = refiner(
        prompt=PROMPT,
        num_inference_steps=N_STEPS,
        guidance_scale=5.0,
        denoising_start=handoff,
        image=latent,
        generator=generator,
    ).images[0]
    elapsed = time.time() - start

    refiner_images.append(img_refined)
    refiner_times.append(elapsed)
    print(f"  Done in {elapsed:.2f}s")

# --- Step 5: Compare all results ---
all_images = [img_base_only] + refiner_images
all_titles = [
    f"Base only\n{N_STEPS} steps | {time_base_only:.1f}s",
] + [
    f"Base+Refiner\nrefiner gets {int((1-h)*100)}% | {t:.1f}s"
    for h, t in zip(handoff_points, refiner_times)
]

show_image_row(
    all_images,
    all_titles,
    suptitle=f'Base vs Base+Refiner: "{PROMPT}"',
    figsize=(22, 6),
)

print("Timing comparison:")
print(f"  Base only:              {time_base_only:.2f}s")
for h, t in zip(handoff_points, refiner_times):
    pct = int((1 - h) * 100)
    print(f"  Base+Refiner ({pct}%):     {t:.2f}s")
```

**Key observations:**
- The refiner adds polish, not fundamentally different content. Look at textures (fur, sand) and edges (cat outline, wave crests). The composition should be the same.
- Giving the refiner too many steps (e.g., 40%) can sometimes *hurt* quality, because the refiner was trained for the low-noise regime and may struggle with higher-noise inputs.
- The compute cost scales linearly: base+refiner takes roughly the time of base-only plus the refiner's steps. Two models, two forward-pass sets.
- `enable_model_cpu_offload()` keeps models in CPU memory until needed, then loads them to GPU on demand. This trades speed for VRAM—necessary on T4 GPUs where loading both models simultaneously would exceed memory.

**Common mistakes:**
- Forgetting `output_type="latent"` on the base when handing off to the refiner. Without this, the base decodes to pixels and you cannot pass it to the refiner.
- Using different seeds for base and refiner. The seeds should match for reproducible comparison.
- Not passing the prompt to the refiner. The refiner uses text conditioning during its denoising pass—without it, the detail polishing is semantically unguided.
- Setting `denoising_start` to a different value than `denoising_end`. These must match for a clean handoff.

</details>

---

## Key Takeaways

1. **Dual text encoders are real and inspectable.** CLIP ViT-L produces [77, 768]. OpenCLIP ViT-bigG produces [77, 1280]. Concatenated: [77, 2048]. One cross-attention path, wider K/V source. The mechanism is unchanged from SD v1.5—the input is just richer.

2. **Richer conditioning means lower optimal guidance.** SDXL works best at guidance_scale ~5–7, lower than SD v1.5's ~7.5. The dual encoders provide better conditioned predictions, so less CFG amplification is needed.

3. **Micro-conditioning is not metadata—it is an active conditioning signal.** Setting `original_size=(256, 256)` tells the model to generate with the quality characteristics it learned from low-resolution training images. The model separates content from quality.

4. **The refiner is img2img with a specialist.** Same mechanism you know from Img2img & Inpainting: take a partially denoised latent, complete the denoising with a specialized model. The base handles composition; the refiner polishes details. Optional but effective.

5. **Every SDXL improvement is about what goes IN, AROUND, or ALONGSIDE the U-Net.** Dual encoders (in), refiner (around), micro-conditioning (alongside). The U-Net backbone itself is the same species as SD v1.5, just larger. The next lesson asks: what if you replaced it entirely?