# IP-Adapter

**Module 7.1, Lesson 3** | CourseAI

You know the architectureâ€”decoupled cross-attention adds a parallel K/V pathway for CLIP image embeddings alongside the existing text K/V projections. This notebook is where the architecture becomes a tool. Reference images, scale tuning, text-image coexistence, and composability with ControlNet.

**What you will do:**
- Load IP-Adapter and generate with a reference image, comparing output with and without image conditioning (scale=0 vs scale=0.6)
- Sweep the IP-Adapter scale parameter across five values and observe the transition from text-dominant to image-dominant
- Test text-image coexistence by pairing the same reference image with three different text prompts
- Combine IP-Adapter with ControlNetâ€”reference image for visual style, edge map for spatial structure

**For each exercise, PREDICT the output before running the cell.**

Every concept in this notebook comes from the lesson. Decoupled cross-attention, the scale parameter as a volume knob, text-image coexistence via parallel K/V paths, and composability with ControlNet. No new theoryâ€”just hands-on practice with real models.

**Estimated time:** 40â€“60 minutes (model downloads may take several minutes on first run).

**VRAM requirements:** This notebook is designed for a T4 GPU (16 GB). It carefully manages GPU memory by clearing pipelines between heavy operations. Follow the cleanup cells between exercises.

---

## Setup

Run this cell to install dependencies, import everything, and configure the environment.

**Important:** Set the runtime to GPU before running. In Colab: Runtime â†’ Change runtime type â†’ T4 GPU.

The first run will download model weights (~5 GB for SD v1.5 + IP-Adapter weights + ControlNet checkpoint). Subsequent runs use cached weights.

In [None]:
!pip install -q diffusers transformers accelerate safetensors controlnet_aux opencv-python-headless

import torch
import numpy as np
import cv2
import matplotlib.pyplot as plt
import gc
from PIL import Image
from diffusers import (
    StableDiffusionPipeline,
    StableDiffusionControlNetPipeline,
    ControlNetModel,
    UniPCMultistepScheduler,
)

# Reproducible results
torch.manual_seed(42)

# Nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [10, 4]

# Device setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')
if device.type == 'cuda':
    print(f'GPU: {torch.cuda.get_device_name(0)}')
    print(f'VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')
else:
    print('WARNING: No GPU detected. This notebook requires a GPU for image generation.')
    print('In Colab: Runtime â†’ Change runtime type â†’ T4 GPU')

print('\nSetup complete.')

## Shared Helpers

Utility functions used across multiple exercises.

In [None]:
def download_sample_image(url, filename="sample.jpg"):
    """Download an image from a URL and return it as a PIL Image."""
    import urllib.request
    urllib.request.urlretrieve(url, filename)
    return Image.open(filename).convert("RGB")


def show_images(images, titles, figsize=None, suptitle=None):
    """Display a row of images with titles."""
    n = len(images)
    if figsize is None:
        figsize = (5 * n, 5)
    fig, axes = plt.subplots(1, n, figsize=figsize)
    if n == 1:
        axes = [axes]
    for ax, img, title in zip(axes, images, titles):
        ax.imshow(img)
        ax.set_title(title, fontsize=10)
        ax.axis('off')
    if suptitle:
        plt.suptitle(suptitle, fontsize=13, y=1.02)
    plt.tight_layout()
    plt.show()


def cleanup_pipeline(pipe):
    """Delete a pipeline and free GPU memory."""
    del pipe
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        allocated = torch.cuda.memory_allocated() / 1e9
        print(f"GPU memory after cleanup: {allocated:.2f} GB allocated")


def make_generator(seed):
    """Create a torch Generator with the given seed for reproducible results."""
    return torch.Generator(device=device).manual_seed(seed)


# Download a reference image for IP-Adapter exercises.
# This is a photograph of a golden retrieverâ€”a subject with distinctive
# visual identity (coat color, fur texture, facial features) that is
# easy to verify in generated outputs.
REFERENCE_URL = "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png"
reference_image = download_sample_image(REFERENCE_URL, "reference.jpg")
reference_image = reference_image.resize((512, 512))

show_images([reference_image], ["Reference Image (512x512)"])
print("This reference image will be used throughout the notebook.")
print("IP-Adapter will extract its visual character via CLIP image embeddings.")
print("The image never enters the denoising loopâ€”it enters via cross-attention.")

---

## Exercise 1: Load IP-Adapter and Compare With/Without `[Guided]`

From the lesson: IP-Adapter adds a parallel K/V pathway for CLIP image embeddings in cross-attention. The existing text K/V path is **completely untouched**. Setting the IP-Adapter scale to 0 should produce output identical to vanilla SD with text only.

We will load IP-Adapter, generate with the same prompt and seed at two scale values:
- **scale=0.0:** Image branch contributes nothing (`text_out + 0 Ã— image_out = text_out`)
- **scale=0.6:** Image branch provides moderate visual influence

This directly tests the lesson's core claim: the text path is untouched, and the image path is purely additive.

**Before running, predict:**
- At scale=0.0, will the reference image have any effect on the output?
- At scale=0.6, what aspects of the reference image will transfer? (color palette? subject identity? exact pixel layout?)
- Will the text prompt still control the scene composition at scale=0.6?

In [None]:
# Step 1: Load SD v1.5 pipeline with IP-Adapter
#
# IP-Adapter is loaded as an add-on to an existing SD pipeline.
# The h94/IP-Adapter repository provides pre-trained adapter weights
# for SD v1.5 that work with any text prompt.

pipe = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    safety_checker=None,
).to(device)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

# Load the IP-Adapter weights into the pipeline.
# This adds the trainable W_K_image and W_V_image projections at every
# cross-attention layer in the U-Net. The frozen text K/V path is untouched.
pipe.load_ip_adapter(
    "h94/IP-Adapter",
    subfolder="models",
    weight_name="ip-adapter_sd15.bin",
)

print("SD v1.5 pipeline loaded with IP-Adapter.")
print(f"GPU memory: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

In [None]:
# Step 2: Generate at scale=0.0 (image branch disabled) and scale=0.6 (moderate influence)

prompt = "a painting of a dog in a garden, beautiful, detailed"
seed = 42
num_steps = 20

# Scale=0.0: image_out is multiplied by 0, so the reference image has no effect.
# output = text_out + 0 Ã— image_out = text_out
pipe.set_ip_adapter_scale(0.0)
generator = make_generator(seed)
img_scale_0 = pipe(
    prompt,
    ip_adapter_image=reference_image,
    num_inference_steps=num_steps,
    generator=generator,
).images[0]
print("Generated at scale=0.0 (no image influence).")

# Scale=0.6: moderate image influence.
# output = text_out + 0.6 Ã— image_out
pipe.set_ip_adapter_scale(0.6)
generator = make_generator(seed)
img_scale_06 = pipe(
    prompt,
    ip_adapter_image=reference_image,
    num_inference_steps=num_steps,
    generator=generator,
).images[0]
print("Generated at scale=0.6 (moderate image influence).")

In [None]:
# Step 3: Compare the results

show_images(
    [reference_image, img_scale_0, img_scale_06],
    [
        "Reference Image\n(CLIP encodes this)",
        "Scale=0.0\n(image branch disabled)",
        "Scale=0.6\n(moderate image influence)",
    ],
    suptitle="Same prompt, same seedâ€”IP-Adapter scale controls image influence",
)

print("Observations:")
print("- Scale=0.0: The reference image has NO effect. This is vanilla SD with text only.")
print("  The model generates a generic dog in a garden based purely on the text prompt.")
print("")
print("- Scale=0.6: The reference image's visual character transfers. The generated dog")
print("  takes on qualities from the referenceâ€”color palette, visual style, mood.")
print("  But the text prompt STILL controls the scene (garden setting, composition).")
print("")
print("Key insight: IP-Adapter is purely additive. At scale=0, the image branch")
print("contributes nothing. The text path is untouched at every scale value.")
print("This is decoupled cross-attention in action.")

### What Just Happened

You loaded IP-Adapter into an SD v1.5 pipeline and generated with the same prompt and seed at two scale values:

- **Scale=0.0:** The image branch contributes nothing (`text_out + 0 Ã— image_out = text_out`). The output is identical to vanilla SD with text only. The reference image is encoded by CLIP but its influence is zeroed out.
- **Scale=0.6:** The reference image's visual character transfersâ€”color palette, visual style, mood. But the text prompt still controls the scene composition. The garden setting, the painting style, the dog's pose all come from the text. The reference image adds its visual identity on top.

This confirms the lesson's core claim: the text K/V path is completely untouched. IP-Adapter adds a new information source via a parallel K/V pathway. It is addition, not replacement. Same principle as ControlNet at conditioning_scale=0â€”the adapter contributes nothing, and the frozen model is unchanged.

---

## Exercise 2: Scale Parameter Sweep `[Guided]`

From the lesson: the scale parameter is a **volume knob for image influence**, directly paralleling ControlNet's conditioning scale. At scale=0.0, the image is silent. At scale=1.0, the image is at full volume. The transition from text-dominant to image-dominant should be smooth and progressive.

We will generate with the same reference image and prompt at five scale values: 0.0, 0.3, 0.5, 0.7, and 1.0. Watch how the visual character of the reference image gradually emerges.

**Before running, predict:**
- At what scale value will you first notice the reference image's influence?
- At scale=1.0, will the text prompt still have any visible effect?
- Is the transition gradual or sudden? Does image influence "snap on" at a threshold?

In [None]:
# Generate at five scale values using the same pipeline from Exercise 1

scales = [0.0, 0.3, 0.5, 0.7, 1.0]
prompt = "a painting of a dog in a garden, beautiful, detailed"
seed = 42
num_steps = 20

sweep_results = {}
for scale in scales:
    pipe.set_ip_adapter_scale(scale)
    generator = make_generator(seed)
    result = pipe(
        prompt,
        ip_adapter_image=reference_image,
        num_inference_steps=num_steps,
        generator=generator,
    ).images[0]
    sweep_results[scale] = result
    print(f"Generated at scale={scale}")

print(f"\nGenerated {len(sweep_results)} images across IP-Adapter scales.")

In [None]:
# Display the scale sweep as a comparison grid

fig, axes = plt.subplots(1, len(scales) + 1, figsize=(4 * (len(scales) + 1), 4))

# First column: the reference image
axes[0].imshow(reference_image)
axes[0].set_title("Reference Image\n(input to CLIP)", fontsize=10)
axes[0].axis('off')

# One column per scale value
scale_labels = {
    0.0: "0.0\n(text only)",
    0.3: "0.3\n(subtle influence)",
    0.5: "0.5\n(balanced)",
    0.7: "0.7\n(image-leaning)",
    1.0: "1.0\n(strong image)",
}

for i, scale in enumerate(scales):
    axes[i + 1].imshow(sweep_results[scale])
    axes[i + 1].set_title(f"Scale {scale_labels[scale]}", fontsize=10)
    axes[i + 1].axis('off')

plt.suptitle(
    "IP-Adapter Scale Sweep: same reference, same prompt, varying image influence",
    fontsize=13, y=1.02,
)
plt.tight_layout()
plt.show()

print("The volume knob for image influence:")
print("  Scale 0.0: Pure text conditioning. Generic dog in a garden.")
print("  Scale 0.3: Subtle shift. Color palette begins to warm toward the reference.")
print("  Scale 0.5: Clear influence. Visual character from the reference is visible.")
print("  Scale 0.7: Strong influence. The reference's visual identity dominates.")
print("  Scale 1.0: Full image conditioning. Very strong reference image influence.")
print("")
print("The transition is gradual, not sudden. This is the same 'volume knob' pattern")
print("from ControlNet's conditioning_scaleâ€”a continuous dial, not an on/off switch.")

### What Just Happened

You swept the IP-Adapter scale from 0.0 to 1.0 and observed the gradual transition from text-dominant to image-dominant generation:

- **Scale 0.0:** Pure text conditioning. The model generates a generic dog in a garden. The reference image has zero influence.
- **Scale 0.3:** Subtle shift. The color palette begins to warm, visual details start to echo the reference. You might need to look carefully to notice the influence.
- **Scale 0.5:** Clearly visible influence. The generated dog takes on visual qualities from the referenceâ€”color, texture, mood. Text still controls the scene.
- **Scale 0.7:** Strong image influence. The reference's visual identity is prominent. The text prompt contributes compositional structure but the visual character is clearly from the reference.
- **Scale 1.0:** Full image conditioning. The reference image's visual character dominates. The text prompt still provides some scene guidance, but the visual identity is strongly controlled by the reference.

The transition is gradual because the scale parameter is a linear multiplier on the image attention output: `output = text_out + scale Ã— image_out`. There is no threshold or discontinuity. This is the same "volume knob" pattern from ControlNet's conditioning scaleâ€”familiar control, new conditioning dimension.

---

## Exercise 3: Text-Image Coexistence `[Supported]`

From the lesson: IP-Adapter is **addition, not replacement**. The text K/V path is untouched. Different text prompts with the same reference image should produce different outputsâ€”the image provides visual character while text controls content and composition.

This exercise directly tests whether you understand the decoupling. If IP-Adapter replaced the text prompt, all three outputs below would look identical. They should not.

Your task: generate with the same reference image and three different text prompts. The pipeline is already loaded. You write the generation loop.

**Before running, predict:**
- Will all three outputs look the same (because the reference image dominates)?
- Will all three outputs look completely different (because the text prompt dominates)?
- Or something in betweenâ€”and if so, what aspect does the image control vs the text?

In [None]:
# Three different text prompts, same reference image, same scale.
#
# The prompts describe different scenes, styles, and compositions.
# If IP-Adapter is truly decoupled (addition, not replacement),
# the text should control the scene while the reference image
# provides visual character.

prompts = [
    "a painting of a dog in a garden, beautiful, detailed",
    "a dog running on a beach at sunset, photorealistic",
    "a dog sitting in a snowy forest, winter landscape, serene",
]

seed = 42
num_steps = 20
scale = 0.6  # Moderate image influenceâ€”enough to see the reference's effect

pipe.set_ip_adapter_scale(scale)

coexistence_results = {}  # Maps prompt -> generated PIL Image

for p in prompts:
    # TODO: Generate an image using the pipeline with this prompt.
    #
    # Use: pipe(p, ip_adapter_image=reference_image,
    #          num_inference_steps=num_steps, generator=make_generator(seed))
    #
    # Store the result (.images[0]) in coexistence_results[p]
    pass

print(f"Generated {len(coexistence_results)} images with different prompts.")

In [None]:
# Display the results

if not coexistence_results:
    print("coexistence_results is emptyâ€”go back and fill in the TODO.")
    print("You need to generate an image for each prompt and store it in coexistence_results[p].")
    print("Check the solution below if you get stuck.")
else:
    fig, axes = plt.subplots(1, 4, figsize=(20, 5))

    # First column: reference image
    axes[0].imshow(reference_image)
    axes[0].set_title("Reference Image\n(same for all three)", fontsize=10)
    axes[0].axis('off')

    # One column per prompt
    short_labels = [
        "\"...dog in a garden\"",
        "\"...dog on a beach\"",
        "\"...dog in snowy forest\"",
    ]

    for i, (p, label) in enumerate(zip(prompts, short_labels)):
        axes[i + 1].imshow(coexistence_results[p])
        axes[i + 1].set_title(f"Prompt: {label}", fontsize=10)
        axes[i + 1].axis('off')

    plt.suptitle(
        f"Same reference image, scale={scale}â€”three different text prompts",
        fontsize=13, y=1.02,
    )
    plt.tight_layout()
    plt.show()

    print("Observations:")
    print("- The three outputs are clearly DIFFERENT. Text controls the scene:")
    print("  garden vs beach vs snowy forest. Composition varies per prompt.")
    print("")
    print("- But the visual character is SHARED across all three. The reference")
    print("  image's influence (color palette, visual style, mood) carries through.")
    print("")
    print("This is decoupled cross-attention in action:")
    print("  Text K/V path â†’ controls WHAT (scene, composition, subject)")
    print("  Image K/V path â†’ controls WHAT-IT-LOOKS-LIKE (visual character)")
    print("  Both run in parallel. Neither replaces the other.")

In [None]:
# Clean up the pipeline before Exercise 4
cleanup_pipeline(pipe)
gc.collect()
torch.cuda.empty_cache()
print("Pipeline cleaned up. Ready for Exercise 4.")

<details>
<summary>ðŸ’¡ Solution</summary>

The key insight is that the generation call is identical for each promptâ€”only the text string changes. The IP-Adapter scale and reference image stay fixed. If IP-Adapter replaced text conditioning, all three outputs would be identical regardless of the prompt. The fact that they differ proves decoupling.

```python
for p in prompts:
    generator = make_generator(seed)
    result = pipe(
        p,
        ip_adapter_image=reference_image,
        num_inference_steps=num_steps,
        generator=generator,
    ).images[0]
    coexistence_results[p] = result
```

**Why the outputs differ:** The text K/V path (`W_K_text`, `W_V_text`) processes each prompt independently. Different prompts produce different K_text and V_text tensors, which produce different text attention outputs. The image K/V path (`W_K_image`, `W_V_image`) produces the same image attention output for all three (same reference image). The final output is `text_out + scale Ã— image_out`â€”different text_out values, same image_out. Hence: different scenes, shared visual character.

**Common mistake:** Forgetting to recreate the generator for each prompt. Without resetting the seed, each generation starts from different random noise, making it harder to isolate the effect of the text prompt from random variation.

</details>

### What Just Happened

You generated three images with the same reference image but three different text prompts:

- **Dog in a garden:** Garden setting, painterly styleâ€”the reference image's visual character blends with the garden scene.
- **Dog on a beach at sunset:** Beach and sunset compositionâ€”completely different scene, but the reference image's visual identity carries through.
- **Dog in a snowy forest:** Winter landscapeâ€”yet again different content, yet the visual character from the reference is consistent.

The outputs are clearly different because the text K/V path is untouched. Each prompt produces different K_text and V_text tensors, creating different text attention outputs. The image K/V path adds the same image influence to all three. This is the practical proof of the lesson's central claim: **image prompting is addition, not replacement.** The two K/V paths run in parallel, and the text prompt retains full control over scene content and composition.

---

## Exercise 4: Combine IP-Adapter with ControlNet `[Independent]`

From the lesson: IP-Adapter provides **WHAT-IT-LOOKS-LIKE** (visual identity via decoupled cross-attention). ControlNet provides **WHERE** (spatial structure via additive encoder features). Text provides **WHAT** (semantic content via the original cross-attention). All three are additive and composableâ€”they target different parts of the U-Net.

**Your task:** Combine IP-Adapter with ControlNet in a single generation.

1. Download a **structure source image** (different from the reference image) and extract Canny edges from it
2. Load a pipeline with both IP-Adapter and ControlNet
3. Generate with:
   - The **reference image** providing visual style/identity (via IP-Adapter)
   - The **edge map** providing spatial structure (via ControlNet)
   - A **text prompt** providing semantic content
4. Display: structure source, edge map, reference image, and generated output
5. Interpret the results: which aspects came from which conditioning source?

No scaffolding. You decide the prompt, the scales, and how to interpret the results.

**Hints:**
- Use `StableDiffusionControlNetPipeline` with a ControlNet, then call `pipe.load_ip_adapter(...)` to add IP-Adapter on top
- The ControlNet conditioning image goes in `image=edge_map`
- The IP-Adapter reference image goes in `ip_adapter_image=reference_image`
- Use `pipe.set_ip_adapter_scale(...)` for IP-Adapter scale and `controlnet_conditioning_scale=...` for ControlNet scale
- Start with moderate scales (IP-Adapter ~0.5, ControlNet ~0.7) and adjust

**Available structure source images:**

In [None]:
# Structure source image options (for extracting edges)
# These are DIFFERENT from the IP-Adapter reference imageâ€”that is the point.
# The reference image provides visual STYLE. The structure source provides spatial LAYOUT.

structure_options = {
    "vermeer": "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png",
    "architecture": "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_input.png",
}

# Show what is available
option_images = []
option_titles = []
for name, url in structure_options.items():
    try:
        img = download_sample_image(url, f"{name}.jpg").resize((512, 512))
        option_images.append(img)
        option_titles.append(name)
    except Exception as e:
        print(f"Could not download {name}: {e}")

show_images(
    [reference_image] + option_images,
    ["Reference Image\n(IP-Adapter: visual style)"] + [f"{t}\n(ControlNet: spatial structure)" for t in option_titles],
    suptitle="Reference image for STYLE (left) vs structure source options for LAYOUT (right)",
)

print("Pick a structure source image. You will extract edges from it (ControlNet)")
print("while using the reference image for visual style (IP-Adapter).")
print("The two images serve different conditioning roles.")

In [None]:
# Your IP-Adapter + ControlNet composition code goes here.
#
# Workflow:
# 1. Choose a structure source image and extract Canny edges
# 2. Load ControlNet (Canny) + SD pipeline
# 3. Load IP-Adapter into the pipeline (pipe.load_ip_adapter(...))
# 4. Set IP-Adapter scale (pipe.set_ip_adapter_scale(...))
# 5. Generate with:
#    - image=edge_map (ControlNet spatial conditioning)
#    - ip_adapter_image=reference_image (IP-Adapter visual conditioning)
#    - controlnet_conditioning_scale=... (ControlNet strength)
#    - A text prompt of your choice
# 6. Display: structure source, edge map, reference image, and generated output
#
# Remember:
# - Use torch.float16 for all models
# - Canny thresholds around (100, 200) work well for clean edges
# - Start with IP-Adapter scale ~0.5, ControlNet scale ~0.7
# - Use make_generator(seed) for reproducible results



<details>
<summary>ðŸ’¡ Solution</summary>

The core insight is that IP-Adapter and ControlNet target different parts of the U-Net and are fully composable. ControlNet adds structural features at the skip connections (WHERE). IP-Adapter adds image features via decoupled cross-attention K/V projections (WHAT-IT-LOOKS-LIKE). Text provides semantic content via the original cross-attention K/V (WHAT). All three are additive.

```python
# 1. Choose structure source and extract edges
structure_url = structure_options["vermeer"]
structure_image = download_sample_image(structure_url, "structure.jpg").resize((512, 512))

edges = cv2.Canny(np.array(structure_image), 100, 200)
edge_map = Image.fromarray(np.stack([edges] * 3, axis=-1))

# 2. Load ControlNet + SD pipeline
controlnet_canny = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16,
)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet_canny,
    torch_dtype=torch.float16,
    safety_checker=None,
).to(device)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

# 3. Load IP-Adapter on top
pipe.load_ip_adapter(
    "h94/IP-Adapter",
    subfolder="models",
    weight_name="ip-adapter_sd15.bin",
)

# 4. Set scales
pipe.set_ip_adapter_scale(0.5)

# 5. Generate
generator = make_generator(42)
result = pipe(
    "a beautiful portrait, detailed, masterpiece",
    image=edge_map,
    ip_adapter_image=reference_image,
    num_inference_steps=20,
    generator=generator,
    controlnet_conditioning_scale=0.7,
).images[0]

# 6. Display
show_images(
    [structure_image, edge_map, reference_image, result],
    [
        "Structure Source\n(layout comes from here)",
        "Canny Edges\n(ControlNet: WHERE)",
        "Reference Image\n(IP-Adapter: WHAT-IT-LOOKS-LIKE)",
        "Generated\n(all three combined)",
    ],
    suptitle="ControlNet (structure) + IP-Adapter (style) + text (content)",
)

print("Three conditioning channels in one generation:")
print("  ControlNet (edges): spatial structure/layout from the structure source")
print("  IP-Adapter (reference): visual style/mood from the reference image")
print("  Text: semantic content and rendering style")

# Clean up
cleanup_pipeline(pipe)
del controlnet_canny
```

**Key decisions:**
- IP-Adapter at 0.5 (moderate visual influenceâ€”we want to see the reference's character without overpowering)
- ControlNet at 0.7 (strong structural adherenceâ€”the edges should clearly control composition)
- The structure source image and reference image are intentionally DIFFERENT. The structure provides WHERE. The reference provides WHAT-IT-LOOKS-LIKE. If they were the same image, you would not be able to tell which conditioning channel contributed what.

**Common mistake:** Using the same image for both IP-Adapter and ControlNet. This defeats the purpose of the exerciseâ€”you cannot observe composability if both signals come from the same source.

</details>

---

## Key Takeaways

1. **IP-Adapter is purely additive.** At scale=0, the image branch contributes nothing and the output is identical to vanilla SD. The text K/V path is completely untouched at every scale value. Remove IP-Adapter and the frozen model is bit-for-bit identical.

2. **The scale parameter is a volume knob for image influence.** The transition from text-dominant (0.0) to image-dominant (1.0) is gradual and continuous. Same "volume knob" pattern as ControlNet's conditioning scaleâ€”familiar control, new conditioning dimension.

3. **Image prompting is addition, not replacement.** The same reference image with three different text prompts produces three different outputs. Text controls content and composition (WHAT). The reference image controls visual character (WHAT-IT-LOOKS-LIKE). Both K/V paths run in parallel.

4. **IP-Adapter and ControlNet are composable.** ControlNet provides spatial structure (WHERE) via additive encoder features. IP-Adapter provides visual identity (WHAT-IT-LOOKS-LIKE) via decoupled cross-attention K/V. Text provides semantic content (WHAT). All three are additive, target different parts of the U-Net, and work together in a single generation.

5. **The practical workflow is: choose reference image â†’ set scale â†’ compose with other conditioning â†’ iterate.** The tools are the same every time; the creative decisions are what change.