# Img2Img and Inpainting

**Module 6.5, Lesson 2** | CourseAI

You know the forward process closed-form formula, the denoising loop, and the alpha-bar curve. You know the full Stable Diffusion pipeline from text prompt to pixel image. This notebook reconfigures the inference process you already understandâ€”starting from a noised real image instead of pure noise (img2img) and applying a spatial mask at each denoising step (inpainting).

**What you will do:**
- Run img2img at strengths 0.1 through 0.9 on the same image to see the full strength spectrum and connect it to the alpha-bar curve
- Implement img2img from scratch: VAE encode, forward process noise, manual denoising loopâ€”no pipeline abstraction
- Create binary masks and run inpainting, observing how mask sizing affects boundary blending and the U-Net receptive field
- Combine img2img and inpainting in a multi-step creative editing workflow

**For each exercise, PREDICT the output before running the cell.**

**Estimated time:** 30â€“45 minutes.

---

## Setup

Run this cell to install dependencies and import everything. This notebook requires a GPU for reasonable inference times.

In [None]:
!pip install -q diffusers transformers accelerate

import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image, ImageDraw
import gc
import requests
from io import BytesIO

# Reproducible results
torch.manual_seed(42)
np.random.seed(42)

# Nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [10, 4]

# Device setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
dtype = torch.float16 if device.type == 'cuda' else torch.float32
print(f'Using device: {device}')
if device.type == 'cuda':
    print(f'GPU: {torch.cuda.get_device_name(0)}')
    print(f'VRAM: {torch.cuda.get_device_properties(0).total_mem / 1024**3:.1f} GB')

print('\nSetup complete.')

## Shared Helpers

Display helpers, model ID, and the source image used across exercises. Each exercise loads only the pipeline it needs and cleans up afterward to stay within free-tier Colab VRAM (~16 GB on a T4).

> **VRAM tip:** If you encounter an out-of-memory error, go to Runtime â†’ Restart runtime and rerun from Setup. Exercise 2 uses the most VRAM because it loads individual components rather than a single pipeline.

In [None]:
model_id = 'stable-diffusion-v1-5/stable-diffusion-v1-5'


def show_images(images, titles, figsize=None):
    """Display a list of PIL images side by side."""
    n = len(images)
    if figsize is None:
        figsize = (5 * n, 5)
    fig, axes = plt.subplots(1, n, figsize=figsize)
    if n == 1:
        axes = [axes]
    for ax, img, title in zip(axes, images, titles):
        ax.imshow(np.array(img))
        ax.set_title(title, fontsize=10)
        ax.axis('off')
    plt.tight_layout()
    plt.show()


def show_image_grid(images, titles, nrows, ncols, figsize=None, suptitle=None):
    """Display images in a grid with the given number of rows and columns."""
    if figsize is None:
        figsize = (4 * ncols, 4 * nrows)
    fig, axes = plt.subplots(nrows, ncols, figsize=figsize)
    axes_flat = axes.flat if nrows > 1 or ncols > 1 else [axes]
    for ax, img, title in zip(axes_flat, images, titles):
        ax.imshow(np.array(img))
        ax.set_title(title, fontsize=10)
        ax.axis('off')
    for ax in list(axes_flat)[len(images):]:
        ax.axis('off')
    if suptitle:
        plt.suptitle(suptitle, fontsize=13)
    plt.tight_layout()
    plt.show()


def load_sample_image(url, size=(512, 512)):
    """Download an image from a URL and resize it."""
    response = requests.get(url)
    img = Image.open(BytesIO(response.content)).convert('RGB')
    return img.resize(size, Image.LANCZOS)


def cleanup():
    """Free GPU memory."""
    gc.collect()
    if device.type == 'cuda':
        torch.cuda.empty_cache()


# Load the source image used across all exercises.
# This is a public-domain landscape photo from Unsplash.
source_url = 'https://images.unsplash.com/photo-1506905925346-21bda4d32df4?w=512&h=512&fit=crop'
source_image = load_sample_image(source_url)

plt.figure(figsize=(5, 5))
plt.imshow(np.array(source_image))
plt.title('Source Image (used across all exercises)', fontsize=11)
plt.axis('off')
plt.tight_layout()
plt.show()

print(f'Source image size: {source_image.size}')
print('Helpers defined.')

---

## Exercise 1: Img2img Strength Exploration [Guided]

The lesson explained that the strength parameter determines where on the noise schedule the denoising loop starts. Low strength means starting where alpha-bar is high (mostly signal)â€”only fine details change. High strength means starting where alpha-bar is low (mostly noise)â€”the model can reimagine the entire composition.

This exercise makes that tangible. You will run img2img on the same input image at five different strength values with the same prompt and seed. The resulting grid directly maps the strength parameter onto the visual changes you see.

**Before running, predict:**
- Which strength value will preserve the most original structure? (Think: which strength starts the denoising loop closest to the end, where only detail-refinement steps run?)
- At strength=0.9, will the output look anything like the original landscape? (Think: strength=0.9 means 90% of the denoising process runs. Where on the alpha-bar curve is the starting point?)
- At what strength value would you expect the output to be completely unrelated to the input? (Hint: the boundary case from the lesson.)

In [None]:
from diffusers import StableDiffusionImg2ImgPipeline, DDIMScheduler

# Load the img2img pipeline.
# We use DDIMScheduler throughout this notebook because its deterministic
# stepping makes comparison between the pipeline (Exercise 1) and our
# manual implementation (Exercise 2) straightforward.
print('Loading StableDiffusionImg2ImgPipeline...')
img2img_pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
    model_id,
    torch_dtype=dtype,
    safety_checker=None,
    requires_safety_checker=False,
).to(device)
img2img_pipe.scheduler = DDIMScheduler.from_config(
    img2img_pipe.scheduler.config
)
print('Pipeline loaded.')

# Run img2img at five strength values with the same prompt and seed.
prompt = 'a watercolor painting of mountains at sunset'
strengths = [0.1, 0.3, 0.5, 0.7, 0.9]
seed = 42

results = []
for strength in strengths:
    generator = torch.Generator(device=device).manual_seed(seed)
    result = img2img_pipe(
        prompt=prompt,
        image=source_image,
        strength=strength,
        guidance_scale=7.5,
        num_inference_steps=30,
        generator=generator,
    ).images[0]
    results.append(result)
    print(f'  strength={strength}: done')

# Display the original image alongside all strength results.
all_images = [source_image] + results
all_titles = ['Original'] + [f'strength={s}' for s in strengths]

show_image_grid(
    all_images, all_titles,
    nrows=2, ncols=3,
    figsize=(18, 12),
    suptitle=f'Img2img Strength Spectrum: "{prompt}"',
)

print(f'\nPrompt: "{prompt}"')
print(f'Seed: {seed}')
print(f'All images used the same input, prompt, and seed.')
print(f'The ONLY variable is the strength parameter.')

In [None]:
# Visualize the alpha-bar connection.
# Map each strength value to its actual position on the alpha-bar curve
# using the scheduler's 30-step timestep schedule (the same schedule used
# in the exercises above).

alphas_cumprod = img2img_pipe.scheduler.alphas_cumprod.cpu().numpy()
total_timesteps = len(alphas_cumprod)

# Get the actual 30-step schedule the scheduler uses during inference.
num_inference_steps = 30
img2img_pipe.scheduler.set_timesteps(num_inference_steps)
actual_timesteps = img2img_pipe.scheduler.timesteps.cpu().numpy()

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(range(total_timesteps), alphas_cumprod, color='cyan', linewidth=2,
        label=r'$\bar{\alpha}_t$ (cumulative signal fraction)')

# Mark each strength value on the curve using the actual scheduler timesteps.
colors = ['#22c55e', '#3b82f6', '#a855f7', '#f59e0b', '#ef4444']
for strength, color in zip(strengths, colors):
    # strength determines how many of the 30 steps run.
    num_steps_to_run = int(num_inference_steps * strength)
    start_step_index = num_inference_steps - num_steps_to_run
    # Clamp to valid range.
    start_step_index = min(start_step_index, len(actual_timesteps) - 1)
    t_start = actual_timesteps[start_step_index]
    t_start_int = int(t_start)
    alpha_bar_at_start = alphas_cumprod[t_start_int]

    ax.axvline(x=t_start_int, color=color, linestyle='--', alpha=0.7)
    ax.scatter([t_start_int], [alpha_bar_at_start], color=color, s=100, zorder=5)
    ax.annotate(
        f's={strength}\n$\\bar{{\\alpha}}$={alpha_bar_at_start:.3f}',
        (t_start_int, alpha_bar_at_start),
        textcoords='offset points', xytext=(10, 10),
        fontsize=9, color=color,
    )

ax.set_xlabel('Timestep t', fontsize=11)
ax.set_ylabel(r'$\bar{\alpha}_t$ (signal fraction)', fontsize=11)
ax.set_title('Strength Parameter Mapped onto the Alpha-Bar Curve (30-step DDIM schedule)', fontsize=12)
ax.legend(fontsize=10)
plt.tight_layout()
plt.show()

print('Each dot shows where the denoising loop STARTS for that strength value.')
print(f'Positions use the actual {num_inference_steps}-step DDIM schedule, not a linear mapping.')
print('Low strength = start near the right (high alpha-bar, mostly signal).')
print('High strength = start near the left (low alpha-bar, mostly noise).')
print('\nThe curve is nonlinear. That is why the effect of strength is nonlinear.')
print('The jump from 0.3 to 0.5 is qualitatively different from 0.7 to 0.9,')
print('because different denoising phases control different aspects of the image.')

In [None]:
# Clean up the img2img pipeline before Exercise 2.
del img2img_pipe
cleanup()
print('Img2img pipeline freed from VRAM.')

### What Just Happened

You saw the full strength spectrum on a single input image. Key observations:

1. **strength=0.1** preserves almost everything. Only the finest textures and color tones shift toward the watercolor prompt. The model ran only the detail-refinement stepsâ€”structure is locked in.

2. **strength=0.5** is the editing sweet spot. The broad composition (mountains, sky, horizon) is preserved, but the rendering style changes significantly. The model ran both structure-setting and detail steps.

3. **strength=0.9** reimagines the image almost entirely. Only the vaguest spatial hints of the original survive. The model ran nearly the entire denoising process, starting from nearly pure noise.

4. **The alpha-bar curve explains everything.** At strength=0.1, alpha-bar is high (~0.98)â€”the noised image is almost clean. At strength=0.9, alpha-bar is low (~0.02)â€”the noised image is almost pure noise. The nonlinear curve explains why going from 0.3 to 0.5 changes detail quality, while going from 0.7 to 0.9 changes global structure.

5. **At strength=1.0, the input image would have zero influence.** The image would be noised to pure random noise, and the output would be standard text-to-image. This is the boundary case from the lesson.

---

## Exercise 2: Implement Img2img From the Denoising Loop [Guided]

The lesson showed that img2img is a 3-line change to the standard text-to-image pipeline:
1. **VAE encode** the input image to latent space
2. **Noise** the latent to a starting timestep using the forward process formula
3. **Change the loop start** from T (pure noise) to t_start (partially noised image)

This exercise implements those three changes by handâ€”no `StableDiffusionImg2ImgPipeline`, just the raw components. You will encode, noise, and denoise manually. Both exercises use the same DDIMScheduler, same seed, same prompt, and same strength, so the manual output should match Exercise 1's pipeline output at strength=0.7.

**Before running, predict:**
- After VAE encoding, what shape will the latent z_0 have? (512/8 = 64, 4 latent channels.)
- With 30 DDIM steps and strength=0.7, how many denoising steps will actually run? (70% of 30 = 21 steps.)
- Will the manually constructed output match the pipeline output from Exercise 1? (Same scheduler, same seed, same strength, same prompt.)

In [None]:
from diffusers import UNet2DConditionModel, AutoencoderKL, DDIMScheduler
from transformers import CLIPTextModel, CLIPTokenizer

# Load individual components. We use the same DDIMScheduler as Exercise 1
# so that our manual output can be compared directly to the pipeline output.
print('Loading individual pipeline components...')
vae = AutoencoderKL.from_pretrained(model_id, subfolder='vae', torch_dtype=dtype).to(device)
unet = UNet2DConditionModel.from_pretrained(model_id, subfolder='unet', torch_dtype=dtype).to(device)
tokenizer = CLIPTokenizer.from_pretrained(model_id, subfolder='tokenizer')
text_encoder = CLIPTextModel.from_pretrained(
    model_id, subfolder='text_encoder', torch_dtype=dtype
).to(device)
scheduler = DDIMScheduler.from_pretrained(model_id, subfolder='scheduler')

# Freeze everythingâ€”we are doing inference, not training.
vae.requires_grad_(False)
unet.requires_grad_(False)
text_encoder.requires_grad_(False)

print('Components loaded.')

In [None]:
# ============================================================
# STEP 1: Encode the input image with the VAE encoder.
# ============================================================
# Remember from the lesson: "We said the VAE encoder is not used during
# text-to-image inference. That was correct. Img2img IS differentâ€”it is
# image-to-image, and the encoder converts your input to latent space."

from torchvision import transforms

# Preprocess the source image to a tensor normalized to [-1, 1].
image_transform = transforms.Compose([
    transforms.Resize((512, 512)),
    transforms.ToTensor(),
    transforms.Normalize([0.5], [0.5]),
])

image_tensor = image_transform(source_image).unsqueeze(0).to(device, dtype=dtype)
print(f'Input image tensor shape: {list(image_tensor.shape)}')
print(f'Value range: [{image_tensor.min().item():.2f}, {image_tensor.max().item():.2f}]')

# VAE encode: image space -> latent space.
# This is the VAE encoder that is NOT used in text-to-image but IS used in img2img.
# We use .mode() (the distribution mean) rather than .sample() for deterministic
# encoding. The VAE's learned variance is small enough that the mean is an
# excellent approximation, and this avoids generator-state differences when
# comparing to the pipeline output in Exercise 1.
with torch.no_grad():
    latent_dist = vae.encode(image_tensor)
    z_0 = latent_dist.latent_dist.mode() * vae.config.scaling_factor

print(f'\nLatent z_0 shape: {list(z_0.shape)}')
print(f'VAE scaling factor: {vae.config.scaling_factor}')
print(f'\n8x spatial compression: 512 -> 64. Four latent channels.')
print('This is the same encoding from "From Pixels to Latents."')
print('In text-to-image, this encoder is skipped (start from pure noise).')
print('In img2img, the encoder is NEEDED to bring the input into latent space.')

In [None]:
# ============================================================
# STEP 2: Add noise to z_0 using the forward process formula.
# ============================================================
# This is the formula from "The Forward Process":
#   z_t = sqrt(alpha_bar_t) * z_0 + sqrt(1 - alpha_bar_t) * epsilon
#
# The strength parameter determines HOW MUCH noise to addâ€”i.e., which
# timestep t_start to noise to.

strength = 0.7
num_inference_steps = 30

# Set up the scheduler's timestep schedule.
scheduler.set_timesteps(num_inference_steps, device=device)
all_timesteps = scheduler.timesteps  # e.g., [981, 961, ..., 1] for 30 steps

# Compute which timestep to start from.
# strength=0.7 means 70% of the denoising steps run.
num_steps_to_run = int(num_inference_steps * strength)  # 21 steps
start_step_index = num_inference_steps - num_steps_to_run  # step index 9
t_start = all_timesteps[start_step_index]  # the actual timestep value

print(f'Total inference steps:  {num_inference_steps}')
print(f'Strength:              {strength}')
print(f'Steps to run:          {num_steps_to_run}')
print(f'Start step index:      {start_step_index}')
print(f'Starting timestep t:   {t_start.item()}')

# Noise z_0 to z_{t_start} using the forward process formula.
# The generator ensures we use the same noise as the pipeline would.
generator = torch.Generator(device=device).manual_seed(42)
noise = torch.randn(z_0.shape, generator=generator, device=device, dtype=dtype)

# Forward process: z_t = sqrt(alpha_bar_t) * z_0 + sqrt(1 - alpha_bar_t) * noise
alpha_bar_t = scheduler.alphas_cumprod[t_start.long().cpu()]
z_t_start = (
    (alpha_bar_t ** 0.5) * z_0
    + ((1 - alpha_bar_t) ** 0.5) * noise
)

print(f'\nalpha_bar at t={t_start.item()}: {alpha_bar_t.item():.4f}')
print(f'Signal fraction:       {alpha_bar_t.item():.4f} ({alpha_bar_t.item()*100:.1f}%)')
print(f'Noise fraction:        {(1 - alpha_bar_t.item()):.4f} ({(1 - alpha_bar_t.item())*100:.1f}%)')
print(f'\nNoised latent z_t_start shape: {list(z_t_start.shape)}')
print(f'\nThis is the forward process formula in action. The same formula')
print(f'you derived in "The Forward Process", used in the capstone, and')
print(f'used in the LoRA training loop. Now it appears in a third context:')
print(f'inference-time image editing.')

In [None]:
# ============================================================
# STEP 3: Encode the text prompt with frozen CLIP.
# ============================================================
# This is identical to standard text-to-image. The text conditioning
# mechanism is unchanged in img2img.

prompt = 'a watercolor painting of mountains at sunset'
negative_prompt = ''

# Encode prompt.
text_tokens = tokenizer(
    prompt, padding='max_length', max_length=tokenizer.model_max_length,
    truncation=True, return_tensors='pt',
)
with torch.no_grad():
    text_embeddings = text_encoder(text_tokens.input_ids.to(device))[0]

# Encode the unconditional (empty) prompt for classifier-free guidance.
uncond_tokens = tokenizer(
    negative_prompt, padding='max_length', max_length=tokenizer.model_max_length,
    truncation=True, return_tensors='pt',
)
with torch.no_grad():
    uncond_embeddings = text_encoder(uncond_tokens.input_ids.to(device))[0]

# Concatenate for CFG: [unconditional, conditional].
# The U-Net processes both in a single batch, then CFG combines them.
text_emb = torch.cat([uncond_embeddings, text_embeddings])

print(f'Prompt: "{prompt}"')
print(f'Text embeddings shape: {list(text_emb.shape)}')
print(f'  [0] = unconditional (empty prompt)')
print(f'  [1] = conditional ("{prompt}")')

In [None]:
# ============================================================
# STEP 4: The denoising loopâ€”starting from t_start, not from T.
# ============================================================
# This is the ONE change from standard text-to-image:
# - Text-to-image: loop from step 0 (t=T, pure noise) to the last step (t~0)
# - Img2img: loop from start_step_index (t=t_start, partially noised image)
#
# The denoising loop itself is IDENTICAL. Same U-Net, same CFG, same sampler.

guidance_scale = 7.5
z = z_t_start.clone()

# Only iterate over the timesteps from t_start onward.
denoising_timesteps = all_timesteps[start_step_index:]
print(f'Denoising from step index {start_step_index} ({denoising_timesteps[0].item()}) '
      f'to step index {num_inference_steps - 1} ({denoising_timesteps[-1].item()})')
print(f'Running {len(denoising_timesteps)} denoising steps (out of {num_inference_steps} total)\n')

with torch.no_grad():
    for i, t in enumerate(denoising_timesteps):
        # Duplicate the latent for CFG (unconditional + conditional).
        z_input = torch.cat([z] * 2)

        # U-Net predicts noise.
        noise_pred = unet(z_input, t, encoder_hidden_states=text_emb).sample

        # Classifier-free guidance: amplify the text direction.
        noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
        noise_pred = noise_pred_uncond + guidance_scale * (
            noise_pred_text - noise_pred_uncond
        )

        # Scheduler step (DDIM update).
        z = scheduler.step(noise_pred, t, z).prev_sample

        if i % 5 == 0:
            print(f'  Step {i:>2d}/{len(denoising_timesteps)}: t = {t.item()}')

print(f'\nDenoising complete. Final latent shape: {list(z.shape)}')

In [None]:
# ============================================================
# STEP 5: VAE decode back to pixel space.
# ============================================================
# This is the same decode step as standard text-to-image.

with torch.no_grad():
    decoded = vae.decode(z / vae.config.scaling_factor).sample

# Convert tensor to PIL image.
decoded_np = decoded[0].cpu().float().numpy()
decoded_np = ((decoded_np + 1) / 2).clip(0, 1)  # [-1, 1] -> [0, 1]
decoded_np = (decoded_np.transpose(1, 2, 0) * 255).astype(np.uint8)
manual_result = Image.fromarray(decoded_np)

# Display the manual result alongside the original.
show_images(
    [source_image, manual_result],
    ['Original', f'Manual img2img (strength={strength})'],
    figsize=(12, 6),
)

print(f'\nYou just implemented img2img from scratch:')
print(f'  1. VAE encode: [3, 512, 512] -> [4, 64, 64]')
print(f'  2. Forward process noise to t={t_start.item()} (alpha_bar={alpha_bar_t.item():.4f})')
print(f'  3. Denoise from t={t_start.item()} to t~0 ({num_steps_to_run} DDIM steps)')
print(f'  4. VAE decode: [4, 64, 64] -> [3, 512, 512]')
print(f'\nNo StableDiffusionImg2ImgPipeline. Just the raw components.')
print(f'Img2img IS the forward-process-then-denoise mechanismâ€”not a black box.')

In [None]:
# Clean up individual components before Exercise 3.
del vae, unet, text_encoder, z, z_0, z_t_start, noise, text_emb
cleanup()
print('Components freed from VRAM.')

### What Just Happened

You implemented img2img by hand using only the individual pipeline componentsâ€”no `StableDiffusionImg2ImgPipeline` abstraction. The core mechanism was three changes:

1. **VAE encode** the input image (the encoder that is NOT used in text-to-image but IS needed for img2img).
2. **Forward process noise** to the starting timestep using the closed-form formula you derived in Module 6.2.
3. **Start the denoising loop from t_start** instead of from T.

The denoising loop itself was unchangedâ€”same U-Net, same CFG (two forward passes, amplify text direction), same DDIM scheduler step. The only difference from standard text-to-image was the starting point.

This confirms the lesson's core claim: **img2img is not a new algorithm. It is the same denoising process with a different starting point.**

---

## Exercise 3: Inpainting with Mask Design [Supported]

The lesson explained that inpainting adds one operation to the denoising loop: at each step, a spatial mask controls which regions the model can change and which are preserved from the original. The mask formula:

$$z_t^{\text{combined}} = m \cdot z_t^{\text{denoised}} + (1 - m) \cdot \text{forward}(z_0^{\text{original}}, t)$$

This exercise uses `StableDiffusionInpaintPipeline` to inpaint a region of the source image. You will create masks of different sizes to observe how mask sizing affects boundary qualityâ€”connecting to the lesson's point about the U-Net's receptive field and why boundaries blend seamlessly.

**Your tasks:**
- Create a binary mask covering a region of the source image
- Run inpainting with a descriptive prompt
- Compare tight vs generous mask sizing
- Observe boundary blending quality

In [None]:
from diffusers import StableDiffusionInpaintPipeline, DPMSolverMultistepScheduler

# Load the inpainting pipeline.
print('Loading StableDiffusionInpaintPipeline...')
inpaint_pipe = StableDiffusionInpaintPipeline.from_pretrained(
    model_id,
    torch_dtype=dtype,
    safety_checker=None,
    requires_safety_checker=False,
).to(device)
inpaint_pipe.scheduler = DPMSolverMultistepScheduler.from_config(
    inpaint_pipe.scheduler.config
)
print('Pipeline loaded.')

In [None]:
# Create TWO binary masks for the source image.
# A binary mask is a PIL Image (mode 'L') where:
#   - White (255) = edit this region (the model denoises here)
#   - Black (0) = preserve the original (re-noised original latent used here)
#
# We provide a TIGHT mask and you create a GENEROUS mask covering roughly
# the same area (the sky/upper portion of the landscape image).

# Pre-filled: the tight mask covers the sky region (top 200 pixels).
# This is intentionally snug against the sky-mountain boundary.
tight_mask = Image.new('L', (512, 512), 0)  # Start with all black (preserve)
tight_draw = ImageDraw.Draw(tight_mask)
tight_draw.rectangle([0, 0, 512, 200], fill=255)

# TODO: Create the generous mask.
# The generous mask should cover the same sky region but extend ~40 pixels
# further downward, giving the model more room to blend at the boundary.
# Use the same pattern as the tight mask above, but with a larger rectangle.
#
# generous_mask = Image.new('L', (512, 512), 0)
# generous_draw = ImageDraw.Draw(generous_mask)
# generous_draw.rectangle([left, top, right, bottom], fill=255)
generous_mask = None  # Replace this line with the generous mask

# Display the masks.
if generous_mask is not None:
    show_images(
        [source_image, tight_mask, generous_mask],
        ['Source Image', 'Tight Mask', 'Generous Mask'],
        figsize=(15, 5),
    )
else:
    print('Create the generous mask above to continue.')

In [None]:
# TODO 2: Run inpainting with both masks and compare.
# Use the inpaint_pipe with:
#   - prompt: 'dramatic thunderstorm clouds, dark sky'
#   - image: source_image
#   - mask_image: the mask (tight_mask or generous_mask)
#   - guidance_scale: 7.5
#   - num_inference_steps: 30
#   - generator: seeded for reproducibility
#
# Run once with tight_mask and once with generous_mask.

inpaint_prompt = 'dramatic thunderstorm clouds, dark sky'
seed = 42

# Inpaint with tight mask.
# TODO: Call inpaint_pipe(...) with tight_mask.
# generator = torch.Generator(device=device).manual_seed(seed)
# tight_result = inpaint_pipe(...).images[0]
tight_result = None  # Replace this line

# Inpaint with generous mask.
# TODO: Call inpaint_pipe(...) with generous_mask.
# generator = torch.Generator(device=device).manual_seed(seed)
# generous_result = inpaint_pipe(...).images[0]
generous_result = None  # Replace this line

if tight_result is not None and generous_result is not None:
    show_image_grid(
        [source_image, tight_mask, tight_result, source_image, generous_mask, generous_result],
        ['Original', 'Tight Mask', 'Tight Result', 'Original', 'Generous Mask', 'Generous Result'],
        nrows=2, ncols=3,
        figsize=(18, 12),
        suptitle=f'Inpainting: "{inpaint_prompt}"',
    )
    print(f'\nCompare the boundary between the inpainted sky and the preserved mountains.')
    print(f'The generous mask gives the model more room to blendâ€”look for smoother')
    print(f'transitions at the sky-mountain boundary.')
    print(f'\nWhy do boundaries blend at all? The U-Net sees the FULL latent at every')
    print(f'denoising step. At the 8x8 bottleneck resolution, each position has global')
    print(f'receptive field. The model\'s predictions for the sky account for the')
    print(f'mountain context. This is fundamentally different from cut-and-paste.')
else:
    print('Fill in TODO 2 to run inpainting.')

<details>
<summary>ðŸ’¡ Solution</summary>

The key insight is that inpainting is the standard denoising loop plus a per-step mask. The mask tells the model which regions to denoise and which to preserve. Mask sizing matters because the model needs spatial context around the edited region to produce coherent boundaries.

**TODO 1--Create the generous mask:**
```python
generous_mask = Image.new('L', (512, 512), 0)
generous_draw = ImageDraw.Draw(generous_mask)
generous_draw.rectangle([0, 0, 512, 240], fill=255)  # 40px more than tight mask
```
Why extend by 40 pixels? The lesson noted that slightly oversized masks produce better results because the model has room to create smooth transitions. The tight mask cuts exactly at the sky-mountain boundary, leaving no room for blending. The generous mask extends into the mountains slightly, letting the model handle the transition.

**TODO 2--Run inpainting:**
```python
generator = torch.Generator(device=device).manual_seed(seed)
tight_result = inpaint_pipe(
    prompt=inpaint_prompt,
    image=source_image,
    mask_image=tight_mask,
    guidance_scale=7.5,
    num_inference_steps=30,
    generator=generator,
).images[0]

generator = torch.Generator(device=device).manual_seed(seed)
generous_result = inpaint_pipe(
    prompt=inpaint_prompt,
    image=source_image,
    mask_image=generous_mask,
    guidance_scale=7.5,
    num_inference_steps=30,
    generator=generator,
).images[0]
```
Note the re-seeded generator for each call. Same seed ensures the random noise is identical, so differences between tight and generous results come only from the mask size.

**Common mistakes:**
- Forgetting to re-create the generator between pipeline calls. The first call consumes the random state.
- Using `mode='RGB'` instead of `mode='L'` for the mask. The pipeline expects a single-channel mask.
- Inverting the mask convention: white (255) means "edit here", black (0) means "preserve." Some implementations use the opposite convention.

</details>

In [None]:
# Bonus: demonstrate the boundary case from the lesson.
# What happens when the mask covers the ENTIRE image?

full_mask = Image.new('L', (512, 512), 255)  # All white = edit everything

generator = torch.Generator(device=device).manual_seed(seed)
full_mask_result = inpaint_pipe(
    prompt=inpaint_prompt,
    image=source_image,
    mask_image=full_mask,
    guidance_scale=7.5,
    num_inference_steps=30,
    generator=generator,
).images[0]

show_images(
    [source_image, full_mask_result],
    ['Original', 'Full-Image Mask (entire image denoised)'],
    figsize=(12, 6),
)

print('When mask=1 everywhere, EVERY region is denoised and NONE are preserved.')
print('This collapses to standard img2img / text-to-image.')
print('The mask IS the entire difference between inpainting and standard denoising.')

In [None]:
# Clean up before Exercise 4.
del inpaint_pipe
cleanup()
print('Inpainting pipeline freed from VRAM.')

### What Just Happened

You created binary masks and ran inpainting, observing two key phenomena:

1. **Mask sizing affects boundary quality.** The generous mask with extra padding produces smoother transitions at the sky-mountain boundary. The tight mask can produce more abrupt transitions because the model has less room to blend. The lesson's advice: "err on the side of slightly too large."

2. **Boundaries blend naturally.** Even with the tight mask, the boundary is far smoother than you would get from cut-and-paste. This is because the U-Net sees the full latent at every denoising step. At the 8x8 bottleneck, each position has global receptive fieldâ€”the model's predictions for the sky region near the boundary account for the mountain context.

3. **Full-image mask = standard denoising.** When the mask covers everything, no regions are preserved. Inpainting collapses to standard generation. The mask is the entire mechanism.

---

## Exercise 4: Creative Multi-Step Workflow [Independent]

The lesson positioned img2img and inpainting as complementary tools: img2img transforms the whole image, inpainting edits specific regions. The most powerful workflows combine them.

Your task: create a multi-step editing workflow that combines both techniques.

**Workflow specification:**
1. **Create a simple sketch** (or use the source image)--a rough starting point
2. **Img2img** at high strength (0.6â€“0.8) to transform it into a coherent scene with a descriptive prompt
3. **Inpainting** to selectively modify one element of the result (e.g., change the sky, add an object, modify a region)
4. **Display the full progression:** original/sketch â†’ img2img result â†’ inpainted result

**Key APIs you will need:**
- `StableDiffusionImg2ImgPipeline`--for the initial transformation
- `StableDiffusionInpaintPipeline`--for the selective edit
- `Image.new('L', (512, 512), 0)` + `ImageDraw.Draw(mask).rectangle(...)`--for mask creation

**VRAM constraint:** Load only one pipeline at a time. Delete and call `cleanup()` between pipelines.

**What to observe:**
- How does the img2img strength affect how much of the original survives?
- Does the inpainted region blend naturally with the img2img result?
- Could you achieve this result with either technique alone?

In [None]:
# YOUR CREATIVE WORKFLOW
#
# Step 1: Create a simple sketch (or use source_image as your starting point).
#
# To create a simple sketch programmatically:
#   sketch = Image.new('RGB', (512, 512), (200, 220, 255))  # light blue sky
#   draw = ImageDraw.Draw(sketch)
#   draw.polygon([(0, 350), (150, 200), (300, 300), (512, 250), (512, 512), (0, 512)],
#                fill=(50, 120, 50))  # green mountains
#   draw.rectangle([0, 400, 512, 512], fill=(80, 60, 40))  # brown ground
#
# Or simply use: starting_image = source_image
#
# Step 2: Load StableDiffusionImg2ImgPipeline and transform the sketch/image.
#   pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
#       model_id, torch_dtype=dtype, safety_checker=None,
#       requires_safety_checker=False).to(device)
#   pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
#   result = pipe(prompt=..., image=..., strength=..., ...).images[0]
#   del pipe; cleanup()
#
# Step 3: Load StableDiffusionInpaintPipeline and edit a specific region.
#   Create a mask, run inpainting on the img2img result.
#   pipe = StableDiffusionInpaintPipeline.from_pretrained(
#       model_id, torch_dtype=dtype, safety_checker=None,
#       requires_safety_checker=False).to(device)
#   pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
#   final = pipe(prompt=..., image=..., mask_image=..., ...).images[0]
#   del pipe; cleanup()
#
# Step 4: Display the full progression.
#   show_images(
#       [starting_image, img2img_result, mask, final_result],
#       ['Starting Image', 'After Img2img', 'Inpainting Mask', 'Final Result'],
#   )

print('Build your creative workflow here.')
print('Remember: load one pipeline at a time, cleanup between them.')

<details>
<summary>ðŸ’¡ Solution</summary>

**Experimental design reasoning:** The multi-step workflow demonstrates that img2img and inpainting are complementary. Img2img transforms the global composition (you cannot do this with inpainting alone without masking everything). Inpainting edits specific regions (you cannot do this with img2img alone without affecting the whole image). Together, they form a practical editing pipeline: rough composition â†’ global transformation â†’ selective refinement.

```python
from diffusers import (StableDiffusionImg2ImgPipeline,
                       StableDiffusionInpaintPipeline,
                       DPMSolverMultistepScheduler)

# Step 1: Create a simple sketch.
sketch = Image.new('RGB', (512, 512), (180, 210, 240))  # light blue sky
draw = ImageDraw.Draw(sketch)
# Simple mountain shapes
draw.polygon([(0, 350), (100, 180), (200, 280), (300, 200), (400, 250),
              (512, 300), (512, 512), (0, 512)], fill=(60, 100, 50))
# A simple lake/river in the foreground
draw.ellipse([100, 380, 400, 480], fill=(100, 150, 200))

# Step 2: Img2img the sketch into a realistic scene.
i2i_pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
    model_id, torch_dtype=dtype, safety_checker=None,
    requires_safety_checker=False).to(device)
i2i_pipe.scheduler = DPMSolverMultistepScheduler.from_config(
    i2i_pipe.scheduler.config)

generator = torch.Generator(device=device).manual_seed(42)
img2img_result = i2i_pipe(
    prompt='a beautiful mountain landscape with a lake, photorealistic, golden hour',
    image=sketch,
    strength=0.75,  # High strength: the sketch is rough, give the model creative freedom
    guidance_scale=7.5,
    num_inference_steps=30,
    generator=generator,
).images[0]

del i2i_pipe
cleanup()

# Step 3: Inpaint the sky to add dramatic clouds.
sky_mask = Image.new('L', (512, 512), 0)
sky_draw = ImageDraw.Draw(sky_mask)
sky_draw.rectangle([0, 0, 512, 220], fill=255)  # Generous sky mask

inp_pipe = StableDiffusionInpaintPipeline.from_pretrained(
    model_id, torch_dtype=dtype, safety_checker=None,
    requires_safety_checker=False).to(device)
inp_pipe.scheduler = DPMSolverMultistepScheduler.from_config(
    inp_pipe.scheduler.config)

generator = torch.Generator(device=device).manual_seed(42)
final_result = inp_pipe(
    prompt='dramatic sunset sky with orange and purple clouds',
    image=img2img_result,
    mask_image=sky_mask,
    guidance_scale=7.5,
    num_inference_steps=30,
    generator=generator,
).images[0]

del inp_pipe
cleanup()

# Step 4: Display the full progression.
show_image_grid(
    [sketch, img2img_result, sky_mask, final_result],
    ['Sketch', 'After Img2img (strength=0.75)',
     'Inpainting Mask', 'After Inpainting (sky replaced)'],
    nrows=1, ncols=4, figsize=(20, 5),
    suptitle='Multi-Step Creative Workflow: Sketch -> Img2img -> Inpainting',
)

print('The workflow progression:')
print('  1. Rough sketch provides spatial composition (where mountains are, where the lake is).')
print('  2. Img2img at strength=0.75 transforms the sketch into a photorealistic scene.')
print('     The model has enough creative freedom to add realistic detail, but the')
print('     sketch\'s composition (mountain shapes, lake position) guides the structure.')
print('  3. Inpainting replaces ONLY the sky with dramatic clouds. The mountains and')
print('     lake are preserved exactly as img2img generated them. The boundary blends')
print('     seamlessly because the U-Net sees the full image at every step.')
print('\nNeither technique alone could do this:')
print('  - Img2img alone cannot selectively edit the sky while preserving the landscape.')
print('  - Inpainting alone cannot transform a sketch into a photorealistic scene.')
```

**Why strength=0.75 for the sketch?** The sketch is very roughâ€”it needs substantial creative reinterpretation. Low strength (0.3) would preserve the sketch's flat colors and hard edges. High strength (0.75) lets the model reimagine the details while keeping the broad spatial composition. This connects to the coarse-to-fine mental model: at strength=0.75, the model runs the structure-setting steps, but the sketch's spatial layout still provides enough guidance to anchor the composition.

**Why a generous sky mask?** The lesson noted that slightly oversized masks produce better results. Extending the mask 20-40 pixels below the visible sky-mountain boundary gives the model room to create a smooth, coherent transition between the new clouds and the preserved mountains.

</details>

---

## Key Takeaways

1. **Img2img starts the denoising loop from a noised real image, not pure noise.** The forward process formula you derived in Module 6.2 noises the input to a specific timestep. The strength parameter maps onto the alpha-bar curveâ€”low strength preserves structure (high alpha-bar, mostly signal), high strength allows creative reinterpretation (low alpha-bar, mostly noise). The effect is nonlinear because the alpha-bar curve is nonlinear.

2. **Inpainting adds a per-step spatial mask to the denoising loop.** At each step, masked regions use the model's prediction while unmasked regions are replaced with the re-noised original. Boundaries blend seamlessly because the U-Net sees the full image at every stepâ€”its global receptive field at the bottleneck ensures predictions account for surrounding context.

3. **Neither technique requires training.** No new math, no new architecture, no training loop. Just two clever reconfigurations of the denoising process you already know. The same U-Net, VAE, and CLIP from text-to-image work unchanged.

4. **Img2img and inpainting are complementary.** Img2img transforms global composition. Inpainting edits specific regions. Together they form a practical editing pipeline: composition â†’ global transformation â†’ selective refinement.

5. **Same denoising process, different starting point (img2img) or selective application (inpainting).** The pipeline you traced across 17 lessons is unchanged. Img2img moves the starting line. Inpainting adds a spatial filter. Both use the forward process formulaâ€”now in its third and fourth applications.