[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/)

# 13. Image Generation: Diffusion from Scratch

---

## What You'll Learn

1. **Forward diffusion** - How noise is progressively added to destroy an image
2. **Reverse diffusion** - How a neural network learns to remove that noise
3. **The full pipeline** - Text encoder -> Denoiser (UNet) -> VAE decoder
4. **Latent space vs pixel space** - Why we work in a compressed representation
5. **Effect of denoising steps** - Quality vs speed tradeoff (5, 10, 20, 50 steps)
6. **Guidance scale** - How it controls prompt adherence
7. **Compute profiling** - Why each step is compute-bound, not memory-bound

---

### The Big Idea

Diffusion models learn to generate images by learning to **reverse** a noise-adding process:

**Training**: Take a real image -> Add noise gradually -> Train a network to predict/remove the noise at each step

**Inference**: Start from pure noise -> Run the denoiser repeatedly -> Get a clean image

In [None]:
!pip install torch torchvision diffusers transformers accelerate matplotlib numpy Pillow -q

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import time
import requests
from io import BytesIO

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
if device == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")

plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 11

## Part 1: The Forward Diffusion Process (Adding Noise)

Forward diffusion gradually destroys an image by adding Gaussian noise at each timestep. The noise schedule determines how quickly the image becomes pure noise.

At timestep $t$, the noisy image $x_t$ is:

$$x_t = \sqrt{\bar{\alpha}_t} \cdot x_0 + \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon$$

where $\epsilon \sim \mathcal{N}(0, I)$ and $\bar{\alpha}_t$ decreases from 1 (clean) to ~0 (pure noise).

In [None]:
# Load a sample image
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png"
try:
    response = requests.get(url, timeout=10)
    img = Image.open(BytesIO(response.content)).convert('RGB').resize((256, 256))
except:
    # Fallback: create a colorful test image
    img_array = np.zeros((256, 256, 3), dtype=np.uint8)
    # Create a gradient pattern
    for i in range(256):
        for j in range(256):
            img_array[i, j] = [i, j, (i+j)//2]
    # Add a circle
    y, x = np.ogrid[-128:128, -128:128]
    mask = x**2 + y**2 < 80**2
    img_array[mask] = [255, 100, 50]
    img = Image.fromarray(img_array)

# Convert to tensor (normalize to [-1, 1])
img_tensor = torch.tensor(np.array(img), dtype=torch.float32).permute(2, 0, 1) / 127.5 - 1.0
print(f"Image tensor shape: {img_tensor.shape}")
print(f"Value range: [{img_tensor.min():.2f}, {img_tensor.max():.2f}]")

plt.figure(figsize=(4, 4))
plt.imshow((img_tensor.permute(1, 2, 0).numpy() + 1) / 2)
plt.title('Original Image')
plt.axis('off')
plt.show()

In [None]:
def create_noise_schedule(num_timesteps=1000, beta_start=0.0001, beta_end=0.02):
    """Create a linear noise schedule (betas) and compute alphas."""
    betas = torch.linspace(beta_start, beta_end, num_timesteps)
    alphas = 1.0 - betas
    alpha_cumprod = torch.cumprod(alphas, dim=0)  # This is alpha_bar_t
    return betas, alphas, alpha_cumprod

def forward_diffusion(x_0, t, alpha_cumprod):
    """Add noise to image x_0 at timestep t.
    
    x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * noise
    """
    noise = torch.randn_like(x_0)
    sqrt_alpha_bar = torch.sqrt(alpha_cumprod[t])
    sqrt_one_minus_alpha_bar = torch.sqrt(1 - alpha_cumprod[t])
    
    x_t = sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * noise
    return x_t, noise

# Create schedule
num_timesteps = 1000
betas, alphas, alpha_cumprod = create_noise_schedule(num_timesteps)

# Visualize the noise schedule
fig, axes = plt.subplots(1, 2, figsize=(14, 4))

axes[0].plot(betas.numpy(), linewidth=2)
axes[0].set_title('Noise Schedule (beta_t)', fontweight='bold')
axes[0].set_xlabel('Timestep t')
axes[0].set_ylabel('beta_t (noise added at each step)')

axes[1].plot(alpha_cumprod.numpy(), linewidth=2, color='green')
axes[1].set_title('Cumulative Signal Retention (alpha_bar_t)', fontweight='bold')
axes[1].set_xlabel('Timestep t')
axes[1].set_ylabel('alpha_bar_t (signal remaining)')
axes[1].axhline(y=0, color='red', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()

print(f"At t=0:   {alpha_cumprod[0]:.4f} signal remaining (99.99% clean)")
print(f"At t=250: {alpha_cumprod[250]:.4f} signal remaining")
print(f"At t=500: {alpha_cumprod[500]:.4f} signal remaining")
print(f"At t=999: {alpha_cumprod[999]:.4f} signal remaining (almost pure noise)")

In [None]:
# Visualize forward diffusion step by step
timesteps_to_show = [0, 50, 100, 200, 400, 600, 800, 999]

fig, axes = plt.subplots(2, 4, figsize=(18, 9))
axes = axes.flatten()

torch.manual_seed(42)

for idx, t in enumerate(timesteps_to_show):
    x_t, noise = forward_diffusion(img_tensor, t, alpha_cumprod)
    
    # Convert back to displayable image
    display = (x_t.permute(1, 2, 0).numpy() + 1) / 2
    display = np.clip(display, 0, 1)
    
    axes[idx].imshow(display)
    signal_pct = alpha_cumprod[t].item() * 100
    axes[idx].set_title(f't={t}\n{signal_pct:.1f}% signal', fontweight='bold')
    axes[idx].axis('off')

plt.suptitle('Forward Diffusion: Gradually Adding Noise\n(t=0 is clean, t=999 is pure noise)', 
             fontsize=15, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

## Part 2: Simple Denoising (Reverse Diffusion)

The reverse process tries to remove noise. In practice, a neural network (UNet) predicts the noise $\epsilon$ at each step. Let's demonstrate the concept with a simple denoiser.

In [None]:
def simple_denoise_step(x_t, predicted_noise, t, alpha_cumprod, betas):
    """One step of reverse diffusion (simplified DDPM).
    
    x_{t-1} = (1/sqrt(alpha_t)) * (x_t - beta_t/sqrt(1-alpha_bar_t) * predicted_noise) + sigma_t * z
    """
    alpha_t = 1 - betas[t]
    alpha_bar_t = alpha_cumprod[t]
    
    # Predicted x_0 from the noise prediction
    x_0_pred = (x_t - torch.sqrt(1 - alpha_bar_t) * predicted_noise) / torch.sqrt(alpha_bar_t)
    x_0_pred = torch.clamp(x_0_pred, -1, 1)  # Clip for stability
    
    if t > 0:
        alpha_bar_prev = alpha_cumprod[t-1]
        # Posterior mean
        coeff1 = torch.sqrt(alpha_bar_prev) * betas[t] / (1 - alpha_bar_t)
        coeff2 = torch.sqrt(alpha_t) * (1 - alpha_bar_prev) / (1 - alpha_bar_t)
        mean = coeff1 * x_0_pred + coeff2 * x_t
        
        # Add noise (except at last step)
        sigma = torch.sqrt(betas[t])
        noise = torch.randn_like(x_t)
        x_prev = mean + sigma * noise
    else:
        x_prev = x_0_pred
    
    return x_prev, x_0_pred

# Demonstrate: if we know the exact noise, we can denoise perfectly
t = 300
torch.manual_seed(42)
x_t, true_noise = forward_diffusion(img_tensor, t, alpha_cumprod)

# Use the TRUE noise (oracle denoiser)
x_denoised, x_0_pred = simple_denoise_step(x_t, true_noise, t, alpha_cumprod, betas)

fig, axes = plt.subplots(1, 4, figsize=(18, 4))
images = [img_tensor, x_t, x_0_pred, img_tensor - x_0_pred]
titles = ['Original', f'Noisy (t={t})', 'Predicted x_0\n(using true noise)', 'Error (magnified 5x)']

for ax, img_t, title in zip(axes, images, titles):
    if 'Error' in title:
        disp = (img_t.permute(1, 2, 0).numpy() * 5 + 0.5)  # Magnify error
    else:
        disp = (img_t.permute(1, 2, 0).numpy() + 1) / 2
    disp = np.clip(disp, 0, 1)
    ax.imshow(disp)
    ax.set_title(title, fontweight='bold')
    ax.axis('off')

plt.suptitle('Reverse Diffusion: If we predict noise correctly, we can reconstruct the image!', 
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

## Part 3: The Full Diffusion Pipeline

A real text-to-image diffusion model has three main components:

1. **Text Encoder** (CLIP): Converts text prompt to embeddings
2. **UNet Denoiser**: Predicts and removes noise, conditioned on text
3. **VAE Decoder**: Converts latent representation to pixel image

```
"A cat on a sofa" --> [Text Encoder] --> text_embeddings
                                              |
Random noise --> [UNet x N steps] <-----------|  
                       |
                 denoised latent
                       |
                 [VAE Decoder]
                       |
                  Final Image (512x512)
```

In [None]:
# Visualize the pipeline architecture
fig, ax = plt.subplots(figsize=(18, 6))
ax.set_xlim(0, 10)
ax.set_ylim(0, 4)
ax.axis('off')

# Boxes for each component
import matplotlib.patches as mpatches

components = [
    (0.5, 2.5, 1.8, 1.0, '#3498db', 'Text Encoder\n(CLIP)\n77x768'),
    (3.0, 2.5, 2.0, 1.0, '#e74c3c', 'UNet\n(Denoiser)\nN iterations'),
    (6.0, 2.5, 1.8, 1.0, '#27ae60', 'VAE\nDecoder'),
    (0.5, 0.5, 1.8, 1.0, '#9b59b6', 'Noise\nScheduler'),
    (8.5, 2.5, 1.2, 1.0, '#f39c12', 'Output\n512x512\nImage'),
]

for x, y, w, h, color, label in components:
    rect = mpatches.FancyBboxPatch((x, y), w, h, 
                                    boxstyle="round,pad=0.1",
                                    facecolor=color, alpha=0.3,
                                    edgecolor=color, linewidth=2)
    ax.add_patch(rect)
    ax.text(x + w/2, y + h/2, label, ha='center', va='center', 
           fontsize=11, fontweight='bold')

# Arrows
arrow_props = dict(arrowstyle='->', linewidth=2, color='black')
ax.annotate('', xy=(3.0, 3.0), xytext=(2.3, 3.0), arrowprops=arrow_props)
ax.annotate('', xy=(6.0, 3.0), xytext=(5.0, 3.0), arrowprops=arrow_props)
ax.annotate('', xy=(8.5, 3.0), xytext=(7.8, 3.0), arrowprops=arrow_props)
ax.annotate('', xy=(3.0, 2.5), xytext=(2.3, 1.2), arrowprops=arrow_props)

# Labels
ax.text(0.2, 3.8, '"A cat on a sofa"', fontsize=13, fontweight='bold', style='italic')
ax.annotate('', xy=(0.5, 3.2), xytext=(1.0, 3.7), arrowprops=arrow_props)

ax.text(2.3, 3.3, 'text\nembeddings', fontsize=8, ha='center')
ax.text(5.3, 3.3, 'denoised\nlatent', fontsize=8, ha='center')

# Loop arrow for UNet
ax.annotate('', xy=(4.8, 2.5), xytext=(4.8, 2.0),
           arrowprops=dict(arrowstyle='->', linewidth=1.5, color='red',
                          connectionstyle='arc3,rad=-0.5'))
ax.text(5.5, 2.1, 'Repeat N\ntimes', fontsize=9, color='red', fontweight='bold')

ax.set_title('Stable Diffusion Pipeline Architecture', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

## Part 4: Latent Space vs Pixel Space

A key insight: we don't denoise in pixel space! The **VAE** compresses images to a much smaller latent space first.

- Pixel space: 512 x 512 x 3 = **786,432** values
- Latent space: 64 x 64 x 4 = **16,384** values (**48x smaller!**)

This is why it's called **Latent Diffusion** (Stable Diffusion's key innovation).

In [None]:
# Visualize the dimensionality reduction
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Pixel space representation
pixel_dims = (512, 512, 3)
latent_dims = (64, 64, 4)

pixel_total = np.prod(pixel_dims)
latent_total = np.prod(latent_dims)
compression = pixel_total / latent_total

# Bar comparison
axes[0].bar(['Pixel Space\n(512x512x3)', 'Latent Space\n(64x64x4)'], 
           [pixel_total, latent_total],
           color=['#e74c3c', '#27ae60'], edgecolor='black')
axes[0].set_ylabel('Number of Values')
axes[0].set_title(f'Dimensionality Comparison\n({compression:.0f}x compression!)', fontweight='bold')
for i, v in enumerate([pixel_total, latent_total]):
    axes[0].text(i, v + 10000, f'{v:,}', ha='center', fontweight='bold')

# Compute cost comparison (quadratic in attention)
pixel_seq = 512 * 512  # 262,144 tokens in pixel space
latent_seq = 64 * 64   # 4,096 tokens in latent space

pixel_attn = pixel_seq ** 2
latent_attn = latent_seq ** 2
attn_savings = pixel_attn / latent_attn

axes[1].bar(['Pixel Space', 'Latent Space'], 
           [pixel_attn / 1e9, latent_attn / 1e9],
           color=['#e74c3c', '#27ae60'], edgecolor='black')
axes[1].set_ylabel('Attention Cost (billions of ops)')
axes[1].set_title(f'Self-Attention Cost\n({attn_savings:.0f}x cheaper in latent space!)', fontweight='bold')

# Time comparison (proportional)
pixel_time = 100  # Arbitrary units
latent_time = pixel_time / compression * 1.2  # Slightly more due to VAE overhead
vae_time = 5  # VAE encode/decode is fast

axes[2].bar(['Pixel Diffusion\n(50 steps)', 'Latent Diffusion\n(50 steps + VAE)'],
           [pixel_time, latent_time + vae_time],
           color=['#e74c3c', '#27ae60'], edgecolor='black')
axes[2].set_ylabel('Relative Time')
axes[2].set_title('Speed Comparison\n(Latent diffusion is dramatically faster)', fontweight='bold')

plt.tight_layout()
plt.show()

print(f"Pixel space:  {pixel_dims} = {pixel_total:,} values")
print(f"Latent space: {latent_dims} = {latent_total:,} values")
print(f"Compression ratio: {compression:.1f}x")
print(f"Attention cost ratio: {attn_savings:.0f}x")

## Part 5: Loading a Real Diffusion Model

Let's load SDXL-Turbo (a fast, small model) and generate real images. SDXL-Turbo uses distillation to work with very few denoising steps.

In [None]:
from diffusers import AutoPipelineForText2Image

# Load SDXL-Turbo (small and fast)
print("Loading SDXL-Turbo pipeline...")
pipe = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/sdxl-turbo",
    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
    variant="fp16" if device == "cuda" else None,
)
pipe = pipe.to(device)

print("\nPipeline components:")
print(f"  Text Encoder 1: {type(pipe.text_encoder).__name__}")
print(f"  Text Encoder 2: {type(pipe.text_encoder_2).__name__}")
print(f"  UNet:           {type(pipe.unet).__name__}")
print(f"  VAE:            {type(pipe.vae).__name__}")
print(f"  Scheduler:      {type(pipe.scheduler).__name__}")

# Count parameters
unet_params = sum(p.numel() for p in pipe.unet.parameters()) / 1e6
vae_params = sum(p.numel() for p in pipe.vae.parameters()) / 1e6
te1_params = sum(p.numel() for p in pipe.text_encoder.parameters()) / 1e6
te2_params = sum(p.numel() for p in pipe.text_encoder_2.parameters()) / 1e6

print(f"\nParameter counts:")
print(f"  Text Encoder 1: {te1_params:.0f}M")
print(f"  Text Encoder 2: {te2_params:.0f}M")
print(f"  UNet:           {unet_params:.0f}M")
print(f"  VAE:            {vae_params:.0f}M")
print(f"  Total:          {te1_params + te2_params + unet_params + vae_params:.0f}M")

In [None]:
# Generate a test image
prompt = "A golden retriever puppy playing in autumn leaves, photograph, detailed"

# Warm up
_ = pipe(prompt, num_inference_steps=1, guidance_scale=0.0, 
         width=512, height=512)

# Generate with timing
start = time.time()
result = pipe(prompt, num_inference_steps=4, guidance_scale=0.0,
              width=512, height=512)
elapsed = time.time() - start

plt.figure(figsize=(6, 6))
plt.imshow(result.images[0])
plt.title(f'Generated in {elapsed:.2f}s (4 steps)\n"{prompt[:50]}..."', fontweight='bold')
plt.axis('off')
plt.show()

## Part 6: Effect of Denoising Steps on Quality

More denoising steps = better quality but slower. Let's compare directly.

In [None]:
prompt = "A serene Japanese garden with cherry blossoms, watercolor painting style"
step_counts = [1, 2, 4, 8, 15, 25]

images = []
times = []

generator = torch.Generator(device=device).manual_seed(42)

for steps in step_counts:
    generator = torch.Generator(device=device).manual_seed(42)
    
    start = time.time()
    result = pipe(
        prompt,
        num_inference_steps=steps,
        guidance_scale=0.0,  # SDXL-Turbo works without guidance
        width=512, height=512,
        generator=generator
    )
    elapsed = time.time() - start
    
    images.append(result.images[0])
    times.append(elapsed)
    print(f"Steps: {steps:>3} | Time: {elapsed:.2f}s")

# Display comparison
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()

for idx, (img, steps, t) in enumerate(zip(images, step_counts, times)):
    axes[idx].imshow(img)
    axes[idx].set_title(f'{steps} steps | {t:.2f}s', fontsize=14, fontweight='bold')
    axes[idx].axis('off')

plt.suptitle(f'Effect of Denoising Steps on Image Quality\n"{prompt}"', 
             fontsize=15, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Plot time vs steps
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(step_counts, times, 'bo-', linewidth=2, markersize=10)
axes[0].set_xlabel('Number of Denoising Steps', fontsize=13)
axes[0].set_ylabel('Generation Time (seconds)', fontsize=13)
axes[0].set_title('Generation Time vs Steps\n(Linear relationship - each step costs the same)', 
                   fontweight='bold')

# Time per step
time_per_step = [t/s for t, s in zip(times, step_counts)]
axes[1].bar([str(s) for s in step_counts], time_per_step, 
           color='steelblue', edgecolor='black')
axes[1].set_xlabel('Number of Steps', fontsize=13)
axes[1].set_ylabel('Time per Step (seconds)', fontsize=13)
axes[1].set_title('Time per Denoising Step\n(Roughly constant - each UNet pass takes the same time)', 
                   fontweight='bold')

for i, v in enumerate(time_per_step):
    axes[1].text(i, v + 0.005, f'{v*1000:.0f}ms', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

## Part 7: Effect of Guidance Scale

**Classifier-Free Guidance (CFG)** controls how strongly the image follows the prompt:
- guidance_scale = 0: No guidance (random image matching the prompt loosely)
- guidance_scale = 7-8: Standard (good balance)
- guidance_scale > 12: Very strong (but can oversaturate)

Note: SDXL-Turbo is designed for guidance_scale=0, but we can still demonstrate the concept.

In [None]:
# For models that use guidance (not turbo), guidance doubles the compute per step
# because you run the UNet twice: once with prompt, once without

prompt = "A futuristic cityscape at sunset, digital art, vibrant colors"
guidance_scales = [0.0, 0.5, 1.0, 2.0, 5.0, 10.0]

images_guidance = []
times_guidance = []

for gs in guidance_scales:
    generator = torch.Generator(device=device).manual_seed(42)
    
    start = time.time()
    result = pipe(
        prompt,
        num_inference_steps=4,
        guidance_scale=gs,
        width=512, height=512,
        generator=generator
    )
    elapsed = time.time() - start
    
    images_guidance.append(result.images[0])
    times_guidance.append(elapsed)
    print(f"Guidance: {gs:>5.1f} | Time: {elapsed:.2f}s")

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()

for idx, (img, gs, t) in enumerate(zip(images_guidance, guidance_scales, times_guidance)):
    axes[idx].imshow(img)
    axes[idx].set_title(f'guidance_scale={gs} | {t:.2f}s', fontsize=13, fontweight='bold')
    axes[idx].axis('off')

plt.suptitle(f'Effect of Guidance Scale\n"{prompt}"', fontsize=15, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

## Part 8: Timing Each Pipeline Component

Let's measure exactly where the time goes in the diffusion pipeline.

In [None]:
import torch
from diffusers import AutoPipelineForText2Image

prompt = "A beautiful mountain landscape at sunrise, photograph"
num_steps = 4

# Manual pipeline execution with timing
timings = {}

# 1. Text Encoding
if device == 'cuda':
    torch.cuda.synchronize()
start = time.time()

# Encode the prompt (using the pipeline's internal method)
text_inputs = pipe.tokenizer(
    prompt, padding="max_length", max_length=pipe.tokenizer.model_max_length,
    truncation=True, return_tensors="pt"
)
text_input_ids = text_inputs.input_ids.to(device)

if device == 'cuda':
    torch.cuda.synchronize()
timings['Text Encoding'] = time.time() - start

# 2. Full generation with per-step timing
step_times = []

# Use a callback to time each step
class StepTimer:
    def __init__(self):
        self.step_times = []
        self.last_time = None
    
    def __call__(self, pipe, step_index, timestep, callback_kwargs):
        if device == 'cuda':
            torch.cuda.synchronize()
        now = time.time()
        if self.last_time is not None:
            self.step_times.append(now - self.last_time)
        self.last_time = now
        return callback_kwargs

timer = StepTimer()

generator = torch.Generator(device=device).manual_seed(42)
start = time.time()
result = pipe(
    prompt,
    num_inference_steps=num_steps,
    guidance_scale=0.0,
    width=512, height=512,
    generator=generator,
    callback_on_step_end=timer
)
if device == 'cuda':
    torch.cuda.synchronize()
total_time = time.time() - start

# Approximate component times
unet_total = sum(timer.step_times) if timer.step_times else total_time * 0.85
vae_time_approx = total_time - unet_total - timings['Text Encoding']

timings['UNet (denoising)'] = unet_total
timings['VAE (decode)'] = max(vae_time_approx, 0.01)
timings['Total'] = total_time

# Display results
print(f"Pipeline Timing Breakdown ({num_steps} steps):")
print("=" * 50)
for component, t in timings.items():
    pct = t / total_time * 100
    bar = '#' * int(pct / 2)
    print(f"  {component:<20}: {t*1000:>8.1f}ms  ({pct:>5.1f}%)  {bar}")

if timer.step_times:
    print(f"\nPer-step UNet times:")
    for i, st in enumerate(timer.step_times):
        print(f"  Step {i+1}: {st*1000:.1f}ms")

# Visualize
fig, ax = plt.subplots(figsize=(10, 5))

components = ['Text Encoding', 'UNet (denoising)', 'VAE (decode)']
values = [timings[c] * 1000 for c in components]
colors = ['#3498db', '#e74c3c', '#27ae60']

bars = ax.barh(components, values, color=colors, edgecolor='black', height=0.5)

for bar, v in zip(bars, values):
    ax.text(bar.get_width() + 2, bar.get_y() + bar.get_height()/2, 
           f'{v:.1f}ms ({v/total_time/10:.0f}%)', va='center', fontweight='bold')

ax.set_xlabel('Time (ms)', fontsize=13)
ax.set_title(f'Where Does Time Go in Diffusion Inference? ({num_steps} steps)\n'
             f'UNet denoising dominates!', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

## Part 9: Visualizing the Denoising Process

Let's capture intermediate latents during generation to see how the image forms.

In [None]:
# Capture intermediate latents
intermediate_images = []

class LatentCapture:
    def __init__(self, pipe):
        self.pipe = pipe
        self.latents = []
    
    def __call__(self, pipe, step_index, timestep, callback_kwargs):
        latents = callback_kwargs['latents']
        self.latents.append(latents.detach().clone())
        return callback_kwargs

# Use more steps to see the progression
capture = LatentCapture(pipe)
generator = torch.Generator(device=device).manual_seed(42)

prompt = "A majestic lion in the savanna, golden hour, nature photography"
result = pipe(
    prompt,
    num_inference_steps=8,
    guidance_scale=0.0,
    width=512, height=512,
    generator=generator,
    callback_on_step_end=capture,
    output_type="latent"
)

# Decode intermediate latents to images
print(f"Captured {len(capture.latents)} intermediate latents")

decoded_intermediates = []
for lat in capture.latents:
    with torch.no_grad():
        # Scale latents
        scaled = lat / pipe.vae.config.scaling_factor
        image = pipe.vae.decode(scaled, return_dict=False)[0]
        image = (image / 2 + 0.5).clamp(0, 1)
        image = image.cpu().permute(0, 2, 3, 1).numpy()[0]
        decoded_intermediates.append(image)

# Also decode the final result
with torch.no_grad():
    final_scaled = result.images / pipe.vae.config.scaling_factor
    final_image = pipe.vae.decode(final_scaled, return_dict=False)[0]
    final_image = (final_image / 2 + 0.5).clamp(0, 1)
    final_image = final_image.cpu().permute(0, 2, 3, 1).numpy()[0]
decoded_intermediates.append(final_image)

# Display
n_show = min(8, len(decoded_intermediates))
indices = np.linspace(0, len(decoded_intermediates)-1, n_show, dtype=int)

fig, axes = plt.subplots(2, 4, figsize=(18, 9))
axes = axes.flatten()

for idx, i in enumerate(indices):
    if idx < len(axes):
        axes[idx].imshow(np.clip(decoded_intermediates[i], 0, 1))
        axes[idx].set_title(f'Step {i+1}/{len(decoded_intermediates)}', 
                           fontweight='bold', fontsize=12)
        axes[idx].axis('off')

plt.suptitle(f'Denoising Process: From Noise to Image\n"{prompt}"', 
             fontsize=15, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

## Part 10: Compute Profile of Each Step

In [None]:
# Profile multiple generations to get reliable step timings
num_runs = 3
steps_to_test = [4, 8, 15, 25]

profile_data = {}

for steps in steps_to_test:
    all_step_times = []
    all_total_times = []
    
    for run in range(num_runs):
        timer = StepTimer()
        generator = torch.Generator(device=device).manual_seed(42 + run)
        
        start = time.time()
        _ = pipe(
            "A test prompt for profiling",
            num_inference_steps=steps,
            guidance_scale=0.0,
            width=512, height=512,
            generator=generator,
            callback_on_step_end=timer
        )
        total = time.time() - start
        
        if timer.step_times:
            all_step_times.extend(timer.step_times)
        all_total_times.append(total)
    
    profile_data[steps] = {
        'avg_step_time': np.mean(all_step_times) * 1000 if all_step_times else 0,
        'std_step_time': np.std(all_step_times) * 1000 if all_step_times else 0,
        'avg_total_time': np.mean(all_total_times) * 1000,
    }

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Total time vs steps
steps_list = list(profile_data.keys())
total_times = [profile_data[s]['avg_total_time'] for s in steps_list]
step_times = [profile_data[s]['avg_step_time'] for s in steps_list]

axes[0].plot(steps_list, total_times, 'ro-', linewidth=2, markersize=10)
axes[0].set_xlabel('Number of Denoising Steps', fontsize=13)
axes[0].set_ylabel('Total Time (ms)', fontsize=13)
axes[0].set_title('Total Generation Time\n(Linear in number of steps)', fontweight='bold')

# Per-step time
axes[1].bar([str(s) for s in steps_list], step_times, 
           color='steelblue', edgecolor='black')
axes[1].set_xlabel('Number of Steps', fontsize=13)
axes[1].set_ylabel('Average Time per Step (ms)', fontsize=13)
axes[1].set_title('Per-Step Time\n(Each UNet forward pass costs the same)', fontweight='bold')

for i, v in enumerate(step_times):
    axes[1].text(i, v + 1, f'{v:.0f}ms', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

print("\nKey insight: Generation time is LINEARLY proportional to the number of steps.")
print("Each UNet forward pass (denoising step) is the same compute cost.")
print("This is why reducing steps (via distillation like Turbo) is so valuable!")

## Part 11: Different Prompts, Same Seed

The noise seed controls the "composition" while the prompt controls the "content".

In [None]:
prompts = [
    "A cat sitting on a windowsill, photograph",
    "A dog sitting on a windowsill, photograph",
    "A robot sitting on a windowsill, photograph",
    "A child sitting on a windowsill, painting",
]

fig, axes = plt.subplots(1, 4, figsize=(20, 5))

for idx, prompt in enumerate(prompts):
    generator = torch.Generator(device=device).manual_seed(42)  # Same seed!
    result = pipe(prompt, num_inference_steps=4, guidance_scale=0.0,
                  width=512, height=512, generator=generator)
    
    axes[idx].imshow(result.images[0])
    axes[idx].set_title(f'"{prompt[:35]}..."', fontsize=10, fontweight='bold')
    axes[idx].axis('off')

plt.suptitle('Same Seed, Different Prompts\n(Notice similar composition but different content)', 
             fontsize=15, fontweight='bold', y=1.05)
plt.tight_layout()
plt.show()

---

## Key Takeaways

### 1. Diffusion = Learn to Reverse Noise
- **Forward process**: Gradually add noise until the image is pure noise (this is math, no learning)
- **Reverse process**: Train a UNet to predict/remove noise at each step (this is the learned part)
- At inference: start from random noise and denoise step by step

### 2. Latent Diffusion is the Key Innovation
- Working in **latent space** (64x64x4) instead of pixel space (512x512x3) is **48x cheaper**
- The VAE encoder/decoder handles the conversion
- This is why it's called "Stable Diffusion" (Latent Diffusion Model)

### 3. The Pipeline Has Three Parts
- **Text Encoder** (CLIP): Convert prompt to embeddings (~fast)
- **UNet Denoiser**: N forward passes, each removing some noise (~dominates compute)
- **VAE Decoder**: Convert latent to pixels (~fast, runs once)

### 4. Steps = Quality vs Speed
- Each denoising step costs the same (one UNet forward pass)
- Generation time is **linear** in the number of steps
- Distilled models (Turbo, Lightning) achieve good quality in 1-4 steps instead of 20-50

### 5. Guidance Scale Controls Prompt Adherence
- Higher guidance = stronger prompt following but potential oversaturation
- With CFG, each step requires **2 UNet passes** (with and without prompt)
- Turbo models are distilled to work without guidance (guidance_scale=0)

### 6. This is Compute-Bound, Not Memory-Bound
- Each step is a full neural network forward pass
- Unlike LLM inference (which is memory-bandwidth bound in decode), diffusion is compute-bound
- This means faster GPUs help more than more memory