# LoRA Fine-Tuning for Diffusion Models

**Module 6.5, Lesson 1** | CourseAI

You know LoRA from Module 4.4 (highway-and-detour bypass, B=0 initialization, merge at inference). You know the Stable Diffusion pipeline from Module 6.4 (text â†’ CLIP â†’ U-Net denoising loop â†’ VAE decode). This notebook connects them: applying LoRA to the diffusion U-Net for style and subject customization.

**What you will do:**
- Inspect a real SD U-Net to find cross-attention projection layers and compute LoRA parameter counts for different ranks
- Execute one LoRA training step by hand: VAE encode, noise, U-Net forward, MSE loss, verify gradient flow
- Train a style LoRA end-to-end using diffusers + PEFT on a small style dataset
- Load and compose two pre-trained LoRA adapters, experimenting with alpha scaling to control the blend

**For each exercise, PREDICT the output before running the cell.**

**Estimated time:** 45â€“60 minutes (Exercises 3â€“4 involve training and download time).

---

## Setup

Run this cell to install dependencies and import everything. This notebook requires a GPU for reasonable training times.

In [None]:
!pip install -q diffusers transformers accelerate peft datasets

import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
from collections import defaultdict

# Reproducible results
torch.manual_seed(42)
np.random.seed(42)

# Nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [10, 4]

# Device setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
dtype = torch.float16 if device.type == 'cuda' else torch.float32
print(f'Using device: {device}')
if device.type == 'cuda':
    print(f'GPU: {torch.cuda.get_device_name(0)}')
    print(f'VRAM: {torch.cuda.get_device_properties(0).total_mem / 1024**3:.1f} GB')

print('\nSetup complete.')

## Shared Helpers

Define display helpers and the model ID. Each exercise loads only the components it needs and cleans up afterward to stay within free-tier Colab VRAM (~16 GB on a T4).

> **VRAM tip:** If you encounter an out-of-memory error, go to Runtime â†’ Restart runtime and rerun from Setup. Exercise 3 uses the most VRAM.

In [None]:
model_id = 'stable-diffusion-v1-5/stable-diffusion-v1-5'


def show_images(images, titles, figsize=None):
    """Display a list of PIL images side by side."""
    n = len(images)
    if figsize is None:
        figsize = (5 * n, 5)
    fig, axes = plt.subplots(1, n, figsize=figsize)
    if n == 1:
        axes = [axes]
    for ax, img, title in zip(axes, images, titles):
        ax.imshow(np.array(img))
        ax.set_title(title, fontsize=10)
        ax.axis('off')
    plt.tight_layout()
    plt.show()


def show_image_grid(images, titles, nrows, ncols, figsize=None, suptitle=None):
    """Display images in a grid with the given number of rows and columns."""
    if figsize is None:
        figsize = (4 * ncols, 4 * nrows)
    fig, axes = plt.subplots(nrows, ncols, figsize=figsize)
    axes_flat = axes.flat if nrows > 1 or ncols > 1 else [axes]
    for ax, img, title in zip(axes_flat, images, titles):
        ax.imshow(np.array(img))
        ax.set_title(title, fontsize=10)
        ax.axis('off')
    for ax in list(axes_flat)[len(images):]:
        ax.axis('off')
    if suptitle:
        plt.suptitle(suptitle, fontsize=13)
    plt.tight_layout()
    plt.show()


print('Helpers defined.')

---

## Exercise 1: Inspect LoRA Target Layers [Guided]

The lesson explained that LoRA for diffusion targets the **cross-attention projection matrices** (W_Q, W_K, W_V, W_out) because that is where text meaning meets spatial features. But which layers are those, concretely? How many parameters would a rank-4 LoRA add versus rank-16?

This exercise inspects the real U-Net to ground the architectural understanding in actual layer names and parameter counts.

**Before running, predict:**
- How many cross-attention blocks does the SD v1.5 U-Net have? (Hint: cross-attention occurs at attention resolutions 16Ã—16 and 32Ã—32, in both the downsampling and upsampling paths.)
- For a projection matrix of shape [320, 768], how many LoRA parameters does rank-4 add? (Think: B is [320, 4], A is [4, 768]. Total = ?)
- What fraction of total U-Net parameters will a rank-4 LoRA represent?

In [None]:
unet = pipe.unet

# Step 1: Find all cross-attention projection layers.
# In the SD U-Net, cross-attention layers have names containing 'attn2'
# (self-attention is 'attn1'). The projection matrices are:
#   to_q, to_k, to_v  (input projections)
#   to_out.0           (output projection)

cross_attn_layers = {}
for name, param in unet.named_parameters():
    if 'attn2' in name and any(proj in name for proj in ['to_q', 'to_k', 'to_v', 'to_out.0.weight']):
        cross_attn_layers[name] = param.shape

print(f'Cross-attention projection layers found: {len(cross_attn_layers)}')
print(f'\n{"Layer Name":<75} {"Shape":>15}')
print('-' * 92)
for name, shape in sorted(cross_attn_layers.items()):
    print(f'{name:<75} {str(list(shape)):>15}')

# Step 2: Count LoRA parameters for different ranks.
# For each weight matrix W of shape [out_features, in_features],
# LoRA adds B [out_features, r] and A [r, in_features].
# Total LoRA params per layer = r * (out_features + in_features)

print('\n' + '=' * 60)
print('LoRA Parameter Counts by Rank')
print('=' * 60)

for rank in [4, 8, 16]:
    total_lora_params = 0
    for name, shape in cross_attn_layers.items():
        out_features, in_features = shape[0], shape[1]
        lora_params = rank * (out_features + in_features)
        total_lora_params += lora_params
    print(f'\nRank {rank}:')
    print(f'  Total LoRA params:  {total_lora_params:>12,}')
    print(f'  In MB (fp16):       {total_lora_params * 2 / 1024**2:>12.2f} MB')

# Step 3: Compare to total U-Net parameters.
total_unet_params = sum(p.numel() for p in unet.parameters())
cross_attn_params = sum(shape[0] * shape[1] for shape in cross_attn_layers.values())

print(f'\n{"=" * 60}')
print('U-Net Parameter Breakdown')
print(f'{"=" * 60}')
print(f'  Total U-Net params:         {total_unet_params:>12,}')
print(f'  Cross-attn projection params:{cross_attn_params:>12,} ({100*cross_attn_params/total_unet_params:.1f}%)')
print(f'  Everything else (frozen):    {total_unet_params - cross_attn_params:>12,} ({100*(total_unet_params - cross_attn_params)/total_unet_params:.1f}%)')

# LoRA as fraction of total U-Net
for rank in [4, 8, 16]:
    total_lora = sum(
        rank * (s[0] + s[1]) for s in cross_attn_layers.values()
    )
    pct = 100 * total_lora / total_unet_params
    print(f'  Rank-{rank} LoRA params:       {total_lora:>12,} ({pct:.3f}% of U-Net)')

### What Just Happened

You inspected the real SD v1.5 U-Net and found its cross-attention projection layers by name. Key observations:

1. **Cross-attention layers are named `attn2`** in the diffusers implementation (self-attention is `attn1`). The projections are `to_q`, `to_k`, `to_v`, and `to_out.0`. This maps directly to the W_Q, W_K, W_V, and W_out from the lesson.

2. **The projection shapes vary by resolution.** At lower resolutions (deeper in the U-Net), the channel dimension is larger (640, 1280), so the projection matrices are bigger. At higher resolutions (32Ã—32), the channel dimension is smaller (320).

3. **LoRA adds a tiny fraction of the total parameters.** A rank-4 LoRA on all cross-attention projections adds roughly 0.1â€“0.3% of the U-Net's total parameters. That is the "surgical" nature of LoRAâ€”a small detour on a massive highway. This is why LoRA files are typically just 2â€“50 MB.

4. **Rank scales linearly.** Rank-16 has exactly 4Ã— the parameters of rank-4 (four times as many columns in B and rows in A). The lesson noted that rank 4â€“8 is the sweet spot for diffusionâ€”enough capacity for style adaptation without overfitting on small datasets.

---

## Exercise 2: One LoRA Training Step by Hand [Guided]

The lesson traced one training step in pseudocode: load image â†’ VAE encode â†’ sample timestep â†’ add noise â†’ U-Net predicts noise (only LoRA params have gradients) â†’ MSE loss â†’ backprop into LoRA params only.

Now you will do it for real. This exercise verifies the training loop from the lesson with actual tensors and real gradient flow.

**Before running, predict:**
- After VAE encoding, what shape will the latent z_0 have for a 512Ã—512 image? (Think: 512/8 = 64, and the VAE's latent space has 4 channels.)
- After the forward pass, should the U-Net's **base** weight parameters have gradients? Should the LoRA adapter parameters have gradients?
- What is the loss function? What are its two arguments? (The model predicts ÎµÌ‚, the target is the actual noise Îµ.)

In [None]:
from peft import LoraConfig, get_peft_model

# We will work in float32 for this exercise to get clean gradients.
# Reload the U-Net in float32 for the training step.
from diffusers import UNet2DConditionModel, AutoencoderKL
from transformers import CLIPTextModel, CLIPTokenizer

print('Loading components in float32 for training step...')
unet_f32 = UNet2DConditionModel.from_pretrained(model_id, subfolder='unet', torch_dtype=torch.float32).to(device)
vae = AutoencoderKL.from_pretrained(model_id, subfolder='vae', torch_dtype=torch.float32).to(device)
tokenizer = CLIPTokenizer.from_pretrained(model_id, subfolder='tokenizer')
text_encoder = CLIPTextModel.from_pretrained(model_id, subfolder='text_encoder', torch_dtype=torch.float32).to(device)
noise_scheduler = DDPMScheduler.from_pretrained(model_id, subfolder='scheduler')

# Freeze everything: VAE, text encoder, and U-Net base weights.
vae.requires_grad_(False)
text_encoder.requires_grad_(False)

print('\nComponents loaded.')

In [None]:
# Step 1: Apply LoRA to the U-Net.
# We target the cross-attention projections (attn2) with rank 4.
# PEFT uses target_modules to specify which layers get LoRA adapters.

lora_config = LoraConfig(
    r=4,                           # Rank 4 â€” the small detour
    lora_alpha=4,                  # Alpha = rank, so scaling is 1.0
    target_modules=[
        'to_q', 'to_k', 'to_v', 'to_out.0',  # Cross-attention projections
    ],
    lora_dropout=0.0,
)

unet_lora = get_peft_model(unet_f32, lora_config)

# Print trainable vs total parameters.
trainable_params = sum(p.numel() for p in unet_lora.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in unet_lora.parameters())
print(f'Trainable LoRA params: {trainable_params:,}')
print(f'Total U-Net params:    {total_params:,}')
print(f'Trainable fraction:    {100 * trainable_params / total_params:.3f}%')
print(f'\nThis matches Exercise 1\'s rank-4 count. PEFT applied LoRA to exactly')
print(f'the cross-attention projections we identified.')

In [None]:
# Step 2: Create a synthetic training sample.
# In a real training run, this would be a (image, caption) pair from your dataset.
# Here we use a random image to focus on the mechanics of the training step.

# "Training image" â€” a random 512x512 image (standing in for a real one)
train_image = torch.randn(1, 3, 512, 512, device=device)
caption = "a watercolor painting of a village"

# Step 3: VAE encode â€” compress the image to latent space.
# The VAE is frozen. We detach and scale by the VAE's scaling factor.
with torch.no_grad():
    latent_dist = vae.encode(train_image)
    z_0 = latent_dist.latent_dist.sample() * vae.config.scaling_factor

print(f'Input image shape:  {list(train_image.shape)}')
print(f'Latent z_0 shape:   {list(z_0.shape)}')
print(f'VAE scaling factor: {vae.config.scaling_factor}')
print(f'\n8x spatial compression: 512/8 = 64. Four latent channels.')
print(f'This is the same VAE encoding from "From Pixels to Latents."')

In [None]:
# Step 4: Sample a random timestep and add noise.
# This is the forward process from "The Forward Process":
#   z_t = sqrt(alpha_bar_t) * z_0 + sqrt(1 - alpha_bar_t) * epsilon

timestep = torch.tensor([500], device=device)  # Middle of the schedule

# Sample the noise that we will try to predict
epsilon = torch.randn_like(z_0)

# Add noise using the scheduler's closed-form formula
z_t = noise_scheduler.add_noise(z_0, epsilon, timestep)

print(f'Timestep:       t = {timestep.item()}')
print(f'Noise shape:    {list(epsilon.shape)}')
print(f'Noised latent:  {list(z_t.shape)}')
print(f'\nThe noise epsilon is the TARGET. The U-Net will try to predict it.')

In [None]:
# Step 5: Encode the caption with frozen CLIP.

tokens = tokenizer(
    caption,
    padding='max_length',
    max_length=tokenizer.model_max_length,
    truncation=True,
    return_tensors='pt',
)

with torch.no_grad():
    text_embeddings = text_encoder(tokens.input_ids.to(device))[0]

print(f'Caption: "{caption}"')
print(f'Token IDs shape:      {list(tokens.input_ids.shape)}')
print(f'Text embeddings shape: {list(text_embeddings.shape)}')
print(f'\n77 tokens (padded), 768-dim CLIP embeddings.')
print(f'These become the K and V inputs to cross-attention.')

In [None]:
# Step 6: U-Net forward pass â€” predict the noise.
# Only LoRA adapter parameters have gradients. Base weights are frozen.

epsilon_hat = unet_lora(z_t, timestep, encoder_hidden_states=text_embeddings).sample

print(f'Predicted noise shape: {list(epsilon_hat.shape)}')
print(f'Target noise shape:    {list(epsilon.shape)}')
print(f'Shapes match: {epsilon_hat.shape == epsilon.shape}')

In [None]:
# Step 7: Compute MSE loss â€” same loss from DDPM training.
loss = F.mse_loss(epsilon_hat, epsilon)

print(f'MSE Loss: {loss.item():.4f}')
print(f'\nThis is the same loss from "Learning to Denoise":')
print(f'  L = MSE(epsilon, epsilon_hat)')
print(f'The model predicted the noise. The loss measures how close it was.')

In [None]:
# Step 8: Backprop and verify gradient flow.
# This is the critical verification: gradients should flow ONLY through
# LoRA adapter parameters, NOT through base U-Net weights.

loss.backward()

# Check LoRA parameters: they SHOULD have gradients.
lora_with_grad = 0
lora_without_grad = 0
for name, param in unet_lora.named_parameters():
    if 'lora_' in name and param.requires_grad:
        if param.grad is not None and param.grad.abs().sum() > 0:
            lora_with_grad += 1
        else:
            lora_without_grad += 1

# Check base parameters: they should NOT have gradients.
base_with_grad = 0
base_without_grad = 0
for name, param in unet_lora.named_parameters():
    if 'lora_' not in name:
        if param.grad is not None and param.grad.abs().sum() > 0:
            base_with_grad += 1
        else:
            base_without_grad += 1

print('=== Gradient Flow Verification ===')
print(f'\nLoRA adapter parameters:')
print(f'  With gradients:    {lora_with_grad}')
print(f'  Without gradients: {lora_without_grad}')
print(f'\nBase U-Net parameters:')
print(f'  With gradients:    {base_with_grad}  (should be 0)')
print(f'  Without gradients: {base_without_grad}')

if base_with_grad == 0 and lora_with_grad > 0:
    print(f'\n*** Gradient flow is correct. ***')
    print(f'Only LoRA adapters receive gradients. The highway is frozen.')
    print(f'The detour is learning.')
else:
    print(f'\nSomething unexpected happened. Check the setup.')

# Show a sample LoRA gradient to prove it is non-trivial
for name, param in unet_lora.named_parameters():
    if 'lora_' in name and param.grad is not None and param.grad.abs().sum() > 0:
        print(f'\nSample gradient: {name}')
        print(f'  Param shape:     {list(param.shape)}')
        print(f'  Grad norm:       {param.grad.norm().item():.6f}')
        print(f'  Grad mean:       {param.grad.mean().item():.8f}')
        print(f'  Grad std:        {param.grad.std().item():.8f}')
        break

### What Just Happened

You executed one complete diffusion LoRA training step, matching the pseudocode from the lesson:

1. **VAE encode** the training image: [3, 512, 512] â†’ [4, 64, 64]. The VAE is frozenâ€”it just compresses the image to latent space.
2. **Sample timestep t=500** and add noise using the forward process closed-form formula. The noise Îµ is the target.
3. **CLIP encode** the caption: "a watercolor painting of a village" â†’ [1, 77, 768]. The text encoder is frozen.
4. **U-Net forward pass**: the noised latent z_t, timestep t, and text embeddings go in. The predicted noise ÎµÌ‚ comes out.
5. **MSE loss**: L = MSE(Îµ, ÎµÌ‚). Same loss from DDPM training in "Learning to Denoise."
6. **Backprop**: gradients flow only through LoRA adapter parameters. Base U-Net weights have **zero gradients**.

The critical verification: `base_with_grad == 0`. The highway is frozen. Only the detour learns. This is exactly the LoRA training pattern from Module 4.4, applied to a different highway.

---

In [None]:
# Clean up the float32 models to free VRAM before Exercise 3.
del unet_lora, unet_f32, vae, text_encoder
torch.cuda.empty_cache() if device.type == 'cuda' else None
print('Cleaned up float32 models. VRAM freed for Exercise 3.')

---

## Exercise 3: Train a Style LoRA [Supported]

Now you will train a LoRA end-to-end. Instead of gathering a real dataset of 50â€“200 style images, we will use a small publicly available dataset of artistic images from Hugging Face. The goal is to see the **workflow and the effect**, not to produce a gallery-quality LoRA.

You will:
1. Load a small art dataset from Hugging Face
2. Set up the LoRA training loop using diffusers + PEFT
3. Train for a few hundred steps
4. Generate images with and without the LoRA to see the style effect
5. Compare rank-4 vs rank-8

The training loop is the same step you did by hand in Exercise 2, repeated many times with real training images.

**Hints:**
- Each TODO is 1â€“3 lines of code.
- The patterns are identical to Exercise 2 (VAE encode, noise, U-Net forward, MSE loss).
- If you get stuck, the solution is below.

In [None]:
from peft import LoraConfig, get_peft_model
from diffusers import UNet2DConditionModel, AutoencoderKL, DDPMScheduler, StableDiffusionPipeline
from transformers import CLIPTextModel, CLIPTokenizer
from torchvision import transforms
import gc

# Load the pipeline components fresh for training.
print('Loading pipeline components for training...')
unet = UNet2DConditionModel.from_pretrained(model_id, subfolder='unet', torch_dtype=torch.float32).to(device)
vae = AutoencoderKL.from_pretrained(model_id, subfolder='vae', torch_dtype=torch.float32).to(device)
tokenizer = CLIPTokenizer.from_pretrained(model_id, subfolder='tokenizer')
text_encoder = CLIPTextModel.from_pretrained(model_id, subfolder='text_encoder', torch_dtype=torch.float32).to(device)
noise_scheduler = DDPMScheduler.from_pretrained(model_id, subfolder='scheduler')

# Freeze VAE and text encoder â€” they do not train.
vae.requires_grad_(False)
text_encoder.requires_grad_(False)

print('Components loaded.')

In [None]:
# Create a small synthetic "style dataset."
# In a real workflow, you would use actual images in your target style.
# Here we create simple synthetic images with a consistent visual pattern
# (warm-toned gradients) so we can verify the LoRA learns *something*
# without needing to download a large dataset.

num_train_images = 20
train_captions = [
    "a warm sunset painting",
    "a golden landscape with warm colors",
    "an orange and red abstract composition",
    "a warm-toned watercolor scene",
    "a painting with amber and crimson hues",
] * 4  # Repeat to get 20 samples

# Create synthetic images with a warm color bias.
# These are not art, but they have a consistent "warm" signal.
image_transform = transforms.Compose([
    transforms.Resize((512, 512)),
    transforms.ToTensor(),
    transforms.Normalize([0.5], [0.5]),  # Scale to [-1, 1]
])

train_images = []
for i in range(num_train_images):
    # Create a warm-biased random image (more red/yellow, less blue)
    img = np.random.rand(512, 512, 3).astype(np.float32)
    img[:, :, 0] = np.clip(img[:, :, 0] + 0.3, 0, 1)  # Boost red
    img[:, :, 1] = np.clip(img[:, :, 1] + 0.1, 0, 1)  # Slight green boost
    img[:, :, 2] = np.clip(img[:, :, 2] - 0.2, 0, 1)  # Reduce blue
    pil_img = Image.fromarray((img * 255).astype(np.uint8))
    train_images.append(image_transform(pil_img))

print(f'Created {len(train_images)} synthetic training images.')
print(f'Image tensor shape: {list(train_images[0].shape)}')
print(f'Value range: [{train_images[0].min():.2f}, {train_images[0].max():.2f}]')

In [None]:
# Apply LoRA to the U-Net â€” same config as Exercise 2.

lora_config = LoraConfig(
    r=4,
    lora_alpha=4,
    target_modules=['to_q', 'to_k', 'to_v', 'to_out.0'],
    lora_dropout=0.0,
)

unet_lora = get_peft_model(unet, lora_config)
unet_lora.print_trainable_parameters()

# Set up the optimizer â€” only LoRA parameters are trainable.
optimizer = torch.optim.AdamW(
    [p for p in unet_lora.parameters() if p.requires_grad],
    lr=1e-4,
    weight_decay=1e-2,
)

print(f'\nOptimizer targets {sum(1 for p in unet_lora.parameters() if p.requires_grad)} parameter groups.')

In [None]:
# Training loop â€” this is Exercise 2's single step, repeated.
# Fill in the TODOs. Each is 1-3 lines.

num_steps = 200
losses = []

unet_lora.train()
print(f'Training for {num_steps} steps...')

for step in range(num_steps):
    # Pick a random training sample
    idx = step % len(train_images)
    pixel_values = train_images[idx].unsqueeze(0).to(device)  # [1, 3, 512, 512]
    caption = train_captions[idx]

    # TODO 1: Encode the image with the frozen VAE.
    # Use vae.encode(pixel_values), get the latent distribution,
    # sample from it, and multiply by vae.config.scaling_factor.
    # Wrap in torch.no_grad() since the VAE is frozen.
    # (This is identical to Exercise 2, Step 3.)
    z_0 = None  # Replace this line

    # TODO 2: Sample a random timestep and add noise.
    # Sample a random timestep between 0 and num_train_timesteps.
    # Generate random noise with torch.randn_like(z_0).
    # Use noise_scheduler.add_noise(z_0, noise, timesteps) to create z_t.
    noise = None     # Replace this line
    timesteps = None  # Replace this line
    z_t = None       # Replace this line

    # Encode the caption with frozen CLIP.
    tokens = tokenizer(
        caption,
        padding='max_length',
        max_length=tokenizer.model_max_length,
        truncation=True,
        return_tensors='pt',
    )
    with torch.no_grad():
        text_emb = text_encoder(tokens.input_ids.to(device))[0]

    # TODO 3: U-Net forward pass and MSE loss.
    # Pass z_t, timesteps, and text_emb through the U-Net.
    # Compute MSE loss between the predicted noise and the actual noise.
    # (This is identical to Exercise 2, Steps 6-7.)
    noise_pred = None  # Replace this line
    loss = None        # Replace this line

    # Backprop and optimizer step
    if loss is not None:
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        losses.append(loss.item())

        if step % 50 == 0 or step == num_steps - 1:
            print(f'  Step {step:>4d}/{num_steps}: loss = {loss.item():.4f}')
    else:
        print('Fill in the TODOs above to start training.')
        break

if losses:
    print(f'\nTraining complete. Final loss: {losses[-1]:.4f}')

<details>
<summary>ðŸ’¡ Solution</summary>

The training loop is Exercise 2's single step, repeated with real data. Every TODO maps to something you already did.

**TODO 1 â€” VAE encode:**
```python
    with torch.no_grad():
        z_0 = vae.encode(pixel_values).latent_dist.sample() * vae.config.scaling_factor
```
Why `torch.no_grad()`? The VAE is frozenâ€”we do not need gradients through it. The `latent_dist.sample()` draws from the learned Gaussian, and the scaling factor normalizes the latent space.

**TODO 2 â€” Sample timestep and add noise:**
```python
    noise = torch.randn_like(z_0)
    timesteps = torch.randint(0, noise_scheduler.config.num_train_timesteps, (1,), device=device)
    z_t = noise_scheduler.add_noise(z_0, noise, timesteps)
```
The timestep is sampled uniformly from the schedule. `add_noise` applies the closed-form forward process: z_t = sqrt(alpha_bar_t) * z_0 + sqrt(1 - alpha_bar_t) * noise.

**TODO 3 â€” Forward pass and loss:**
```python
    noise_pred = unet_lora(z_t, timesteps, encoder_hidden_states=text_emb).sample
    loss = F.mse_loss(noise_pred, noise)
```
The U-Net predicts the noise that was added. MSE measures how close the prediction is to the actual noise. Only LoRA params have gradients, so backprop only updates the detour.

**Common mistakes:**
- Forgetting `torch.no_grad()` around the VAE encode. This wastes memory tracking gradients through a frozen model and can cause OOM errors.
- Using `timesteps` as an integer instead of a tensor. The scheduler expects a tensor.
- Confusing the loss direction: `F.mse_loss(noise_pred, noise)`, not `F.mse_loss(noise, noise_pred)`. Both give the same MSE, but the convention is (predicted, target).

</details>

In [None]:
# Plot the training loss curve.
if losses:
    fig, ax = plt.subplots(figsize=(10, 4))
    ax.plot(losses, alpha=0.3, color='cyan', label='Per-step loss')
    # Smoothed version
    window = min(20, len(losses))
    if len(losses) >= window:
        smoothed = np.convolve(losses, np.ones(window)/window, mode='valid')
        ax.plot(range(window-1, len(losses)), smoothed, color='cyan', linewidth=2, label=f'Smoothed ({window}-step avg)')
    ax.set_xlabel('Training Step')
    ax.set_ylabel('MSE Loss')
    ax.set_title('LoRA Training Loss')
    ax.legend()
    plt.tight_layout()
    plt.show()
    
    print(f'Starting loss: {losses[0]:.4f}')
    print(f'Final loss:    {losses[-1]:.4f}')
    print(f'\nThe loss should decrease, showing the LoRA adapters are learning.')
    print(f'On synthetic data, the effect will be subtle but measurable.')
else:
    print('No losses recorded. Fill in the TODOs in the training loop.')

In [None]:
# Generate images WITH the LoRA to see the effect.
if losses:
    from diffusers import DPMSolverMultistepScheduler

    # Build a pipeline using our LoRA-adapted U-Net.
    unet_lora.eval()

    # Merge the LoRA weights into the base model for inference.
    # This is the merge-at-inference pattern from Module 4.4:
    #   W_merged = W + BA * (alpha/r)
    # After merging, there is zero inference overhead.
    unet_lora.merge_adapter()

    # Create a fresh pipeline with the LoRA-merged U-Net.
    lora_pipe = StableDiffusionPipeline.from_pretrained(
        model_id,
        unet=unet_lora,
        torch_dtype=torch.float32,
        safety_checker=None,
        requires_safety_checker=False,
    ).to(device)
    lora_pipe.scheduler = DPMSolverMultistepScheduler.from_config(lora_pipe.scheduler.config)

    # Also load a clean pipeline without LoRA for comparison.
    clean_pipe = StableDiffusionPipeline.from_pretrained(
        model_id,
        torch_dtype=dtype,
        safety_checker=None,
        requires_safety_checker=False,
    ).to(device)
    clean_pipe.scheduler = DPMSolverMultistepScheduler.from_config(clean_pipe.scheduler.config)

    # Generate with the same seed, same prompt.
    prompt = "a landscape with mountains and a river"
    seed = 42

    gen = torch.Generator(device=device).manual_seed(seed)
    img_without = clean_pipe(
        prompt, guidance_scale=7.5, num_inference_steps=25,
        generator=gen, height=512, width=512,
    ).images[0]

    gen = torch.Generator(device=device).manual_seed(seed)
    img_with = lora_pipe(
        prompt, guidance_scale=7.5, num_inference_steps=25,
        generator=gen, height=512, width=512,
    ).images[0]

    show_images(
        [img_without, img_with],
        ['Without LoRA (base model)', 'With LoRA (trained on warm tones)'],
        figsize=(12, 6),
    )

    print('The LoRA was trained on a small synthetic dataset with warm-toned images.')
    print('The effect may be subtle (synthetic data, short training), but it')
    print('demonstrates the full workflow: train LoRA -> merge -> generate.')
    print('\nWith a real style dataset (50-200 images of actual artwork),')
    print('the style transfer would be dramatically more visible.')
else:
    print('Complete the training loop first.')

In [None]:
# Clean up training artifacts before Exercise 4.
del unet_lora, unet
if 'lora_pipe' in dir():
    del lora_pipe
if 'clean_pipe' in dir():
    del clean_pipe
gc.collect()
torch.cuda.empty_cache() if device.type == 'cuda' else None
print('Cleaned up. Ready for Exercise 4.')

---

## Exercise 4: LoRA Composition Experiment [Independent]

The lesson explained that multiple LoRA adapters can be applied simultaneously:

$$W_{\text{combined}} = W + BA_{\text{style}} \cdot \frac{\alpha_1}{r_1} + BA_{\text{subject}} \cdot \frac{\alpha_2}{r_2}$$

Your task:
1. Load two **pre-trained community LoRA adapters** from Hugging Face (suggestions below)
2. Apply each adapter **individually** and generate an image
3. Apply **both adapters together** and generate an image
4. **Experiment with alpha scaling**: try different adapter weights (e.g., 0.5 and 1.0, or 0.3 and 0.7) and observe how the blend changes
5. Display all results in a grid and write observations

**Suggested LoRA adapters from Hugging Face:**
- Use any two community LoRA adapters compatible with SD v1.5. Search [huggingface.co/models](https://huggingface.co/models?pipeline_tag=text-to-image&sort=downloads) filtered by "lora" and "stable-diffusion".
- Some well-known examples: `nerijs/pixel-art-xl` (pixel art style, but check SD version compatibility), or search for "lora sd-1.5" on the Hub.

**Key diffusers APIs you will need:**
- `pipe.load_lora_weights("hf-repo-id", adapter_name="name")` â€” loads a LoRA from Hugging Face
- `pipe.set_adapters(["name1", "name2"], adapter_weights=[w1, w2])` â€” activates multiple adapters with weights
- `pipe.set_adapters(["name1"], adapter_weights=[1.0])` â€” activates only one adapter
- `pipe.unload_lora_weights()` â€” removes all adapters

**What to observe:**
- Do the two LoRAs compose cleanly, or do they interfere?
- How does adjusting the adapter weights change the visual output?
- Is the composition a smooth blend, or does one LoRA dominate?

This is your experiment. There is no single correct answer.

In [None]:
# YOUR EXPERIMENT
#
# 1. Load the base SD v1.5 pipeline
# 2. Load two LoRA adapters with pipe.load_lora_weights()
# 3. Generate with each adapter individually
# 4. Generate with both adapters composed at different weight ratios
# 5. Display results and write observations
#
# Starter code:
#   pipe = StableDiffusionPipeline.from_pretrained(
#       model_id, torch_dtype=dtype, safety_checker=None,
#       requires_safety_checker=False,
#   ).to(device)
#
#   pipe.load_lora_weights("some-hf-repo", adapter_name="style_a")
#   pipe.load_lora_weights("another-hf-repo", adapter_name="style_b")
#
#   # Generate with only style_a:
#   pipe.set_adapters(["style_a"], adapter_weights=[1.0])
#   img_a = pipe(prompt, ...).images[0]
#
#   # Generate with only style_b:
#   pipe.set_adapters(["style_b"], adapter_weights=[1.0])
#   img_b = pipe(prompt, ...).images[0]
#
#   # Generate with both:
#   pipe.set_adapters(["style_a", "style_b"], adapter_weights=[0.7, 0.7])
#   img_both = pipe(prompt, ...).images[0]
#
#   show_images([img_a, img_b, img_both], ["Style A", "Style B", "Both"])

print('Observations:')
print('  (Write your observations here after running the experiment.)')

<details>
<summary>ðŸ’¡ Solution</summary>

The key insight is that LoRA composition is additive: the two bypass outputs are summed with their respective alpha scaling. Whether they compose cleanly depends on whether they modify the same projections in compatible directions.

```python
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler

# Load the base pipeline
pipe = StableDiffusionPipeline.from_pretrained(
    model_id,
    torch_dtype=dtype,
    safety_checker=None,
    requires_safety_checker=False,
).to(device)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

# Load two LoRA adapters.
# Replace these with actual LoRA repos from Hugging Face.
# Example repos (verify availability and SD v1.5 compatibility):
lora_a_repo = "YOUR_FIRST_LORA_REPO"   # e.g., a style LoRA
lora_b_repo = "YOUR_SECOND_LORA_REPO"  # e.g., another style LoRA

pipe.load_lora_weights(lora_a_repo, adapter_name="style_a")
pipe.load_lora_weights(lora_b_repo, adapter_name="style_b")

prompt = "a cat sitting in a garden"
seed = 42

# Generate: no LoRA (baseline)
pipe.set_adapters([], adapter_weights=[])
gen = torch.Generator(device=device).manual_seed(seed)
img_base = pipe(prompt, guidance_scale=7.5, num_inference_steps=25,
                generator=gen, height=512, width=512).images[0]

# Generate: style_a only
pipe.set_adapters(["style_a"], adapter_weights=[1.0])
gen = torch.Generator(device=device).manual_seed(seed)
img_a = pipe(prompt, guidance_scale=7.5, num_inference_steps=25,
             generator=gen, height=512, width=512).images[0]

# Generate: style_b only
pipe.set_adapters(["style_b"], adapter_weights=[1.0])
gen = torch.Generator(device=device).manual_seed(seed)
img_b = pipe(prompt, guidance_scale=7.5, num_inference_steps=25,
             generator=gen, height=512, width=512).images[0]

# Generate: both at full weight
pipe.set_adapters(["style_a", "style_b"], adapter_weights=[1.0, 1.0])
gen = torch.Generator(device=device).manual_seed(seed)
img_both_full = pipe(prompt, guidance_scale=7.5, num_inference_steps=25,
                     generator=gen, height=512, width=512).images[0]

# Generate: both at half weight
pipe.set_adapters(["style_a", "style_b"], adapter_weights=[0.5, 0.5])
gen = torch.Generator(device=device).manual_seed(seed)
img_both_half = pipe(prompt, guidance_scale=7.5, num_inference_steps=25,
                     generator=gen, height=512, width=512).images[0]

# Generate: weighted blend (favoring style_a)
pipe.set_adapters(["style_a", "style_b"], adapter_weights=[0.8, 0.3])
gen = torch.Generator(device=device).manual_seed(seed)
img_blend = pipe(prompt, guidance_scale=7.5, num_inference_steps=25,
                 generator=gen, height=512, width=512).images[0]

# Display results
show_image_grid(
    [img_base, img_a, img_b, img_both_full, img_both_half, img_blend],
    ['Base (no LoRA)', 'Style A only', 'Style B only',
     'Both (1.0, 1.0)', 'Both (0.5, 0.5)', 'Blend (0.8, 0.3)'],
    nrows=2, ncols=3, figsize=(18, 12),
    suptitle='LoRA Composition Experiment',
)

print('Observations:')
print('- Each LoRA individually produces a consistent style shift from the base.')
print('- Both at full weight (1.0, 1.0) may produce artifacts if the styles conflict.')
print('- Both at half weight (0.5, 0.5) often blends more smoothly.')
print('- Weighted blend lets you favor one style over the other.')
print('- Composition is not always clean: if both LoRAs push cross-attention')
print('  projections in conflicting directions, the result may be incoherent.')
```

**Why adapter weights matter:** Each adapter's contribution is scaled by its weight. At (1.0, 1.0), both LoRAs apply at full strength, which can overshootâ€”the combined delta may be too large. At (0.5, 0.5), each contributes half its learned delta, producing a gentler blend. This is the alpha scaling from the lesson: you are controlling how much each detour influences the highway.

**Common mistakes:**
1. Forgetting to re-seed the generator for each run. Without the same z_T, comparisons are meaningless.
2. Using LoRAs trained for different SD versions (e.g., SDXL LoRA on SD v1.5). The projection dimensions will not match.
3. Expecting perfect composition. LoRA composition is a sum of independent adaptationsâ€”there is no guarantee they are compatible.

</details>

---

## Key Takeaways

1. **Cross-attention projections are the LoRA targets.** The SD v1.5 U-Net's cross-attention layers (`attn2`) contain `to_q`, `to_k`, `to_v`, and `to_out` projectionsâ€”these are where text meaning meets spatial features. A rank-4 LoRA on these layers adds roughly 0.1â€“0.3% of the total U-Net parameters.

2. **The diffusion LoRA training step is DDPM training with frozen base weights.** VAE encode â†’ sample timestep â†’ add noise â†’ U-Net predicts noise â†’ MSE loss â†’ backprop into LoRA params only. Every piece was familiar from prior lessons.

3. **Gradient flow verification is the critical check.** Base U-Net parameters should have zero gradients. LoRA adapter parameters should have non-zero gradients. If this is wrong, the training is wrong.

4. **LoRA composition is additive but not always clean.** Two LoRA adapters sum their bypass outputs. Adapter weights control the blend. Conflicting adaptations can interfereâ€”scale the weights to find a balance.

5. **Same detour, different highway.** The LoRA mechanism from Module 4.4 transferred directly. What changed was the context: which layers to target (cross-attention projections), what the training data looks like (images + captions), and what the loss measures (noise prediction MSE).