# Learning to Denoise

**Module 6.2, Lesson 3** | CourseAI

In this notebook you will use the closed-form formula from The Forward Process to create noisy images, compute the DDPM training loss by hand, write the training algorithm in code, and reason about which noise levels are hardest for the model.

**What you'll do:**
- Create noisy versions of Fashion-MNIST images at different timesteps using the closed-form formula
- Compute MSE loss between "predicted" noise and actual noise, both manually and with `nn.MSELoss`
- Fill in the DDPM training loop skeleton â€” the data preparation and loss computation steps
- Reason about the loss landscape: which timesteps are hardest for the model, and why

**For each exercise, PREDICT the output before running the cell.**

**Estimated time:** 20-30 minutes.

---

## Setup

Run this cell to import everything and configure the environment.

In [None]:
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import math

# Reproducible results
torch.manual_seed(42)

# Nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [10, 4]

print('Setup complete.')

## Shared Setup: Data and Noise Schedule

We load Fashion-MNIST and define the noise schedule. These are the tools you will use in every exercise.

In [None]:
# Load Fashion-MNIST (same dataset from the web lesson examples)
transform = transforms.Compose([
    transforms.ToTensor(),
    # Normalize to [-1, 1] â€” standard for diffusion models
    transforms.Normalize((0.5,), (0.5,))
])

dataset = torchvision.datasets.FashionMNIST(
    root='./data', train=True, download=True, transform=transform
)

# Class names for labels
class_names = [
    'T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
    'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'
]

print(f'Dataset size: {len(dataset)}')
print(f'Image shape: {dataset[0][0].shape}')
print(f'Pixel range: [{dataset[0][0].min():.1f}, {dataset[0][0].max():.1f}]')

In [None]:
# Define the noise schedule
# We use a cosine schedule (Nichol & Dhariwal 2021), which preserves
# more signal at early timesteps than a linear schedule.

T = 1000  # Total timesteps

def cosine_alpha_bar_schedule(T, s=0.008):
    """Compute alpha_bar for each timestep using the cosine schedule.

    alpha_bar_t represents how much of the original signal remains at step t.
    At t=0, alpha_bar ~ 1 (all signal). At t=T, alpha_bar ~ 0 (all noise).
    """
    steps = torch.arange(T + 1, dtype=torch.float32)
    f = torch.cos(((steps / T) + s) / (1 + s) * (math.pi / 2)) ** 2
    alpha_bar = f / f[0]
    return alpha_bar

# alpha_bar[t] = cumulative signal fraction at timestep t
# Index 0 = clean, index T = nearly pure noise
alpha_bar = cosine_alpha_bar_schedule(T)

print(f'alpha_bar shape: {alpha_bar.shape}')
print(f'alpha_bar[0]   = {alpha_bar[0]:.4f}  (clean image â€” nearly all signal)')
print(f'alpha_bar[500] = {alpha_bar[500]:.4f}  (midpoint â€” signal-to-noise dial halfway)')
print(f'alpha_bar[999] = {alpha_bar[999]:.6f}  (nearly pure noise)')

---

## Exercise 1: Create Noisy Images (Guided)

This is the **data preparation** step of the DDPM training algorithm. Given a clean image $x_0$, a timestep $t$, and sampled noise $\epsilon$, you create the noisy image using the closed-form formula:

$$x_t = \sqrt{\bar{\alpha}_t} \cdot x_0 + \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon$$

You derived this formula in The Forward Process. Now you use it as a tool.

**Before running, predict:**
- At $t = 100$, most of the signal remains. What will the image look like? Will you clearly see the clothing item?
- At $t = 500$, the midpoint. What do you expect?
- At $t = 800$, almost all signal is gone. Can you still recognize anything?

In [None]:
# Grab a T-shirt image (index 0 in Fashion-MNIST is a T-shirt/top)
x_0, label = dataset[0]
print(f'Image: {class_names[label]}')
print(f'Shape: {x_0.shape}')  # [1, 28, 28]

# Sample noise (same shape as the image)
torch.manual_seed(42)
epsilon = torch.randn_like(x_0)

# Create noisy images at different timesteps
timesteps = [0, 100, 300, 500, 700, 900]

fig, axes = plt.subplots(1, len(timesteps), figsize=(15, 3))

for i, t in enumerate(timesteps):
    # The closed-form formula: x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon
    signal_coeff = torch.sqrt(alpha_bar[t])
    noise_coeff = torch.sqrt(1 - alpha_bar[t])
    x_t = signal_coeff * x_0 + noise_coeff * epsilon

    # Display
    axes[i].imshow(x_t.squeeze(), cmap='gray', vmin=-2, vmax=2)
    axes[i].set_title(f't={t}\n\u03b1\u0305={alpha_bar[t]:.3f}', fontsize=10)
    axes[i].axis('off')

    # Print the coefficients so you can see the signal/noise balance
    print(f't={t:>3d} | signal coeff: {signal_coeff:.3f} | noise coeff: {noise_coeff:.3f} | \u03b1\u0305_t: {alpha_bar[t]:.4f}')

plt.suptitle('The same T-shirt at different noise levels', y=1.05, fontsize=13)
plt.tight_layout()
plt.show()

### What Just Happened

You used the closed-form formula as a tool â€” the same formula you derived in The Forward Process. Each timestep $t$ dials the signal-to-noise ratio:

- **t=0:** $\bar{\alpha}_0 \approx 1$. Nearly all signal, no noise. The T-shirt is crystal clear.
- **t=100:** $\bar{\alpha}_{100}$ is still high. A light dusting of static â€” the T-shirt is easily recognizable.
- **t=500:** Around the midpoint. Signal and noise are mixing. You can see a vague shape.
- **t=900:** $\bar{\alpha}_{900}$ is nearly zero. Almost pure noise. The T-shirt is barely visible (if at all).

During training, the network sees noisy images across this entire range. At low $t$ it must detect subtle perturbations. At high $t$ it must hallucinate plausible structure from near-pure static. **Same algorithm, same loss, vastly different difficulty.**

Notice: we jumped directly to each timestep. No iterating through intermediate steps. That is the power of the closed-form formula â€” it lets training "teleport" to any noise level.

---

## Exercise 2: Compute the Loss by Hand (Guided)

The DDPM training loss is:

$$L = \| \epsilon - \epsilon_\theta(x_t, t) \|^2 = \frac{1}{n} \sum_{i=1}^{n} (\epsilon_i - \hat{\epsilon}_i)^2$$

That is MSE loss. The same formula from Series 1 (Loss Functions), the same `nn.MSELoss` from Series 2 (Training Loop). The only difference is what goes in: the actual noise $\epsilon$ and the network's predicted noise $\hat{\epsilon}$.

We do not have a trained denoising network yet (that is the capstone lesson). Instead, we will use **random noise as the prediction** â€” a stand-in for an untrained network that guesses randomly.

**Before running, predict:**
- If the "predicted" noise is random (no relation to the actual noise), will the MSE be close to 0, close to 1, or close to 2?
- Hint: both $\epsilon$ and $\hat{\epsilon}$ are drawn from $\mathcal{N}(0, 1)$. The variance of their difference is the sum of their variances.

In [None]:
# Use the T-shirt image from Exercise 1
x_0, label = dataset[0]

# Step 1: Pick a timestep
t = 500

# Step 2: Sample the actual noise (the "answer key")
torch.manual_seed(123)
epsilon = torch.randn_like(x_0)

# Step 3: Create the noisy image using the closed-form formula
x_t = torch.sqrt(alpha_bar[t]) * x_0 + torch.sqrt(1 - alpha_bar[t]) * epsilon

# Step 4: Simulate a network prediction
# A real network would take x_t and t as inputs and output its noise prediction.
# We do not have a trained network, so we use random noise as a stand-in
# (an untrained network's output is effectively random).
torch.manual_seed(456)
epsilon_hat = torch.randn_like(x_0)  # "predicted" noise (random guess)

# Step 5: Compute MSE loss MANUALLY
# The formula: L = (1/n) * sum((epsilon_i - epsilon_hat_i)^2)
n = epsilon.numel()  # number of elements (1 * 28 * 28 = 784)
squared_errors = (epsilon - epsilon_hat) ** 2
manual_mse = squared_errors.sum() / n

print(f'Number of elements: {n}')
print(f'Manual MSE:         {manual_mse:.4f}')

# Step 6: Verify with nn.MSELoss â€” the same PyTorch loss from Series 2
criterion = nn.MSELoss()
pytorch_mse = criterion(epsilon_hat, epsilon)

print(f'nn.MSELoss:         {pytorch_mse.item():.4f}')
print(f'Match: {torch.allclose(manual_mse, pytorch_mse)}')

In [None]:
# Why is the MSE approximately 2?
# Both epsilon and epsilon_hat are drawn from N(0, 1).
# Their difference (epsilon - epsilon_hat) has:
#   mean = 0 - 0 = 0
#   variance = 1 + 1 = 2   (variances add for independent variables)
# MSE = E[(epsilon - epsilon_hat)^2] = Var(epsilon - epsilon_hat) = 2
#
# So MSE ~ 2 is what you get from a random guess.
# A trained model should push this MUCH lower.

print('--- Why MSE ~ 2 for random predictions ---')
print(f'Var(epsilon):                 {epsilon.var():.4f}   (should be ~1)')
print(f'Var(epsilon_hat):             {epsilon_hat.var():.4f}   (should be ~1)')
print(f'Var(epsilon - epsilon_hat):   {(epsilon - epsilon_hat).var():.4f}   (should be ~2)')
print(f'MSE (= mean of squared diff): {manual_mse:.4f}   (should be ~2)')
print()
print('A trained network would predict epsilon_hat close to the actual epsilon,')
print('driving MSE far below 2. That is the entire learning objective.')

### What Just Happened

You computed the DDPM training loss by hand and verified it matches `nn.MSELoss`. The formula is identical to every other MSE computation you have done in this course:

| Context | Prediction | Target |
|---------|-----------|--------|
| Linear regression (Series 1) | $\hat{y}$ (predicted price) | $y$ (actual price) |
| Autoencoder (Module 6.1) | $\hat{x}$ (reconstructed image) | $x$ (input image) |
| **DDPM (this lesson)** | $\hat{\epsilon}$ (predicted noise) | $\epsilon$ (actual noise) |

Same formula. Same `nn.MSELoss()`. Same gradients. Different question.

The random-guess MSE of ~2 is your baseline. A trained model should do much better because it learns the relationship between the noisy image $x_t$ and the noise $\epsilon$ that produced it.

---

## Exercise 3: Write the DDPM Training Loop (Supported)

Now you will write the core of the DDPM training algorithm in code. We will not train a real model (there is no denoising network yet â€” that is the capstone lesson). Instead, you will write the **data preparation and loss computation** steps as a function.

**Note:** This exercise builds on the code patterns from Exercises 1 and 2. If you skipped them, make sure to run the Shared Setup cells first and review how the closed-form formula is applied in Exercise 1 and how MSE is computed in Exercise 2.

The DDPM training algorithm has seven steps:
1. **Sample** a training image $x_0$ from the dataset
2. **Sample** a random timestep $t \sim \text{Uniform}(1, T)$
3. **Sample** noise $\epsilon \sim \mathcal{N}(0, I)$
4. **Create** the noisy image: $x_t = \sqrt{\bar{\alpha}_t} \cdot x_0 + \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon$
5. **Predict**: $\hat{\epsilon} = \text{network}(x_t, t)$
6. **Compute loss**: $L = \text{MSE}(\hat{\epsilon}, \epsilon)$
7. **Backpropagate** and update weights

Steps 1-4 are the diffusion-specific data preparation. Steps 5-7 are the standard training loop you already know.

**Task:** Fill in the three TODO markers below. Each is 1-2 lines of code.

<details>
<summary>ðŸ’¡ Solution</summary>

The key insight: the DDPM training loop is the standard training loop with diffusion-specific data preparation. The three TODOs are the three diffusion-specific sampling steps â€” everything else is the familiar forward-loss-backward-update pattern.

```python
# TODO 1: Sample a random timestep for each image in the batch
t = torch.randint(1, T, (batch_size,))

# TODO 2: Sample fresh noise (the "answer key")
epsilon = torch.randn_like(x_0)

# TODO 3: Create the noisy image using the closed-form formula
# Gather the right alpha_bar for each image's timestep, reshape for broadcasting
alpha_bar_t = alpha_bar[t].view(batch_size, 1, 1, 1)
x_t = torch.sqrt(alpha_bar_t) * x_0 + torch.sqrt(1 - alpha_bar_t) * epsilon
```

Note the `.view(batch_size, 1, 1, 1)` â€” this reshapes alpha_bar from `[batch_size]` to `[batch_size, 1, 1, 1]` so it broadcasts correctly against the image tensor `[batch_size, 1, 28, 28]`. Each image in the batch gets its own alpha_bar value based on its randomly sampled timestep.

</details>

In [None]:
def ddpm_training_step(model, x_0, alpha_bar, T, criterion):
    """One step of the DDPM training algorithm.

    Args:
        model: the denoising network (takes x_t and t, returns predicted noise)
        x_0: batch of clean images [batch_size, 1, 28, 28]
        alpha_bar: precomputed alpha_bar schedule [T+1]
        T: total number of timesteps
        criterion: nn.MSELoss()

    Returns:
        loss: the MSE between predicted and actual noise
    """
    batch_size = x_0.shape[0]

    # Step 1: x_0 is already provided (the batch of clean images)

    # Step 2: Sample a random timestep for each image in the batch
    # Each image gets its own random t from Uniform(1, T)
    # TODO: t = ???

    # Step 3: Sample noise â€” the "answer key" for this training step
    # Fresh Gaussian noise, same shape as x_0
    # TODO: epsilon = ???

    # Step 4: Create the noisy image using the closed-form formula
    # x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon
    # Hint: alpha_bar[t] gives a value per image. You need to reshape it
    # to [batch_size, 1, 1, 1] so it broadcasts against [batch_size, 1, 28, 28]
    # TODO: alpha_bar_t = ??? and x_t = ???

    # Step 5: Predict the noise (forward pass)
    epsilon_hat = model(x_t, t)

    # Step 6: Compute MSE loss (same loss from Series 1)
    loss = criterion(epsilon_hat, epsilon)

    return loss

print('ddpm_training_step defined.')
print('(We cannot run it yet â€” we have no denoising network.)')
print('But the structure is the complete DDPM training algorithm.')

In [None]:
# Verify your data preparation steps work with a dummy model.
# This "model" just returns random noise â€” it is not a real denoising network.
# The point is to confirm your steps 2-4 produce tensors of the right shape.

class DummyModel(nn.Module):
    """A fake denoising network that returns random noise.
    Real architecture comes in Module 6.3."""
    def forward(self, x_t, t):
        return torch.randn_like(x_t)

# Grab a small batch
loader = torch.utils.data.DataLoader(dataset, batch_size=8, shuffle=True)
batch_images, batch_labels = next(iter(loader))

print(f'Batch shape: {batch_images.shape}')  # [8, 1, 28, 28]

# Run one training step
dummy_model = DummyModel()
criterion = nn.MSELoss()

loss = ddpm_training_step(dummy_model, batch_images, alpha_bar, T, criterion)
print(f'Loss: {loss.item():.4f}')
print(f'Expected: ~2.0 (random predictions vs random noise)')
print()
if 1.0 < loss.item() < 3.0:
    print('Loss is in the expected range. Your training step is correct!')
else:
    print('Loss is outside the expected range. Check your TODOs.')

In [None]:
# Now see what a full training loop looks like.
# This is the standard training pattern â€” the heartbeat hasn't changed.

print('--- The Full DDPM Training Loop (pseudocode in real Python) ---')
print()

# This code WON'T run (we use a dummy model), but the structure is exactly
# what a real implementation looks like. Compare to the training loops
# you wrote in Series 2.

dummy_model = DummyModel()
# In a real implementation:
#   model = UNet(...)  # the denoising architecture (Module 6.3)
#   optimizer = torch.optim.Adam(model.parameters(), lr=2e-4)
criterion = nn.MSELoss()

# Simulate 3 "training steps" with our dummy model
for step in range(3):
    batch_images, _ = next(iter(loader))

    # The DDPM training step â€” all the diffusion-specific work is here
    loss = ddpm_training_step(dummy_model, batch_images, alpha_bar, T, criterion)

    # Standard training loop continues:
    # optimizer.zero_grad()
    # loss.backward()
    # optimizer.step()

    print(f'Step {step + 1} | Loss: {loss.item():.4f}')

print()
print('The heartbeat of training (forward -> loss -> backward -> update)')
print('is identical to every training loop since Series 2.')
print('The only new part: how you prepare the data (steps 2-4).')

### What Just Happened

You wrote the DDPM training algorithm in code. The three TODOs were the three diffusion-specific steps:

1. Sample a random timestep for each image (`torch.randint`)
2. Sample noise â€” the answer key (`torch.randn_like`)
3. Create the noisy image using the closed-form formula

Everything after that â€” the forward pass, MSE loss, backprop, optimizer step â€” is the standard training loop you have used since Series 2. The heartbeat has not changed.

Key detail: each image in the batch gets its **own random timestep**. In one batch, the model might see $t=42$ for one image and $t=891$ for another. No sequential order. This is why the closed-form formula matters â€” it lets you teleport to any noise level without iterating through intermediate steps.

---

## Exercise 4: Predict the Loss Landscape (Independent)

You now understand the training algorithm. Here is a question about how a **trained** model would behave:

**Question:** Imagine a well-trained denoising model. If you compute the average MSE loss at each timestep separately, which timesteps will have the **highest** loss? Which will have the **lowest**?

Think about what the noisy image $x_t$ looks like at different noise levels, and how hard it is for the network to predict the noise $\epsilon$ from it.

Consider three regimes:
- **Low $t$ (e.g., $t = 50$):** The image is nearly clean, with a light dusting of noise.
- **Middle $t$ (e.g., $t = 500$):** Roughly equal parts signal and noise.
- **High $t$ (e.g., $t = 950$):** Nearly pure noise, almost no signal left.

Write your prediction in the cell below, then run the simulation to check.

**Note:** We use a dummy model to illustrate the *shape* of the loss landscape by measuring how much the noisy image "looks like" the noise versus the original signal. A real trained model would show a similar pattern.

In [None]:
# YOUR PREDICTION
# Before running the next cell, write your prediction here:
#
# Which timesteps will have the HIGHEST loss (hardest for the model)?
# Your answer: ...
#
# Which timesteps will have the LOWEST loss (easiest for the model)?
# Your answer: ...
#
# Why?
# Your reasoning: ...

In [None]:
# Let's measure the difficulty at each noise level.
#
# We'll use a simple heuristic as a stand-in for a trained model:
# the "oracle" predictor that uses the closed-form formula to estimate epsilon.
#
# Given x_t and x_0, you can recover epsilon exactly:
#   epsilon = (x_t - sqrt(alpha_bar_t) * x_0) / sqrt(1 - alpha_bar_t)
#
# But the model doesn't have x_0! It must guess.
# To simulate difficulty, we measure how much information about epsilon
# is "visible" in x_t at each noise level. At low noise, the signal
# dominates â€” small perturbations are hard to detect. At high noise,
# x_t is almost pure epsilon, so the noise is "obvious" but the
# signal needed to correct the prediction is buried.

torch.manual_seed(42)

# Use a batch of images
test_loader = torch.utils.data.DataLoader(dataset, batch_size=256, shuffle=True)
test_batch, _ = next(iter(test_loader))

# For each timestep, compute the "naive predictor" loss:
# Predict epsilon = 0 (the prior mean). This gives loss = ||epsilon||^2 / n = 1.0 everywhere.
# A better predictor: use the noisy image itself as a clue.
# At high t, x_t ~ epsilon, so predicting epsilon_hat = x_t / sqrt(1 - alpha_bar_t) is good.
# At low t, x_t ~ x_0, so the noise is invisible and prediction is hard.

timestep_range = torch.arange(1, T, 10)  # sample every 10th timestep
losses_at_t = []

for t_val in timestep_range:
    t = t_val.item()
    epsilon = torch.randn_like(test_batch)

    ab_t = alpha_bar[t]
    x_t = torch.sqrt(ab_t) * test_batch + torch.sqrt(1 - ab_t) * epsilon

    # Simple estimator: assume the image content is the dataset mean (~0)
    # Then x_t ~ sqrt(1 - alpha_bar_t) * epsilon, so:
    # epsilon_hat = x_t / sqrt(1 - alpha_bar_t)
    # This works well at high noise but fails at low noise where x_t ~ x_0
    noise_scale = torch.sqrt(1 - ab_t)
    if noise_scale > 1e-6:
        epsilon_hat = x_t / noise_scale
    else:
        epsilon_hat = torch.zeros_like(epsilon)

    mse = nn.functional.mse_loss(epsilon_hat, epsilon).item()
    losses_at_t.append(mse)

# Plot the loss landscape
fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(timestep_range.numpy(), losses_at_t, color='#c084fc', linewidth=2)
ax.set_xlabel('Timestep t', fontsize=12)
ax.set_ylabel('MSE Loss', fontsize=12)
ax.set_title('Denoising difficulty across timesteps', fontsize=13)
ax.axhline(y=2.0, color='gray', linestyle='--', alpha=0.5, label='Random guess baseline (MSE=2)')

# Annotate the regions
ax.annotate('Low noise\n(subtle perturbations)', xy=(50, losses_at_t[4]),
            fontsize=9, ha='left', color='#86efac')
ax.annotate('High noise\n(almost pure noise)', xy=(850, losses_at_t[-15]),
            fontsize=9, ha='right', color='#fca5a5')

ax.legend(fontsize=10)
plt.tight_layout()
plt.show()

print(f'Loss at t=10:  {losses_at_t[0]:.2f}  (low noise â€” hard to detect subtle perturbations)')
print(f'Loss at t=500: {losses_at_t[len(losses_at_t)//2]:.2f}  (medium noise â€” mixed signal and noise)')
print(f'Loss at t=990: {losses_at_t[-1]:.2f}  (high noise â€” noise dominates, easier to estimate)')

<details>
<summary>ðŸ’¡ Solution</summary>

**Middle timesteps tend to have the highest loss.** Here is the reasoning for each regime:

**Low $t$ (easy in a different way):** The image is nearly clean. The noise is a tiny perturbation â€” like a light dusting of static on a clear photo. The network can see the image clearly, but the noise is so subtle that precisely predicting every noise value is difficult. However, the noise *magnitude* is small, so the squared errors are also small. Low MSE.

**High $t$ (somewhat predictable):** The image is almost pure noise. There is very little signal left. But this means $x_t \approx \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon$, so the noisy image *is* mostly the noise (scaled). The network can make a reasonable guess just from $x_t$ itself. Moderate MSE.

**Middle $t$ (hardest):** This is where signal and noise are roughly balanced. The network cannot ignore either component. It must simultaneously understand the image content (to subtract the signal) AND estimate the noise pattern. This ambiguity creates the most room for error. Highest MSE.

The intuition: at the extremes, one of the two components (signal or noise) dominates and the problem simplifies. In the middle, neither dominates â€” maximum ambiguity, maximum difficulty.

Note: In practice, a real trained model's loss landscape is shaped by the specific architecture, dataset, and training dynamics. The general pattern (middle timesteps being hardest) is widely observed, though the exact curve varies.

</details>

---

## Key Takeaways

1. **The closed-form formula is a training tool.** You use $x_t = \sqrt{\bar{\alpha}_t} \cdot x_0 + \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon$ to create noisy images at arbitrary timesteps â€” no iterating through intermediate steps.

2. **The DDPM loss is MSE.** The same formula, the same `nn.MSELoss()`, the same gradients as every MSE computation since Series 1. The only difference: prediction = predicted noise, target = actual noise.

3. **The training loop is the standard loop with diffusion-specific data preparation.** Steps 1-4 (sample image, sample timestep, sample noise, create noisy image) are new. Steps 5-7 (forward pass, loss, backward, update) are the same heartbeat from Series 2.

4. **Each image in a batch gets a random timestep.** No sequential order. The closed-form formula lets training teleport to any noise level efficiently.

5. **Difficulty varies with the timestep.** Low noise = subtle perturbations. High noise = hallucinating structure. Middle noise = maximum ambiguity. Same algorithm handles all of them.

**Mental model:** Same building blocks, different question. The building blocks: MSE loss, backprop, gradient descent. The question: "what noise was added to this image?"