# Sampling and Generation

**Module 6.2, Lesson 4** | CourseAI

In this notebook you will compute the reverse step coefficients by hand, trace a single reverse step with real numbers, compare one-shot denoising to iterative denoising, and implement the full DDPM sampling loop.

**What you'll do:**
- Compute the scaling factor, noise correction, and noise injection coefficients at various timesteps
- Trace one reverse step at t=500 numerically—plug in values, compute x_{499}, and visualize the result
- Compare one-shot denoising (blurry failure) to multi-step iterative denoising (structured output)
- Implement the full sampling loop using a dummy "oracle" model and visualize the denoising trajectory

**For each exercise, PREDICT the output before running the cell.**

**Estimated time:** 20-30 minutes.

---

## Setup

Run this cell to import everything and configure the environment.

In [None]:
import torch
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import math

# Reproducible results
torch.manual_seed(42)

# Nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [10, 4]

print('Setup complete.')

## Shared Setup: Data and Noise Schedule

We load Fashion-MNIST and define the cosine noise schedule—the same setup from the Learning to Denoise notebook.

In [None]:
# Load Fashion-MNIST
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))  # Normalize to [-1, 1]
])

dataset = torchvision.datasets.FashionMNIST(
    root='./data', train=True, download=True, transform=transform
)

class_names = [
    'T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
    'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'
]

print(f'Dataset size: {len(dataset)}')
print(f'Image shape: {dataset[0][0].shape}')
print(f'Pixel range: [{dataset[0][0].min():.1f}, {dataset[0][0].max():.1f}]')

In [None]:
# Cosine noise schedule (same as Learning to Denoise notebook)
T = 1000

def cosine_alpha_bar_schedule(T, s=0.008):
    """Compute alpha_bar for each timestep using the cosine schedule.
    
    alpha_bar_t = how much original signal remains at step t.
    At t=0, alpha_bar ~ 1 (all signal). At t=T, alpha_bar ~ 0 (all noise).
    """
    steps = torch.arange(T + 1, dtype=torch.float32)
    f = torch.cos(((steps / T) + s) / (1 + s) * (math.pi / 2)) ** 2
    alpha_bar = f / f[0]
    return alpha_bar

alpha_bar = cosine_alpha_bar_schedule(T)

# Derive the per-step quantities the reverse formula needs:
# alpha_t = alpha_bar_t / alpha_bar_{t-1}
# beta_t = 1 - alpha_t
# sigma_t = sqrt(beta_t)

# alpha_bar has T+1 entries (indices 0 through T)
# alpha_t is defined for t=1..T as: alpha_t = alpha_bar[t] / alpha_bar[t-1]
alpha = torch.zeros(T + 1)
alpha[0] = 1.0  # not used in sampling, but keeps indexing clean
alpha[1:] = alpha_bar[1:] / alpha_bar[:-1]

beta = 1.0 - alpha
sigma = torch.sqrt(beta)

print(f'alpha_bar shape: {alpha_bar.shape}')
print(f'alpha_bar[0]   = {alpha_bar[0]:.4f}   (clean image)')
print(f'alpha_bar[500] = {alpha_bar[500]:.4f}  (midpoint)')
print(f'alpha_bar[999] = {alpha_bar[999]:.6f}  (near pure noise)')
print()
print(f'beta[500]  = {beta[500]:.6f}')
print(f'sigma[500] = {sigma[500]:.6f}')
print(f'alpha[500] = {alpha[500]:.6f}')

---

## Exercise 1: Reverse Step Coefficients (Guided)

The DDPM reverse step formula is:

$$x_{t-1} = \underbrace{\frac{1}{\sqrt{\alpha_t}}}_{\text{scaling factor}} \left( x_t - \underbrace{\frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}}_{\text{noise correction}} \cdot \epsilon_\theta(x_t, t) \right) + \underbrace{\sigma_t}_{\text{noise injection}} \cdot z$$

Each of these three coefficients changes across the schedule. Let us compute them at different timesteps and see how the algorithm behaves differently at different noise levels.

**Before running, predict:**
- At t=950 (near pure noise), is the noise correction coefficient large or small? Is the model making bold changes or tiny refinements?
- At t=50 (near clean), is the noise correction large or small?
- Does the scaling factor $\frac{1}{\sqrt{\alpha_t}}$ change dramatically across timesteps, or stay close to 1?

In [None]:
# Compute the three reverse step coefficients at several timesteps
timesteps_to_check = [950, 800, 500, 200, 50]

print(f'{"t":>5} | {"1/sqrt(a_t)":>12} | {"beta_t/sqrt(1-ab_t)":>20} | {"sigma_t":>10}')
print('-' * 60)

scaling_factors = []
noise_corrections = []
noise_injections = []

for t in timesteps_to_check:
    # The three coefficients from the reverse step formula
    scaling_factor = 1.0 / torch.sqrt(alpha[t])
    noise_correction = beta[t] / torch.sqrt(1.0 - alpha_bar[t])
    noise_injection = sigma[t]
    
    scaling_factors.append(scaling_factor.item())
    noise_corrections.append(noise_correction.item())
    noise_injections.append(noise_injection.item())
    
    print(f'{t:>5} | {scaling_factor:>12.6f} | {noise_correction:>20.6f} | {noise_injection:>10.6f}')

print()
print('Observations:')
print(f'  Scaling factor range: {min(scaling_factors):.4f} to {max(scaling_factors):.4f} (always close to 1)')
print(f'  Noise correction range: {min(noise_corrections):.6f} to {max(noise_corrections):.6f}')
print(f'  Noise injection range: {min(noise_injections):.6f} to {max(noise_injections):.6f}')

In [None]:
# Visualize how the coefficients change across the full schedule
all_t = torch.arange(1, T + 1)
all_scaling = 1.0 / torch.sqrt(alpha[1:])
all_noise_corr = beta[1:] / torch.sqrt(1.0 - alpha_bar[1:])
all_noise_inj = sigma[1:]

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].plot(all_t.numpy(), all_scaling.numpy(), color='#c084fc', linewidth=2)
axes[0].set_title('Scaling factor\n$1 / \\sqrt{\\alpha_t}$', fontsize=11)
axes[0].set_xlabel('Timestep t')
axes[0].set_ylabel('Value')

axes[1].plot(all_t.numpy(), all_noise_corr.numpy(), color='#fca5a5', linewidth=2)
axes[1].set_title('Noise correction\n$\\beta_t / \\sqrt{1 - \\bar{\\alpha}_t}$', fontsize=11)
axes[1].set_xlabel('Timestep t')
axes[1].set_ylabel('Value')

axes[2].plot(all_t.numpy(), all_noise_inj.numpy(), color='#86efac', linewidth=2)
axes[2].set_title('Noise injection\n$\\sigma_t = \\sqrt{\\beta_t}$', fontsize=11)
axes[2].set_xlabel('Timestep t')
axes[2].set_ylabel('Value')

for ax in axes:
    ax.axvline(x=500, color='gray', linestyle='--', alpha=0.3)

plt.suptitle('Reverse step coefficients across the schedule', y=1.02, fontsize=13)
plt.tight_layout()
plt.show()

### What Just Happened

You computed the three terms of the reverse step formula across the noise schedule. Key observations:

- **Scaling factor** ($1/\sqrt{\alpha_t}$): Always close to 1. This compensates for how much the signal was scaled down at each step. Since $\alpha_t$ is close to 1, this factor barely changes.

- **Noise correction** ($\beta_t / \sqrt{1 - \bar{\alpha}_t}$): This is the interesting one. At high $t$ (near pure noise), the correction is larger—the model is making bold structural decisions. At low $t$ (near clean), the correction is tiny—the model is polishing fine details.

- **Noise injection** ($\sigma_t = \sqrt{\beta_t}$): Fresh noise added at each step for exploration. Larger at high $t$ where there is more room for diversity, smaller at low $t$ where the image is nearly committed.

Same formula at every step. The coefficients change with the schedule, and the model adapts via the timestep embedding—but the algorithm is identical.

---

## Exercise 2: Trace One Reverse Step at t=500 (Guided)

Now let us apply the reverse step formula to an actual image. We will:
1. Start with a clean T-shirt image $x_0$
2. Use the forward process formula to create a noisy image $x_{500}$
3. Pretend we have a perfect "oracle" model that knows the exact noise $\epsilon$ (in training, this was the answer key—in real sampling, the model only has its prediction $\epsilon_\theta$)
4. Apply one reverse step to get $x_{499}$
5. See if $x_{499}$ is slightly cleaner than $x_{500}$

**Before running, predict:**
- At t=500, the image is heavily noisy. After one reverse step, will x_499 look dramatically cleaner, slightly cleaner, or about the same?
- The scaling factor is ~1 and the noise correction is small. What does that tell you about the size of each step?

In [None]:
# Start with a T-shirt image
x_0, label = dataset[0]
print(f'Image: {class_names[label]}')

# Step 1: Create x_500 using the forward process closed-form formula
t = 500
torch.manual_seed(42)
epsilon = torch.randn_like(x_0)  # the true noise (our "answer key")

x_t = torch.sqrt(alpha_bar[t]) * x_0 + torch.sqrt(1 - alpha_bar[t]) * epsilon
print(f'Created x_{t} using forward formula')
print(f'  alpha_bar[{t}] = {alpha_bar[t]:.4f}')
print(f'  signal coeff = {torch.sqrt(alpha_bar[t]):.4f}')
print(f'  noise coeff  = {torch.sqrt(1 - alpha_bar[t]):.4f}')

# Step 2: Apply one reverse step using the ORACLE (true epsilon)
# In real sampling, the model would predict epsilon_theta.
# Here we cheat and use the true noise to verify the formula works.
epsilon_theta = epsilon  # oracle: perfect prediction

# The reverse step formula:
# x_{t-1} = (1/sqrt(alpha_t)) * (x_t - (beta_t/sqrt(1-alpha_bar_t)) * epsilon_theta) + sigma_t * z
torch.manual_seed(123)
z = torch.randn_like(x_0)  # fresh noise for stochastic sampling

scaling = 1.0 / torch.sqrt(alpha[t])
noise_corr = beta[t] / torch.sqrt(1.0 - alpha_bar[t])

x_t_minus_1 = scaling * (x_t - noise_corr * epsilon_theta) + sigma[t] * z

print(f'\nReverse step coefficients at t={t}:')
print(f'  scaling factor = {scaling:.6f}')
print(f'  noise correction = {noise_corr:.6f}')
print(f'  noise injection (sigma) = {sigma[t]:.6f}')

# Step 3: What SHOULD x_499 look like?
# If the formula is correct, x_499 should be approximately what the forward
# process would produce at t=499 from the same x_0.
x_499_forward = torch.sqrt(alpha_bar[t - 1]) * x_0 + torch.sqrt(1 - alpha_bar[t - 1]) * epsilon

# Visualize all three
fig, axes = plt.subplots(1, 4, figsize=(14, 3.5))

images = [x_0, x_t, x_t_minus_1, x_499_forward]
titles = [
    f'x_0\n(clean)',
    f'x_{t}\n(noisy input)',
    f'x_{t-1} (reverse step)\n(formula output)',
    f'x_{t-1} (forward ref)\n(what t={t-1} should look like)',
]

for i, (img, title) in enumerate(zip(images, titles)):
    axes[i].imshow(img.squeeze(), cmap='gray', vmin=-2, vmax=2)
    axes[i].set_title(title, fontsize=9)
    axes[i].axis('off')

plt.suptitle('One reverse step: x_500 -> x_499', y=1.05, fontsize=13)
plt.tight_layout()
plt.show()

# How similar is the reverse step output to what we expect?
# (Not exact because of the stochastic z term)
mse_vs_forward = torch.mean((x_t_minus_1 - x_499_forward) ** 2).item()
print(f'MSE between reverse step x_499 and forward reference x_499: {mse_vs_forward:.4f}')
print(f'(Not zero because of the stochastic noise injection sigma_t * z)')

### What Just Happened

You traced one reverse step of the DDPM sampling algorithm. Here is what each piece did:

1. **Created $x_{500}$** from the clean image using the forward process formula (same closed-form teleportation from The Forward Process).
2. **Applied the reverse step formula** using the oracle's perfect noise prediction. The formula removed a small amount of the predicted noise and added a small amount of fresh noise.
3. **Compared** the result to what the forward process would produce at $t=499$. They are similar but not identical, because the stochastic $\sigma_t \cdot z$ term adds fresh randomness.

The key insight: **one step barely changes the image**. $x_{499}$ looks almost identical to $x_{500}$. That is by design—each step makes a tiny correction. You need 1,000 of these tiny corrections to go from pure noise to a clean image.

Also notice: we used the **oracle** (true $\epsilon$) here. In real sampling, the model only has its imperfect prediction $\epsilon_\theta$. That is why iterative small steps matter—each step's error is tiny, and the model gets a fresh look at a slightly cleaner image next time.

---

## Exercise 3: One-Shot vs Multi-Step Denoising (Supported)

In the web lesson, you learned that one-shot denoising—jumping directly from $x_t$ to $\hat{x}_0$ in a single step—fails. The result is a foggy smear because a single imperfect prediction cannot recover all the lost information.

Now you will see this firsthand. You will:
1. Implement the one-shot formula (rearranging the forward process to solve for $x_0$)
2. Implement a multi-step reverse process using the oracle
3. Compare the results

We use the forward process closed-form formula to simulate a "perfect oracle" model: given $x_0$, we can compute the true noise at any timestep. This lets us isolate the effect of iterating vs not iterating, without any model error.

**Fill in the two TODO markers.** Each is 1-2 lines.

<details>
<summary>💡 Solution</summary>

The key insight: one-shot denoising rearranges the forward formula to solve for $x_0$ directly. Multi-step denoising applies the reverse step formula repeatedly. Even with a perfect oracle, the one-shot approach is fine—but with an imperfect model, the multi-step approach is far better because errors at each step are small.

```python
# TODO 1: One-shot denoising
x_0_oneshot = (x_t - torch.sqrt(1 - alpha_bar[start_t]) * epsilon_theta) / torch.sqrt(alpha_bar[start_t])

# TODO 2: Apply the reverse step formula
x_curr = scaling * (x_curr - noise_corr * eps_pred) + sigma[t] * z
```

Note: for TODO 2, the reverse step formula is the same one from Exercise 2, just inside a loop.

</details>

In [None]:
# Start from a heavily noisy image
x_0, label = dataset[0]
start_t = 800

torch.manual_seed(42)
true_epsilon = torch.randn_like(x_0)

# Create x_t at t=start_t
x_t = torch.sqrt(alpha_bar[start_t]) * x_0 + torch.sqrt(1 - alpha_bar[start_t]) * true_epsilon


# --- Method 1: One-shot denoising ---
# Rearrange the forward formula: x_0 = (x_t - sqrt(1 - alpha_bar_t) * epsilon) / sqrt(alpha_bar_t)
# Use the TRUE epsilon (oracle) for a fair comparison.
epsilon_theta = true_epsilon

# TODO 1: Compute the one-shot estimate of x_0
# Use the rearranged forward formula with the oracle's noise prediction.
# x_0_oneshot = ???


# --- Method 2: Multi-step reverse process ---
# Apply the reverse step formula from t=start_t down to t=1.
# For each step, use the oracle to get the true noise at that timestep.
x_curr = x_t.clone()
num_steps = start_t

torch.manual_seed(999)  # fresh seed for the z noise in reverse steps

for t in range(start_t, 0, -1):
    # Oracle: compute the true noise at timestep t
    # Using the closed-form: x_t = sqrt(ab_t) * x_0 + sqrt(1-ab_t) * eps
    # So: eps = (x_t_from_x0 - sqrt(ab_t) * x_0) / sqrt(1-ab_t)
    # But we are cheating: we just use the forward formula to compute the true noise.
    # In reality, the model would predict epsilon from x_curr and t.
    eps_pred = (x_curr - torch.sqrt(alpha_bar[t]) * x_0) / torch.sqrt(1 - alpha_bar[t])
    
    # Reverse step coefficients
    scaling = 1.0 / torch.sqrt(alpha[t])
    noise_corr = beta[t] / torch.sqrt(1.0 - alpha_bar[t])
    
    # Sample z (zero at the final step t=1)
    z = torch.randn_like(x_curr) if t > 1 else torch.zeros_like(x_curr)
    
    # TODO 2: Apply the reverse step formula to get x_{t-1}
    # x_curr = ???
    pass

x_0_multistep = x_curr


# --- Visualize the comparison ---
fig, axes = plt.subplots(1, 4, figsize=(14, 3.5))

images = [x_0, x_t, x_0_oneshot, x_0_multistep]
titles = [
    'Original x_0',
    f'Noisy x_{start_t}',
    'One-shot denoising\n(single formula)',
    f'Multi-step denoising\n({num_steps} reverse steps)',
]

for i, (img, title) in enumerate(zip(images, titles)):
    axes[i].imshow(img.squeeze().clamp(-1, 1), cmap='gray', vmin=-2, vmax=2)
    axes[i].set_title(title, fontsize=10)
    axes[i].axis('off')

plt.suptitle('One-shot vs multi-step denoising (oracle model)', y=1.05, fontsize=13)
plt.tight_layout()
plt.show()

# Quantitative comparison
mse_oneshot = torch.mean((x_0_oneshot - x_0) ** 2).item()
mse_multistep = torch.mean((x_0_multistep - x_0) ** 2).item()
print(f'MSE (one-shot vs original):    {mse_oneshot:.4f}')
print(f'MSE (multi-step vs original):  {mse_multistep:.4f}')

### What Just Happened

With a **perfect oracle**, both methods recover the image well. The one-shot method works here because the oracle knows the exact noise. But remember: in real sampling, the model does NOT know the exact noise—it only predicts $\epsilon_\theta$, which is imperfect.

Now let us see what happens with an **imperfect** predictor. We will add a small amount of noise to the oracle's prediction, simulating a model that is good but not perfect.

In [None]:
# Simulate an IMPERFECT model by adding noise to the oracle's prediction
# This is the realistic case: the model's prediction is close but not perfect.
noise_level = 0.3  # how imperfect the model is

# --- One-shot with imperfect model ---
torch.manual_seed(77)
noisy_pred = true_epsilon + noise_level * torch.randn_like(true_epsilon)
x_0_oneshot_noisy = (x_t - torch.sqrt(1 - alpha_bar[start_t]) * noisy_pred) / torch.sqrt(alpha_bar[start_t])

# --- Multi-step with imperfect model ---
x_curr = x_t.clone()
torch.manual_seed(999)

for t in range(start_t, 0, -1):
    # Oracle prediction with added noise (simulating model imperfection)
    true_eps_at_t = (x_curr - torch.sqrt(alpha_bar[t]) * x_0) / torch.sqrt(1 - alpha_bar[t])
    eps_pred = true_eps_at_t + noise_level * torch.randn_like(x_curr)
    
    scaling = 1.0 / torch.sqrt(alpha[t])
    noise_corr = beta[t] / torch.sqrt(1.0 - alpha_bar[t])
    z = torch.randn_like(x_curr) if t > 1 else torch.zeros_like(x_curr)
    
    x_curr = scaling * (x_curr - noise_corr * eps_pred) + sigma[t] * z

x_0_multistep_noisy = x_curr

# Visualize
fig, axes = plt.subplots(1, 4, figsize=(14, 3.5))

images = [x_0, x_t, x_0_oneshot_noisy, x_0_multistep_noisy]
titles = [
    'Original x_0',
    f'Noisy x_{start_t}',
    'One-shot (imperfect)\nBLURRY / BROKEN',
    f'Multi-step (imperfect)\n{start_t} reverse steps',
]
colors = ['white', 'white', '#fca5a5', '#86efac']

for i, (img, title, color) in enumerate(zip(images, titles, colors)):
    axes[i].imshow(img.squeeze().clamp(-2, 2), cmap='gray', vmin=-2, vmax=2)
    axes[i].set_title(title, fontsize=10, color=color)
    axes[i].axis('off')

plt.suptitle('Imperfect model: one-shot fails, multi-step recovers', y=1.05, fontsize=13)
plt.tight_layout()
plt.show()

mse_oneshot_noisy = torch.mean((x_0_oneshot_noisy - x_0) ** 2).item()
mse_multistep_noisy = torch.mean((x_0_multistep_noisy - x_0) ** 2).item()
print(f'MSE (one-shot, imperfect):    {mse_oneshot_noisy:.4f}')
print(f'MSE (multi-step, imperfect):  {mse_multistep_noisy:.4f}')
print()
print('With an imperfect model, one-shot denoising amplifies errors catastrophically.')
print('Multi-step denoising keeps errors small at each step, and the model gets a')
print('fresh look at a slightly cleaner image every time. This is why 1,000 steps.')

### What Just Happened

This is the core lesson: **one-shot denoising amplifies prediction errors catastrophically.** At $t=800$, $\bar{\alpha}_t$ is small, so dividing by $\sqrt{\bar{\alpha}_t}$ in the one-shot formula magnifies any error in the noise prediction. A small mistake in $\epsilon_\theta$ becomes a large mistake in $\hat{x}_0$.

Multi-step denoising avoids this because:
1. Each step makes a **tiny** correction (the noise correction coefficient is small)
2. Errors at each step are proportionally small
3. The model gets a **fresh look** at a slightly cleaner image next time, so it can course-correct
4. The stochastic noise injection ($\sigma_t \cdot z$) prevents the process from committing too early to one interpretation

This is why DDPM uses 1,000 steps. Not for elegance—for robustness against imperfect predictions.

---

## Exercise 4: The Full Sampling Loop (Independent)

Implement the complete DDPM sampling algorithm. Start from pure Gaussian noise $x_T \sim \mathcal{N}(0, I)$ and iteratively denoise to produce a generated image.

Since we do not have a trained neural network (that comes in Lesson 5), you will use a **dummy oracle model**: at each step, the oracle computes what the noise *would be* if the target image were the T-shirt from the dataset. This simulates a perfectly trained model that always generates one specific image.

**Your task:**
1. Write a `ddpm_sample()` function that implements the full sampling algorithm
2. The function should accept a target image `x_target` (the oracle's "knowledge")
3. Save snapshots at key timesteps to visualize the denoising trajectory
4. Return the final generated image

**Algorithm (from the web lesson):**
1. Sample $x_T \sim \mathcal{N}(0, I)$
2. For $t = T, T-1, \ldots, 1$:
   - If $t > 1$, sample $z \sim \mathcal{N}(0, I)$. If $t = 1$, set $z = 0$.
   - Predict noise: $\epsilon_\theta = \text{oracle}(x_t, t, x_{\text{target}})$
   - Compute: $x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \cdot \epsilon_\theta \right) + \sigma_t \cdot z$
3. Return $x_0$

**Oracle model:** Given $x_t$, timestep $t$, and the target $x_{\text{target}}$, the oracle computes the noise as:
$$\epsilon_{\text{oracle}} = \frac{x_t - \sqrt{\bar{\alpha}_t} \cdot x_{\text{target}}}{\sqrt{1 - \bar{\alpha}_t}}$$

This is just rearranging the forward formula—the oracle "knows" the target and can compute what noise would have been needed.

<details>
<summary>💡 Solution</summary>

The sampling loop is the reverse step formula applied T times. The oracle provides a stand-in for a trained model. The key details are: (1) start from pure noise, (2) loop from T down to 1, (3) set z=0 at the final step (t=1), and (4) save snapshots at specific timesteps for visualization.

```python
def ddpm_sample(x_target, alpha_bar, alpha, beta, sigma, T, snapshot_timesteps=None):
    """Full DDPM sampling loop using an oracle model."""
    shape = x_target.shape
    snapshots = {}
    
    # Step 1: Start from pure noise
    x = torch.randn(shape)
    
    if snapshot_timesteps and T in snapshot_timesteps:
        snapshots[T] = x.clone()
    
    # Step 2: Loop from T down to 1
    for t in range(T, 0, -1):
        # Sample z (zero at final step)
        z = torch.randn_like(x) if t > 1 else torch.zeros_like(x)
        
        # Oracle noise prediction
        eps_pred = (x - torch.sqrt(alpha_bar[t]) * x_target) / torch.sqrt(1 - alpha_bar[t])
        
        # Reverse step
        scaling = 1.0 / torch.sqrt(alpha[t])
        noise_corr = beta[t] / torch.sqrt(1 - alpha_bar[t])
        x = scaling * (x - noise_corr * eps_pred) + sigma[t] * z
        
        if snapshot_timesteps and (t - 1) in snapshot_timesteps:
            snapshots[t - 1] = x.clone()
    
    return x, snapshots
```

Common mistakes:
- Forgetting to set `z = 0` at `t = 1`—this adds noise to the final output
- Using `alpha_bar` instead of `alpha` for the scaling factor—they are different quantities
- Looping from 0 to T instead of T down to 1

</details>

In [None]:
# YOUR CODE HERE
# Implement the ddpm_sample function.
# Parameters: x_target, alpha_bar, alpha, beta, sigma, T, snapshot_timesteps (list of timesteps to save)
# Returns: (final_image, snapshots_dict)






In [None]:
# Run the sampling loop and visualize the trajectory
x_target = dataset[0][0]  # T-shirt

snapshot_steps = [1000, 900, 800, 600, 400, 200, 100, 50, 0]

torch.manual_seed(42)
generated, snapshots = ddpm_sample(
    x_target, alpha_bar, alpha, beta, sigma, T,
    snapshot_timesteps=snapshot_steps
)

# Visualize the denoising trajectory
fig, axes = plt.subplots(1, len(snapshot_steps), figsize=(18, 3))

for i, t in enumerate(snapshot_steps):
    img = snapshots[t].squeeze().clamp(-2, 2)
    axes[i].imshow(img, cmap='gray', vmin=-2, vmax=2)
    axes[i].set_title(f't={t}', fontsize=10)
    axes[i].axis('off')

plt.suptitle('The denoising trajectory: pure noise to generated image', y=1.05, fontsize=13)
plt.tight_layout()
plt.show()

# Compare final result to original
fig, axes = plt.subplots(1, 2, figsize=(6, 3))
axes[0].imshow(x_target.squeeze(), cmap='gray', vmin=-2, vmax=2)
axes[0].set_title('Original T-shirt', fontsize=11)
axes[0].axis('off')

axes[1].imshow(generated.squeeze().clamp(-2, 2), cmap='gray', vmin=-2, vmax=2)
axes[1].set_title('Generated (oracle sampling)', fontsize=11)
axes[1].axis('off')

plt.tight_layout()
plt.show()

mse = torch.mean((generated - x_target) ** 2).item()
print(f'MSE between generated and original: {mse:.4f}')
print(f'(Not zero because of the stochastic noise injection at each step)')
print(f'Number of "forward passes" (steps): {T}')
print(f'In real DDPM, each step is a neural network forward pass.')
print(f'That is {T} evaluations for ONE image.')

### What Just Happened

You implemented the complete DDPM sampling algorithm. The trajectory shows exactly what the web lesson described:

- **t=1000 to t=800:** Pure static begins to show the vaguest hint of structure. The oracle is making coarse, global decisions.
- **t=600 to t=400:** A recognizable shape emerges. Edges form, proportions become clear.
- **t=200 to t=50:** Fine details appear. Textures, subtle shading, the outline solidifies.
- **t=0:** The generated image. Not identical to the original because of the stochastic noise injection at each step—but structurally the same.

**The cost is real.** 1,000 steps. In a real diffusion model, each step is a full neural network forward pass through a U-Net. At 28×28, this runs in seconds. At 512×512 (Stable Diffusion scale), each step takes much longer. That is the pain that motivates accelerated samplers, DDIM, latent diffusion, and everything that follows.

---

## Key Takeaways

1. **The reverse step coefficients change with the schedule.** The scaling factor stays close to 1, but the noise correction is larger at high $t$ (bold structural decisions) and tiny at low $t$ (fine detail polishing). Same formula, different regime.

2. **One step barely changes the image.** Each reverse step makes a tiny correction. You need 1,000 of them—and that is a strength, not a weakness. Small steps mean small errors per step.

3. **One-shot denoising fails with imperfect predictions.** Dividing by $\sqrt{\bar{\alpha}_t}$ amplifies any error in the noise prediction. Multi-step denoising is robust because each step's error is proportionally small and the model gets a fresh look at a slightly cleaner image.

4. **The full sampling algorithm is a loop: start from noise, apply the reverse step formula T times, return $x_0$.** Set $z = 0$ at the final step ($t = 1$)—the last step commits, every step before it explores.

5. **The denoising trajectory is coarse-to-fine.** Early steps create structure from nothing. Late steps refine details. Not all steps are created equal.

**Mental model:** Destruction was easy and known. Creation requires a trained guide. At each step, the guide points toward the clean image, but a small jitter keeps the path from collapsing to a single boring route. 1,000 tiny corrections compose into something that never existed.