# Flow Matching

**Module 7.2, Lesson 2** | CourseAI

You know the theory from the lesson: flow matching replaces curved diffusion trajectories with straight lines, uses velocity prediction instead of noise prediction, and produces a simpler training objective. This notebook makes that concrete.

**What you will do:**
- Implement both DDPM and flow matching interpolation in 1D, plot the paths and velocity profiles
- Apply Euler's method on curved vs straight 2D trajectories and see why straight paths need fewer steps
- Train a flow matching model on a 2D distribution (two-moons) and generate samples via Euler ODE solving
- Compare flow matching vs DDPM head-to-head: sample quality at varying step counts (1, 5, 10, 20, 50)

**For each exercise, PREDICT the output before running the cell.**

Every concept in this notebook comes from the lesson. Straight-line interpolation, constant velocity, velocity prediction, the connection to Euler's method. No new theory—just hands-on practice with the math and models.

**Estimated time:** 30–45 minutes. Exercises 1–2 are pure math (no training). Exercises 3–4 train small MLPs on 2D data (~1–2 minutes on CPU).

## Setup

Run this cell to import everything and configure the environment.

No GPU required for this notebook. Everything runs on CPU. The models in Exercises 3–4 are tiny MLPs trained on 2D point distributions.

In [None]:
!pip install -q torch numpy matplotlib scikit-learn

In [None]:
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons

# Reproducible results
torch.manual_seed(42)
np.random.seed(42)

# Nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [10, 4]
plt.rcParams['figure.dpi'] = 100

print('Setup complete. No GPU needed for this notebook.')

## Shared Helpers

A small MLP for velocity prediction (flow matching) and noise prediction (DDPM), plus data generation utilities. The same architecture works for both—the only difference is the training target.

Run this cell now. It defines everything needed for Exercises 3 and 4.

In [None]:
# ============================================================
# Shared: MLP for 2D generative models (flow matching + DDPM)
# ============================================================
# The same architecture predicts either velocity (flow matching)
# or noise (DDPM). The network takes (x_t, t) and outputs a 2D vector.
# What that vector *means* depends on the training objective.

class ToyModel(nn.Module):
    """MLP that takes (x_t, t) and outputs a 2D vector.
    
    For flow matching: output = predicted velocity v_theta(x_t, t)
    For DDPM: output = predicted noise epsilon_theta(x_t, t)
    Same architecture, different training target.
    """
    def __init__(self, hidden_dim=128):
        super().__init__()
        # Sinusoidal time embedding (simple but effective)
        self.time_mlp = nn.Sequential(
            nn.Linear(1, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, hidden_dim),
        )
        # Main network
        self.net = nn.Sequential(
            nn.Linear(2 + hidden_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, 2),
        )

    def forward(self, x_t, t):
        """Predict a 2D vector from (x_t, t).
        
        x_t: (batch, 2) -- noisy/interpolated point
        t: (batch, 1) or (batch,) -- timestep in [0, 1]
        """
        if t.dim() == 1:
            t = t.unsqueeze(-1)
        t_emb = self.time_mlp(t)
        inp = torch.cat([x_t, t_emb], dim=-1)
        return self.net(inp)


def sample_two_moons(n, noise=0.06):
    """Sample from the two-moons distribution.
    
    Returns a (n, 2) tensor of 2D points.
    The distribution has two crescent-shaped clusters.
    """
    data, _ = make_moons(n_samples=n, noise=noise)
    # Center and scale
    data = (data - data.mean(axis=0)) * 2.0
    return torch.tensor(data, dtype=torch.float32)


def make_cosine_schedule(T=100):
    """Cosine noise schedule for DDPM.
    
    Returns alpha_bars: (T,) tensor where alpha_bars[t] is the
    cumulative signal retention at timestep t.
    """
    steps = torch.arange(T + 1, dtype=torch.float32) / T
    f = torch.cos((steps + 0.008) / 1.008 * (np.pi / 2)) ** 2
    alpha_bars = f[1:] / f[0]  # Normalize so alpha_bar[0] is close to 1
    alpha_bars = alpha_bars.clamp(min=1e-5, max=1.0 - 1e-5)
    return alpha_bars


print('Shared helpers defined.')
print('- ToyModel: MLP that takes (x_t, t) -> 2D vector')
print('- sample_two_moons: 2D two-moons distribution')
print('- make_cosine_schedule: cosine noise schedule for DDPM')

---

## Exercise 1: Flow Matching vs DDPM Interpolation `[Guided]`

From the lesson: DDPM uses $x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1-\bar\alpha_t}\, \epsilon$ with nonlinear coefficients. Flow matching uses $x_t = (1-t)\, x_0 + t\, \epsilon$ with linear coefficients. The DDPM path curves; the flow matching path is a straight line.

We will implement both interpolation schemes in 1D:
- Start at $x_0 = 5.0$ (data point) and interpolate toward $\epsilon = -2.0$ (noise sample)
- Plot the interpolation path $x_t$ vs time for both methods
- Plot the "velocity" (derivative $dx_t/dt$) along each path

**Before running, predict:**
- Which path will be a straight line on the plot? Which will curve?
- For the flow matching path, will the velocity be constant or varying?
- For the DDPM path, at what point in time will the velocity change the most?

In [None]:
# ============================================================
# Exercise 1: Compare DDPM vs Flow Matching interpolation in 1D
# ============================================================

x_0 = 5.0   # Data point
eps = -2.0   # Noise sample

# Time grid (0 to 1)
t = torch.linspace(0, 1, 200)

# --- Flow Matching interpolation ---
# x_t = (1 - t) * x_0 + t * epsilon
# This is a straight line from x_0 (at t=0) to epsilon (at t=1)
fm_path = (1 - t) * x_0 + t * eps

# Flow matching velocity: dx/dt = epsilon - x_0 (constant!)
fm_velocity = torch.full_like(t, eps - x_0)  # -7.0 everywhere

# --- DDPM interpolation (with cosine schedule) ---
# x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon
# The coefficients are nonlinear functions of t
T_steps = 200
alpha_bars_1d = make_cosine_schedule(T_steps)

# Map t in [0, 1] to the discrete schedule
# t=0 -> alpha_bar near 1 (clean), t=1 -> alpha_bar near 0 (noisy)
indices = (t * (T_steps - 1)).long().clamp(0, T_steps - 1)
ab = alpha_bars_1d[indices]

ddpm_path = torch.sqrt(ab) * x_0 + torch.sqrt(1 - ab) * eps

# DDPM "velocity" (numerical derivative)
ddpm_velocity = torch.zeros_like(t)
dt = t[1] - t[0]
ddpm_velocity[1:-1] = (ddpm_path[2:] - ddpm_path[:-2]) / (2 * dt)
ddpm_velocity[0] = (ddpm_path[1] - ddpm_path[0]) / dt
ddpm_velocity[-1] = (ddpm_path[-1] - ddpm_path[-2]) / dt

print(f'Data point x_0 = {x_0}, noise sample epsilon = {eps}')
print(f'Flow matching velocity = epsilon - x_0 = {eps - x_0} (constant)')
print(f'DDPM velocity varies from {ddpm_velocity[1]:.2f} to {ddpm_velocity[-2]:.2f}')

In [None]:
# Plot: interpolation paths and velocity profiles side by side

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# --- Left panel: Interpolation paths ---
ax = axes[0]
ax.plot(t.numpy(), fm_path.numpy(), color='#34d399', linewidth=2.5, label='Flow Matching: $x_t = (1-t)\,x_0 + t\,\epsilon$')
ax.plot(t.numpy(), ddpm_path.numpy(), color='#f59e0b', linewidth=2.5, label='DDPM: $x_t = \sqrt{\\bar\\alpha_t}\,x_0 + \sqrt{1-\\bar\\alpha_t}\,\epsilon$')
ax.axhline(x_0, color='white', linestyle=':', alpha=0.3, label=f'$x_0$ = {x_0}')
ax.axhline(eps, color='white', linestyle=':', alpha=0.3, label=f'$\epsilon$ = {eps}')
ax.set_xlabel('Time $t$', fontsize=12)
ax.set_ylabel('$x_t$', fontsize=12)
ax.set_title('Interpolation Path: Data to Noise', fontsize=13)
ax.legend(fontsize=9)
ax.grid(alpha=0.2)

# --- Right panel: Velocity profiles ---
ax = axes[1]
ax.plot(t.numpy(), fm_velocity.numpy(), color='#34d399', linewidth=2.5, label='Flow Matching velocity (constant)')
ax.plot(t.numpy(), ddpm_velocity.numpy(), color='#f59e0b', linewidth=2.5, label='DDPM velocity (varies with $t$)')
ax.axhline(0, color='white', linewidth=0.5, alpha=0.3)
ax.set_xlabel('Time $t$', fontsize=12)
ax.set_ylabel('$dx_t / dt$', fontsize=12)
ax.set_title('Velocity Along the Path', fontsize=13)
ax.legend(fontsize=9)
ax.grid(alpha=0.2)

plt.tight_layout()
plt.show()

print('Observations:')
print('- Left panel: The flow matching path (green) is a perfect straight line.')
print('  The DDPM path (amber) curves\u2014it stays near x_0 for a while, then')
print('  drops rapidly toward epsilon. This is because the cosine schedule')
print('  keeps alpha_bar near 1 (mostly signal) at early timesteps.')
print()
print('- Right panel: The flow matching velocity is constant (-7.0 everywhere).')
print('  The DDPM velocity changes dramatically\u2014slow at first, fast in the middle,')
print('  then slow again. This CHANGING velocity is what creates curvature.')
print('  A network trying to predict DDPM velocity would need to learn a complex')
print('  time-dependent function. Flow matching velocity is trivial: one number.')

### What Just Happened

You implemented both interpolation schemes and saw the core difference:

- **Flow matching path is a straight line.** The linear coefficients $(1-t)$ and $t$ produce a path from $x_0$ to $\epsilon$ with zero curvature. The velocity is constant: $v = \epsilon - x_0 = -7.0$ at every timestep.

- **DDPM path curves.** The nonlinear coefficients $\sqrt{\bar\alpha_t}$ and $\sqrt{1-\bar\alpha_t}$ from the cosine schedule create a path that lingers near $x_0$ (the schedule keeps $\bar\alpha_t$ near 1 for a while), then drops rapidly toward $\epsilon$. The velocity changes dramatically with time.

- **Changing velocity = curvature.** The DDPM path has a velocity that varies from slow to fast to slow. This is exactly what creates curvature in the trajectory. An ODE solver following this path must adjust to the changing velocity at every step. On the flow matching path, the velocity never changes—there is nothing to adjust to.

This is the interpolation difference from the lesson: same idea (interpolate between data and noise), simpler coefficients (linear instead of nonlinear), straight path instead of curved.

---

## Exercise 2: Euler Steps on Curved vs Straight Paths `[Guided]`

From the lesson: Euler's method computes the direction at the current point and takes a step. On a curved path, the trajectory bends after each step, causing the Euler approximation to drift off course. On a straight path, Euler's extrapolation is exact—the true trajectory IS a straight line, so linear extrapolation is perfect.

We will create two 2D ODEs:
1. A **curved** trajectory (a spiral) where the velocity field changes direction
2. A **straight** trajectory (a line) where the velocity is constant

Then apply Euler's method with $N=5$ steps and $N=1$ step to both, and compare to the true solution.

**Before running, predict:**
- With $N=5$ Euler steps on the straight path, how large will the error be?
- With $N=1$ Euler step on the straight path, will the error increase compared to $N=5$?
- On the curved path, which will have more error: $N=5$ or $N=1$?

In [None]:
# ============================================================
# Exercise 2: Euler's method on curved vs straight trajectories
# ============================================================

# --- Define two ODE systems ---
#
# Curved ODE: dx/dt = A * x where A is a rotation matrix
# This produces a spiral trajectory (the velocity direction rotates).
#
# Straight ODE: dx/dt = v (constant velocity)
# This produces a straight-line trajectory.

# Starting point (same for both)
x_start = torch.tensor([2.0, 0.0])

# --- Curved ODE: spiral ---
# dx/dt = A * x, where A produces rotation + slight contraction
theta_rate = 3.0  # Rotation speed (radians per unit time)
contraction = -0.5  # Slight inward pull

def curved_velocity(x, t):
    """Velocity field for the curved (spiral) ODE."""
    # Rotation + contraction matrix
    A = torch.tensor([
        [contraction, -theta_rate],
        [theta_rate, contraction]
    ])
    return A @ x

# Analytical solution for the curved ODE: x(t) = exp(A*t) * x_start
# For rotation matrix: x(t) = exp(contraction*t) * R(theta_rate*t) * x_start
def curved_exact(t_val):
    """Exact solution of the spiral ODE at time t."""
    scale = np.exp(contraction * t_val)
    angle = theta_rate * t_val
    R = torch.tensor([
        [np.cos(angle), -np.sin(angle)],
        [np.sin(angle), np.cos(angle)]
    ])
    return scale * (R @ x_start)

# --- Straight ODE: constant velocity ---
target_point = torch.tensor([-1.0, 2.0])
straight_v = target_point - x_start  # Constant velocity

def straight_velocity(x, t):
    """Velocity field for the straight-line ODE."""
    return straight_v

def straight_exact(t_val):
    """Exact solution of the straight-line ODE at time t."""
    return x_start + t_val * straight_v

# --- Euler's method ---
def euler_solve(velocity_fn, x_init, T_end, n_steps):
    """Solve an ODE using Euler's method.
    
    Returns the trajectory as a list of points.
    """
    dt = T_end / n_steps
    trajectory = [x_init.clone()]
    x = x_init.clone()
    
    for i in range(n_steps):
        t = i * dt
        v = velocity_fn(x, t)
        x = x + dt * v
        trajectory.append(x.clone())
    
    return trajectory


T_end = 1.0  # Integrate from t=0 to t=1

# Exact solutions (dense, for plotting)
t_dense = np.linspace(0, T_end, 200)
curved_true = torch.stack([curved_exact(t) for t in t_dense])
straight_true = torch.stack([straight_exact(t) for t in t_dense])

# Euler solutions at different step counts
step_counts = [5, 1]
curved_eulers = {n: euler_solve(curved_velocity, x_start, T_end, n) for n in step_counts}
straight_eulers = {n: euler_solve(straight_velocity, x_start, T_end, n) for n in step_counts}

# Compute endpoint errors
curved_true_end = curved_exact(T_end)
straight_true_end = straight_exact(T_end)

print('Endpoint errors:')
print(f'  True curved endpoint:   [{curved_true_end[0]:.4f}, {curved_true_end[1]:.4f}]')
print(f'  True straight endpoint: [{straight_true_end[0]:.4f}, {straight_true_end[1]:.4f}]')
print()
for n in step_counts:
    c_end = curved_eulers[n][-1]
    s_end = straight_eulers[n][-1]
    c_err = torch.norm(c_end - curved_true_end).item()
    s_err = torch.norm(s_end - straight_true_end).item()
    print(f'  N={n} steps:')
    print(f'    Curved path error:   {c_err:.6f}')
    print(f'    Straight path error: {s_err:.6f}')

In [None]:
# Plot Euler's method on both trajectories at N=5 and N=1

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

for row, n_steps in enumerate(step_counts):
    # --- Curved trajectory ---
    ax = axes[row, 0]
    # True trajectory
    ax.plot(curved_true[:, 0], curved_true[:, 1], color='white', linewidth=1.5,
            alpha=0.4, label='True trajectory', linestyle='--')
    # Euler approximation
    euler_pts = torch.stack(curved_eulers[n_steps])
    ax.plot(euler_pts[:, 0], euler_pts[:, 1], 'o-', color='#f59e0b', linewidth=2,
            markersize=8, label=f'Euler ({n_steps} steps)')
    # Mark start and endpoints
    ax.plot(x_start[0], x_start[1], 's', color='#60a5fa', markersize=10, zorder=5, label='Start')
    ax.plot(curved_true_end[0], curved_true_end[1], '*', color='#34d399', markersize=15, zorder=5, label='True end')
    # Error line
    euler_end = curved_eulers[n_steps][-1]
    err = torch.norm(euler_end - curved_true_end).item()
    ax.plot([euler_end[0], curved_true_end[0]], [euler_end[1], curved_true_end[1]],
            color='#ef4444', linewidth=2, linestyle='-', label=f'Error: {err:.4f}')
    ax.set_title(f'Curved Path (Spiral)\u2014{n_steps} Euler Step{"s" if n_steps > 1 else ""}', fontsize=12)
    ax.legend(fontsize=8)
    ax.set_aspect('equal')
    ax.grid(alpha=0.2)

    # --- Straight trajectory ---
    ax = axes[row, 1]
    # True trajectory
    ax.plot(straight_true[:, 0], straight_true[:, 1], color='white', linewidth=1.5,
            alpha=0.4, label='True trajectory', linestyle='--')
    # Euler approximation
    euler_pts = torch.stack(straight_eulers[n_steps])
    ax.plot(euler_pts[:, 0], euler_pts[:, 1], 'o-', color='#34d399', linewidth=2,
            markersize=8, label=f'Euler ({n_steps} steps)')
    # Mark start and endpoints
    ax.plot(x_start[0], x_start[1], 's', color='#60a5fa', markersize=10, zorder=5, label='Start')
    ax.plot(straight_true_end[0], straight_true_end[1], '*', color='#34d399', markersize=15, zorder=5, label='True end')
    # Error line
    euler_end = straight_eulers[n_steps][-1]
    err = torch.norm(euler_end - straight_true_end).item()
    ax.plot([euler_end[0], straight_true_end[0]], [euler_end[1], straight_true_end[1]],
            color='#ef4444', linewidth=2, linestyle='-', label=f'Error: {err:.6f}')
    ax.set_title(f'Straight Path (Line)\u2014{n_steps} Euler Step{"s" if n_steps > 1 else ""}', fontsize=12)
    ax.legend(fontsize=8)
    ax.set_aspect('equal')
    ax.grid(alpha=0.2)

plt.suptitle(
    'Euler\'s Method: Curved Path (left) vs Straight Path (right)\n'
    'Top: 5 steps | Bottom: 1 step',
    fontsize=13, y=1.02
)
plt.tight_layout()
plt.show()

print('Key observations:')
print()
print('- CURVED path, 5 steps: Euler drifts off the true spiral. Each step')
print('  overshoots because the trajectory curves after the step. Visible error.')
print()
print('- CURVED path, 1 step: Euler goes way off course. A single linear')
print('  extrapolation completely misses the spiral. Large error.')
print()
print('- STRAIGHT path, 5 steps: Euler lands EXACTLY on the true endpoint.')
print('  Error is essentially zero (floating point precision only).')
print()
print('- STRAIGHT path, 1 step: STILL exact! One Euler step from start to')
print('  end is perfect because the true trajectory IS a straight line.')
print('  Euler extrapolates linearly, and the trajectory is linear. Match.')
print()
print('THIS is why flow matching needs fewer steps. The trajectory is straight,')
print('so Euler\'s method (or any ODE solver) gets it right with minimal steps.')
print('On curved diffusion trajectories, more steps are always needed to')
print('compensate for the curvature error.')

### What Just Happened

You applied Euler's method to curved and straight trajectories and verified the lesson's key claim:

- **On a straight path, Euler is exact.** Even with a single step, Euler's linear extrapolation lands exactly on the true endpoint. The error is zero (up to floating point). This is because Euler's method assumes the trajectory continues in a straight line from the current point—and it does.

- **On a curved path, Euler accumulates error.** Each step overshoots because the trajectory bends after the step. More curvature means more error. Reducing from 5 steps to 1 step makes the error dramatically worse.

- **This IS the flow matching advantage.** Diffusion ODE trajectories curve (as you saw in Exercise 1). Flow matching trajectories are straight (by construction). ODE solvers like Euler's method are exact on straight paths, meaning flow matching models can generate in far fewer steps.

The negative experiment makes this concrete: on the curved path, 1 step is catastrophically wrong. On the straight path, 1 step is perfect. The trajectory geometry determines the solver's accuracy, not the solver itself.

---

## Exercise 3: Train a Flow Matching Model on 2D Data `[Supported]`

From the lesson: the flow matching training loop is structurally identical to DDPM's, but simpler:
1. Sample data $x_0$ and noise $\epsilon \sim \mathcal{N}(0, I)$
2. Sample a random time $t \sim \text{Uniform}(0, 1)$
3. Compute $x_t = (1-t)\, x_0 + t\, \epsilon$
4. Target velocity: $v = \epsilon - x_0$ (constant, does not depend on $t$!)
5. Loss = $\text{MSE}(v_\theta(x_t, t),\; \epsilon - x_0)$

Your task: fill in the TODO markers to implement the flow matching training loop and the Euler ODE sampling.

**Before running, predict:**
- With 5 Euler steps, will the generated samples be recognizable as two-moons?
- With 50 Euler steps, will the quality be noticeably better than 5 steps? (Think about what Exercise 2 showed you.)

In [None]:
# ============================================================
# Exercise 3: Train a flow matching model on two-moons
# ============================================================
# NOTE: Fill in ALL four TODOs before running this cell.
# Running with None values will raise an error.

torch.manual_seed(42)

# Create the model
fm_model = ToyModel(hidden_dim=128)
optimizer = torch.optim.Adam(fm_model.parameters(), lr=3e-4)

n_epochs = 500
batch_size = 512
losses = []

print('Training flow matching model on two-moons...')
print('(This should take ~30 seconds on CPU)')
print()

for epoch in range(n_epochs):
    # Step 1: Sample clean data
    x_0 = sample_two_moons(batch_size)
    
    # Step 2: Sample random noise
    epsilon = torch.randn_like(x_0)
    
    # TODO: Sample random time t ~ Uniform(0, 1), shape (batch_size, 1)
    # Hint: Use torch.rand
    t = None  # <-- Replace this line
    
    # TODO: Compute the interpolated point x_t = (1 - t) * x_0 + t * epsilon
    # This is the flow matching interpolation from the lesson.
    x_t = None  # <-- Replace this line
    
    # TODO: Compute the target velocity v = epsilon - x_0
    # This is constant for each (x_0, epsilon) pair---it does not depend on t!
    target_v = None  # <-- Replace this line
    
    # Guard: make sure TODOs are filled in before proceeding
    if t is None or x_t is None or target_v is None:
        raise NotImplementedError(
            "Fill in the three TODOs above (t, x_t, target_v) before running this cell."
        )
    
    # Forward pass: network predicts velocity from (x_t, t)
    pred_v = fm_model(x_t, t)
    
    # TODO: Compute the loss = MSE between predicted and target velocity
    loss = None  # <-- Replace this line
    
    if loss is None:
        raise NotImplementedError(
            "Fill in the loss TODO above before running this cell."
        )
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    losses.append(loss.item())
    if (epoch + 1) % 100 == 0:
        print(f'  Epoch {epoch+1}/{n_epochs}, loss: {loss.item():.4f}')

# Plot training loss
fig, ax = plt.subplots(1, 1, figsize=(8, 3))
ax.plot(losses, color='#34d399', linewidth=1, alpha=0.7)
ax.set_xlabel('Epoch')
ax.set_ylabel('MSE Loss')
ax.set_title('Flow Matching Training Loss')
plt.tight_layout()
plt.show()

print(f'\nFinal loss: {losses[-1]:.4f}')

In [None]:
# ============================================================
# Generate samples using Euler ODE solving
# ============================================================
#
# Generation goes from noise (t=1) to data (t=0).
# At each step, the model predicts the velocity v_theta(x_t, t),
# and we step in the REVERSE direction (from t=1 toward t=0).
#
# Euler step: x_{t-dt} = x_t + (-dt) * v_theta(x_t, t)
#   (negative dt because we go backward in time)
# Equivalently: x_{t-dt} = x_t - dt * v_theta(x_t, t)

fm_model.eval()

@torch.no_grad()
def sample_flow_matching(model, n_samples, n_steps):
    """Generate samples by solving the flow matching ODE with Euler's method.
    
    Start from noise (t=1), step toward data (t=0).
    """
    # Start from pure noise
    x = torch.randn(n_samples, 2)
    dt = 1.0 / n_steps
    
    for i in range(n_steps):
        # TODO: Compute current time t (starts at 1.0, decreases by dt each step)
        # Hint: t = 1.0 - i * dt
        t = None  # <-- Replace this line
        t_tensor = torch.full((n_samples, 1), t)
        
        # TODO: Get the model's velocity prediction
        v_pred = None  # <-- Replace this line
        
        # TODO: Euler step in reverse time: x = x - dt * v_pred
        x = None  # <-- Replace this line
    
    return x


# Generate samples at different step counts
step_counts_to_test = [1, 5, 10, 50]

# Also get real data for comparison
torch.manual_seed(0)
real_data = sample_two_moons(500)

fig, axes = plt.subplots(1, len(step_counts_to_test) + 1, figsize=(18, 4))

# Plot real data
ax = axes[0]
ax.scatter(real_data[:, 0], real_data[:, 1], s=3, alpha=0.5, color='#60a5fa')
ax.set_title('Real Data\n(two-moons)', fontsize=11)
ax.set_xlim(-4, 4); ax.set_ylim(-4, 4)
ax.set_aspect('equal')
ax.grid(alpha=0.15)

# Generate and plot for each step count
for idx, n_steps in enumerate(step_counts_to_test):
    torch.manual_seed(0)
    samples = sample_flow_matching(fm_model, 500, n_steps)
    
    ax = axes[idx + 1]
    if samples is not None:
        ax.scatter(samples[:, 0], samples[:, 1], s=3, alpha=0.5, color='#34d399')
    ax.set_title(f'{n_steps} Euler Step{"s" if n_steps > 1 else ""}', fontsize=11)
    ax.set_xlim(-4, 4); ax.set_ylim(-4, 4)
    ax.set_aspect('equal')
    ax.grid(alpha=0.15)

plt.suptitle(
    'Flow Matching: Sample Quality vs Number of Euler Steps',
    fontsize=13, y=1.02
)
plt.tight_layout()
plt.show()

print('Observations:')
print('- Even with 5 Euler steps, the two-moons shape is recognizable.')
print('  This is the flow matching advantage: nearly-straight trajectories')
print('  mean Euler\'s method works well with very few steps.')
print()
print('- 1 step may be noisy (the aggregate learned field is not perfectly')
print('  straight\u2014remember rectified flow from the lesson).')
print()
print('- 10 and 50 steps produce very similar quality. Once the trajectories')
print('  are nearly straight, adding more steps gives diminishing returns.')

<details>
<summary>Solution</summary>

The key insight is that the flow matching training loop is almost trivially simple compared to DDPM. No noise schedule, no alpha_bar, no cumulative products. Just a linear interpolation and a subtraction for the target.

**Training TODOs:**
```python
# Sample random time
t = torch.rand(batch_size, 1)

# Interpolated point
x_t = (1 - t) * x_0 + t * epsilon

# Target velocity (constant for each pair!)
target_v = epsilon - x_0

# MSE loss
loss = nn.functional.mse_loss(pred_v, target_v)
```

**Sampling TODOs:**
```python
# Current time
t = 1.0 - i * dt

# Model prediction
v_pred = model(x, t_tensor)

# Euler step (reverse time)
x = x - dt * v_pred
```

**Why reverse time for sampling:** During training, $t=0$ is clean data and $t=1$ is pure noise. To generate, we start from noise ($t=1$) and walk backward to data ($t=0$). The velocity field $v_\theta$ was trained in the forward direction (data to noise), so to go backward we subtract: $x_{t-dt} = x_t - dt \cdot v_\theta(x_t, t)$.

**Common mistakes:**
- Forgetting that `t` needs shape `(batch_size, 1)` for broadcasting with `x_0` which has shape `(batch_size, 2)`
- Adding `dt * v_pred` instead of subtracting (wrong direction—generating noise instead of data)
- Using integer timesteps instead of continuous $t \in [0, 1]$ (flow matching uses continuous time, not DDPM's discrete timesteps)

</details>

### What Just Happened

You trained a flow matching model end-to-end and generated samples via Euler ODE solving:

- **The training loop is simple.** No noise schedule, no alpha_bar lookup, no variance-preserving formulation. Just a linear interpolation for $x_t$, a subtraction for the target velocity, and MSE loss. Compare this to the DDPM training loop you implemented in the previous series.

- **Few steps work.** With just 5 Euler steps, the generated samples already resemble the two-moons distribution. This confirms what Exercise 2 showed: on nearly-straight trajectories, Euler's method converges quickly.

- **Diminishing returns from more steps.** Going from 10 to 50 steps shows minimal improvement. Once the trajectory is approximately straight, additional steps provide negligible accuracy gain.

- **1 step is imperfect.** The aggregate learned velocity field is not perfectly straight (remember: individual conditional paths are straight, but the average introduces some curvature). This is why rectified flow exists—to straighten the aggregate field.

---

## Exercise 4: Compare Flow Matching to DDPM on the Same Data `[Independent]`

You have trained a flow matching model on the two-moons distribution. Now train a DDPM model on the same data and compare them head-to-head.

**Your task:**

1. **Train a DDPM model** on two-moons using the noise prediction objective:
   - Use the cosine schedule from `make_cosine_schedule(T=100)`
   - Forward process: $x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1-\bar\alpha_t}\, \epsilon$
   - Target: $\epsilon$ (the noise itself)
   - Loss: $\text{MSE}(\epsilon_\theta(x_t, t),\; \epsilon)$
   - Use the same `ToyModel` architecture and same hyperparameters (500 epochs, batch size 512, lr 3e-4)

2. **Implement DDPM sampling** (DDIM-style deterministic, for fair comparison):
   - Predict $\hat{x}_0 = (x_t - \sqrt{1-\bar\alpha_t}\, \epsilon_\theta) / \sqrt{\bar\alpha_t}$
   - Step: $x_{t-1} = \sqrt{\bar\alpha_{t-1}}\, \hat{x}_0 + \sqrt{1-\bar\alpha_{t-1}}\, \epsilon_\theta$
   - To use N steps, create a sub-sequence of N evenly spaced timesteps from the full schedule

3. **Generate samples at varying step counts** (1, 5, 10, 20, 50) for both models.

4. **Plot the comparison:** for each step count, show flow matching samples vs DDPM samples side by side.

**Expected result:** Flow matching should produce recognizable samples at fewer steps than DDPM.

**Bonus:** Try one round of rectified flow—generate (noise, data) pairs using the trained flow matching model, then retrain on those pairs. Compare the rectified model's 1-step and 5-step samples to the original.

**Before running, predict:**
- At 5 steps, which model will produce better samples?
- At 50 steps, will the quality difference between the two models be large or small?
- At 1 step, will either model produce anything recognizable?

In [None]:
# Your code here.
#
# Suggested structure:
#
# 1. Train DDPM model:
#    - Create a new ToyModel for DDPM
#    - Use make_cosine_schedule(T=100) for the noise schedule
#    - Training loop: sample x_0, sample t (integer in [0, T-1]), sample epsilon,
#      compute x_t using DDPM forward process, predict epsilon, MSE loss
#    - IMPORTANT: the model takes continuous t, so normalize: t_input = t_int / T
#
# 2. Implement DDPM DDIM-style sampling:
#    def sample_ddpm(model, alpha_bars, n_samples, n_steps):
#        Create a subsequence of n_steps evenly spaced timesteps from [T-1, ..., 0]
#        For each step:
#            - Get alpha_bar at current and previous timestep
#            - Predict epsilon
#            - Compute predicted x_0
#            - Compute x_{t-1} via DDIM formula
#        Return final samples
#
# 3. Generate samples from both models at step counts [1, 5, 10, 20, 50]
#    Use the SAME random seed for starting noise so the comparison is fair.
#
# 4. Plot a grid: rows = step counts, columns = [Real Data, Flow Matching, DDPM]
#    Or: one row per step count with flow matching and DDPM side by side.
#
# Remember:
# - fm_model is already trained from Exercise 3
# - sample_flow_matching() is already defined from Exercise 3
# - Use torch.no_grad() for sampling
# - Clamp predicted x_0 to a reasonable range (e.g., -6 to 6) for stability



<details>
<summary>Solution</summary>

The core insight is that both models use the same architecture and see the same data. The only differences are the interpolation scheme, the training target, and the time parameterization. This controlled comparison isolates the effect of trajectory geometry on sample quality at low step counts.

```python
# ============================================================
# Step 1: Train a DDPM model on two-moons
# ============================================================

torch.manual_seed(42)

T = 100
alpha_bars = make_cosine_schedule(T)

ddpm_model = ToyModel(hidden_dim=128)
optimizer_ddpm = torch.optim.Adam(ddpm_model.parameters(), lr=3e-4)

ddpm_losses = []
print('Training DDPM model on two-moons...')

for epoch in range(500):
    x_0 = sample_two_moons(512)
    
    # Sample random integer timesteps
    t_int = torch.randint(0, T, (512,))
    epsilon = torch.randn_like(x_0)
    
    # DDPM forward process
    ab_t = alpha_bars[t_int].unsqueeze(-1)  # (batch, 1)
    x_t = torch.sqrt(ab_t) * x_0 + torch.sqrt(1 - ab_t) * epsilon
    
    # Normalize timestep to [0, 1] for the model
    t_input = t_int.float() / T
    
    # Predict noise
    eps_pred = ddpm_model(x_t, t_input.unsqueeze(-1))
    loss = nn.functional.mse_loss(eps_pred, epsilon)
    
    optimizer_ddpm.zero_grad()
    loss.backward()
    optimizer_ddpm.step()
    
    ddpm_losses.append(loss.item())
    if (epoch + 1) % 100 == 0:
        print(f'  Epoch {epoch+1}/500, loss: {loss.item():.4f}')

print(f'Final DDPM loss: {ddpm_losses[-1]:.4f}')


# ============================================================
# Step 2: DDPM sampling (DDIM-style deterministic)
# ============================================================

ddpm_model.eval()

@torch.no_grad()
def sample_ddpm(model, alpha_bars, n_samples, n_steps):
    """DDIM-style deterministic sampling from a DDPM model."""
    T = len(alpha_bars)
    
    # Create sub-sequence of timesteps
    # Evenly spaced from T-1 down to 0
    timesteps = torch.linspace(T - 1, 0, n_steps + 1).long()
    
    # Start from pure noise
    x = torch.randn(n_samples, 2)
    
    for i in range(len(timesteps) - 1):
        t_curr = timesteps[i]
        t_prev = timesteps[i + 1]
        
        ab_curr = alpha_bars[t_curr]
        ab_prev = alpha_bars[t_prev] if t_prev >= 0 else torch.tensor(1.0)
        
        # Predict noise
        t_input = (t_curr.float() / T).unsqueeze(0).expand(n_samples, 1)
        eps_pred = model(x, t_input)
        
        # DDIM: predict x_0, then jump to t_prev
        pred_x0 = (x - torch.sqrt(1 - ab_curr) * eps_pred) / torch.sqrt(ab_curr)
        pred_x0 = pred_x0.clamp(-6, 6)  # Stability
        
        # Jump to t_prev
        x = torch.sqrt(ab_prev) * pred_x0 + torch.sqrt(1 - ab_prev) * eps_pred
    
    return pred_x0


# ============================================================
# Step 3 & 4: Generate and compare
# ============================================================

step_counts = [1, 5, 10, 20, 50]
n_samples = 500

# Real data for reference
torch.manual_seed(99)
real = sample_two_moons(n_samples)

fig, axes = plt.subplots(len(step_counts), 3, figsize=(12, 4 * len(step_counts)))

for row, n_steps in enumerate(step_counts):
    # Same starting noise for both
    torch.manual_seed(42)
    fm_samples = sample_flow_matching(fm_model, n_samples, n_steps)
    
    torch.manual_seed(42)
    ddpm_samples = sample_ddpm(ddpm_model, alpha_bars, n_samples, n_steps)
    
    # Real data
    axes[row, 0].scatter(real[:, 0], real[:, 1], s=3, alpha=0.5, c='#60a5fa')
    axes[row, 0].set_title(f'Real Data', fontsize=10)
    axes[row, 0].set_xlim(-4, 4); axes[row, 0].set_ylim(-4, 4)
    axes[row, 0].set_aspect('equal')
    axes[row, 0].set_ylabel(f'{n_steps} steps', fontsize=12, fontweight='bold')
    
    # Flow matching
    axes[row, 1].scatter(fm_samples[:, 0], fm_samples[:, 1], s=3, alpha=0.5, c='#34d399')
    axes[row, 1].set_title(f'Flow Matching ({n_steps} steps)', fontsize=10)
    axes[row, 1].set_xlim(-4, 4); axes[row, 1].set_ylim(-4, 4)
    axes[row, 1].set_aspect('equal')
    
    # DDPM
    axes[row, 2].scatter(ddpm_samples[:, 0], ddpm_samples[:, 1], s=3, alpha=0.5, c='#f59e0b')
    axes[row, 2].set_title(f'DDPM ({n_steps} steps)', fontsize=10)
    axes[row, 2].set_xlim(-4, 4); axes[row, 2].set_ylim(-4, 4)
    axes[row, 2].set_aspect('equal')

plt.suptitle(
    'Flow Matching vs DDPM: Sample Quality at Varying Step Counts\n'
    'Same architecture, same data, same number of training epochs',
    fontsize=13, y=1.01
)
plt.tight_layout()
plt.show()

print('Expected observations:')
print('- At 50 steps: both models produce good two-moons samples. Similar quality.')
print('- At 10-20 steps: flow matching still looks good. DDPM may show some degradation.')
print('- At 5 steps: flow matching is recognizable. DDPM is more distorted.')
print('- At 1 step: neither is great, but flow matching is closer to the target.')
print()
print('The advantage is not about ultimate quality (both converge with enough steps).')
print('The advantage is about efficiency: flow matching reaches good quality FASTER.')
```

**Key decisions:**
- Using DDIM-style deterministic sampling for DDPM (not stochastic DDPM) makes the comparison fair: both methods use the same ODE-solving approach, the only difference is the trajectory shape.
- Normalizing DDPM's integer timesteps to [0, 1] for the model input keeps the architecture identical. The model does not know whether it is doing DDPM or flow matching—only the training target differs.
- Using the same random seed for starting noise ensures both models start from the exact same noise vectors, so differences in the output are entirely due to the model, not the initialization.

**Common mistakes:**
- Using stochastic DDPM sampling (adding noise at each step) which introduces randomness that makes the comparison unfair.
- Forgetting to normalize DDPM timesteps to [0, 1] for the model input, causing the model to receive out-of-range time values.
- Not clamping the predicted $x_0$ in DDPM, which can lead to numerical instability at early timesteps where $\bar\alpha_t$ is very small.

</details>

---

## Key Takeaways

1. **Flow matching interpolation is a straight line; DDPM interpolation curves.** The linear coefficients $(1-t)$ and $t$ produce a straight path with constant velocity. The DDPM coefficients $\sqrt{\bar\alpha_t}$ and $\sqrt{1-\bar\alpha_t}$ produce a curved path with time-varying velocity. This is the fundamental geometric difference.

2. **On a straight path, Euler's method is exact in one step.** Curved paths cause Euler to overshoot at every step, requiring many small steps to stay accurate. Straight paths have no curvature to overshoot—one step lands you exactly at the endpoint. This is why flow matching models need fewer sampling steps.

3. **The flow matching training loop is simpler.** No noise schedule, no $\bar\alpha_t$, no cumulative products. Just $x_t = (1-t)\, x_0 + t\, \epsilon$ for the interpolation and $v = \epsilon - x_0$ for the target. The entire training loop is a weighted average and a subtraction.

4. **Flow matching reaches good sample quality at fewer steps than DDPM.** With the same architecture, data, and training budget, flow matching produces recognizable samples at 5–10 Euler steps where DDPM may need 20–50 steps. The advantage is efficiency, not ultimate quality—both converge with enough steps.

5. **Same architecture, different training target.** The `ToyModel` MLP was identical for both flow matching and DDPM. The only changes were the interpolation scheme, the training target (velocity vs noise), and the time parameterization (continuous uniform vs discrete schedule). This is why velocity prediction does not require a new architecture—it is a training objective change, not an architectural one.