# Diffusion Lab
Welcome to the student workbook for the diffusion lab. Each section mirrors the instructor notebook but leaves intentional gaps so you can reason about them. Use the provided hints, experiment often, and take notes about what surprised you.

> **Expectation:** You may consult AI tools, but you must understand and explain every line you accept. Write short reflections in the provided cells to solidify your learning. Some parts of the lab may be challenging; embrace the struggle as part of the learning process!

## Rules & Notes
- **Libraries:** Only use **NumPy** and **Matplotlib**.
- **Keep it simple:** Small dataset, small network, clear visuals.
- **Focus on understanding:** Read comments, run cells one-by-one, and observe the plots.
- **Workflow tip:** When you see **TODO**, stop and think before coding; if you get stuck, document what you tried before asking for help.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Reproducibility helpers
RNG_SEED = 42
rng = np.random.default_rng(RNG_SEED)

## 1) Build a small 2D toy dataset (a spiral)
We want a 2D dataset that makes it easy to **see** whether denoising works. A noisy spiral does the trick.

**TODO:** Complete `make_spiral`. Keep it simple: sample an angle `t`, map to `(x, y)`, add gentle Gaussian jitter, and normalize.

> **Hint:** The instructor version scales the radius linearly with the angle and then standardizes the data. You can mimic that or invent your own variation just document it.

In [None]:
def make_spiral(n_samples=1000, noise=0.02):
    """Return a normalized (n_samples, 2) array shaped like a spiral."""
    # TODO: implement using rng for randomness
    raise NotImplementedError("Fill in make_spiral to proceed.")

X = make_spiral()
plt.figure(figsize=(4,4))
plt.scatter(X[:,0], X[:,1], s=5)
plt.title("Spiral dataset (normalized)")
plt.axis("equal")
plt.show()

## 2) Make a simple **noise schedule** (linear $\beta_t$)
A **noise schedule** tells us how much Gaussian noise to inject at each diffusion step. In DDPM we typically choose a simple schedule such as a linear ramp so that early steps barely disturb $x_0$ while later steps push the sample toward pure noise. The closed-form forward equation
$$ x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t}\, \epsilon $$
depends on the cumulative product $\bar{\alpha}_t = \prod_{s=1}^t (1-\beta_s)$, so carefully shaping $\beta_t$ directly controls how fast signal fades.

**TODOs:**
1. Implement `make_linear_beta_schedule` so it returns `betas`, `alphas`, and the cumulative product `alpha_bars`.
2. Plot the schedule to see if it matches your intuition.

> **Hint:** Remember `alpha_t = 1 - beta_t`. Use `np.cumprod` for the running product.

In [None]:
def make_linear_beta_schedule(T=100, beta_start=1e-4, beta_end=5e-3):
    """Return betas, alphas, and cumulative alpha_bars for T steps."""
    # TODO: fill in this function using np.linspace and np.cumprod
    raise NotImplementedError

T = 100
betas, alphas, alpha_bars = make_linear_beta_schedule(T=T)

plt.figure(figsize=(5,4))
plt.plot(betas)
plt.title("Linear beta schedule")
plt.xlabel("t")
plt.ylabel("beta_t")
plt.show()

plt.figure(figsize=(5,4))
plt.plot(alpha_bars)
plt.title("Cumulative product: alpha_bar_t")
plt.xlabel("t")
plt.ylabel("alpha_bar_t")
plt.show()

## 3) Forward diffusion: make $x_t$ from $x_0$
The closed-form equation lets us jump directly to any timestep.

**TODO:** Finish `q_sample` so it returns `(x_t, eps)` for either a scalar `t` or a vector of times. For the equation, go back to
$$ x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t}\, \epsilon. $$

> **Hint:** When `t` is a vector, you'll want to broadcast `alpha_bars[t-1]` to match the shape of `x0`.

In [None]:
def q_sample(x0, t, alpha_bars):
    """Sample x_t and record the actual noise epsilon."""
    # TODO: support scalar or vector t values
    raise NotImplementedError
    
# Visualize noising at different t values
ts_to_show = [1, T//4, T//2, T]
fig, axes = plt.subplots(1, len(ts_to_show), figsize=(12,3))
for i, tt in enumerate(ts_to_show):
    xt, _ = q_sample(X, tt, alpha_bars)
    axes[i].scatter(xt[:,0], xt[:,1], s=5)
    axes[i].set_title(f"t={tt}")
    axes[i].set_aspect('equal', 'box')
plt.suptitle("Forward diffusion: adding noise over time")
plt.show()

## 4) A multi layer network to predict the noise $\hat{\epsilon}$
In DDPM we train a model to estimate the noise $\epsilon$ injected during the forward process. If a network can **recover that noise** from $x_t$ and the current timestep $t$, the reverse sampler can subtract it and gradually turn noise back into data.

Remember from Section 1 that you normalized the spiral so each feature had roughly unit scale. That choice strongly influences how large your initial weights should be. If the data is standardized, smaller ranges often keep activations in a sensible region. Keep this connection in mind while you tune the network.

- $x_t, y_t$: the two coordinates of the noised point at time $t$ (the input the network sees).
- $t/T$: the normalized scalar timestep so the model knows how much noise to expect.

**TODO:** Implement the linear layers + ReLU by replacing the `raise NotImplementedError` blocks in `forward_network` and `backward_network`, and justify your chosen values for `WEIGHT_INITIALIZATION_RANGE` and `DEFAULT_MOMENTUM`. Explain how these choices interact with your data normalization and whether momentum is truly required.

> **Hint:** This is a plain 3-layer MLP. Start from matrix multiplication + bias addition + ReLU, nothing fancy.

In [None]:
# TODO: set these after experimenting with normalization + training stability
WEIGHT_INITIALIZATION_RANGE = None
DEFAULT_MOMENTUM = None

def relu(x):
    return np.maximum(0.0, x)

def relu_deriv(x):
    return (x > 0).astype(np.float32)

def MultilayerNetwork(input_dim=3, hidden=128, out_dim=2, rng=rng, momentum=DEFAULT_MOMENTUM):
    if WEIGHT_INITIALIZATION_RANGE is None or momentum is None:
        raise ValueError("Set WEIGHT_INITIALIZATION_RANGE and DEFAULT_MOMENTUM after investigating their effects.")
    net = {
        "W1": rng.normal(0, WEIGHT_INITIALIZATION_RANGE, size=(input_dim, hidden)).astype(np.float32),
        "b1": np.zeros(hidden, dtype=np.float32),
        "W2": rng.normal(0, WEIGHT_INITIALIZATION_RANGE, size=(hidden, hidden)).astype(np.float32),
        "b2": np.zeros(hidden, dtype=np.float32),
        "W3": rng.normal(0, WEIGHT_INITIALIZATION_RANGE, size=(hidden, out_dim)).astype(np.float32),
        "b3": np.zeros(out_dim, dtype=np.float32),
        "momentum": momentum,
    }
    return net

def forward_network(net, x):
    z1 = x @ net["W1"] + net["b1"]
    h1 = relu(z1)
    z2 = h1 @ net["W2"] + net["b2"]
    h2 = relu(z2)
    out = h2 @ net["W3"] + net["b3"]
    cache = (x, z1, h1, z2, h2)
    return out, cache

def backward_network(net, cache, grad_out, lr=1e-3):
    x, z1, h1, z2, h2 = cache
    dW3 = h2.T @ grad_out
    db3 = grad_out.sum(axis=0)
    dh2 = grad_out @ net["W3"].T
    dz2 = dh2 * relu_deriv(z2)
    dW2 = h1.T @ dz2
    db2 = dz2.sum(axis=0)
    dh1 = dz2 @ net["W2"].T
    dz1 = dh1 * relu_deriv(z1)
    dW1 = x.T @ dz1
    db1 = dz1.sum(axis=0)
    grads = {
        "W3": dW3,
        "b3": db3,
        "W2": dW2,
        "b2": db2,
        "W1": dW1,
        "b1": db1,
    }
    for name, grad in grads.items():
        net[name] -= lr * net["momentum"] * grad

def mse_loss(pred, target):
    diff = pred - target
    return np.mean(diff**2), (2.0 / pred.shape[0]) * diff

## 5) Train to predict noise
Now that you can sample $x_t$ and build a small network, stitch everything together. Each iteration should (1) use the whole dataset, (2) pick random timesteps, (3) generate $x_t$ with `q_sample`, (4) feed `[x_t, t/T]` through the network, and (5) measure the MSE between predicted and true noise. Track a smoothed loss to see trends.

**TODO:** Write the entire training loop yourself. No starter code is provided below. Describe and justify each design choice (learning rate, smoothing, momentum) in comments or short markdown notes. Why is your choice appropriate for this spiral dataset?

In [None]:
network = MultilayerNetwork(input_dim=3, hidden=128, out_dim=2, rng=rng , momentum=DEFAULT_MOMENTUM)

def train(network, X, T, alpha_bars, steps, lr):
    # TODO: implement the full steepest-descent routine described above
    pass

# TODO: choose reasonable defaults before running

losses = train(network, X, T, alpha_bars, steps=?, lr=?)

plt.figure(figsize=(5,4))
plt.plot(losses)
plt.title("Training loss (steepest descent)")
plt.xlabel("step")
plt.ylabel("MSE")
plt.show()

### Reflection
What happens if you reduce the momentum dramatically (e.g., from 0.999 to 0.5)? Why?

In [None]:
reflection = "TODO: discuss smoothing behavior."
print(reflection)

## 6) Sampling: generate new data by reversing diffusion
The function $p_	heta(x_{t-1}\mid x_t)$ is the learned reverse transition: given a noisy sample $x_t$, we predict the noise $\hat{\epsilon}$ and use it to recover a denoised estimate $x_{t-1}$. In a DDPM this is the core step that turns pure Gaussian noise back into data, so implementing `p_sample` correctly is essential.

Now use your trained network to walk backward from pure Gaussian noise to the data manifold. Every reverse step should:
1. Build the same conditioning vector `[x_t, t/T]` you used during training.
2. Predict the noise $\hat{\epsilon}$ and plug it into the DDPM update equation.
3. Optionally add fresh Gaussian noise unless you're at $t=1$.

**TODOs:(estimate 2 hours)**
- Fill in the missing math inside `p_sample` (mean term and the optional noise injection).
- Reuse the instructor-style `p_sample_loop` to log snapshots at the times in `ts_to_show` so you can compare with the forward plots.
- Explain in markdown how the snapshots confirm (or contradict) your expectations about the denoising path.

> **Hint:** $\sqrt{\bar{\alpha}_t}$ and $\sqrt{1-\bar{\alpha}_t}$ were already computed in the forward helper, use them again here.

In [None]:
def p_sample(network, x_t, t, betas, alphas, alpha_bars):
    t_scaled = np.full((x_t.shape[0],1), t/len(betas), dtype=np.float32)
    net_in = np.concatenate([x_t, t_scaled], axis=1)
    eps_hat, _ = forward_network(network, net_in)

    a_t = alphas[t-1]
    ab_t = alpha_bars[t-1]
    beta_t = betas[t-1]

    # TODO: derive the DDPM mean using eps_hat (see Section 6 equation)
    mean_part = None
    raise NotImplementedError("Replace with the correct mean expression.")

    if t > 1:
        z = rng.normal(0.0, 1.0, size=x_t.shape).astype(np.float32)
        sigma = np.sqrt(beta_t)
        # TODO: combine the mean with sigma * z
        raise NotImplementedError("Combine mean and stochastic term when t>1.")
    else:
        x_prev = mean_part
    return x_prev

def p_sample_loop(network, n_samples, T, betas, alphas, alpha_bars, ts_to_show=None):
    x_t = rng.normal(0.0, 1.0, size=(n_samples, 2)).astype(np.float32)
    for t in range(T, 0, -1):
        x_t = p_sample(network, x_t, t, betas, alphas, alpha_bars)
        if ts_to_show is not None and t in ts_to_show:
            plt.figure(figsize=(4,4))
            plt.scatter(X[:,0], X[:,1], s=5, label="real")
            plt.scatter(x_t[:,0], x_t[:,1], s=5, label="generated")
            plt.title(f"Generated samples at t={t}")
            plt.axis("equal")
            plt.legend()
            plt.show()
    return x_t

ts_to_show = [1, T//4, T//2, T]
gen = p_sample_loop(network, n_samples=1000, T=T, betas=betas, alphas=alphas, alpha_bars=alpha_bars, ts_to_show=ts_to_show)

plt.figure(figsize=(4,4))
plt.scatter(X[:,0], X[:,1], s=5, label="real")
plt.scatter(gen[:,0], gen[:,1], s=5, label="generated")
plt.legend()
plt.title("Final Real vs Generated (simple DDPM)")
plt.axis("equal")
plt.show()

### Reflection
Compare the visual storytelling between the forward and reverse plots. What does each snapshot teach you about the diffusion process?

In [None]:
reflection = "TODO: describe insights from the reverse snapshots."
print(reflection)

## 7) Discussion & Experiments
Run at least two mini-investigations and jot down short answers:
1. **Noise schedule tweak:** What happens to sample quality if you halve or double the final $\beta_T$? Why?
2. **Network capacity:** Does increasing the hidden width improve denoising, or does it overfit? Provide evidence.
3. **Sampler behavior:** Try removing the stochastic term in `p_sample` for the last few steps. How does that change the visuals?
4. **Failure analysis:** Capture one failure case (e.g., mode collapse) and explain what you think caused it.

Conclude with one concrete lesson about diffusion models that you didn't appreciate before running these experiments.

In [None]:
lab_report = "TODO: summarize experiments and lessons learned."
print(lab_report)