# Module 1: Diffusion Models - Theory and Practice

**üìç Notebook 2 of 8**

## üíª GPU Requirements
**‚úÖ No GPU needed!** All examples run on CPU.

---

## üéØ Learning Objectives

By the end of this notebook, you will:

1. Understand what diffusion models are and how they work
2. Grasp the forward diffusion process (adding noise)
3. Grasp the reverse diffusion process (denoising)
4. Implement a simple 1D diffusion model from scratch
5. Visualize the diffusion process in 2D
6. Understand how this applies to protein design

## üìö Prerequisites

- Basic probability (Gaussian distributions)
- Python and NumPy
- Basic understanding of neural networks (optional for theory)

---

## ü§î What Are Diffusion Models?

**Core Idea**: Learn to generate data by gradually removing noise.

### Analogy: The Sculptor's Approach
- Traditional GANs: Sculpt a statue from nothing in one shot ‚ö°
- Diffusion Models: Gradually chip away at a block of marble, step by step üî®

### Two Processes:

**Forward Process** (Training time):
```
Real Data ‚Üí Add Noise ‚Üí Add More Noise ‚Üí ... ‚Üí Pure Noise
  x‚ÇÄ      ‚Üí    x‚ÇÅ     ‚Üí      x‚ÇÇ        ‚Üí ... ‚Üí    x‚Çú
```

**Reverse Process** (Generation time):
```
Pure Noise ‚Üí Denoise ‚Üí Denoise More ‚Üí ... ‚Üí Generated Data
    x‚Çú     ‚Üí   x‚Çú‚Çã‚ÇÅ  ‚Üí      x‚Çú‚Çã‚ÇÇ     ‚Üí ... ‚Üí      x‚ÇÄ
```

In [None]:
# Import libraries
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from IPython.display import HTML
import seaborn as sns

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.random.seed(42)

print("‚úÖ Libraries loaded successfully!")

## üé≤ The Forward Diffusion Process

### Mathematical Formulation

Starting with data `x‚ÇÄ`, we add Gaussian noise over `T` timesteps:

$$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$$

Where:
- $\beta_t$ is the noise schedule (how much noise to add at step $t$)
- As $t$ increases, $\beta_t$ typically increases
- At $t=T$, data becomes pure noise: $x_T \sim \mathcal{N}(0, I)$

### Key Insight: Closed Form

We can jump directly to any timestep without intermediate steps:

$$q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) I)$$

Where $\bar{\alpha}_t = \prod_{i=1}^t (1-\beta_i)$

**This is crucial for efficient training!**

In [None]:
# Implement forward diffusion in 1D
def forward_diffusion_1d(x0, timesteps=100):
    """
    Apply forward diffusion to 1D data.
    
    Args:
        x0: Original data point (scalar)
        timesteps: Number of diffusion steps
    
    Returns:
        x_trajectory: Array of noisy versions at each timestep
        alphas_cumprod: Cumulative product of (1-beta)
    """
    # Define noise schedule (linear)
    betas = np.linspace(0.0001, 0.02, timesteps)
    alphas = 1 - betas
    alphas_cumprod = np.cumprod(alphas)
    
    # Store trajectory
    x_trajectory = np.zeros(timesteps + 1)
    x_trajectory[0] = x0
    
    # Apply diffusion at each step
    for t in range(1, timesteps + 1):
        # Sample noise
        noise = np.random.randn()
        
        # Apply noise using closed form
        sqrt_alpha_cumprod = np.sqrt(alphas_cumprod[t-1])
        sqrt_one_minus_alpha_cumprod = np.sqrt(1 - alphas_cumprod[t-1])
        
        x_trajectory[t] = sqrt_alpha_cumprod * x0 + sqrt_one_minus_alpha_cumprod * noise
    
    return x_trajectory, alphas_cumprod

# Test with a simple value
x0 = 5.0  # Original data point
x_traj, alphas = forward_diffusion_1d(x0, timesteps=100)

# Visualize
plt.figure(figsize=(12, 4))
plt.plot(x_traj, linewidth=2)
plt.axhline(y=x0, color='r', linestyle='--', label=f'Original value: {x0}')
plt.axhline(y=0, color='g', linestyle='--', alpha=0.5, label='Pure noise mean: 0')
plt.xlabel('Timestep', fontsize=12)
plt.ylabel('Value', fontsize=12)
plt.title('Forward Diffusion: Gradually Adding Noise', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"Starting value: {x0}")
print(f"After 25 steps: {x_traj[25]:.3f}")
print(f"After 50 steps: {x_traj[50]:.3f}")
print(f"After 100 steps: {x_traj[100]:.3f} (approaching pure noise)")


## üîô The Reverse Diffusion Process

### The Goal

Learn to predict $x_{t-1}$ from $x_t$ (remove noise step by step).

### Reverse Distribution

$$p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$$

Where $\mu_\theta$ is a neural network that predicts the mean.

### Training Objective

Train the network to predict the **noise** that was added:

$$\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon} [\|\epsilon - \epsilon_\theta(x_t, t)\|^2]$$

Where:
- $\epsilon$ is the actual noise added
- $\epsilon_\theta$ is the network's prediction
- $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$

**Once trained, we can sample by starting from noise and iteratively denoising!**

In [None]:
# Simple denoising function (oracle - knows the true data)
# In practice, this would be a trained neural network
def denoise_step(xt, t, x0_true, alphas_cumprod):
    """
    Single reverse diffusion step (oracle version).
    
    In real diffusion models, we'd use a neural network to predict
    the noise instead of using the true x0.
    """
    alpha_cumprod_t = alphas_cumprod[t]
    alpha_cumprod_prev = alphas_cumprod[t-1] if t > 0 else 1.0
    
    # Predict x0 from xt (in practice, network predicts noise)
    # Here we cheat and use the true x0 for demonstration
    predicted_x0 = x0_true
    
    # Compute mean of p(x_{t-1} | x_t, x_0)
    coef1 = np.sqrt(alpha_cumprod_prev) * (1 - alpha_cumprod_t / alpha_cumprod_prev)
    coef2 = np.sqrt(alpha_cumprod_t) * (1 - alpha_cumprod_prev)
    mean = (coef1 * predicted_x0 + coef2 * xt) / (1 - alpha_cumprod_t)
    
    # Add small noise (except at t=0)
    if t > 0:
        variance = (1 - alpha_cumprod_prev) / (1 - alpha_cumprod_t) * (1 - alpha_cumprod_t / alpha_cumprod_prev)
        noise = np.random.randn() * np.sqrt(variance)
        return mean + noise
    else:
        return mean

# Perform reverse diffusion
x_reverse = np.zeros(101)
x_reverse[-1] = x_traj[-1]  # Start from noisy version

for t in range(99, -1, -1):
    x_reverse[t] = denoise_step(x_reverse[t+1], t, x0, alphas)

# Visualize forward and reverse
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Forward process
ax1.plot(x_traj, linewidth=2, color='#E63946')
ax1.axhline(y=x0, color='black', linestyle='--', alpha=0.5)
ax1.set_xlabel('Timestep', fontsize=11)
ax1.set_ylabel('Value', fontsize=11)
ax1.set_title('Forward: Data ‚Üí Noise', fontsize=13, fontweight='bold')
ax1.grid(True, alpha=0.3)

# Reverse process
ax2.plot(x_reverse, linewidth=2, color='#457B9D')
ax2.axhline(y=x0, color='black', linestyle='--', alpha=0.5, label=f'Target: {x0}')
ax2.set_xlabel('Timestep', fontsize=11)
ax2.set_ylabel('Value', fontsize=11)
ax2.set_title('Reverse: Noise ‚Üí Data', fontsize=13, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Original value: {x0}")
print(f"Recovered value: {x_reverse[0]:.3f}")
print(f"Error: {abs(x_reverse[0] - x0):.3f}")

## üé® 2D Example: Swiss Roll Dataset

Let's see diffusion in action on 2D data (more visual!)

In [None]:
# Generate Swiss Roll data
def make_swiss_roll(n_samples=1000):
    """Generate 2D Swiss Roll dataset."""
    t = 3 * np.pi * (1 + 2 * np.random.rand(n_samples))
    x = t * np.cos(t)
    y = t * np.sin(t)
    X = np.stack([x, y], axis=1)
    return X / 10.0  # Scale down

# Forward diffusion for 2D data
def forward_diffusion_2d(X, t, betas):
    """Apply forward diffusion to 2D data."""
    alphas = 1 - betas
    alphas_cumprod = np.cumprod(alphas)
    
    sqrt_alpha_cumprod = np.sqrt(alphas_cumprod[t])
    sqrt_one_minus_alpha_cumprod = np.sqrt(1 - alphas_cumprod[t])
    
    noise = np.random.randn(*X.shape)
    X_noisy = sqrt_alpha_cumprod * X + sqrt_one_minus_alpha_cumprod * noise
    
    return X_noisy

# Generate data
X_original = make_swiss_roll(500)

# Define noise schedule
T = 100
betas = np.linspace(0.0001, 0.02, T)

# Apply diffusion at different timesteps
timesteps = [0, 10, 25, 50, 100]
fig, axes = plt.subplots(1, len(timesteps), figsize=(18, 3))

for idx, t in enumerate(timesteps):
    if t == 0:
        X_t = X_original
        title = "Original Data (t=0)"
    else:
        X_t = forward_diffusion_2d(X_original, t-1, betas)
        title = f"t={t}"
    
    axes[idx].scatter(X_t[:, 0], X_t[:, 1], s=2, alpha=0.6, c=range(len(X_t)), cmap='viridis')
    axes[idx].set_title(title, fontsize=12, fontweight='bold')
    axes[idx].set_xlim(-4, 4)
    axes[idx].set_ylim(-4, 4)
    axes[idx].set_aspect('equal')
    axes[idx].grid(True, alpha=0.3)

plt.suptitle('Forward Diffusion: Structure ‚Üí Noise', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("Notice how the structured Swiss Roll gradually becomes random noise!")

## üß¨ Connection to Protein Design

Now let's connect this to RFDiffusion:

### What Gets Diffused?
Instead of 2D points, we diffuse **3D protein backbone coordinates**!

- **x‚ÇÄ**: Valid protein structure (backbone atom positions)
- **x_t**: Progressively noisier structure
- **x_T**: Complete random positions (no structure)

### Key Differences for Proteins:

1. **SE(3) Equivariance**: Network must respect rotations/translations
2. **Constraints**: Bond lengths, angles must be reasonable
3. **Conditioning**: Can condition on motifs, symmetry, binding partners

### The Process:

```
Valid Protein ‚Üí Add Noise to Coordinates ‚Üí Pure Random Positions
    (structured)                             (no structure)
                      ‚Üì TRAINING ‚Üì
Pure Random ‚Üí Neural Network Denoises ‚Üí Valid Protein Structure
                (learns from real proteins)
```

In [None]:
# Simulate protein backbone diffusion (simplified 2D projection)
# In reality, proteins are 3D, but we'll visualize in 2D for clarity

def make_helix_2d(n_residues=20):
    """Generate a simple helix pattern in 2D (like looking down a helix)."""
    t = np.linspace(0, 4*np.pi, n_residues)
    r = np.linspace(0.5, 2, n_residues)  # Expanding radius
    x = r * np.cos(t)
    y = r * np.sin(t)
    return np.stack([x, y], axis=1)

# Create a simple "protein" (helix)
protein_backbone = make_helix_2d(30)

# Apply diffusion
timesteps_protein = [0, 20, 50, 100]
fig, axes = plt.subplots(2, 4, figsize=(16, 8))

for idx, t in enumerate(timesteps_protein):
    if t == 0:
        X_t = protein_backbone
        title = "Original Structure\n(t=0)"
    else:
        X_t = forward_diffusion_2d(protein_backbone, t-1, betas)
        title = f"Diffused\n(t={t})"
    
    # Top row: scatter plot
    axes[0, idx].plot(X_t[:, 0], X_t[:, 1], 'o-', markersize=6, linewidth=1.5, alpha=0.7)
    axes[0, idx].set_title(title, fontsize=11, fontweight='bold')
    axes[0, idx].set_xlim(-4, 4)
    axes[0, idx].set_ylim(-4, 4)
    axes[0, idx].set_aspect('equal')
    axes[0, idx].grid(True, alpha=0.3)
    axes[0, idx].set_xlabel('X coordinate (√Ö)', fontsize=9)
    axes[0, idx].set_ylabel('Y coordinate (√Ö)', fontsize=9)
    
    # Bottom row: distance matrix (shows structure)
    from scipy.spatial.distance import cdist
    dist_matrix = cdist(X_t, X_t)
    im = axes[1, idx].imshow(dist_matrix, cmap='viridis', aspect='auto')
    axes[1, idx].set_title('Distance Matrix', fontsize=10)
    axes[1, idx].set_xlabel('Residue', fontsize=9)
    axes[1, idx].set_ylabel('Residue', fontsize=9)
    if idx == 3:
        plt.colorbar(im, ax=axes[1, idx], label='Distance (√Ö)')

plt.suptitle('Protein Backbone Diffusion (2D Visualization)', fontsize=14, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()

print("üìå Key Observations:")
print("   - Top row: Backbone coordinates become random")
print("   - Bottom row: Distance matrix loses structure ‚Üí becomes uniform")
print("   - This is what RFDiffusion learns to reverse!")

## üìä Noise Schedules

The choice of $\beta_t$ (noise schedule) is important!

### Common Schedules:

1. **Linear**: $\beta_t = \beta_{\min} + (\beta_{\max} - \beta_{\min}) \frac{t}{T}$
2. **Cosine**: Slower at the beginning, faster at the end
3. **Quadratic**: Even more gradual

Let's compare them:

In [None]:
# Compare different noise schedules
T = 100

# Linear schedule
betas_linear = np.linspace(0.0001, 0.02, T)

# Cosine schedule
def cosine_beta_schedule(timesteps, s=0.008):
    steps = timesteps + 1
    x = np.linspace(0, timesteps, steps)
    alphas_cumprod = np.cos(((x / timesteps) + s) / (1 + s) * np.pi * 0.5) ** 2
    alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
    betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
    return np.clip(betas, 0, 0.999)

betas_cosine = cosine_beta_schedule(T)

# Quadratic schedule
betas_quadratic = (np.linspace(0.0001**0.5, 0.02**0.5, T))**2

# Plot schedules
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Beta values
ax1.plot(betas_linear, label='Linear', linewidth=2)
ax1.plot(betas_cosine, label='Cosine', linewidth=2)
ax1.plot(betas_quadratic, label='Quadratic', linewidth=2)
ax1.set_xlabel('Timestep', fontsize=12)
ax1.set_ylabel('Œ≤ (noise level)', fontsize=12)
ax1.set_title('Noise Schedules: Œ≤ Values', fontsize=13, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Cumulative product (signal retention)
alphas_linear = 1 - betas_linear
alphas_cosine = 1 - betas_cosine
alphas_quadratic = 1 - betas_quadratic

ax2.plot(np.cumprod(alphas_linear), label='Linear', linewidth=2)
ax2.plot(np.cumprod(alphas_cosine), label='Cosine', linewidth=2)
ax2.plot(np.cumprod(alphas_quadratic), label='Quadratic', linewidth=2)
ax2.set_xlabel('Timestep', fontsize=12)
ax2.set_ylabel('·æ± (signal retention)', fontsize=12)
ax2.set_title('Signal Retention Over Time', fontsize=13, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("üìå Interpretation:")
print("   - Linear: Uniform noise addition")
print("   - Cosine: Preserves more signal early on")
print("   - Quadratic: Slower initial corruption")
print("\nRFDiffusion uses a modified schedule optimized for proteins!")

## üéì Key Takeaways

1. **Diffusion models** learn to gradually denoise data
2. **Forward process** is fixed (add noise according to schedule)
3. **Reverse process** is learned (neural network predicts noise to remove)
4. **Training** is simple: predict the noise that was added
5. **Sampling** starts from pure noise and iteratively denoises
6. **For proteins**: Same idea, but with 3D coordinates and geometric constraints

## ‚úÖ Self-Check Questions

1. What are the two processes in a diffusion model?
2. Why do we need a noise schedule?
3. What does the neural network predict during training?
4. How do we generate new samples?
5. What's special about protein diffusion compared to image diffusion?

## üí° Practice Exercise

Try modifying the code above to:
1. Use a different noise schedule
2. Change the number of timesteps
3. Apply diffusion to your own 2D dataset

## üìñ Further Reading

- [DDPM Paper](https://arxiv.org/abs/2006.11239) - Original denoising diffusion paper
- [Score-Based Models](https://yang-song.net/blog/2021/score/) - Alternative perspective
- [Diffusion Models Tutorial](https://lilianweng.github.io/posts/2021-07-11-diffusion-models/) - Lilian Weng's excellent blog

## ‚è≠Ô∏è Next Notebook

**03_protein_representation.ipynb** - Learn how proteins are encoded as input to RFDiffusion

üí° **Still no GPU needed!** Next notebook covers data structures and representations.