# Algorithm 17: Diffusion Transformer (AlphaFold3)

The Diffusion Transformer is the core structure prediction module in AF3, using diffusion-based denoising.

## Source Code Location
- **File**: `AF3-Ref-src/alphafold3-official/src/alphafold3/model/network/diffusion.py`

## Overview

### Key Architecture Components

1. **Noise Level Embedding**: Encodes current noise level t
2. **Adaptive LayerNorm**: Conditions on noise level
3. **Atom Transformer**: Processes atom-level features
4. **Conditioning**: Uses single/pair representations

### Diffusion Process
```
Forward:  x_0 → x_t = sqrt(alpha_t) * x_0 + sqrt(1-alpha_t) * noise
Reverse:  x_t → x_0 (predicted via neural network)
```

In [None]:
import numpy as np
np.random.seed(42)

def layer_norm(x, eps=1e-5):
    mean = np.mean(x, axis=-1, keepdims=True)
    var = np.var(x, axis=-1, keepdims=True)
    return (x - mean) / np.sqrt(var + eps)

def softmax(x, axis=-1):
    x_max = np.max(x, axis=axis, keepdims=True)
    exp_x = np.exp(x - x_max)
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)

def swish(x):
    return x / (1 + np.exp(-np.clip(x, -500, 500)))

In [None]:
def noise_level_embedding(t, c=256, max_period=10000):
    """
    Sinusoidal embedding for noise level t.
    Similar to positional encoding in Transformers.
    """
    half = c // 2
    freqs = np.exp(-np.log(max_period) * np.arange(half) / half)
    args = t * freqs
    emb = np.concatenate([np.cos(args), np.sin(args)])
    return emb

In [None]:
def adaptive_layer_norm(x, cond, eps=1e-5):
    """
    Adaptive LayerNorm conditioned on noise level.
    """
    c = x.shape[-1]
    x_norm = layer_norm(x, eps)
    
    # Derive scale and bias from conditioning
    W = np.random.randn(cond.shape[-1], 2 * c) * (cond.shape[-1] ** -0.5)
    params = cond @ W
    scale, bias = np.split(params, 2, axis=-1)
    
    return (1 + scale) * x_norm + bias

In [None]:
def diffusion_transformer_block(x, s, z, t_emb, num_heads=8, c=64):
    """
    Single Diffusion Transformer block.
    
    Args:
        x: Noisy atom positions [N_atoms, 3]
        s: Single representation [N, c_s]
        z: Pair representation [N, N, c_z]
        t_emb: Noise level embedding [c_t]
        num_heads: Number of attention heads
        c: Hidden dimension
    
    Returns:
        Position update [N_atoms, 3]
    """
    N_atoms = x.shape[0]
    c_s = s.shape[-1]
    
    print(f"Diffusion Transformer Block")
    print(f"="*50)
    print(f"Atoms: {N_atoms}, Single dim: {c_s}")
    
    # Embed positions
    W_x = np.random.randn(3, c) * (3 ** -0.5)
    x_emb = x @ W_x  # [N_atoms, c]
    
    # Add conditioning from single representation
    # Assume atoms are mapped to tokens (simplified: 1-to-1)
    N = min(N_atoms, s.shape[0])
    W_s = np.random.randn(c_s, c) * (c_s ** -0.5)
    s_proj = s[:N] @ W_s  # [N, c]
    
    # Combine
    h = x_emb[:N] + s_proj  # [N, c]
    
    # Adaptive LayerNorm with noise conditioning
    h = adaptive_layer_norm(h, t_emb)
    
    # Self-attention with pair bias
    W_q = np.random.randn(c, num_heads, c // num_heads) * (c ** -0.5)
    W_k = np.random.randn(c, num_heads, c // num_heads) * (c ** -0.5)
    W_v = np.random.randn(c, num_heads, c // num_heads) * (c ** -0.5)
    
    q = np.einsum('ic,chd->ihd', h, W_q)
    k = np.einsum('jc,chd->jhd', h, W_k)
    v = np.einsum('jc,chd->jhd', h, W_v)
    
    # Pair bias
    c_z = z.shape[-1]
    W_b = np.random.randn(c_z, num_heads) * (c_z ** -0.5)
    b = np.einsum('ijc,ch->ijh', z[:N, :N], W_b)
    
    attn = np.einsum('ihd,jhd->ijh', q, k) / np.sqrt(c // num_heads)
    attn = attn + b
    attn = softmax(attn, axis=1)
    
    out = np.einsum('ijh,jhd->ihd', attn, v)
    
    # Output projection
    W_o = np.random.randn(num_heads, c // num_heads, c) * (c ** -0.5)
    out = np.einsum('ihd,hdc->ic', out, W_o)
    
    # Predict position update
    W_out = np.random.randn(c, 3) * (c ** -0.5)
    delta_x = out @ W_out  # [N, 3]
    
    # Pad to full atoms if needed
    full_delta = np.zeros((N_atoms, 3))
    full_delta[:N] = delta_x
    
    print(f"Position update: {full_delta.shape}")
    
    return full_delta

In [None]:
# Test
print("Test: Diffusion Transformer Block")
print("="*60)

N_atoms = 64
N = 32
c_s = 128
c_z = 64

x = np.random.randn(N_atoms, 3)  # Noisy positions
s = np.random.randn(N, c_s)
z = np.random.randn(N, N, c_z)
t = 0.5  # Noise level
t_emb = noise_level_embedding(t, c=64)

delta_x = diffusion_transformer_block(x, s, z, t_emb)

print(f"\nPosition update norm: {np.linalg.norm(delta_x):.4f}")
print(f"Output finite: {np.isfinite(delta_x).all()}")

## Key Insights

1. **Noise Conditioning**: Uses AdaLN conditioned on noise level t
2. **Pair Bias**: Incorporates pair representation as attention bias
3. **Position Updates**: Predicts delta to add to current positions
4. **Iterative Refinement**: Applied multiple times during denoising