# Diffusion Models for Generative AI

Diffusion models are a class of generative models that create new data by reversing a process that gradually adds noise. They operate in two main phases:

- **Forward Process (Diffusion):** Noise is incrementally added to the original data over $T$ steps, transforming it into pure noise.
- **Model Training:** The model learns to predict the noise added at each step, enabling it to understand how to reverse the process.
- **Backward Process (Denoising):** Starting from random noise, the model iteratively removes noise over $T$ steps, reconstructing a new, realistic sample.

This approach allows diffusion models to generate new high-quality images from a image dataset by simulating the process of denoising.

## Imports

In [39]:
import numpy as np
import torch
import torch.nn as nn
import torchvision
from torchvision.datasets import mnist

In [40]:
transform = torchvision.transforms.Compose([
    torchvision.transforms.Grayscale(num_output_channels=1),
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize((0.5,), (0.5,))
])

mnist_train = mnist.MNIST(root='./data', train=True, download=True, transform=transform)
mnist_test = mnist.MNIST(root='./data', train=False, download=True, transform=transform)

### Forward Process - Noising

Let $x_0$ be the image at t=0, and $x_t$ be the data at timestep $t$ after adding noise. The forward process is defined as:

$$x_t = \sqrt{\alpha_t} x_{t-1} + \sqrt{1 - \alpha_t} \epsilon_t,\quad \epsilon_t \sim \mathcal{N}(0, 1)$$

$$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \quad \beta_t \mathbf{I}),\quad t \sim \mathcal{N}(0, T)$$



where $\beta_t = 1 - \alpha_t$ is the variance schedule for timestep $t$.

The cumulative process from $x_0$ to $x_t$:

$$
q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, \quad (1 - \bar{\alpha}_t) \mathbf{I})
$$

where $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^t \alpha_s$.

#### Algorithm

1. **Initialize** $x_0$ as the original data sample.
2. **For** $t = 1$ to $T$:
    - Sample noise $\epsilon_t \sim \mathcal{N}(0, \mathbf{I})$
    - Compute $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon_t$
3. **Return** $x_T$ as the fully noised data.

This process gradually transforms the original data into pure noise over $T$ steps.

In [56]:
T = 30  # total timesteps
a = 0.1 # noise proportion per step
# kept constant for simplicity

def Noising(images, T, a):
    t = torch.randint(0, T, (images.size(0),), device=images.device).float()
    noise = torch.randn_like(images)
    a_t = a**t

    noisy_images = (a_t.view(-1, 1, 1, 1))**0.5 * images + (1 - a_t).view(-1, 1, 1, 1)**0.5 * noise

    return noisy_images, noise , t  # -> this should also be output 

### Deep Learning Model ###
A simple CNN based model to run over-the image reduse the dimensions and atlast give an output $\epsilon_t$.\
$\epsilon_t$ is a prediction of the amount of error introduced. And, the model learns to predict $\epsilon_t$ better and better.

In [57]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class Diffusion(nn.Module):
    def __init__(self, img_channels=1, hidden_dim=64):
        super().__init__()
        self.conv1 = nn.Conv2d(img_channels, hidden_dim, 3, padding=1)
        self.conv2 = nn.Conv2d(hidden_dim, hidden_dim, 3, padding=1)
        self.conv3 = nn.Conv2d(hidden_dim, img_channels, 3, padding=1)

        self.time_mlp = nn.Sequential(
            nn.Linear(1, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim)
        )

    def forward(self, x, t):
        t = t.view(-1, 1).float() / 1000.0  # scale down
        t_emb = self.time_mlp(t)[:, :, None, None]

        h = F.relu(self.conv1(x))
        h = h + t_emb
        h = F.relu(self.conv2(h))
        out = self.conv3(h)

        return out


### Model Training
- Design a loss function to train for $\epsilon_t$.
- For simplicity MSELoss

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

num_epochs = 5
device = "cpu"

model = Diffusion().to(device)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

T = 30
a = 0.9 # deacay factor

for epoch in range(num_epochs):
    model.train()
    total_loss = 0

    for images, _ in mnist_train:
        images = images.to(device)

        noisy_images, noise, t = Noising(images, T, a)

        # model predicts noise for each pixel
        noise_pred = model(noisy_images, t)

        # loss = MSE between true noise and predicted noise
        loss = criterion(noise_pred, noise)

        # backprop
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(mnist_train)
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}")


**Full Update Rule:**

The full denoising update at each step is:

$$
x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) + \sigma_t z
$$

where:
- $\alpha_t$: noise schedule at step $t$
- $\bar{\alpha}_t$: cumulative product of $\alpha_t$ up to $t$
- $\epsilon_\theta(x_t, t)$: predicted noise by the model
- $\sigma_t$: standard deviation for added noise
- $z \sim \mathcal{N}(0, I)$: random noise

This update produces higher fidelity samples by accounting for the variance schedule.


In [None]:
import torch

@torch.no_grad()
def Denoising(model, img_size=(1, 28, 28), T=30, device="cpu"):
    model.eval()
    betas = torch.linspace(1e-4, 0.02, T, device=device)
    alphas = 1.0 - betas
    alpha_bar = torch.cumprod(alphas, dim=0)

    # pure noise
    x_t = torch.randn(1, *img_size, device=device)

    for t in reversed(range(T)):
        t_tensor = torch.tensor([t], device=device, dtype=torch.long)

        # predict noise
        eps_theta = model(x_t, t_tensor)

        # coefficients
        alpha_t = alphas[t]
        alpha_bar_t = alpha_bar[t]
        beta_t = betas[t]

        # mean term μ_θ(x_t, t)
        mean = (1.0 / torch.sqrt(alpha_t)) * (
            x_t - (beta_t / torch.sqrt(1 - alpha_bar_t)) * eps_theta
        )

        if t > 0:
            # variance term σ_t z
            alpha_bar_prev = alpha_bar[t-1] if t > 0 else torch.tensor(1.0, device=device)
            posterior_var = beta_t * (1 - alpha_bar_prev) / (1 - alpha_bar_t)
            sigma_t = torch.sqrt(posterior_var)
            noise = torch.randn_like(x_t)
            x_t = mean + sigma_t * noise
        else:
            # at t=0, just take the mean
            x_t = mean

    return x_t


if __name__ == "__main__":
    gen_image = Denoising(model, img_size=(1, 28, 28), T=30)
    import matplotlib.pyplot as plt
    plt.imshow(gen_image.squeeze().cpu(), cmap="gray")
    plt.show()