# Denoising Diffusion Probabilistic Models

[Denoising Diffusion Probabilistic Models](https://arxiv.org/pdf/2006.11239)

[Tutorial on Diffusion Models for Imaging and Vision](https://arxiv.org/pdf/2403.18103)

[Bayes theorem, the geometry of changing beliefs](https://youtu.be/HZGCoVF3YvM?si=wPw_XLl6pZFQmIws)

[Probabilities of probabilities](https://youtube.com/playlist?list=PLZHQObOWTQDOjmo3Y6ADm0ScWAlEXf-fp&si=2GUpUMdHkjuCqSPK)

[What are Diffusion Models?](https://youtu.be/fbLgFrlTnGU?si=_d1SbwC5wNQ6eVym)

[Diffusion Models | Paper Explanation | Math Explained](https://youtu.be/HoKDTa5jHvg?si=Df3zTMRsPOgloGvI)

[The Annotated Diffusion Model](https://huggingface.co/blog/annotated-diffusion)

## Main components of DDPM

#### Forward Diffusion Process

The forward diffusion process is about adding noise gradually to the data so that it eventually becomes Gaussian noise. This can be expressed as a series of transitions, where at each step, a small amount of Gaussian noise is added.

#### Reverse Process (Denoising)

The reverse process is more complex and involves learning how to reverse the noise addition in order to recover the original data. Here, a neural network is typically trained to predict the noise added at each step.

#### Connection to Variational Inference (KL Divergence, ELBO, etc.)

This part of the model helps explain why and how we can learn the reverse process. We want the reverse process to approximate the true posterior distribution of the data (i.e., the reverse distribution). This can be done by minimizing the KL divergence between the true posterior and the learned reverse process.

Great! Let’s start by walking through the **forward diffusion process**, where we gradually add noise to the data.

#### Forward Process (Diffusion of Data)

In the forward process, we transform data $x_0$ (e.g., an image) into a sequence of noisy versions $x_1, x_2, \ldots, x_T$ through a gradual addition of Gaussian noise. By the end of the process (at time step $T$), the data should resemble pure noise, typically modeled as a Gaussian distribution $\mathcal{N}(0, I)$.

The forward process is defined as a Markov chain, where each state $x_t$ depends only on the previous state $x_{t-1}$. Specifically, we can express this step-wise transition as:

$$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{\alpha_t} x_{t-1}, (1-\alpha_t) I)$$

Where:
- $x_t$ is the noisy data at time step $t$.
- $\alpha_t$ is a variance schedule controlling the amount of noise added at each step.
- $\mathcal{N}(\mu, \Sigma)$ denotes a Gaussian distribution with mean $\mu$ and covariance $\Sigma$.

Let's break this down.

#### What's happening in each step?

At each step $t$, a small amount of Gaussian noise is added to the previous state $x_{t-1}$. This noise is parameterized by $\alpha_t$, which controls how much noise is injected at each step. The forward process moves the data from its original clean state $x_0$ towards a fully noisy state $x_T$.

- The term $\sqrt{\alpha_t} x_{t-1}$ scales down the contribution of the clean data.
- The term $(1-\alpha_t) I$ introduces Gaussian noise.

By the time you reach $x_T$, the data has been degraded into noise.

#### How do we accumulate noise?

We want to describe the relationship between $x_0$ (the original data) and $x_t$ at an arbitrary time step $t$. Since each step involves adding noise, we can compute the cumulative noise added over all the steps up to $t$.

By recursively applying the noise-adding process from $t = 1$ to $t = T$, we get:

$$q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) I)$$

Where:
- $\alpha_t = 1 - \beta_t$
- $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$

This equation tells us that $x_t$ is a noisy version of the original data $x_0$, where $\sqrt{\bar{\alpha}_t}$ is a decaying factor applied to the original data, and $1 - \bar{\alpha}_t$ governs the amount of noise accumulated over time.

#### Why is this form useful?

This equation is crucial because it allows us to directly sample $x_t$ at any time step $t$ from the original data $x_0$. The cumulative noise $x_t$ is modeled as a Gaussian distribution where:
- The mean is $\sqrt{\bar{\alpha}_t} x_0$, a scaled version of the original data.
- The variance is $(1 - \bar{\alpha}_t) I$, which grows over time, introducing more noise.


In [None]:
# Standard library imports
import math
import os
import time
import shutil
import logging
from collections.abc import Mapping
from pathlib import Path
from operator import attrgetter, itemgetter
from functools import partial
from copy import copy
from contextlib import contextmanager

# Third-party library imports
import fastcore.all as fc
import matplotlib as mpl
import numpy as np
import matplotlib.pyplot as plt
import torch
import random
import gzip
import pickle

# torch imports
import torchvision.transforms.functional as TF
import torch.nn.functional as F
from torch import tensor, nn, optim
from torch.utils.data import DataLoader, default_collate
from torch.nn import init
from torch.optim import lr_scheduler
from torcheval.metrics import MulticlassAccuracy

# dataset imports
import datasets
from datasets import load_dataset, load_dataset_builder

# miniai imports
from miniai.datasets import *
from miniai.conv import *
from miniai.learner import *
from miniai.activations import *
from miniai.init import *
from miniai.sgd import *
from miniai.resnet import *
from miniai.augment import *

In [None]:
mpl.rcParams['image.cmap'] = 'gray'
logging.disable(logging.WARNING)

In [None]:
x,y = 'image', 'label'
dsd = load_dataset('fashion_mnist')

In [None]:
@inplace
def transformi(b):
    b[x] = [TF.resize(TF.to_tensor(o), (32,32)) for o in b[x]] # resize the 28x28 images to 32x32 to make it simpler for model's architecture 

In [None]:
set_seed(42)
bs = 128
tds = dsd.with_transform(transformi)
dls = DataLoaders.from_dd(tds, bs, num_workers=8)
dt = dls.train
xb,yb = next(iter(dt))
xb.shape,yb[:10]