Diffusion Models are generative models. These means that the are used to generate data which happens to be similar to the data on what they are trained for. The main idea is that the model destroys the training data by adding Gaussian noise and then, it learns how to recoved the data by reversing the noising process. If the process is well developed, the diffusion model is able to generate data by just receiving a simple sampled noise.

A Diffusion Model is a latent variable model which maps to the latent space using a fixed Markov chain. This chain gradually adds noise to the data in order to obtain the approximate posterior $\{q(\mathbf{x}_T|\mathbf{x}_{T-1}), q(\mathbf{x}_{T-1}|\mathbf{x}_{T-2}),\cdots,q(\mathbf{x}_1|\mathbf{x}_{0})\}$, where $\mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_T$ are the latent variables with the same dimensionality as$\mathbf{x}_0$ .

Ultimately, the image is asymptotically transformed to pure Gaussian noise. The goal of training a diffusion model is to learn the reverse process - i.e. training $p_\theta(\mathbf{x}_{T-1}| \mathbf{x}_{T})$. By traversing backwards along this chain, we can generate new data.

The sampling chain transitions in the forward process can be set to conditional Gaussians when the noise level is sufficiently low. Combining this fact with the Markov assumption leads to a simple parameterization of the forward process:

$$
q(\mathbf{x}_T|\mathbf{x}_{0})=\prod_{t=1}^T q(\mathbf{x}_t|\mathbf{x}_{t-1}) = \prod_{t=1}^T  \mathcal{N}(\mathbf{x}_t;\sqrt{1-\beta}\mathbf{x}_{t-1},\beta_t \mathbf{I})
$$

A process that is performed at each step in the Markov chain, such that we are simply sampling from a Gaussian distribution whose mean is the previous value (i.e. image) in the chain as $image_t=image_{t-1}+N(0,1)$. Here in this description $\beta_t$ is a variance schedule (either learned or fixed) which, if well-behaved, ensures that$\mathbf{x}_t$ is nearly an isotropic Gaussian for sufficiently large T.


During training, the model learns to reverse this diffusion process in order to generate new data. Starting with the pure Gaussian noise $p(\mathbf{x}_T)=\mathcal{N}(\mathbf{x}_T;\mathbf{0},\mathbf{I})$ the model learns the joint distribution $p_\theta(\mathbf{x}_{0:T})$ as:

$$
p_\theta(\mathbf{x}_{0:T})= p(\mathbf{x}_T)\prod_{t=1}^T p_\theta(\mathbf{x}_{t-1}| \mathbf{x}_{t}) =
p(\mathbf{x}_T)\prod_{t=1}^T \mathcal{N}(\mathbf{x}_{t-1};\mathbf{\mu}_\theta(\mathbf{x}_{t},t),\mathbf{\Sigma}_\theta(\mathbf{x}_{t},t))
$$
where the time-dependent parameters of the Gaussian transitions are learned. 

A Diffusion Model is trained by finding the reverse Markov transitions that maximize the likelihood of the training data. In practice, training equivalently consists of minimizing the variational upper bound on the negative log likelihood.

$$
E\left[-\log p_\theta(\mathbf{x}_{0}\right] \le E_q \left[ - \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\right] \equiv L_{vlb}
$$

In [1]:
import torch
from denoising_diffusion_pytorch import Unet, GaussianDiffusion

In [2]:
#dim parameter specifies the number of feature maps before the first down-sampling, 
# dim_mults parameter provides multiplicands for this value and successive down-samplings
model = Unet(
    dim = 64,
    dim_mults = (1, 2, 4, 8)
)

In [3]:
#the size of images to generate, 
# the number of timesteps in the diffusion process, 
# choice between the L1 and L2 norms
diffusion = GaussianDiffusion(
    model,
    image_size = 128,
    timesteps = 1000,   # number of steps
    loss_type = 'l1'    # L1 or L2
)

In [4]:
training_images = torch.randn(8, 3, 128, 128)
loss = diffusion(training_images)
loss.backward()

In [5]:
sampled_images = diffusion.sample(batch_size = 4)

sampling loop time step:   0%|          | 0/1000 [00:00<?, ?it/s]

In [None]:
from denoising_diffusion_pytorch import Unet, GaussianDiffusion, Trainer

model = Unet(
    dim = 64,
    dim_mults = (1, 2, 4, 8)
).cuda()

diffusion = GaussianDiffusion(
    model,
    image_size = 128,
    timesteps = 1000,   # number of steps
    loss_type = 'l1'    # L1 or L2
).cuda()

trainer = Trainer(
    diffusion,
    'path/to/your/images',
    train_batch_size = 32,
    train_lr = 2e-5,
    train_num_steps = 700000,         # total training steps
    gradient_accumulate_every = 2,    # gradient accumulation steps
    ema_decay = 0.995,                # exponential moving average decay
    amp = True                        # turn on mixed precision
)

trainer.train()