# Assignment 4: Diffusion Model

In this assignment, you will implement a diffusion model from scratch and train it on the MNIST dataset. Diffusion models are a class of generative models that learn to gradually denoise random noise to generate realistic images. This assignment will guide you through the core components and training process of diffusion models.

Useful links:
1. [What are Diffusion Models?](https://lilianweng.github.io/posts/2021-07-11-diffusion-models/)
2. [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239)

Please:
* Fill out the code marked with `TODO` or `Your code here`. You are allowed to split functions or visualizations to different files for more flexibility as long as your output includes what we asked.
* Reuse or modify visualization code from Assignment 2 for creating necessary visualizations.
* Submit the notebook with all original outputs. If the output is included from another file, please include them into your folder. 
* Answer questions at the end of the notebook. Write your answere in the notebook.

**Please reserve enough time for this assignment given the potential amount of time for training.**

In [1]:
import torch

## Part 1: Implementing the U-Net (30 pt)

In this part, you will implement a U-Net style model that serves as the backbone for the diffusion process. The model takes noisy images and their corresponding timesteps as input and predicts the noise that was added to the original images.

Please fill out the code in `diffusion.DiffusionModel` then run the following code for test. For the time embedding, you can only use one embedding layer and concatenate it with the feature. The attention layer is not enforced given the computation resource.

In [2]:
from diffusion import DiffusionModel

def check_diffusion_model(model_class):
    """Verify that the DiffusionModel class is correctly implemented."""
    try:
        channels = 1
        image_size = 28
        noise_steps = 1000
        model = model_class(image_size=image_size, channels=channels)
        
        # Test forward pass with random inputs
        batch_size = 4
        x = torch.randn(batch_size, channels, image_size, image_size)
        t = torch.randint(0, noise_steps, (batch_size,))
        
        output = model(x, t)
        
        # Check output shape
        expected_shape = (batch_size, channels, image_size, image_size)
        assert output.shape == expected_shape, f"Expected output shape {expected_shape}, got {output.shape}"
        
        print("DiffusionModel implementation is correct!")
        return True
    except Exception as e:
        print(f"DiffusionModel check failed: {str(e)}")
        return False
    
check_diffusion_model(DiffusionModel)

DiffusionModel implementation is correct!


True

## Part 2: Implementing the Diffusion Process (30 pt)

In this part, you will implement the core diffusion process, including the forward diffusion (adding noise) and the denoising process. This includes setting up the noise schedule and implementing functions for noise addition and sampling.

Please fill out the code in `diffusion.DiffusionProcess` then run the following code for test. Note that this test only tests the correctness of the output format. You need to be careful about the actual math.

In [3]:
from diffusion import DiffusionProcess

def check_diffusion_process(diffusion_class):
    """Verify that the DiffusionProcess class is correctly implemented."""
    try:
        channels = 1
        image_size = 28
        noise_steps = 1000
        diffusion = diffusion_class(image_size=image_size, channels=channels, noise_steps=noise_steps)
        
        # Test add_noise function
        batch_size = 4
        x = torch.randn(batch_size, channels, image_size, image_size)
        t = torch.randint(0, noise_steps, (batch_size,))
        
        noisy_x, noise = diffusion.add_noise(x, t)
        assert noisy_x.shape == x.shape, f"Expected noisy_x shape {x.shape}, got {noisy_x.shape}"
        assert noise.shape == x.shape, f"Expected noise shape {x.shape}, got {noise.shape}"
        
        # Test train_step function
        loss = diffusion.train_step(x)
        assert isinstance(loss, float), f"Expected loss to be a float, got {type(loss)}"
        
        print("DiffusionProcess implementation is correct!")
        return True
    except Exception as e:
        print(f"DiffusionProcess check failed: {str(e)}")
        return False
    
check_diffusion_process(DiffusionProcess)

DiffusionProcess implementation is correct!


True

## Part 3: Training and Sampling (20 points)

In this part, you will implement the training loop for the diffusion model and the functions for generating and visualizing samples. Please try to follow the assignment you have written and use the `DiffusionModel`  and `DiffusionProcess` above for write your training function. You should write your training code in a standalone python file.

Please include the training curves and the sampled results below. You can reuse the visualization code we provided in the GAN assignment.

You can include an image like:

![image](./DDPM.png)

## Linear Beta Scheduling
![image](./analysis/linear_training_loss.png)

<!-- ![image](./analysis/linear_sample_epoch_100.png) -->
<img src="./analysis/linear_sample_epoch_100.png" width="200px">

## Cosine Beta Scheduling
![image](./analysis/cosine_training_loss.png)

<!-- ![image](./analysis/cosine_sample_epoch_100.png) -->
<img src="./analysis/cosine_sample_epoch_100.png" width="200px">

## Part 4: Analysis and Visualization (20 points)

Answer the question with your analysis. Most of the questions are open-ended. We are looking for yourown observasion from the experiments you did.

1. How does the choice of noise schedule (beta values) affect the training stability and sample quality? Try at least one alternative to the linear schedule (e.g., cosine or quadratic) and compare the results.

[Answer]: The choice of noise schedule significantly impacts both training stability and sample quality in diffusion models. After experimenting with both linear and cosine beta schedules, I found that the cosine schedule demonstrated superior performance in both metrics. Examining the training curves, the cosine schedule shows smoother, more consistent convergence with minimal fluctuations after epoch 20, while the linear schedule exhibits noticeable oscillations particularly around epochs 60-80, though both ultimately converge to similar final loss values of approximately 0.03-0.05. More importantly, the sample quality differs substantially between the two schedules—the cosine schedule produces cleaner, more well-defined digits with sharper edges and better foreground-background separation, whereas the linear schedule generates recognizable but slightly fuzzier digits with more background noise. This performance difference occurs because the cosine schedule applies a more gradual and balanced noise distribution across timesteps, preserving image structure longer in early steps and avoiding overly aggressive noise addition. In contrast, the linear schedule's constant rate of noise increase can be suboptimal, being too aggressive early on and potentially insufficient at later timesteps, which explains the training instability and reduced sample quality observed in my experiments.

2. Based on your observations, at which timesteps (early, middle, or late in the diffusion process) does the model seem to struggle the most with accurately predicting the noise (looking into loss)? Why do you think this occurs?

[Answer]: Based on the training dynamics observed in both the linear and cosine schedules, the model appears to struggle most during the early timesteps (high noise levels, t close to T=1000) of the diffusion process. This is evidenced by the initially high loss values (~0.28 for cosine, ~0.23 for linear) at the beginning of training, which represents when the model is learning to denoise images across all timesteps uniformly. The rapid initial drop in loss during the first 20 epochs suggests the model quickly learns the easier denoising tasks at middle and late timesteps, while the more gradual convergence afterward indicates continued struggle with the difficult early timesteps. This difficulty occurs because at early timesteps, the images are almost entirely pure Gaussian noise with minimal signal remaining from the original digit structure, making it extremely challenging for the model to predict what noise should be removed to recover meaningful features. The signal-to-noise ratio is lowest at these timesteps, requiring the model to essentially "hallucinate" or infer structure from nearly random noise based solely on learned priors from the training data. Additionally, early timesteps have the largest magnitude noise (highest beta values), which means prediction errors at these steps have more significant impact on the loss. In contrast, late timesteps (t close to 0) contain mostly clean images with small amounts of noise, making the denoising task substantially easier as the model only needs to perform minor refinements rather than major structure recovery.

3. Perform interpolation between two noise vectors and analyze the resulting generated images. Is the transition smooth? What does this tell you about the model's learned latent space?

[Answer]: The interpolation results demonstrate partially smooth transitions with interesting intermediate characteristics. While the sequence shows several distinct digit identities (8 → 5 → 9 → 7 → 4 → 7), there are notable ambiguous forms in the middle positions (particularly images 3 and 6) that appear less clearly defined and could be interpreted as transitional states between digits. This suggests that the model's learned representation exhibits local smoothness in certain regions of the noise space, where nearby noise vectors can produce semantically related or hybrid outputs. However, the overall transition is not uniformly smooth—most steps still show discrete jumps between recognizable digits rather than gradual morphing. This reveals that diffusion models operate in a high-dimensional noise space that has some emergent structure but lacks the globally organized, semantically smooth latent space characteristic of VAE or GAN models. The model successfully learned to map various noise patterns to realistic digits, and in some cases, interpolated noise vectors land in regions that produce ambiguous or transitional forms. This demonstrates that while the initial noise space is fundamentally chaotic, the denoising process can create pockets of local continuity where similar noise patterns yield related visual outputs.

<img src="./analysis/interpolation.png" width="300px">

4. Recall Assignment 2, we implemented GAN. compare your diffusion model with GANs in terms of:
* Training stability
* Sample quality
* Diversity of samples
* Computational requirements
* Anything else you find interesting

[Answer]: Comparing my implementations revealed surprising differences in sample quality. My GAN produced cleaner, more recognizable digits despite GANs' reputation for training difficulty. The GAN samples showed sharp, well-defined digits with clean backgrounds, particularly for simpler digits like 1 and 7, though complex digits like 8 showed slight blurriness. In contrast, my DDPM samples were noticeably messier, with several ambiguous or malformed digits, inconsistent stroke thickness, and residual background noise. This counterintuitive result can be attributed to several factors: (1) Insufficient training time - my DDPM trained for 100 epochs but likely needed 200-500+ epochs for comparable quality, whereas the GAN produced good results in ~10 epochs; (2) Error accumulation - the DDPM's 1000-step sampling process compounds small prediction errors, while the GAN generates in a single forward pass with no accumulation; (3) Model capacity - my simple 3-level U-Net without attention layers lacked the capacity to accurately denoise across all 1000 timesteps, whereas my GAN's architecture with spectral normalization was well-suited for its direct generation task; (4) Loss function differences - the adversarial loss provided rich, image-level feedback for the GAN, while the DDPM's MSE loss on predicted noise is weaker for capturing high-level structure. In terms of training stability, the DDPM was far superior with smooth, predictable convergence, while the GAN required careful hyperparameter tuning and oscillated throughout training. For computational requirements, the GAN was dramatically more efficient - generating samples in milliseconds versus the DDPM's minutes due to 1000 denoising steps. This comparison highlights an important lesson: training stability doesn't guarantee sample quality, and sometimes architecturally simpler models (GANs) can outperform theoretically superior ones (DDPMs) when the implementation is well-tuned for the specific task.

## Extra Credit: Diffusion Model on CIFAR 10 (20 pt)

In this extra credit assignment, you'll extend your Diffusion implementation to handle the more complex CIFAR-10 dataset.

You should design your own network architectures, considering factors like the increased complexity of RGB images, memory efficiency, and training stability. The basic code structure from the MNIST implementation can serve as a reference, but you'll need to modify the network dimensions and potentially add more capacity to handle the increased complexity.

Your submission should include complete implementation code (in a standalone file), training curves, generated imagesamples (inclue below), and a brief analysis (1-2 paragraphs) comparing your CIFAR-10 results with your MNIST implementation

## Extra Credit: DDIM Sampler Implementation (20 points)

Implement the Denoising Diffusion Implicit Models (DDIM) sampling method, which allows for faster sampling based on paper [Denoising Diffusion Implicit Models](https://arxiv.org/abs/2010.02502).

1. Implement the DDIM sampling algorithm based on the paper.
2. Compare DDIM sampling with the standard DDPM sampling in terms of:
* Sampling speed
* Sample quality
* Number of required steps

3. Experiment with different numbers of DDIM steps and analyze the tradeoff between speed and quality.

Your submission should include complete implementation code (can be in another python file), generated imagesamples (using the same MNIST model you presented above), and a brief analysis (1-2 paragraphs) comparing DDIM and DDPM.