# Assignment 4: Diffusion Model

In this assignment, you will implement a diffusion model from scratch and train it on the MNIST dataset. Diffusion models are a class of generative models that learn to gradually denoise random noise to generate realistic images. This assignment will guide you through the core components and training process of diffusion models.

Useful links:
1. [What are Diffusion Models?](https://lilianweng.github.io/posts/2021-07-11-diffusion-models/)
2. [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239)

Please:
* Fill out the code marked with `TODO` or `Your code here`. You are allowed to split functions or visualizations to different files for more flexibility as long as your output includes what we asked.
* Reuse or modify visualization code from Assignment 2 for creating necessary visualizations.
* Submit the notebook with all original outputs. If the output is included from another file, please include them into your folder. 
* Answer questions at the end of the notebook. Write your answere in the notebook.

**Please reserve enough time for this assignment given the potential amount of time for training.**

In [1]:
import torch

## Part 1: Implementing the U-Net (30 pt)

In this part, you will implement a U-Net style model that serves as the backbone for the diffusion process. The model takes noisy images and their corresponding timesteps as input and predicts the noise that was added to the original images.

Please fill out the code in `diffusion.DiffusionModel` then run the following code for test. For the time embedding, you can only use one embedding layer and concatenate it with the feature. The attention layer is not enforced given the computation resource.

In [2]:
from diffusion import DiffusionModel

def check_diffusion_model(model_class):
    """Verify that the DiffusionModel class is correctly implemented."""
    try:
        channels = 1
        image_size = 28
        noise_steps = 1000
        model = model_class(image_size=image_size, channels=channels)
        
        # Test forward pass with random inputs
        batch_size = 4
        x = torch.randn(batch_size, channels, image_size, image_size)
        t = torch.randint(0, noise_steps, (batch_size,))
        
        output = model(x, t)
        
        # Check output shape
        expected_shape = (batch_size, channels, image_size, image_size)
        assert output.shape == expected_shape, f"Expected output shape {expected_shape}, got {output.shape}"
        
        print("DiffusionModel implementation is correct!")
        return True
    except Exception as e:
        print(f"DiffusionModel check failed: {str(e)}")
        return False
    
check_diffusion_model(DiffusionModel)

DiffusionModel implementation is correct!


True

## Part 2: Implementing the Diffusion Process (30 pt)

In this part, you will implement the core diffusion process, including the forward diffusion (adding noise) and the denoising process. This includes setting up the noise schedule and implementing functions for noise addition and sampling.

Please fill out the code in `diffusion.DiffusionProcess` then run the following code for test. Note that this test only tests the correctness of the output format. You need to be careful about the actual math.

In [3]:
from diffusion import DiffusionProcess

def check_diffusion_process(diffusion_class):
    """Verify that the DiffusionProcess class is correctly implemented."""
    try:
        channels = 1
        image_size = 28
        noise_steps = 1000
        diffusion = diffusion_class(image_size=image_size, channels=channels, noise_steps=noise_steps)
        
        # Test add_noise function
        batch_size = 4
        x = torch.randn(batch_size, channels, image_size, image_size)
        t = torch.randint(0, noise_steps, (batch_size,))
        
        noisy_x, noise = diffusion.add_noise(x, t)
        assert noisy_x.shape == x.shape, f"Expected noisy_x shape {x.shape}, got {noisy_x.shape}"
        assert noise.shape == x.shape, f"Expected noise shape {x.shape}, got {noise.shape}"
        
        # Test train_step function
        loss = diffusion.train_step(x)
        assert isinstance(loss, float), f"Expected loss to be a float, got {type(loss)}"
        
        print("DiffusionProcess implementation is correct!")
        return True
    except Exception as e:
        print(f"DiffusionProcess check failed: {str(e)}")
        return False
    
check_diffusion_process(DiffusionProcess)

DiffusionProcess implementation is correct!


True

## Part 3: Training and Sampling (20 points)

In this part, you will implement the training loop for the diffusion model and the functions for generating and visualizing samples. Please try to follow the assignment you have written and use the `DiffusionModel`  and `DiffusionProcess` above for write your training function. You should write your training code in a standalone python file.

Please include the training curves and the sampled results below. You can reuse the visualization code we provided in the GAN assignment.

You can include an image like:

![image](./DDPM.png)

# 3 Layer Architecture Setup

## Linear Beta Scheduling
lr at 3e-4
![image](./analysis/linear_training_loss.png)

### Sample from training at 100 epoch
<!-- ![image](./analysis/linear_sample_epoch_100.png) -->
<img src="./analysis/linear_sample_epoch_100.png" width="200px">

### Full 8x8 Sample Generation
<img src="./analysis/linear_final_comparison_8x8.png">

## Cosine Beta Scheduling
cosine clip at 0.95 and lr at 3e-4
![image](./analysis/cosine_training_loss.png)

### Sample from training at 100 epoch
<!-- ![image](./analysis/cosine_sample_epoch_100.png) -->
<img src="./analysis/cosine_sample_epoch_100.png" width="200px">

### Full 8x8 Sample Generation
<img src="./analysis/cosine_final_comparison_8x8.png">

# 4 Layer Architecture Setup

## Linear Beta Scheduling
lr at 1e-4
![image](./analysis/linear_4layer_training_loss.png)

<!-- ![image](./analysis/linear_sample_epoch_100.png) -->
### Sample from training at 100 epoch
<img src="./analysis/linear_4layer_sample_epoch_100.png" width="200px">

### Full 8x8 Sample Generation
<img src="./analysis/linear_4layer_final_comparison_8x8.png">

## Cosine Beta Scheduling
cosine clip at 0.99 and lr at 1e-4
![image](./analysis/cosine_4layer_training_loss.png)

<!-- ![image](./analysis/cosine_sample_epoch_100.png) -->
### Sample from training at 100 epoch
<img src="./analysis/cosine_4layer_sample_epoch_100.png" width="200px">

### Full 8x8 Sample Generation
<img src="./analysis/cosine_4layer_final_comparison_8x8.png">


## Part 4: Analysis and Visualization (20 points)

Answer the question with your analysis. Most of the questions are open-ended. We are looking for yourown observasion from the experiments you did.

1. How does the choice of noise schedule (beta values) affect the training stability and sample quality? Try at least one alternative to the linear schedule (e.g., cosine or quadratic) and compare the results.

[Answer]: The choice of noise schedule significantly impacts both training stability and sample quality in diffusion models. After systematically experimenting with both linear and cosine beta schedules across different architectures, I conducted my primary analysis using an improved 4-level U-Net [64, 128, 256, 512] with 5-6M parameters, trained with a learning rate of 1e-4 and cosine clipping at 0.99. I also tested a simpler 3-level architecture [32, 64, 128] with higher learning rate (3e-4) and aggressive cosine clipping (0.95), but the 4-level model provided more meaningful insights due to its sufficient capacity and stable training configuration with overall better results.

Comparing the two schedules on the 4-level architecture revealed nuanced differences in training dynamics and sample quality. Examining the training curves, both schedules achieved smooth convergence with the cosine schedule reaching a final loss of approximately 0.048 and the linear schedule achieving approximately 0.030-0.035, suggesting the linear schedule may optimize the MSE objective slightly better. However, training stability was comparable between both approaches—the cosine schedule showed consistent smooth convergence throughout all 100 epochs, while the linear schedule exhibited one minor spike around epoch 85 but otherwise maintained stable training. Most critically, the sample quality showed interesting trade-offs: the cosine schedule produced digits with better-defined structure and more consistent stroke patterns, resulting in highly recognizable characters across all digit classes, while the linear schedule generated samples with slightly thicker, bolder strokes but occasionally less refined details in complex digits like 8s and 3s.

This performance comparison reveals that the linear schedule's simpler, constant rate of noise increase from β_start to β_end provides straightforward optimization that achieves lower MSE loss, while the cosine schedule's more gradual noise distribution—which preserves image structure longer in early timesteps through its curved beta trajectory—produces marginally better perceptual quality despite slightly higher loss values. The cosine schedule's mathematical formulation creates a more balanced learning signal across all 1000 timesteps, preventing the model from over-focusing on either very clean or very noisy images, whereas the linear schedule's uniform progression may lead to slight imbalances in how well different noise levels are learned. Ultimately, both schedules proved viable for MNIST generation with the properly configured 4-level architecture, with the choice depending on whether one prioritizes raw loss metrics (linear) or perceptual consistency (cosine). The dramatic improvement over my initial 3-level experiments demonstrates that architectural capacity is the primary determinant of sample quality, while noise schedule selection provides secondary but meaningful refinement to the results.

2. Based on your observations, at which timesteps (early, middle, or late in the diffusion process) does the model seem to struggle the most with accurately predicting the noise (looking into loss)? Why do you think this occurs?

[Answer]: Based on empirical analysis of loss per timestep and the training dynamics observed in my 4-level linear schedule model, the model demonstrates a clear monotonic difficulty gradient across timesteps, confirming that denoising performance degrades consistently as noise levels increase, with the most challenging tasks occurring at early timesteps (t < 100, highest noise levels).

The loss-per-timestep analysis reveals a smooth, exponential-like decay pattern that validates theoretical expectations. At the very beginning of the noise range (t ≈ 0-50), the model exhibits extremely high loss (~0.26 for cosine, ~0.22 for linear), representing the hardest denoising task where images are almost entirely pure Gaussian noise with minimal signal remaining from the original digit structure. The loss then decreases consistently and smoothly across the entire timestep range, reaching near-zero values (~0.005-0.01) at late timesteps (t > 800) where images retain substantial structure and require only minor refinement. Notably, the linear schedule shows a steeper initial drop and slightly better overall performance across most timestep ranges, achieving lower loss values particularly in the critical early-to-middle range (t=100-400) where the transition from structure-dominated to noise-dominated occurs. The cosine schedule, while exhibiting slightly higher loss at very early timesteps, maintains more consistent performance across the middle range, resulting in a smoother overall curve.

The training curve corroborates these findings through its characteristic two-phase learning pattern. The rapid initial loss drop during epochs 0-15 (from 0.21 to 0.07 for linear) represents the model quickly mastering the late timesteps (t > 600) where clear digit structure remains visible—these are the easiest denoising tasks with the lowest per-timestep loss values (< 0.02). The subsequent slower, more gradual convergence from epochs 15-100 (0.07 to 0.030) indicates continued refinement of the more challenging early and middle timesteps, particularly the critical range t < 300 where per-timestep losses remain elevated (> 0.05). The minor spike observed around epoch 85 in the linear schedule likely represents a temporary imbalance in learning across different timestep difficulty zones, though the model recovered quickly and continued improving to achieve its final strong performance.

The fundamental challenge at early timesteps stems from the near-complete absence of signal: at t < 100, images consist of approximately 99% Gaussian noise, requiring the model to essentially infer or "hallucinate" digit structure based solely on learned priors from training data rather than observable image features. At these timesteps, the largest magnitude noise (highest beta values) means prediction errors contribute disproportionately to the MSE loss, making them both harder to learn and more impactful on the training objective. The smooth monotonic decrease in loss across timesteps demonstrates that the difficulty gradient is purely a function of signal-to-noise ratio—as more structure becomes visible at higher timesteps, the denoising task becomes progressively easier, requiring less inference and more refinement. The linear schedule's superior performance (36% lower final training loss: 0.030 vs 0.048) combined with its consistently lower per-timestep losses across most ranges explains why it produced better sample quality with thicker, bolder strokes—it achieved more effective learning across the full spectrum of noise levels, particularly in the challenging early timesteps that determine the overall structure of generated digits.

### Linear 4 layer loss per timestep
![image](./analysis/linear_4layer_loss_per_timestep.png)

### Cosine 4 layer loss per timestep
![image](./analysis/cosine_4layer_loss_per_timestep.png)

3. Perform interpolation between two noise vectors and analyze the resulting generated images. Is the transition smooth? What does this tell you about the model's learned latent space?

[Answer]: The interpolation results from the linear schedule model demonstrate that the learned noise space exhibits discrete, well-separated digit representations rather than smooth semantic transitions. The interpolation sequence shows a clear progression through distinct digit identities, where each step produces a recognizable, defined digit with minimal ambiguous or minimal hybrid forms. The transitions are characterized by abrupt changes from one digit class to another rather than gradual morphing.

This discrete transition pattern reveals fundamental properties of how diffusion models organize the noise space. Unlike VAEs or GANs, which learn explicit semantic latent spaces where interpolation produces smooth morphing (e.g., a "3" gradually reshaping into an "8"), diffusion models work directly in high-dimensional Gaussian noise space (ℝ^(1×28×28)) with no inherent semantic structure. When linearly interpolating between two random noise vectors z₁ and z₂, we create intermediate points z_interp = (1-α)z₁ + αz₂ that are simply weighted combinations of random noise patterns. The denoising process then independently maps each of these interpolated noise vectors to whichever digit structure the model has learned to associate with that particular noise pattern, with no guarantee that nearby noise vectors correspond to semantically similar outputs.

The clean, discrete nature of the transitions—where each interpolation step produces a distinct, recognizable digit—indicates that the linear schedule successfully learned to partition the noise space into well-separated regions, with each region strongly associated with a specific digit class. The lack of ambiguous intermediate forms suggests sharp boundaries between these regions, meaning the model developed high confidence in its noise-to-digit mappings. This well-structured implicit organization likely contributes to the linear schedule's superior overall performance: the clear separation between digit classes in noise space translates to more confident, decisive denoising at each timestep, resulting in the thicker, bolder strokes and cleaner backgrounds observed in the final samples. The model "knows" which digit it should generate from a given noise pattern rather than hedging between multiple possibilities.

This experiment demonstrates that while diffusion models can generate high-quality samples and learn effective mappings from noise to images, they fundamentally differ from latent-variable models in how they represent data. The organization of the noise space emerges implicitly through the denoising objective rather than being explicitly structured through an encoder-decoder architecture. The discrete interpolation behavior—jumping between distinct digit classes rather than smoothly morphing between them—is not a limitation but rather a natural consequence of operating in unstructured Gaussian noise space. The fact that the linear schedule achieved such clear separation between digit classes while maintaining excellent generation quality (loss of 0.030, clean visual samples) validates that this implicit organization, when properly learned, can be highly effective for generative modeling tasks.

### Linear Interpolation
<img src="./analysis/linear_4layer_interpolation.png" width="300px">

### Cosine Interpolation
<img src="./analysis/cosine_4layer_interpolation.png" width="300px">

4. Recall Assignment 2, we implemented GAN. compare your diffusion model with GANs in terms of:
* Training stability
* Sample quality
* Diversity of samples
* Computational requirements
* Anything else you find interesting

[Answer]: Comparing my DDPM and GAN implementations on MNIST revealed nuanced trade-offs between these two generative modeling approaches, with each demonstrating distinct strengths. After systematic experimentation and architectural improvements, my final **4-level linear schedule DDPM** achieved competitive performance that narrowed the gap with my GAN implementation, though notable differences remained across multiple dimensions.

**Training Stability:** The DDPM demonstrated dramatically superior training stability compared to the GAN. The diffusion model exhibited smooth, predictable convergence from initial loss of 0.21 to final loss of 0.030 over 100 epochs, with consistent improvement and only one minor spike around epoch 85. In contrast, the GAN required careful hyperparameter tuning, spectral normalization to prevent mode collapse, and precise learning rate balancing between generator and discriminator. The GAN's training curves showed characteristic adversarial oscillations with d_loss and g_loss constantly competing, and the model remained vulnerable to training instabilities throughout the entire training process. The DDPM's stability made it significantly easier to debug, tune, and achieve reliable results—I could confidently restart training with new configurations knowing convergence was virtually guaranteed.

**Sample Quality:** This dimension showed the most interesting evolution through my experiments. My initial 3-level DDPM [32, 64, 128] produced noticeably inferior samples compared to the GAN—digits were thin, wispy, and inconsistent with several ambiguous or malformed characters. However, upgrading to the 4-level architecture [64, 128, 256, 512] and optimizing the training configuration (linear schedule, learning rate 1e-4) substantially closed this gap. The final DDPM samples exhibited thicker, more confident strokes with better-defined structure, producing recognizable digits across all classes. The GAN samples remained slightly superior with sharper edges, cleaner backgrounds, and more consistent stroke thickness, particularly for complex digits like 8s and 3s. Critically, the GAN's single-step generation avoided error accumulation, while the DDPM's 1000-step iterative denoising could compound small prediction errors into visible artifacts. However, the DDPM's 36% lower final loss (0.030 vs typical GAN losses) and empirically-validated learning across all noise timesteps demonstrated it achieved strong performance despite architectural constraints.

**Diversity of Samples:** Both models demonstrated excellent diversity without mode collapse. The GAN, despite its vulnerability to mode collapse during training, successfully generated all 10 digit classes with varied writing styles once properly configured with spectral normalization. The DDPM showed similarly strong diversity, with interpolation analysis revealing well-separated digit classes in noise space and no evidence of missing or underrepresented digits. The DDPM's training on explicit noise levels across 1000 timesteps may have provided more comprehensive coverage of the data distribution compared to the GAN's implicit learning through adversarial feedback, though both achieved the practical goal of diverse, representative generation.

**Computational Requirements:** The GAN demonstrated overwhelming superiority in computational efficiency. Generation required only a single forward pass through the generator network, producing 64 samples in milliseconds. The DDPM, by contrast, required 1000 iterative denoising steps per sample, making generation approximately 1000× slower—producing the same 64 samples took several minutes. Training time per epoch also favored the GAN (~10-12 seconds) compared to the 4-level DDPM (~18 seconds), and the GAN achieved strong results in just 10-20 epochs versus the DDPM's 100 epochs. For applications requiring real-time generation or large-scale sampling, the GAN's speed advantage is decisive.

**Implementation Complexity and Lessons Learned:** My journey with the DDPM revealed that architectural choices and training configuration significantly impact results. The progression from 3-level to 4-level architecture, combined with proper noise schedule selection (linear outperforming cosine for MNIST) and learning rate tuning, was essential for competitive performance. The GAN, while requiring careful adversarial balancing, ultimately proved simpler to optimize once spectral normalization was incorporated. An important insight emerged: **optimal solutions are highly task-dependent**. The cosine schedule, favored in recent literature for complex high-resolution images, actually underperformed the simpler linear schedule for MNIST's low-resolution, binary structure. Similarly, the GAN's architectural simplicity proved advantageous for MNIST's relatively straightforward generation task, while the DDPM's iterative refinement—theoretically superior for complex images—introduced unnecessary overhead for simple digit generation.

**Conclusion:** Both approaches successfully generated convincing MNIST digits, validating their respective theoretical foundations. The DDPM offered superior training stability, guaranteed convergence, and explicit control over the generation process through timestep conditioning, making it more predictable and debuggable. The GAN provided better computational efficiency, slightly higher perceptual quality, and faster iteration during development. For MNIST specifically, the GAN's advantages in speed and sample quality make it the more practical choice, though the DDPM's stability and scalability to more complex tasks (as demonstrated in recent text-to-image models like Stable Diffusion) showcase its broader potential. This comparison reinforced that **effective machine learning requires matching methodology to task requirements** rather than assuming newer or theoretically sophisticated approaches are universally superior.

## Extra Credit: Diffusion Model on CIFAR 10 (20 pt)

In this extra credit assignment, you'll extend your Diffusion implementation to handle the more complex CIFAR-10 dataset.

You should design your own network architectures, considering factors like the increased complexity of RGB images, memory efficiency, and training stability. The basic code structure from the MNIST implementation can serve as a reference, but you'll need to modify the network dimensions and potentially add more capacity to handle the increased complexity.

Your submission should include complete implementation code (in a standalone file), training curves, generated imagesamples (inclue below), and a brief analysis (1-2 paragraphs) comparing your CIFAR-10 results with your MNIST implementation

## Extra Credit: DDIM Sampler Implementation (20 points)

Implement the Denoising Diffusion Implicit Models (DDIM) sampling method, which allows for faster sampling based on paper [Denoising Diffusion Implicit Models](https://arxiv.org/abs/2010.02502).

1. Implement the DDIM sampling algorithm based on the paper.
2. Compare DDIM sampling with the standard DDPM sampling in terms of:
* Sampling speed
* Sample quality
* Number of required steps

3. Experiment with different numbers of DDIM steps and analyze the tradeoff between speed and quality.

Your submission should include complete implementation code (can be in another python file), generated imagesamples (using the same MNIST model you presented above), and a brief analysis (1-2 paragraphs) comparing DDIM and DDPM.