# VisionGen-Comparative-Study (Part 2)

_A Comparative Analysis of CGANs and Diffusion Models for Conditional Image Synthesis_

---

## Motivation

Generative models are transforming the field of computer vision, enabling AI systems to generate realistic images from simple inputs like class labels or text prompts. This project investigates:
- How do CGANs and Diffusion Models perform on classic and challenging datasets?
- What are the differences in sample quality, training stability, and controllability?
- Which model is more suitable for high-fidelity, class-conditional image synthesis?

## Approach

- **CGANs:**  
  Implemented and trained on MNIST (digits), Oxford-102 Flowers, and CUB-200 Birds datasets, using fully connected and convolutional architectures.

- **Diffusion Models:**  
  Integrated modern diffusion pipelines (e.g., DDPM, Stable Diffusion) using open-source libraries. Evaluated on the same datasets for direct comparison.

- **Comparative Analysis:**  
  Results evaluated both quantitatively (FID, Inception Score) and qualitatively (side-by-side image grids, training dynamics).

## Key Features

- End-to-end PyTorch and HuggingFace-based implementation
- Clean code, modular design, and ready-to-run Jupyter notebooks
- Direct, apples-to-apples comparison of CGANs and Diffusion Models
- Sample outputs, evaluation metrics, and visualizations included

---

## Author

**Ayushman Mishra**  
[GitHub: frMishR](https://github.com/frMishR)

---

## Research Inspiration

This project draws inspiration from foundational work in generative modeling, especially:

- **Conditional Generative Adversarial Nets (Mirza & Osindero, 2014):**  
  [arXiv:1411.1784](https://arxiv.org/abs/1411.1784)  
  The original CGAN paper provided the conceptual and technical basis for the CGAN implementations.

Recent advances in diffusion models and open-source research also guided the design of experiments and code.

---

## Project Origin

This project originated as a group submission for:

- **Course:** EEE 598: Generative AI – Theory and Practice  
- **Professor:** Dr. Lalitha Sankar (Arizona State University)  
- **Semester:** Spring 2025

### Original Contributors

- **Ayushman Mishra** (Solo Upgrade & Comparative Study, 2025)
- **Snavya Sai Munti Mudugu Badri Prasad** [GitHub: snavya0309](https://github.com/snavya0309)
- **Sushma Niresh** [GitHub: SushmaNiresh](https://github.com/SushmaNiresh)

> *Note: This repository represents a major solo upgrade and extension by Ayushman Mishra. All diffusion modeling, comparative analysis, and new documentation were developed independently by Ayushman Mishra (July–September 2025). The original CGAN implementation and dataset preparation were developed collaboratively by the above group.*

---

# VisionGen – Class-Conditional Diffusion (DDPM) on Multi-Class Datasets

> **Notebook Focus**: This notebook implements a **class-conditional Diffusion model (DDPM)** using a UNet backbone with `num_class_embeds` for label conditioning. We use the **same datasets** as Part 1 (e.g., **MNIST**, **Oxford-102 Flowers**, **CUB-200 Birds**) to enable an **apples-to-apples comparison** with cDCGAN.

---

## Objective

Move beyond adversarial training (GANs) to **score-based diffusion**.  
Diffusion models learn to **denoise images** from pure noise through a sequence of timesteps, offering **training stability** and **high-fidelity samples**. Here we make the model **class-conditional**, so generation can be guided by labels (e.g., “rose,” “eagle,” “digit 3”).

---

## Key Concepts Covered

- **Forward/Reverse Diffusion**:  
  - *Forward*: add Gaussian noise to images over T steps.  
  - *Reverse*: learn to **predict the noise** and iteratively denoise.
- **DDPM Training Objective**: Predict ε (noise) with MSE; scheduler defines β-schedule and timesteps.
- **Label Conditioning**: UNet receives **class embeddings**; sampling is controlled via class IDs.
- **Schedulers**: Linear or cosine (“`squaredcos_cap_v2`”) β-schedules; inference uses the learned reverse process.
- **Sampling**: Iterative denoising loop creates class-specific images from pure noise.
- **(Optional)** **FID/IS**: Quantitative comparison against validation sets; supports **CGAN vs. DDPM** analysis.

---

## Notebook Structure

1. **Imports & Setup** – `torch`, `diffusers` (UNet + `DDPMScheduler`), device config, seeds  
2. **Config & Hyperparameters** – `image_size`, `batch_size`, `epochs`, β-schedule, `num_train_timesteps`  
3. **Dataset Handling (same splits as Part 1)** – `ImageFolder` for Flowers/CUB, MNIST helper, transforms & normalization  
4. **Model Definition** – `UNet2DModel` with `num_class_embeds = K` and channels {1|3}  
5. **Training Loop** – random timestep per sample → add noise → predict ε → MSE loss → optimizer step  
6. **Sampling Routine** – iterative denoise with chosen labels; save class-grids each epoch  
7. **(Optional) Metrics** – generate N images per class; compute **FID** via `torch-fidelity`  
8. **Checkpointing & Reload** – save `UNet` weights + scheduler; quick reload for inference

---

## Why this matters

- **Stability & Fidelity**: Diffusion training is typically more stable than GANs and often yields **cleaner textures** and **fewer mode-collapse artifacts**.  
- **Controlled Generation**: Class conditioning enables **targeted image synthesis** for multi-class datasets.  
- **Real-world Signal**: Shows you can implement modern **score-based generative modeling**, not just adversarial setups.

---

## Comparative Angle (Part 1 vs Part 2)

| Aspect | cDCGAN (Part 1) | DDPM (Part 2) |
|---|---|---|
| Training | Adversarial (G vs D) | Denoising (predict ε) |
| Stability | Trickier (GAN dynamics) | Generally stable |
| Speed | Often faster per epoch | Slower (many timesteps) |
| Fidelity | Good, may artifact | Often very clean |
| Conditioning | Label concat/embeds | `num_class_embeds` in UNet |
| Metrics | FID/IS | FID/IS |

> **Deliverable**: A **side-by-side table** of FID and sample grids for the same classes/datasets.

---

## Reproducibility & Usage Notes

- Fix seeds; log losses and sample grids at consistent epochs.  
- Keep **image size** and **train splits** identical to Part 1 for fair comparison.  
- For laptop GPUs: start with **64×64**, lower batch size, fewer epochs; scale up once the pipeline is verified. 

---

## Outcome

This notebook provides a **clean Diffusion baseline** to compare with your **cDCGAN** results—enabling a credible **Comparative Study** across **architecture families**, **datasets**, and **metrics**.

### Setting up GPU & Environment Check

Sanity-check your setup before training:
- Prints **Python**, **PyTorch**, **torchvision**, **diffusers**, **transformers** versions.
- Detects **CUDA** and reports GPU name + VRAM.
- Creates a `device` you’ll reuse later (`cuda` or `cpu`).

In [4]:
import sys, platform
import torch

def _ver(mod):
    try:
        return mod.__version__
    except Exception:
        return "N/A"

print(f"Python: {sys.version.split()[0]} ({platform.system()})")
print(f"Torch: {torch.__version__}")

# Optional libs
try:
    import torchvision
    print("torchvision:", _ver(torchvision))
except Exception:
    print("torchvision: not installed")

try:
    import diffusers
    print("diffusers:", _ver(diffusers))
except Exception:
    print("diffusers: not installed")

try:
    import transformers
    print("transformers:", _ver(transformers))
except Exception:
    print("transformers: not installed")

# Device info
if torch.cuda.is_available():
    dev_id = torch.cuda.current_device()
    props = torch.cuda.get_device_properties(dev_id)
    cap = torch.cuda.get_device_capability(dev_id)
    vram_gb = props.total_memory / (1024**3)
    print(f"Device: CUDA:{dev_id} — {props.name} (CC {cap[0]}.{cap[1]}, {vram_gb:.1f} GB VRAM)")
    device = torch.device("cuda")
else:
    print("Device: CPU (CUDA not available)")
    device = torch.device("cpu")

device

Python: 3.10.0 (Windows)
Torch: 2.5.1+cu121
torchvision: 0.20.1+cu121
diffusers: 0.35.1
transformers: not installed
Device: CUDA:0 — NVIDIA GeForce RTX 4060 Laptop GPU (CC 8.9, 8.0 GB VRAM)


device(type='cuda')

### Utilities

We define helper functions for **reproducibility** and **visualization**:

- `set_seed(seed)`: Fixes randomness across `random`, `numpy`, and `torch` so runs are deterministic and comparable.  
- `show_grid(tensor, nrow, title)`: Nicely display a batch of images in a grid, converting them back from `[-1,1]` to `[0,1]` if needed.

These are lightweight utilities used throughout training and evaluation.

In [11]:
import random, numpy as np
import matplotlib.pyplot as plt
from torchvision.utils import make_grid

def set_seed(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    print(f"Seed fixed at {seed}")

def show_grid(tensor, nrow=8, title=None):
    """Display a grid of images (expects [B,C,H,W])"""
    t = tensor.detach().cpu()
    if t.min() < 0:   # convert from [-1,1] to [0,1]
        t = (t + 1) / 2
    grid = make_grid(t, nrow=nrow, pad_value=0.5)
    plt.figure(figsize=(8,8))
    plt.axis("off")
    if title:
        plt.title(title)
    plt.imshow(grid.permute(1,2,0))
    plt.show()

### Configuration (what this code does)

We centralize all hyperparameters into a **Config dataclass** so that training can be easily adjusted.  

Key fields:  
- **dataset**: choose from `'mnist'`, `'flowers'`, or `'cub'`.  
- **image_size**: resize all images (64 is laptop-friendly; 128+ for higher fidelity).  
- **batch_size / epochs / lr**: standard training settings.  
- **diffusion params**:  
  - `num_train_timesteps` → number of forward diffusion steps (default 1000).  
  - `beta_schedule` → schedule for variance growth (`linear`, `scaled_linear`, or `squaredcos_cap_v2`).  
  - `prediction_type` → what the model predicts (`epsilon` = noise).  
- **logging / sampling**: how often to log loss and save sample grids.  
- **checkpointing**: output directory and frequency of saving weights.  

This ensures reproducibility and makes it easy to rerun with different datasets or scales.

In [13]:
@dataclass
class Config:
    # Dataset options
    dataset: str = "Oxford102"      # 'mnist' | 'oxford102' | 'cub'
    data_root: str = "data"         # expects structured subfolders

    # Image & training
    image_size: int = 64
    channels: int = 3               # 1 for MNIST, 3 otherwise
    batch_size: int = 64
    num_workers: int = 4
    epochs: int = 50
    lr: float = 2e-4
    betas: tuple = (0.9, 0.999)

    # Diffusion hyperparameters
    num_train_timesteps: int = 1000
    beta_schedule: str = "linear"       # or 'scaled_linear', 'squaredcos_cap_v2'
    prediction_type: str = "epsilon"    # predict noise

    # Logging / sampling
    log_every: int = 100
    sample_every: int = 1               # epochs
    n_sample_per_class: int = 8

    # Checkpointing
    out_dir: str = "outputs/diffusion"
    ckpt_every: int = 5

### Datasets & Dataloaders (Oxford102 + CUB only)

We prepare datasets for the diffusion model:

- **Oxford-102 Flowers (`oxford102`)**: Expects `data/oxford102/train` and `data/oxford102/val`.  
- **CUB-200 Birds (`cub`)**: Expects `data/cub/train` and `data/cub/val`.  

All images are resized to `cfg.image_size`, converted to tensor, and normalized to `[-1,1]`.  
We return `train_loader`, `val_loader`, and `num_classes` (102 or 200).