# **Summary of Data Preparation Steps for DDPM Image Generation**
*This document outlines the key steps for preparing the Flowers and CelebA datasets to train a Denoising Diffusion Probabilistic Model (DDPM), explaining each step's purpose.*

# **Steps**

* **Image Resizing**
  * Purpose: Standardize all images to a consistent resolution (e.g., 256x256 pixels) to ensure uniform input shape for the model, which is essential for stable DDPM training.
  * Implementation: We resize images to 256x256 pixels using transforms.Resize((256, 256)).

* **Normalization**
  * Purpose: Scale image pixel values to the range
[
−
1
,
1
]
[−1,1] to prepare images for noise addition and removal in the DDPM pipeline, which typically performs best within this range.
  * Implementation: We apply transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]).

* **Data Augmentation (Training Only)**

  * Purpose: Increase variability and robustness in the training data, helping the model generalize better by introducing transformations such as random flips, rotations, and color adjustments.
  * Implementation: Augmentations include RandomHorizontalFlip, RandomRotation, and ColorJitter. These are applied only to the training set.

* **Dataset Splitting**
  * Purpose: Divide the dataset into training, validation, and test sets to enable effective model evaluation and generalization testing.
  * Implementation: We split each dataset into 80% training, 10% validation, and 10% test sets using train_test_split.

* **DataLoader Creation**
  * Purpose: Enable efficient, batched loading of data for model training and evaluation. Shuffling in the training DataLoader helps distribute data evenly across training iterations.
  * Implementation: We create DataLoader instances for each split with a batch size of 32, shuffling enabled for training.

In [1]:
import os
import random
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, Subset
from torchvision import transforms
from torchvision.datasets import ImageFolder
import torch
import zipfile
import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [2]:
!pip install kaggle

!kaggle datasets download -d alxmamaev/flowers-recognition
!kaggle datasets download -d jessicali9530/celeba-dataset

Dataset URL: https://www.kaggle.com/datasets/alxmamaev/flowers-recognition
License(s): unknown
Downloading flowers-recognition.zip to /content
 90% 202M/225M [00:01<00:00, 132MB/s]
100% 225M/225M [00:01<00:00, 147MB/s]
Dataset URL: https://www.kaggle.com/datasets/jessicali9530/celeba-dataset
License(s): other
Downloading celeba-dataset.zip to /content
 98% 1.31G/1.33G [00:06<00:00, 211MB/s]
100% 1.33G/1.33G [00:06<00:00, 204MB/s]


In [3]:
with zipfile.ZipFile('flowers-recognition.zip', 'r') as zip_ref:
    zip_ref.extractall('flowers-recognition')

with zipfile.ZipFile('celeba-dataset.zip', 'r') as zip_ref:
    zip_ref.extractall('celeba')

In [5]:
flowers_dataset_path = "flowers-recognition/flowers"
celeba_images_path = "celeba/img_align_celeba"
celeba_attributes_path = "celeba/list_attr_celeba.csv"

common_transforms = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
])

augmentation_transforms = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(15),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
])

def split_dataset(dataset, val_split=0.1, test_split=0.1):
    train_size = 1 - (val_split + test_split)
    train_idx, test_idx = train_test_split(range(len(dataset)), test_size=test_split)
    train_idx, val_idx = train_test_split(train_idx, test_size=val_split / train_size)
    return Subset(dataset, train_idx), Subset(dataset, val_idx), Subset(dataset, test_idx)

flowers_dataset = ImageFolder(root=flowers_dataset_path, transform=common_transforms)
flowers_train, flowers_val, flowers_test = split_dataset(flowers_dataset)

flowers_train_loader = DataLoader(flowers_train, batch_size=32, shuffle=True, num_workers=4)
flowers_val_loader = DataLoader(flowers_val, batch_size=32, shuffle=False, num_workers=4)
flowers_test_loader = DataLoader(flowers_test, batch_size=32, shuffle=False, num_workers=4)

celeba_images = ImageFolder(root=celeba_images_path, transform=common_transforms)
celeba_train, celeba_val, celeba_test = split_dataset(celeba_images)

celeba_train_loader = DataLoader(celeba_train, batch_size=32, shuffle=True, num_workers=4)
celeba_val_loader = DataLoader(celeba_val, batch_size=32, shuffle=False, num_workers=4)
celeba_test_loader = DataLoader(celeba_test, batch_size=32, shuffle=False, num_workers=4)