# 01 - Prepare Datasets

This notebook prepares the datasets for the Format Matters project.

**Datasets:**
1. **CIFAR-10** (small): Auto-download with torchvision
2. **ImageNet-mini** (medium, preferred): Manual upload/download
3. **Tiny-ImageNet-200** (medium, fallback): Auto-download

**Output:**
- Datasets extracted to `data/raw/<dataset>/`
- Verification of dataset integrity
- Summary statistics

In [1]:
import os
import sys
import shutil
import tarfile
import zipfile
from pathlib import Path
from collections import defaultdict
import urllib.request

import torch
import torchvision
from torchvision import datasets
from PIL import Image
from tqdm.auto import tqdm

# Load common utilities
%run ./10_common_utils.ipynb

✓ Common utilities loaded successfully

Available functions:
  - set_seed(seed)
  - get_transforms(augment)
  - write_sysinfo(path)
  - time_first_batch(dataloader, device)
  - start_monitor(log_path, interval)
  - stop_monitor(thread, stop_event)
  - append_to_summary(path, row_dict)
  - compute_metrics_from_logs(log_path)
  - get_device()
  - format_bytes(bytes)
  - count_parameters(model)

Constants:
  - STANDARD_TRANSFORM


## Configuration

In [2]:
# Detect environment (local vs Kaggle)
IS_KAGGLE = "KAGGLE_KERNEL_RUN_TYPE" in os.environ

if IS_KAGGLE:
    BASE_DIR = Path('/kaggle/working/format-matters')
    INPUT_DIR = Path('/kaggle/input')  # For Kaggle datasets
else:
    BASE_DIR = Path('..').resolve()
    INPUT_DIR = None

DATA_DIR = BASE_DIR / 'data'
RAW_DIR = DATA_DIR / 'raw'

print(f"Environment: {'Kaggle' if IS_KAGGLE else 'Local'}")
print(f"Base directory: {BASE_DIR}")
print(f"Data directory: {DATA_DIR}")

# Create directories
RAW_DIR.mkdir(parents=True, exist_ok=True)

Environment: Local
Base directory: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters
Data directory: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data


## 1. CIFAR-10 (Small Dataset)

Download and extract CIFAR-10 to individual image files for format builders.

In [3]:
def prepare_cifar10():
    """
    Download CIFAR-10 and extract to individual image files.
    """
    cifar_dir = RAW_DIR / 'cifar10'
    cifar_dir.mkdir(parents=True, exist_ok=True)
    
    print("Downloading CIFAR-10...")
    
    # Download using torchvision
    train_dataset = datasets.CIFAR10(
        root=str(cifar_dir / 'torchvision'),
        train=True,
        download=True
    )
    
    test_dataset = datasets.CIFAR10(
        root=str(cifar_dir / 'torchvision'),
        train=False,
        download=True
    )
    
    # Class names
    class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
                   'dog', 'frog', 'horse', 'ship', 'truck']
    
    # Extract to individual files
    for split_name, dataset in [('train', train_dataset), ('val', test_dataset)]:
        split_dir = cifar_dir / split_name
        
        # Skip if already extracted
        if split_dir.exists() and len(list(split_dir.rglob('*.png'))) > 0:
            print(f"  {split_name} already extracted")
            continue
        
        print(f"  Extracting {split_name} split...")
        
        for class_idx in range(10):
            class_dir = split_dir / class_names[class_idx]
            class_dir.mkdir(parents=True, exist_ok=True)
        
        # Save images
        for idx, (img, label) in enumerate(tqdm(dataset, desc=f"  {split_name}")):
            class_name = class_names[label]
            img_path = split_dir / class_name / f"{idx:05d}.png"
            img.save(img_path)
    
    # Verify
    train_count = len(list((cifar_dir / 'train').rglob('*.png')))
    val_count = len(list((cifar_dir / 'val').rglob('*.png')))
    
    print(f"\n✓ CIFAR-10 prepared:")
    print(f"  Train: {train_count:,} images")
    print(f"  Val: {val_count:,} images")
    print(f"  Location: {cifar_dir}")
    
    return train_count, val_count

# Run preparation
cifar_train, cifar_val = prepare_cifar10()

Downloading CIFAR-10...


100%|█████████████████████████████████████████████████████████████████████████████████| 170M/170M [00:11<00:00, 15.4MB/s]


  Extracting train split...


  train:   0%|          | 0/50000 [00:00<?, ?it/s]

  Extracting val split...


  val:   0%|          | 0/10000 [00:00<?, ?it/s]


✓ CIFAR-10 prepared:
  Train: 50,000 images
  Val: 10,000 images
  Location: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\cifar10


## 2. ImageNet-mini (Medium Dataset - Preferred)

**Manual Setup Required:**

### For Local:
1. Download ImageNet-mini dataset
2. Extract to `data/raw/imagenet-mini/`
3. Structure should be:
   ```
   imagenet-mini/
     train/
       <class1>/
         img1.JPEG
         img2.JPEG
       <class2>/
         ...
     val/
       <class1>/
         ...
   ```

### For Kaggle:
1. Upload ImageNet-mini as a Kaggle Dataset
2. Add it to this notebook via "Add Data"
3. Run the cell below to copy to working directory

In [4]:
def prepare_imagenet_mini():
    """
    Prepare ImageNet-mini dataset.
    """
    imagenet_dir = RAW_DIR / 'imagenet-mini'
    
    # Check if already exists
    if imagenet_dir.exists() and (imagenet_dir / 'train').exists():
        train_count = len(list((imagenet_dir / 'train').rglob('*.JPEG'))) + \
                     len(list((imagenet_dir / 'train').rglob('*.jpg'))) + \
                     len(list((imagenet_dir / 'train').rglob('*.png')))
        val_count = len(list((imagenet_dir / 'val').rglob('*.JPEG'))) + \
                   len(list((imagenet_dir / 'val').rglob('*.jpg'))) + \
                   len(list((imagenet_dir / 'val').rglob('*.png')))
        
        if train_count > 0:
            print(f"✓ ImageNet-mini already prepared:")
            print(f"  Train: {train_count:,} images")
            print(f"  Val: {val_count:,} images")
            print(f"  Location: {imagenet_dir}")
            return train_count, val_count
    
    # For Kaggle: try to copy from input
    if IS_KAGGLE and INPUT_DIR:
        # Look for imagenet-mini in input datasets
        possible_names = ['imagenet-mini', 'imagenet_mini', 'imagenetmini']
        source_dir = None
        
        for name in possible_names:
            candidate = INPUT_DIR / name
            if candidate.exists():
                source_dir = candidate
                break
        
        if source_dir:
            print(f"Copying ImageNet-mini from {source_dir}...")
            shutil.copytree(source_dir, imagenet_dir, dirs_exist_ok=True)
            
            train_count = len(list((imagenet_dir / 'train').rglob('*.JPEG')))
            val_count = len(list((imagenet_dir / 'val').rglob('*.JPEG')))
            
            print(f"\n✓ ImageNet-mini copied:")
            print(f"  Train: {train_count:,} images")
            print(f"  Val: {val_count:,} images")
            return train_count, val_count
    
    print("⚠ ImageNet-mini not found")
    print("  Please follow manual setup instructions above")
    return 0, 0

# Try to prepare ImageNet-mini
imagenet_train, imagenet_val = prepare_imagenet_mini()

✓ ImageNet-mini already prepared:
  Train: 34,745 images
  Val: 3,923 images
  Location: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\imagenet-mini


## 3. Tiny-ImageNet-200 (Medium Dataset - Fallback)

If ImageNet-mini is not available, use Tiny-ImageNet-200 as fallback.

In [5]:
# def download_file(url, dest_path):
#     """
#     Download file with progress bar.
#     """
#     dest_path = Path(dest_path)
#     dest_path.parent.mkdir(parents=True, exist_ok=True)
    
#     class DownloadProgressBar(tqdm):
#         def update_to(self, b=1, bsize=1, tsize=None):
#             if tsize is not None:
#                 self.total = tsize
#             self.update(b * bsize - self.n)
    
#     with DownloadProgressBar(unit='B', unit_scale=True, miniters=1, desc=dest_path.name) as t:
#         urllib.request.urlretrieve(url, dest_path, reporthook=t.update_to)


# def prepare_tiny_imagenet():
#     """
#     Download and prepare Tiny-ImageNet-200 dataset.
#     """
#     tiny_dir = RAW_DIR / 'tiny-imagenet-200'
    
#     # Check if already exists
#     if tiny_dir.exists() and (tiny_dir / 'train').exists():
#         train_count = len(list((tiny_dir / 'train').rglob('*.JPEG')))
#         val_count = len(list((tiny_dir / 'val').rglob('*.JPEG')))
        
#         if train_count > 0:
#             print(f"✓ Tiny-ImageNet-200 already prepared:")
#             print(f"  Train: {train_count:,} images")
#             print(f"  Val: {val_count:,} images")
#             print(f"  Location: {tiny_dir}")
#             return train_count, val_count
    
#     print("Downloading Tiny-ImageNet-200...")
    
#     # Download
#     url = 'http://cs231n.stanford.edu/tiny-imagenet-200.zip'
#     zip_path = RAW_DIR / 'tiny-imagenet-200.zip'
    
#     if not zip_path.exists():
#         download_file(url, zip_path)
    
#     # Extract
#     print("Extracting...")
#     with zipfile.ZipFile(zip_path, 'r') as zip_ref:
#         zip_ref.extractall(RAW_DIR)
    
#     # Reorganize validation set (it has a different structure)
#     val_dir = tiny_dir / 'val'
#     if (val_dir / 'images').exists():
#         print("Reorganizing validation set...")
        
#         # Read val annotations
#         val_annotations = {}
#         with open(val_dir / 'val_annotations.txt', 'r') as f:
#             for line in f:
#                 parts = line.strip().split('\t')
#                 if len(parts) >= 2:
#                     val_annotations[parts[0]] = parts[1]
        
#         # Move images to class folders
#         images_dir = val_dir / 'images'
#         for img_file in images_dir.glob('*.JPEG'):
#             if img_file.name in val_annotations:
#                 class_name = val_annotations[img_file.name]
#                 class_dir = val_dir / class_name
#                 class_dir.mkdir(exist_ok=True)
#                 shutil.move(str(img_file), str(class_dir / img_file.name))
        
#         # Remove old images directory
#         if images_dir.exists():
#             shutil.rmtree(images_dir)
    
#     # Clean up zip
#     if zip_path.exists():
#         zip_path.unlink()
    
#     # Count images
#     train_count = len(list((tiny_dir / 'train').rglob('*.JPEG')))
#     val_count = len(list((tiny_dir / 'val').rglob('*.JPEG')))
    
#     print(f"\n✓ Tiny-ImageNet-200 prepared:")
#     print(f"  Train: {train_count:,} images")
#     print(f"  Val: {val_count:,} images")
#     print(f"  Location: {tiny_dir}")
    
#     return train_count, val_count

# # Prepare Tiny-ImageNet if ImageNet-mini not available
# if imagenet_train == 0:
#     print("\nImageNet-mini not available, preparing Tiny-ImageNet-200 as fallback...\n")
#     tiny_train, tiny_val = prepare_tiny_imagenet()
# else:
#     print("\nImageNet-mini available, skipping Tiny-ImageNet-200")
#     tiny_train, tiny_val = 0, 0

## 4. Dataset Summary

In [6]:
def analyze_dataset(dataset_path: Path, name: str):
    """
    Analyze dataset structure and statistics.
    """
    if not dataset_path.exists():
        return None
    
    stats = {
        'name': name,
        'path': str(dataset_path),
    }
    
    for split in ['train', 'val']:
        split_dir = dataset_path / split
        if not split_dir.exists():
            continue
        
        # Count images
        image_files = list(split_dir.rglob('*.JPEG')) + \
                     list(split_dir.rglob('*.jpg')) + \
                     list(split_dir.rglob('*.png'))
        
        # Count classes
        classes = set()
        for img_path in image_files:
            classes.add(img_path.parent.name)
        
        # Sample image sizes
        sizes = []
        for img_path in image_files[:100]:  # Sample first 100
            try:
                with Image.open(img_path) as img:
                    sizes.append(img.size)
            except:
                pass
        
        stats[f'{split}_images'] = len(image_files)
        stats[f'{split}_classes'] = len(classes)
        if sizes:
            avg_width = sum(s[0] for s in sizes) / len(sizes)
            avg_height = sum(s[1] for s in sizes) / len(sizes)
            stats[f'{split}_avg_size'] = f"{avg_width:.0f}x{avg_height:.0f}"
    
    return stats


print("\n" + "="*60)
print("DATASET SUMMARY")
print("="*60 + "\n")

datasets_info = []

# CIFAR-10
cifar_stats = analyze_dataset(RAW_DIR / 'cifar10', 'CIFAR-10')
if cifar_stats:
    datasets_info.append(cifar_stats)
    print(f"CIFAR-10:")
    print(f"  Train: {cifar_stats.get('train_images', 0):,} images, {cifar_stats.get('train_classes', 0)} classes")
    print(f"  Val: {cifar_stats.get('val_images', 0):,} images, {cifar_stats.get('val_classes', 0)} classes")
    print(f"  Avg size: {cifar_stats.get('train_avg_size', 'N/A')}")
    print()

# ImageNet-mini
imagenet_stats = analyze_dataset(RAW_DIR / 'imagenet-mini', 'ImageNet-mini')
if imagenet_stats and imagenet_stats.get('train_images', 0) > 0:
    datasets_info.append(imagenet_stats)
    print(f"ImageNet-mini:")
    print(f"  Train: {imagenet_stats.get('train_images', 0):,} images, {imagenet_stats.get('train_classes', 0)} classes")
    print(f"  Val: {imagenet_stats.get('val_images', 0):,} images, {imagenet_stats.get('val_classes', 0)} classes")
    print(f"  Avg size: {imagenet_stats.get('train_avg_size', 'N/A')}")
    print()

# Tiny-ImageNet-200
# tiny_stats = analyze_dataset(RAW_DIR / 'tiny-imagenet-200', 'Tiny-ImageNet-200')
# if tiny_stats and tiny_stats.get('train_images', 0) > 0:
#     datasets_info.append(tiny_stats)
#     print(f"Tiny-ImageNet-200:")
#     print(f"  Train: {tiny_stats.get('train_images', 0):,} images, {tiny_stats.get('train_classes', 0)} classes")
#     print(f"  Val: {tiny_stats.get('val_images', 0):,} images, {tiny_stats.get('val_classes', 0)} classes")
#     print(f"  Avg size: {tiny_stats.get('train_avg_size', 'N/A')}")
#     print()

print("="*60)
print(f"\n✓ {len(datasets_info)} dataset(s) prepared and ready for format building")


DATASET SUMMARY

CIFAR-10:
  Train: 50,000 images, 10 classes
  Val: 10,000 images, 10 classes
  Avg size: 32x32

ImageNet-mini:
  Train: 34,745 images, 1000 classes
  Val: 3,923 images, 1000 classes
  Avg size: 494x378


✓ 2 dataset(s) prepared and ready for format building


## 5. Acceptance Checks

In [9]:
print("\nRunning acceptance checks...\n")

checks_passed = 0
checks_total = 0

# Check 1: CIFAR-10 exists and has correct counts
checks_total += 1
if cifar_train >= 50000 and cifar_val >= 10000:
    print("✓ CIFAR-10: Correct image counts")
    checks_passed += 1
else:
    print(f"✗ CIFAR-10: Expected ≥50k train, ≥10k val, got {cifar_train}, {cifar_val}")

# Check 2: At least one medium dataset available
checks_total += 1
medium_available = (imagenet_train >= 30000) #or (tiny_train >= 50000)
if medium_available:
    print("✓ Medium dataset: Available with ≥30k training images")
    checks_passed += 1
else:
    print("✗ Medium dataset: No dataset with ≥30k training images found")

# Check 3: Directory structure
checks_total += 1
required_dirs = [
    RAW_DIR / 'cifar10/train',
    RAW_DIR / 'cifar10/val',
]
all_dirs_exist = all(d.exists() for d in required_dirs)
if all_dirs_exist:
    print("✓ Directory structure: All required directories exist")
    checks_passed += 1
else:
    print("✗ Directory structure: Some required directories missing")

# Summary
print(f"\n{'='*60}")
print(f"Acceptance checks: {checks_passed}/{checks_total} passed")
print(f"{'='*60}")

if checks_passed == checks_total:
    print("\n✅ All checks passed! Ready to proceed with format building.")
else:
    print("\n⚠ Some checks failed. Please review and fix issues before proceeding.")


Running acceptance checks...

✓ CIFAR-10: Correct image counts
✓ Medium dataset: Available with ≥30k training images
✓ Directory structure: All required directories exist

Acceptance checks: 3/3 passed

✅ All checks passed! Ready to proceed with format building.


## ✅ Dataset Preparation Complete

**Next Steps:**
1. Run format builder notebooks (02-05) to convert datasets
2. Each builder will read from `data/raw/<dataset>/`
3. Converted formats will be written to `data/built/<dataset>/<format>/`

**Available Datasets:**
- CIFAR-10 (small, always available)
- ImageNet-mini or Tiny-ImageNet-200 (medium, for realistic experiments)