# 03 - Build WebDataset (TAR Shards)

This notebook creates WebDataset format with TAR-based shards.

**Format:** TAR archives with optional compression
- Sequential I/O friendly
- Configurable shard sizes
- Optional zstd compression
- Efficient for streaming from object storage

**Variants:**
- Shard sizes: 64MB, 256MB, 1024MB
- Compression: none, zstd

**Output:**
- `data/built/<dataset>/webdataset/<variant>/*.tar[.zst]`
- Build statistics logged to `runs/<session>/summary.csv`

In [1]:
import os
import sys
import time
import json
import tarfile
import io
from pathlib import Path
from collections import defaultdict

import webdataset as wds
from PIL import Image
from tqdm.auto import tqdm

# Load common utilities
%run ./10_common_utils.ipynb

✓ Common utilities loaded successfully

Available functions:
  - set_seed(seed)
  - get_transforms(augment)
  - write_sysinfo(path)
  - time_first_batch(dataloader, device)
  - start_monitor(log_path, interval)
  - stop_monitor(thread, stop_event)
  - append_to_summary(path, row_dict)
  - compute_metrics_from_logs(log_path)
  - get_device()
  - format_bytes(bytes)
  - count_parameters(model)

Constants:
  - STANDARD_TRANSFORM


## Configuration

In [2]:
# Detect environment
IS_KAGGLE = "KAGGLE_KERNEL_RUN_TYPE" in os.environ
BASE_DIR = Path('/kaggle/working/format-matters') if IS_KAGGLE else Path('..').resolve()

RAW_DIR = BASE_DIR / 'data/raw'
BUILT_DIR = BASE_DIR / 'data/built'

# Create run directory for this session
RUN_DIR = BASE_DIR / 'runs' / time.strftime('%Y%m%d-%H%M%S') / 'builds'
RUN_DIR.mkdir(parents=True, exist_ok=True)

SUMMARY_CSV = RUN_DIR / 'summary.csv'
SUMMARY_CSV.touch(exist_ok=True)

print(f"Environment: {'Kaggle' if IS_KAGGLE else 'Local'}")
print(f"Base directory: {BASE_DIR}")
print(f"Run directory: {RUN_DIR}")
print(f"Summary log: {SUMMARY_CSV}")

Environment: Local
Base directory: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters
Run directory: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\runs\20251127-133730\builds
Summary log: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\runs\20251127-133730\builds\summary.csv


## Build Configuration

Configure which variants to build:

In [3]:
# Shard sizes in MB
SHARD_SIZES = [64, 256, 1024]

# Compression options
COMPRESSIONS = ['none', 'zstd']

# Generate all variants
VARIANTS = []
for shard_mb in SHARD_SIZES:
    for compression in COMPRESSIONS:
        VARIANTS.append({
            'shard_mb': shard_mb,
            'compression': compression,
            'name': f"shard{shard_mb}_{compression}"
        })

print(f"Will build {len(VARIANTS)} variants:")
for v in VARIANTS:
    print(f"  - {v['name']}")

Will build 6 variants:
  - shard64_none
  - shard64_zstd
  - shard256_none
  - shard256_zstd
  - shard1024_none
  - shard1024_zstd


## WebDataset Builder

In [4]:
def build_webdataset_split(
    dataset_name: str,
    split: str,
    raw_path: Path,
    output_path: Path,
    shard_mb: int,
    compression: str,
    class_to_label: dict,
    image_extensions: list
):
    """
    Build WebDataset shards for a single split.
    
    Args:
        dataset_name: Name of dataset
        split: Split name ('train' or 'val')
        raw_path: Path to raw dataset split directory
        output_path: Path to output directory
        shard_mb: Target shard size in MB
        compression: Compression type ('none' or 'zstd')
        class_to_label: Mapping from class name to label index
        image_extensions: List of image file extensions to include
    
    Returns:
        Dictionary with build statistics
    """
    output_path.mkdir(parents=True, exist_ok=True)
    
    # Find all image files
    image_files = []
    for ext in image_extensions:
        image_files.extend(raw_path.rglob(f'*{ext}'))
    
    # Sort for reproducibility
    image_files = sorted(image_files)
    split_class_count = len({f.parent.name for f in image_files})
    print(f"    Found {len(image_files):,} images in {split_class_count} classes")
    
    # Determine shard pattern
    shard_pattern = "file://" + (output_path / f"{split}-%06d.tar").as_posix()
    if compression == 'zstd':
        shard_pattern += '.zst'
    
    # Target shard size in bytes
    maxsize = shard_mb * 1024 * 1024
    
    # Create WebDataset writer
    with wds.ShardWriter(shard_pattern, maxsize=maxsize) as sink:
        for idx, img_path in enumerate(tqdm(image_files, desc=f"    Writing {split}")):
            # Read image
            with open(img_path, 'rb') as f:
                img_bytes = f.read()
            
            # Get label
            class_name = img_path.parent.name
            if class_name not in class_to_label:
                raise KeyError(f"Class {class_name} missing from mapping for split {split}")
            label = class_to_label[class_name]
            
            # Determine image extension
            ext = img_path.suffix.lower().lstrip('.')
            if ext == 'jpeg':
                ext = 'jpg'
            
            # Create sample
            sample = {
                '__key__': f"{idx:08d}",
                ext: img_bytes,
                'cls': str(label).encode('utf-8'),  # Store label as text
            }
            
            sink.write(sample)
    
    # Count shards and total size


    shard_files = list(output_path.glob(f"{split}-*.tar*"))
    total_bytes = sum(f.stat().st_size for f in shard_files)
    
    return {
        'items': len(image_files),
        'bytes_on_disk': total_bytes,
        'num_files': len(shard_files),
        'avg_file_size': total_bytes // len(shard_files) if shard_files else 0,
    }


def build_webdataset(
    dataset_name: str,
    raw_path: Path,
    output_path: Path,
    shard_mb: int,
    compression: str,
    variant_name: str
):
    """
    Build WebDataset for a dataset with specific configuration.
    
    Args:
        dataset_name: Name of dataset
        raw_path: Path to raw dataset directory
        output_path: Path to output directory
        shard_mb: Target shard size in MB
        compression: Compression type ('none' or 'zstd')
        variant_name: Variant identifier
    
    Returns:
        Dictionary with build statistics
    """
    print(f"\nBuilding WebDataset for {dataset_name} ({variant_name})...")
    print(f"  Source: {raw_path}")
    print(f"  Output: {output_path}")
    print(f"  Shard size: {shard_mb}MB, Compression: {compression}")
    
    start_time = time.time()
    # FIXED: Use case-insensitive matching
    image_extensions = {'.jpg', '.jpeg', '.png'}
    all_class_names = set()
    for split in ['train', 'val']:
        split_dir = raw_path / split
        if not split_dir.exists():
            continue
        for ext in image_extensions:
            for img_path in split_dir.rglob(f'*{ext}'):
                all_class_names.add(img_path.parent.name)
    class_names = sorted(all_class_names)
    class_to_label = {name: idx for idx, name in enumerate(class_names)}
    print(f"  Total classes across splits: {len(class_names)}")
    
    # Process each split
    for split in ['train', 'val']:
        split_dir = raw_path / split
        if not split_dir.exists():
            print(f"  ⚠ {split} split not found, skipping")
            continue
        
        print(f"\n  Processing {split} split...")
        
        split_stats = build_webdataset_split(
            dataset_name, split, split_dir, output_path,
            shard_mb, compression, class_to_label, image_extensions
        )
        
        print(f"    ✓ {split}: {split_stats['num_files']} shards, "
              f"{format_bytes(split_stats['bytes_on_disk'])}")
        
        # Log to summary
        build_time = time.time() - start_time
        row = {
            'stage': 'build',
            'dataset': dataset_name,
            'format': 'webdataset',
            'variant': variant_name,
            'split': split,
            'items': split_stats['items'],
            'bytes_on_disk': split_stats['bytes_on_disk'],
            'num_files': split_stats['num_files'],
            'avg_file_size': split_stats['avg_file_size'],
            'build_wall_s': build_time,
        }
        append_to_summary(SUMMARY_CSV, row)
    
    build_time = time.time() - start_time
    print(f"\n  ✓ Build completed in {build_time:.2f}s")
    
    return {'dataset': dataset_name, 'variant': variant_name, 'build_time': build_time}


## Build WebDatasets for All Datasets and Variants

In [5]:
# Find available datasets
available_datasets = []
for dataset_name in ['cifar10', 'imagenet-mini']:
    dataset_path = RAW_DIR / dataset_name
    if dataset_path.exists() and (dataset_path / 'train').exists():
        available_datasets.append(dataset_name)

print(f"Found {len(available_datasets)} dataset(s): {', '.join(available_datasets)}")
print("\n" + "="*60)

Found 2 dataset(s): cifar10, imagenet-mini



In [6]:
# Build all variants for all datasets
build_results = []

for dataset_name in available_datasets:
    raw_path = RAW_DIR / dataset_name
    
    for variant in VARIANTS:
        output_path = BUILT_DIR / dataset_name / 'webdataset' / variant['name']
        
        result = build_webdataset(
            dataset_name=dataset_name,
            raw_path=raw_path,
            output_path=output_path,
            shard_mb=variant['shard_mb'],
            compression=variant['compression'],
            variant_name=variant['name']
        )
        build_results.append(result)
        
        print("="*60)

print(f"\n✓ Built {len(build_results)} WebDataset variant(s)")


Building WebDataset for cifar10 (shard64_none)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\cifar10
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\cifar10\webdataset\shard64_none
  Shard size: 64MB, Compression: none
  Total classes across splits: 10

  Processing train split...
    Found 50,000 images in 10 classes
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/cifar10/webdataset/shard64_none/train-000000.tar 0 0.0 GB 0


    Writing train:   0%|          | 0/50000 [00:00<?, ?it/s]

# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/cifar10/webdataset/shard64_none/train-000001.tar 30063 0.1 GB 30063
    ✓ train: 2 shards, 290.2 MB

  Processing val split...
    Found 10,000 images in 10 classes
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/cifar10/webdataset/shard64_none/val-000000.tar 0 0.0 GB 0


    Writing val:   0%|          | 0/10000 [00:00<?, ?it/s]

    ✓ val: 1 shards, 58.1 MB

  ✓ Build completed in 51.76s

Building WebDataset for cifar10 (shard64_zstd)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\cifar10
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\cifar10\webdataset\shard64_zstd
  Shard size: 64MB, Compression: zstd
  Total classes across splits: 10

  Processing train split...
    Found 50,000 images in 10 classes
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/cifar10/webdataset/shard64_zstd/train-000000.tar.zst 0 0.0 GB 0


    Writing train:   0%|          | 0/50000 [00:00<?, ?it/s]

# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/cifar10/webdataset/shard64_zstd/train-000001.tar.zst 30063 0.1 GB 30063
    ✓ train: 2 shards, 290.2 MB

  Processing val split...
    Found 10,000 images in 10 classes
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/cifar10/webdataset/shard64_zstd/val-000000.tar.zst 0 0.0 GB 0


    Writing val:   0%|          | 0/10000 [00:00<?, ?it/s]

    ✓ val: 1 shards, 58.1 MB

  ✓ Build completed in 28.56s

Building WebDataset for cifar10 (shard256_none)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\cifar10
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\cifar10\webdataset\shard256_none
  Shard size: 256MB, Compression: none
  Total classes across splits: 10

  Processing train split...
    Found 50,000 images in 10 classes
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/cifar10/webdataset/shard256_none/train-000000.tar 0 0.0 GB 0


    Writing train:   0%|          | 0/50000 [00:00<?, ?it/s]

    ✓ train: 1 shards, 290.2 MB

  Processing val split...
    Found 10,000 images in 10 classes
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/cifar10/webdataset/shard256_none/val-000000.tar 0 0.0 GB 0


    Writing val:   0%|          | 0/10000 [00:00<?, ?it/s]

    ✓ val: 1 shards, 58.1 MB

  ✓ Build completed in 26.95s

Building WebDataset for cifar10 (shard256_zstd)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\cifar10
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\cifar10\webdataset\shard256_zstd
  Shard size: 256MB, Compression: zstd
  Total classes across splits: 10

  Processing train split...
    Found 50,000 images in 10 classes
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/cifar10/webdataset/shard256_zstd/train-000000.tar.zst 0 0.0 GB 0


    Writing train:   0%|          | 0/50000 [00:00<?, ?it/s]

    ✓ train: 1 shards, 290.2 MB

  Processing val split...
    Found 10,000 images in 10 classes
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/cifar10/webdataset/shard256_zstd/val-000000.tar.zst 0 0.0 GB 0


    Writing val:   0%|          | 0/10000 [00:00<?, ?it/s]

    ✓ val: 1 shards, 58.1 MB

  ✓ Build completed in 27.12s

Building WebDataset for cifar10 (shard1024_none)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\cifar10
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\cifar10\webdataset\shard1024_none
  Shard size: 1024MB, Compression: none
  Total classes across splits: 10

  Processing train split...
    Found 50,000 images in 10 classes
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/cifar10/webdataset/shard1024_none/train-000000.tar 0 0.0 GB 0


    Writing train:   0%|          | 0/50000 [00:00<?, ?it/s]

    ✓ train: 1 shards, 290.2 MB

  Processing val split...
    Found 10,000 images in 10 classes
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/cifar10/webdataset/shard1024_none/val-000000.tar 0 0.0 GB 0


    Writing val:   0%|          | 0/10000 [00:00<?, ?it/s]

    ✓ val: 1 shards, 58.1 MB

  ✓ Build completed in 26.23s

Building WebDataset for cifar10 (shard1024_zstd)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\cifar10
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\cifar10\webdataset\shard1024_zstd
  Shard size: 1024MB, Compression: zstd
  Total classes across splits: 10

  Processing train split...
    Found 50,000 images in 10 classes
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/cifar10/webdataset/shard1024_zstd/train-000000.tar.zst 0 0.0 GB 0


    Writing train:   0%|          | 0/50000 [00:00<?, ?it/s]

    ✓ train: 1 shards, 290.2 MB

  Processing val split...
    Found 10,000 images in 10 classes
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/cifar10/webdataset/shard1024_zstd/val-000000.tar.zst 0 0.0 GB 0


    Writing val:   0%|          | 0/10000 [00:00<?, ?it/s]

    ✓ val: 1 shards, 58.1 MB

  ✓ Build completed in 26.34s

Building WebDataset for imagenet-mini (shard64_none)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\imagenet-mini
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\imagenet-mini\webdataset\shard64_none
  Shard size: 64MB, Compression: none
  Total classes across splits: 1000

  Processing train split...
    Found 34,745 images in 1000 classes
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard64_none/train-000000.tar 0 0.0 GB 0


    Writing train:   0%|          | 0/34745 [00:00<?, ?it/s]

# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard64_none/train-000001.tar 583 0.1 GB 583
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard64_none/train-000002.tar 593 0.1 GB 1176
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard64_none/train-000003.tar 454 0.1 GB 1630
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard64_none/train-000004.tar 514 0.1 GB 2144
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard64_none/train-000005.tar 582 0.1 GB 2726
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard64_none/train-000006.tar 565 0.1 G

    Writing val:   0%|          | 0/3923 [00:00<?, ?it/s]

# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard64_none/val-000001.tar 458 0.1 GB 458
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard64_none/val-000002.tar 518 0.1 GB 976
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard64_none/val-000003.tar 521 0.1 GB 1497
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard64_none/val-000004.tar 567 0.1 GB 2064
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard64_none/val-000005.tar 566 0.1 GB 2630
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard64_none/val-000006.tar 561 0.1 GB 3191
# writ

    Writing train:   0%|          | 0/34745 [00:00<?, ?it/s]

# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard64_zstd/train-000001.tar.zst 583 0.1 GB 583
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard64_zstd/train-000002.tar.zst 593 0.1 GB 1176
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard64_zstd/train-000003.tar.zst 454 0.1 GB 1630
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard64_zstd/train-000004.tar.zst 514 0.1 GB 2144
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard64_zstd/train-000005.tar.zst 582 0.1 GB 2726
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard64_zstd/train-

    Writing val:   0%|          | 0/3923 [00:00<?, ?it/s]

# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard64_zstd/val-000001.tar.zst 458 0.1 GB 458
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard64_zstd/val-000002.tar.zst 518 0.1 GB 976
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard64_zstd/val-000003.tar.zst 521 0.1 GB 1497
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard64_zstd/val-000004.tar.zst 567 0.1 GB 2064
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard64_zstd/val-000005.tar.zst 566 0.1 GB 2630
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard64_zstd/val-000006.tar.zs

    Writing train:   0%|          | 0/34745 [00:00<?, ?it/s]

# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard256_none/train-000001.tar 2141 0.3 GB 2141
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard256_none/train-000002.tar 2166 0.3 GB 4307
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard256_none/train-000003.tar 2920 0.3 GB 7227
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard256_none/train-000004.tar 2565 0.3 GB 9792
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard256_none/train-000005.tar 2211 0.3 GB 12003
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard256_none/train-000006.

    Writing val:   0%|          | 0/3923 [00:00<?, ?it/s]

# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard256_none/val-000001.tar 2060 0.3 GB 2060
    ✓ val: 2 shards, 494.4 MB

  ✓ Build completed in 62.47s

Building WebDataset for imagenet-mini (shard256_zstd)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\imagenet-mini
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\imagenet-mini\webdataset\shard256_zstd
  Shard size: 256MB, Compression: zstd
  Total classes across splits: 1000

  Processing train split...
    Found 34,745 images in 1000 classes
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard256_zstd/train-000000.tar.zst 0 0.0 GB 0


    Writing train:   0%|          | 0/34745 [00:00<?, ?it/s]

# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard256_zstd/train-000001.tar.zst 2141 0.3 GB 2141
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard256_zstd/train-000002.tar.zst 2166 0.3 GB 4307
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard256_zstd/train-000003.tar.zst 2920 0.3 GB 7227
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard256_zstd/train-000004.tar.zst 2565 0.3 GB 9792
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard256_zstd/train-000005.tar.zst 2211 0.3 GB 12003
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard25

    Writing val:   0%|          | 0/3923 [00:00<?, ?it/s]

# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard256_zstd/val-000001.tar.zst 2060 0.3 GB 2060
    ✓ val: 2 shards, 494.4 MB

  ✓ Build completed in 63.55s

Building WebDataset for imagenet-mini (shard1024_none)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\imagenet-mini
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\imagenet-mini\webdataset\shard1024_none
  Shard size: 1024MB, Compression: none
  Total classes across splits: 1000

  Processing train split...
    Found 34,745 images in 1000 classes
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard1024_none/train-000000.tar 0 0.0 GB 0


    Writing train:   0%|          | 0/34745 [00:00<?, ?it/s]

# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard1024_none/train-000001.tar 9790 1.1 GB 9790
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard1024_none/train-000002.tar 9810 1.1 GB 19600
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard1024_none/train-000003.tar 10647 1.1 GB 30247
    ✓ train: 4 shards, 3.6 GB

  Processing val split...
    Found 3,923 images in 1000 classes
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard1024_none/val-000000.tar 0 0.0 GB 0


    Writing val:   0%|          | 0/3923 [00:00<?, ?it/s]

    ✓ val: 1 shards, 494.4 MB

  ✓ Build completed in 78.74s

Building WebDataset for imagenet-mini (shard1024_zstd)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\imagenet-mini
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\imagenet-mini\webdataset\shard1024_zstd
  Shard size: 1024MB, Compression: zstd
  Total classes across splits: 1000

  Processing train split...
    Found 34,745 images in 1000 classes
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard1024_zstd/train-000000.tar.zst 0 0.0 GB 0


    Writing train:   0%|          | 0/34745 [00:00<?, ?it/s]

# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard1024_zstd/train-000001.tar.zst 9790 1.1 GB 9790
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard1024_zstd/train-000002.tar.zst 9810 1.1 GB 19600
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard1024_zstd/train-000003.tar.zst 10647 1.1 GB 30247
    ✓ train: 4 shards, 3.6 GB

  Processing val split...
    Found 3,923 images in 1000 classes
# writing file://C:/Users/arjya/Fall 2025/Systems for ML/Project 1/SML/format-matters/data/built/imagenet-mini/webdataset/shard1024_zstd/val-000000.tar.zst 0 0.0 GB 0


    Writing val:   0%|          | 0/3923 [00:00<?, ?it/s]

    ✓ val: 1 shards, 494.4 MB

  ✓ Build completed in 88.39s

✓ Built 12 WebDataset variant(s)


## Verification

In [7]:
print("\nVerifying WebDataset shards...\n")

for dataset_name in available_datasets:
    print(f"{dataset_name}:")
    
    for variant in VARIANTS:
        wds_dir = BUILT_DIR / dataset_name / 'webdataset' / variant['name']
        
        if not wds_dir.exists():
            print(f"  ✗ {variant['name']}: directory not found")
            continue
        
        # Count shards
        train_shards = list(wds_dir.glob('train-*.tar*'))
        val_shards = list(wds_dir.glob('val-*.tar*'))
        
        if not train_shards:
            print(f"  ✗ {variant['name']}: no train shards found")
            continue
        
        # Calculate sizes
        train_size = sum(f.stat().st_size for f in train_shards)
        val_size = sum(f.stat().st_size for f in val_shards)
        
        print(f"  ✓ {variant['name']}:")
        print(f"      Train: {len(train_shards)} shards, {format_bytes(train_size)}")
        print(f"      Val: {len(val_shards)} shards, {format_bytes(val_size)}")
    
    print()


Verifying WebDataset shards...

cifar10:
  ✓ shard64_none:
      Train: 2 shards, 290.2 MB
      Val: 1 shards, 58.1 MB
  ✓ shard64_zstd:
      Train: 2 shards, 290.2 MB
      Val: 1 shards, 58.1 MB
  ✓ shard256_none:
      Train: 1 shards, 290.2 MB
      Val: 1 shards, 58.1 MB
  ✓ shard256_zstd:
      Train: 1 shards, 290.2 MB
      Val: 1 shards, 58.1 MB
  ✓ shard1024_none:
      Train: 1 shards, 290.2 MB
      Val: 1 shards, 58.1 MB
  ✓ shard1024_zstd:
      Train: 1 shards, 290.2 MB
      Val: 1 shards, 58.1 MB

imagenet-mini:
  ✓ shard64_none:
      Train: 56 shards, 3.6 GB
      Val: 8 shards, 494.5 MB
  ✓ shard64_zstd:
      Train: 56 shards, 3.6 GB
      Val: 8 shards, 494.5 MB
  ✓ shard256_none:
      Train: 14 shards, 3.6 GB
      Val: 2 shards, 494.4 MB
  ✓ shard256_zstd:
      Train: 14 shards, 3.6 GB
      Val: 2 shards, 494.4 MB
  ✓ shard1024_none:
      Train: 4 shards, 3.6 GB
      Val: 1 shards, 494.4 MB
  ✓ shard1024_zstd:
      Train: 4 shards, 3.6 GB
      Val: 1 s

## Sample Data Inspection

In [8]:
# # Inspect first shard of first dataset/variant
# if available_datasets and VARIANTS:
#     dataset_name = available_datasets[0]
#     variant = VARIANTS[0]
#     wds_dir = BUILT_DIR / dataset_name / 'webdataset' / variant['name']
    
#     train_shards = sorted(wds_dir.glob('train-*.tar*'))
#     if train_shards:
#         print(f"Inspecting first shard of {dataset_name} ({variant['name']}):\n")
#         print(f"Shard: {train_shards[0].name}")
#         print(f"Size: {format_bytes(train_shards[0].stat().st_size)}")
        
#         # Read first few samples
#         print("\nFirst 5 samples:")
#         dataset = wds.WebDataset(str(train_shards[0]))
        
#         for i, sample in enumerate(dataset):
#             if i >= 5:
#                 break
            
#             print(f"\nSample {i}:")
#             print(f"  Keys: {list(sample.keys())}")
#             print(f"  __key__: {sample.get('__key__', 'N/A')}")
            
#             # Check image
#             for ext in ['jpg', 'jpeg', 'png']:
#                 if ext in sample:
#                     img_bytes = sample[ext]
#                     print(f"  Image ({ext}): {len(img_bytes)} bytes")
#                     break
            
#             # Check label
#             if 'cls' in sample:
#                 label = sample['cls'].decode('utf-8') if isinstance(sample['cls'], bytes) else sample['cls']
#                 print(f"  Label: {label}")
# Inspect first shard of first dataset/variant
if available_datasets and VARIANTS:
    dataset_name = available_datasets[0]
    variant = VARIANTS[0]
    wds_dir = BUILT_DIR / dataset_name / 'webdataset' / variant['name']
    
    train_shards = sorted(wds_dir.glob('train-*.tar*'))
    if train_shards:
        print(f"Inspecting first shard of {dataset_name} ({variant['name']}):\n")
        print(f"Shard: {train_shards[0].name}")
        print(f"Size: {format_bytes(train_shards[0].stat().st_size)}")
        
        # Force WebDataset to interpret path as local file (works on Windows + Linux)
        shard_path = "file://" + train_shards[0].as_posix()

        print("\nFirst 5 samples:")
        dataset = wds.WebDataset(shard_path)

        for i, sample in enumerate(dataset):
            if i >= 5:
                break

            print(f"\nSample {i}:")
            print(f"  Keys: {list(sample.keys())}")
            print(f"  __key__: {sample.get('__key__', 'N/A')}")

            for ext in ['jpg', 'jpeg', 'png']:
                if ext in sample:
                    img_bytes = sample[ext]
                    print(f"  Image ({ext}): {len(img_bytes)} bytes")
                    break

            if 'cls' in sample:
                label = sample['cls'].decode('utf-8') if isinstance(sample['cls'], bytes) else sample['cls']
                print(f"  Label: {label}")


Inspecting first shard of cifar10 (shard64_none):

Shard: train-000000.tar
Size: 173.7 MB

First 5 samples:

Sample 0:
  Keys: ['__key__', '__url__', 'cls', 'png']
  __key__: 00000000
  Image (png): 2063 bytes
  Label: 0

Sample 1:
  Keys: ['__key__', '__url__', 'cls', 'png']
  __key__: 00000001
  Image (png): 2169 bytes
  Label: 0

Sample 2:
  Keys: ['__key__', '__url__', 'cls', 'png']
  __key__: 00000002
  Image (png): 2237 bytes
  Label: 0

Sample 3:
  Keys: ['__key__', '__url__', 'cls', 'png']
  __key__: 00000003
  Image (png): 2104 bytes
  Label: 0

Sample 4:
  Keys: ['__key__', '__url__', 'cls', 'png']
  __key__: 00000004
  Image (png): 2181 bytes
  Label: 0


## Build Summary

In [9]:
# # Read and display summary
# if SUMMARY_CSV.exists() and SUMMARY_CSV.stat().st_size > 0:
#     import pandas as pd
#     cols = [
#     "stage", "dataset", "format", "variant", "split",
#     "items", "bytes_on_disk", "num_files", "avg_file_size",
#     "build_wall_s", "timestamp"
#     ]
#     summary_df = pd.read_csv(SUMMARY_CSV, names=cols, header=None)
    
#     print("\nBuild Summary:")
#     print("="*80)
    
#     for dataset in summary_df['dataset'].unique():
#         print(f"\n{dataset}:")
#         dataset_df = summary_df[summary_df['dataset'] == dataset]
        
#         for variant in dataset_df['variant'].unique():
#             variant_df = dataset_df[dataset_df['variant'] == variant]
#             print(f"\n  {variant}:")
            
#             for _, row in variant_df.iterrows():
#                 print(f"    {row['split']}:")
#                 print(f"      Items: {row['items']:,}")
#                 print(f"      Shards: {row['num_files']}")
#                 print(f"      Size: {format_bytes(row['bytes_on_disk'])}")
#                 print(f"      Avg shard: {format_bytes(row['avg_file_size'])}")
    
#     print("\n" + "="*80)
#     print(f"\nSummary saved to: {SUMMARY_CSV}")
# else:
#     print("No summary data available")
if SUMMARY_CSV.exists() and SUMMARY_CSV.stat().st_size > 0:
    import pandas as pd

    cols = [
        "stage", "dataset", "format", "variant", "split",
        "items", "bytes_on_disk", "num_files", "avg_file_size",
        "build_wall_s", "timestamp"
    ]

    summary_df = pd.read_csv(SUMMARY_CSV, names=cols, header=None)

    # Convert all numeric columns safely
    numeric_cols = ["items", "bytes_on_disk", "num_files", "avg_file_size", "build_wall_s"]
    summary_df[numeric_cols] = summary_df[numeric_cols].apply(pd.to_numeric, errors="coerce")

    print("\nBuild Summary:")
    print("=" * 80)

    for dataset in summary_df["dataset"].dropna().unique():
        print(f"\n{dataset}:")
        dataset_df = summary_df[summary_df["dataset"] == dataset]

        for variant in dataset_df["variant"].dropna().unique():
            print(f"\n  {variant}:")
            variant_df = dataset_df[dataset_df["variant"] == variant]

            for _, row in variant_df.iterrows():
                split = row["split"] if pd.notna(row["split"]) else "Unknown"

                items = f"{int(row['items']):,}" if pd.notna(row["items"]) else "N/A"
                shards = f"{int(row['num_files'])}" if pd.notna(row["num_files"]) else "N/A"
                size = format_bytes(row["bytes_on_disk"]) if pd.notna(row["bytes_on_disk"]) else "N/A"
                avg_size = format_bytes(row["avg_file_size"]) if pd.notna(row["avg_file_size"]) else "N/A"

                print(f"    {split}:")
                print(f"      Items: {items}")
                print(f"      Shards: {shards}")
                print(f"      Size: {size}")
                print(f"      Avg shard: {avg_size}")

    print("\n" + "=" * 80)
    print(f"\nSummary saved to: {SUMMARY_CSV}")
else:
    print("No summary data available")



Build Summary:

dataset:

  variant:
    split:
      Items: N/A
      Shards: N/A
      Size: N/A
      Avg shard: N/A

cifar10:

  shard64_none:
    train:
      Items: 50,000
      Shards: 2
      Size: 290.2 MB
      Avg shard: 145.1 MB
    val:
      Items: 10,000
      Shards: 1
      Size: 58.1 MB
      Avg shard: 58.1 MB

  shard64_zstd:
    train:
      Items: 50,000
      Shards: 2
      Size: 290.2 MB
      Avg shard: 145.1 MB
    val:
      Items: 10,000
      Shards: 1
      Size: 58.1 MB
      Avg shard: 58.1 MB

  shard256_none:
    train:
      Items: 50,000
      Shards: 1
      Size: 290.2 MB
      Avg shard: 290.2 MB
    val:
      Items: 10,000
      Shards: 1
      Size: 58.1 MB
      Avg shard: 58.1 MB

  shard256_zstd:
    train:
      Items: 50,000
      Shards: 1
      Size: 290.2 MB
      Avg shard: 290.2 MB
    val:
      Items: 10,000
      Shards: 1
      Size: 58.1 MB
      Avg shard: 58.1 MB

  shard1024_none:
    train:
      Items: 50,000
      Shards:

## ✅ WebDataset Build Complete

**What was created:**
- TAR-based shards with configurable sizes
- Multiple variants with different shard sizes and compression
- Sequential I/O friendly format

**Variants built:**
- `shard64_none`: 64MB shards, no compression
- `shard64_zstd`: 64MB shards, zstd compression
- `shard256_none`: 256MB shards, no compression
- `shard256_zstd`: 256MB shards, zstd compression
- `shard1024_none`: 1024MB shards, no compression
- `shard1024_zstd`: 1024MB shards, zstd compression

**Output locations:**
- `data/built/<dataset>/webdataset/<variant>/*.tar[.zst]`

**Next steps:**
1. Run `12_loader_webdataset.ipynb` to create the dataloader
2. Or continue with other format builders (04-05)
3. Then run training experiments (20-21)