# 20 - Baseline Training Experiments

This notebook runs baseline training experiments across all data formats to compare:
- Data loading throughput
- Training time per epoch
- Resource utilization (GPU, CPU, disk I/O)
- Memory usage

**Experiment Design:**
- Simple ResNet-18 model
- Fixed hyperparameters across all formats
- 3 epochs for quick comparison
- Same batch size and workers
- Alternating run order to control for system state

**Formats Tested:**
1. CSV (baseline)
2. WebDataset (shard256_none)
3. TFRecord (shard256_none)
4. LMDB (compress_none)

**Output:**
- Training metrics logged to `runs/<session>/summary.csv`
- Resource monitoring logs per experiment

In [1]:
import os
import sys
import time
import json
from pathlib import Path
from collections import defaultdict

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import models
from tqdm.auto import tqdm

# Load common utilities
%run ./10_common_utils.ipynb

✓ Common utilities loaded successfully

Available functions:
  - set_seed(seed)
  - get_transforms(augment)
  - write_sysinfo(path)
  - time_first_batch(dataloader, device)
  - start_monitor(log_path, interval)
  - stop_monitor(thread, stop_event)
  - append_to_summary(path, row_dict)
  - compute_metrics_from_logs(log_path)
  - get_device()
  - format_bytes(bytes)
  - count_parameters(model)

Constants:
  - STANDARD_TRANSFORM


## Configuration

In [2]:
# Detect environment
IS_KAGGLE = "KAGGLE_KERNEL_RUN_TYPE" in os.environ
BASE_DIR = Path('/kaggle/working/format-matters') if IS_KAGGLE else Path('..').resolve()

# Create run directory for this session
RUN_DIR = BASE_DIR / 'runs' / time.strftime('%Y%m%d-%H%M%S') / 'train_baselines'
RUN_DIR.mkdir(parents=True, exist_ok=True)

SUMMARY_CSV = RUN_DIR / 'summary.csv'
SUMMARY_CSV.touch(exist_ok=True)

print(f"Environment: {'Kaggle' if IS_KAGGLE else 'Local'}")
print(f"Base directory: {BASE_DIR}")
print(f"Run directory: {RUN_DIR}")
print(f"Summary log: {SUMMARY_CSV}")

# Write system info
write_sysinfo(RUN_DIR / 'sysinfo.json')

Environment: Local
Base directory: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters
Run directory: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\runs\20251127-145553\train_baselines
Summary log: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\runs\20251127-145553\train_baselines\summary.csv
✓ System info written to C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\runs\20251127-145553\train_baselines\sysinfo.json


## Training Configuration

In [3]:
# Training hyperparameters
CONFIG = {
    'batch_size': 64,
    'num_workers': 0,
    'num_epochs': 3,
    'learning_rate': 0.001,
    'momentum': 0.9,
    'weight_decay': 1e-4,
    'seed': 42,
}

# Formats to test (format_name, loader_notebook, variant)
FORMATS = [
    ('webdataset', '12_loader_webdataset.ipynb', 'shard256_none'),
    ('csv', '11_loader_csv.ipynb', 'default'),
    ('tfrecord', '13_loader_tfrecord.ipynb', 'shard256_none'),
    ('lmdb', '14_loader_lmdb.ipynb', 'compress_none'),
]

# Dataset to use (first available)
BUILT_DIR = BASE_DIR / 'data' / 'built'
DATASET = None
for ds in ['cifar10', 'imagenet-mini', 'tiny-imagenet-200']:
    if (BUILT_DIR / ds).exists():
        DATASET = ds
        break

if DATASET is None:
    raise RuntimeError("No datasets found. Run dataset preparation notebooks first.")

print(f"\nTraining Configuration:")
print(f"  Dataset: {DATASET}")
print(f"  Batch size: {CONFIG['batch_size']}")
print(f"  Num workers: {CONFIG['num_workers']}")
print(f"  Num epochs: {CONFIG['num_epochs']}")
print(f"  Learning rate: {CONFIG['learning_rate']}")
print(f"\nFormats to test: {len(FORMATS)}")
for fmt_name, _, variant in FORMATS:
    print(f"  - {fmt_name} ({variant})")


Training Configuration:
  Dataset: cifar10
  Batch size: 64
  Num workers: 0
  Num epochs: 3
  Learning rate: 0.001

Formats to test: 4
  - webdataset (shard256_none)
  - csv (default)
  - tfrecord (shard256_none)
  - lmdb (compress_none)


## Model Definition

In [4]:
def create_model(num_classes: int, device: torch.device):
    """
    Create a ResNet-18 model for training.
    
    Args:
        num_classes: Number of output classes
        device: Device to place model on
    
    Returns:
        PyTorch model
    """
    model = models.resnet18(pretrained=False)
    
    # Modify final layer for num_classes
    model.fc = nn.Linear(model.fc.in_features, num_classes)
    
    model = model.to(device)
    
    print(f"Model: ResNet-18")
    print(f"Parameters: {count_parameters(model):,}")
    print(f"Device: {device}")
    
    return model

## Training Loop

In [5]:
def train_epoch(model, dataloader, criterion, optimizer, device, epoch):
    """
    Train for one epoch.
    
    Args:
        model: PyTorch model
        dataloader: Training dataloader
        criterion: Loss function
        optimizer: Optimizer
        device: Device to train on
        epoch: Current epoch number
    
    Returns:
        Dictionary with training metrics
    """
    model.train()
    
    total_loss = 0.0
    correct = 0
    total = 0
    num_batches = 0
    
    start_time = time.time()
    
    pbar = tqdm(dataloader, desc=f"Epoch {epoch}")
    for batch_idx, (images, labels) in enumerate(pbar):
        num_batches += 1
        # Move to device
        images = images.to(device)
        labels = labels.to(device)
        
        # Forward pass
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward pass
        loss.backward()
        optimizer.step()
        
        # Statistics
        total_loss += loss.item()
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()
        
        # Update progress bar
        pbar.set_postfix({
            'loss': total_loss / num_batches,
            'acc': 100. * correct / total if total > 0 else 0.0
        })
    
    if num_batches == 0:
        raise RuntimeError('No batches were produced by the dataloader.')
    
    epoch_time = time.time() - start_time
    avg_loss = total_loss / num_batches
    accuracy = 100. * correct / total if total > 0 else 0.0
    throughput = total / epoch_time if epoch_time > 0 else 0.0
    
    return {
        'loss': avg_loss,
        'accuracy': accuracy,
        'time': epoch_time,
        'samples_per_sec': throughput,
    }


def validate(model, dataloader, criterion, device):
    """
    Validate the model.
    
    Args:
        model: PyTorch model
        dataloader: Validation dataloader
        criterion: Loss function
        device: Device to validate on
    
    Returns:
        Dictionary with validation metrics
    """
    model.eval()
    
    total_loss = 0.0
    correct = 0
    total = 0
    num_batches = 0
    
    with torch.no_grad():
        for images, labels in tqdm(dataloader, desc="Validation"):
            num_batches += 1
            images = images.to(device)
            labels = labels.to(device)
            
            outputs = model(images)
            loss = criterion(outputs, labels)
            
            total_loss += loss.item()
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()
    
    if num_batches == 0:
        raise RuntimeError('No batches were produced by the validation dataloader.')
    
    avg_loss = total_loss / num_batches
    accuracy = 100. * correct / total if total > 0 else 0.0
    
    return {
        'loss': avg_loss,
        'accuracy': accuracy,
    }


## Experiment Runner

In [6]:
def run_experiment(format_name, loader_notebook, variant, dataset, config, device):
    """
    Run training experiment for a single format.
    
    Args:
        format_name: Name of format (e.g., 'csv', 'webdataset')
        loader_notebook: Path to loader notebook
        variant: Format variant
        dataset: Dataset name
        config: Training configuration dict
        device: Device to train on
    
    Returns:
        Dictionary with experiment results
    """
    print(f"\n{'='*80}")
    print(f"Running experiment: {format_name} ({variant})")
    print(f"{'='*80}\n")
    
    # Set seed for reproducibility
    set_seed(config['seed'])
    
    # Load the appropriate dataloader
    print(f"Loading dataloader from {loader_notebook}...")
    %run ./{loader_notebook}
    
    # Create dataloaders
    print(f"\nCreating dataloaders...")
    train_loader = make_dataloader(
        dataset=dataset,
        split='train',
        batch_size=config['batch_size'],
        num_workers=config['num_workers'],
        variant=variant,
        shuffle=True,
        pin_memory=True
    )
    
    val_loader = make_dataloader(
        dataset=dataset,
        split='val',
        batch_size=config['batch_size'],
        num_workers=config['num_workers'],
        variant=variant,
        shuffle=False,
        pin_memory=True
    )
    
    # Determine number of classes
    if dataset == 'cifar10':
        num_classes = 10
    elif dataset == 'tiny-imagenet-200':
        num_classes = 200
    else:
        num_classes = 1000  # imagenet-mini
    
    # Create model
    print(f"\nCreating model...")
    model = create_model(num_classes, device)
    
    # Loss and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(
        model.parameters(),
        lr=config['learning_rate'],
        momentum=config['momentum'],
        weight_decay=config['weight_decay']
    )
    
    # Start resource monitoring
    log_path = RUN_DIR / f"{format_name}_{variant}_metrics.csv"
    monitor_thread, stop_event = start_monitor(log_path, interval=0.5)
    
    # Training loop
    print(f"\nStarting training for {config['num_epochs']} epochs...\n")
    
    experiment_start = time.time()
    epoch_results = []
    
    for epoch in range(1, config['num_epochs'] + 1):
        # Train
        train_metrics = train_epoch(
            model, train_loader, criterion, optimizer, device, epoch
        )
        
        # Validate
        val_metrics = validate(model, val_loader, criterion, device)
        
        # Log results
        epoch_result = {
            'epoch': epoch,
            'train_loss': train_metrics['loss'],
            'train_acc': train_metrics['accuracy'],
            'train_time': train_metrics['time'],
            'train_samples_per_sec': train_metrics['samples_per_sec'],
            'val_loss': val_metrics['loss'],
            'val_acc': val_metrics['accuracy'],
        }
        epoch_results.append(epoch_result)
        
        print(f"\nEpoch {epoch} Summary:")
        print(f"  Train Loss: {train_metrics['loss']:.4f}, Acc: {train_metrics['accuracy']:.2f}%")
        print(f"  Val Loss: {val_metrics['loss']:.4f}, Acc: {val_metrics['accuracy']:.2f}%")
        print(f"  Time: {train_metrics['time']:.2f}s, Throughput: {train_metrics['samples_per_sec']:.1f} samples/s")
    
    total_time = time.time() - experiment_start
    
    # Stop resource monitoring
    stop_monitor(monitor_thread, stop_event)
    
    # Compute resource metrics
    resource_metrics = compute_metrics_from_logs(log_path)
    
    # Compile results
    results = {
        'format': format_name,
        'variant': variant,
        'dataset': dataset,
        'total_time': total_time,
        'epochs': epoch_results,
        'resources': resource_metrics,
    }
    
    # Log to summary CSV
    for epoch_result in epoch_results:
        row = {
            'stage': 'train',
            'format': format_name,
            'variant': variant,
            'dataset': dataset,
            **epoch_result,
            **resource_metrics,
        }
        append_to_summary(SUMMARY_CSV, row)
    
    print(f"\n✓ Experiment completed in {total_time:.2f}s")
    print(f"  Final train acc: {epoch_results[-1]['train_acc']:.2f}%")
    print(f"  Final val acc: {epoch_results[-1]['val_acc']:.2f}%")
    
    return results

## Run All Experiments

In [7]:
# Get device
device = get_device()
print(f"Using device: {device}\n")

# Run experiments
all_results = []

for format_name, loader_notebook, variant in FORMATS:
    try:
        results = run_experiment(
            format_name=format_name,
            loader_notebook=loader_notebook,
            variant=variant,
            dataset=DATASET,
            config=CONFIG,
            device=device
        )
        all_results.append(results)
        
        # Small delay between experiments
        time.sleep(5)
        
    except Exception as e:
        print(f"\n✗ Experiment failed for {format_name}: {e}")
        import traceback
        traceback.print_exc()
        continue

print(f"\n{'='*80}")
print(f"All experiments completed!")
print(f"{'='*80}")

Using device: cpu


Running experiment: webdataset (shard256_none)

✓ Random seed set to 42
Loading dataloader from 12_loader_webdataset.ipynb...
✓ Common utilities loaded successfully

Available functions:
  - set_seed(seed)
  - get_transforms(augment)
  - write_sysinfo(path)
  - time_first_batch(dataloader, device)
  - start_monitor(log_path, interval)
  - stop_monitor(thread, stop_event)
  - append_to_summary(path, row_dict)
  - compute_metrics_from_logs(log_path)
  - get_device()
  - format_bytes(bytes)
  - count_parameters(model)

Constants:
  - STANDARD_TRANSFORM
Running WebDataset DataLoader smoke test...

Testing with dataset: cifar10, variant: shard1024_none

Found 1 shard(s) for cifar10/train (shard1024_none)
Example files: ['train-000000.tar']

DataLoader created:
  Batch size: 32
  Num workers: 0
  Variant: shard1024_none
[('cifar10', 'shard1024_none'), ('cifar10', 'shard1024_zstd'), ('cifar10', 'shard256_none'), ('cifar10', 'shard256_zstd'), ('cifar10', 'shard64_none'), ('



Model: ResNet-18
Parameters: 11,181,642
Device: cpu
✓ Resource monitoring started (logging to C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\runs\20251127-145553\train_baselines\webdataset_shard256_none_metrics.csv)

Starting training for 3 epochs...



Epoch 1: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]


Epoch 1 Summary:
  Train Loss: 1.4466, Acc: 49.54%
  Val Loss: 1.8776, Acc: 36.00%
  Time: 2405.47s, Throughput: 20.8 samples/s


Epoch 2: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]


Epoch 2 Summary:
  Train Loss: 1.2536, Acc: 58.00%
  Val Loss: 1.6444, Acc: 40.76%
  Time: 2475.04s, Throughput: 20.2 samples/s


Epoch 3: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]


Epoch 3 Summary:
  Train Loss: 1.1481, Acc: 61.74%
  Val Loss: 1.5785, Acc: 43.45%
  Time: 2444.17s, Throughput: 20.5 samples/s
✓ Resource monitoring stopped

✓ Experiment completed in 7874.65s
  Final train acc: 61.74%
  Final val acc: 43.45%

Running experiment: csv (default)

✓ Random seed set to 42
Loading dataloader from 11_loader_csv.ipynb...
✓ Common utilities loaded successfully

Available functions:
  - set_seed(seed)
  - get_transforms(augment)
  - write_sysinfo(path)
  - time_first_batch(dataloader, device)
  - start_monitor(log_path, interval)
  - stop_monitor(thread, stop_event)
  - append_to_summary(path, row_dict)
  - compute_metrics_from_logs(log_path)
  - get_device()
  - format_bytes(bytes)
  - count_parameters(model)

Constants:
  - STANDARD_TRANSFORM
Running CSV DataLoader smoke test...

Testing with dataset: cifar10

Loaded CSV dataset: 50000 samples from train.csv

DataLoader created:
  Dataset size: 50,000
  Batch size: 16
  Num batches: 3,125
  Num workers: 0





Epoch 1:   0%|          | 0/782 [00:00<?, ?it/s]

Loading sample 25310/50000
Loading sample 8560/50000
Loading sample 7790/50000
Loading sample 20170/50000
Loading sample 35360/50000
Loading sample 37960/50000
Loading sample 7150/50000
Loading sample 41150/50000
Loading sample 12830/50000
Loading sample 23030/50000
Loading sample 21290/50000
Loading sample 21660/50000
Loading sample 1130/50000
Loading sample 39390/50000
Loading sample 13890/50000
Loading sample 31200/50000
Loading sample 15620/50000
Loading sample 37900/50000
Loading sample 42940/50000
Loading sample 41700/50000
Loading sample 140/50000
Loading sample 33490/50000
Loading sample 29480/50000
Loading sample 10140/50000
Loading sample 45360/50000
Loading sample 22640/50000
Loading sample 9120/50000
Loading sample 22950/50000
Loading sample 17320/50000
Loading sample 2050/50000
Loading sample 38930/50000
Loading sample 40190/50000
Loading sample 27000/50000
Loading sample 7680/50000
Loading sample 33680/50000
Loading sample 30620/50000
Loading sample 24040/50000
Loading sa

Validation:   0%|          | 0/157 [00:00<?, ?it/s]

Loading sample 0/10000
Loading sample 10/10000
Loading sample 20/10000
Loading sample 30/10000
Loading sample 40/10000
Loading sample 50/10000
Loading sample 60/10000
Loading sample 70/10000
Loading sample 80/10000
Loading sample 90/10000
Loading sample 100/10000
Loading sample 110/10000
Loading sample 120/10000
Loading sample 130/10000
Loading sample 140/10000
Loading sample 150/10000
Loading sample 160/10000
Loading sample 170/10000
Loading sample 180/10000
Loading sample 190/10000
Loading sample 200/10000
Loading sample 210/10000
Loading sample 220/10000
Loading sample 230/10000
Loading sample 240/10000
Loading sample 250/10000
Loading sample 260/10000
Loading sample 270/10000
Loading sample 280/10000
Loading sample 290/10000
Loading sample 300/10000
Loading sample 310/10000
Loading sample 320/10000
Loading sample 330/10000
Loading sample 340/10000
Loading sample 350/10000
Loading sample 360/10000
Loading sample 370/10000
Loading sample 380/10000
Loading sample 390/10000
Loading sam

Epoch 2:   0%|          | 0/782 [00:00<?, ?it/s]

Loading sample 30400/50000
Loading sample 1430/50000
Loading sample 12880/50000
Loading sample 1200/50000
Loading sample 3690/50000
Loading sample 22950/50000
Loading sample 9760/50000
Loading sample 14660/50000
Loading sample 33440/50000
Loading sample 47450/50000
Loading sample 6110/50000
Loading sample 31010/50000
Loading sample 41600/50000
Loading sample 13830/50000
Loading sample 5400/50000
Loading sample 49150/50000
Loading sample 27150/50000
Loading sample 12270/50000
Loading sample 47460/50000
Loading sample 16950/50000
Loading sample 23080/50000
Loading sample 9920/50000
Loading sample 9410/50000
Loading sample 18000/50000
Loading sample 3800/50000
Loading sample 25780/50000
Loading sample 27350/50000
Loading sample 3880/50000
Loading sample 23700/50000
Loading sample 7370/50000
Loading sample 37790/50000
Loading sample 18720/50000
Loading sample 37170/50000
Loading sample 24190/50000
Loading sample 22080/50000
Loading sample 26110/50000
Loading sample 34150/50000
Loading samp

Validation:   0%|          | 0/157 [00:00<?, ?it/s]

Loading sample 0/10000
Loading sample 10/10000
Loading sample 20/10000
Loading sample 30/10000
Loading sample 40/10000
Loading sample 50/10000
Loading sample 60/10000
Loading sample 70/10000
Loading sample 80/10000
Loading sample 90/10000
Loading sample 100/10000
Loading sample 110/10000
Loading sample 120/10000
Loading sample 130/10000
Loading sample 140/10000
Loading sample 150/10000
Loading sample 160/10000
Loading sample 170/10000
Loading sample 180/10000
Loading sample 190/10000
Loading sample 200/10000
Loading sample 210/10000
Loading sample 220/10000
Loading sample 230/10000
Loading sample 240/10000
Loading sample 250/10000
Loading sample 260/10000
Loading sample 270/10000
Loading sample 280/10000
Loading sample 290/10000
Loading sample 300/10000
Loading sample 310/10000
Loading sample 320/10000
Loading sample 330/10000
Loading sample 340/10000
Loading sample 350/10000
Loading sample 360/10000
Loading sample 370/10000
Loading sample 380/10000
Loading sample 390/10000
Loading sam

Epoch 3:   0%|          | 0/782 [00:00<?, ?it/s]

Loading sample 39010/50000
Loading sample 25560/50000
Loading sample 39850/50000
Loading sample 27730/50000
Loading sample 4830/50000
Loading sample 12150/50000
Loading sample 23730/50000
Loading sample 41210/50000
Loading sample 34770/50000
Loading sample 24030/50000
Loading sample 16120/50000
Loading sample 9380/50000
Loading sample 38470/50000
Loading sample 12210/50000
Loading sample 36620/50000
Loading sample 37240/50000
Loading sample 43150/50000
Loading sample 43160/50000
Loading sample 20080/50000
Loading sample 28850/50000
Loading sample 18770/50000
Loading sample 8340/50000
Loading sample 32010/50000
Loading sample 47990/50000
Loading sample 34760/50000
Loading sample 28190/50000
Loading sample 45510/50000
Loading sample 47360/50000
Loading sample 30150/50000
Loading sample 7910/50000
Loading sample 10570/50000
Loading sample 35150/50000
Loading sample 41190/50000
Loading sample 10560/50000
Loading sample 4030/50000
Loading sample 33550/50000
Loading sample 23600/50000
Loadin

Validation:   0%|          | 0/157 [00:00<?, ?it/s]

Loading sample 0/10000
Loading sample 10/10000
Loading sample 20/10000
Loading sample 30/10000
Loading sample 40/10000
Loading sample 50/10000
Loading sample 60/10000
Loading sample 70/10000
Loading sample 80/10000
Loading sample 90/10000
Loading sample 100/10000
Loading sample 110/10000
Loading sample 120/10000
Loading sample 130/10000
Loading sample 140/10000
Loading sample 150/10000
Loading sample 160/10000
Loading sample 170/10000
Loading sample 180/10000
Loading sample 190/10000
Loading sample 200/10000
Loading sample 210/10000
Loading sample 220/10000
Loading sample 230/10000
Loading sample 240/10000
Loading sample 250/10000
Loading sample 260/10000
Loading sample 270/10000
Loading sample 280/10000
Loading sample 290/10000
Loading sample 300/10000
Loading sample 310/10000
Loading sample 320/10000
Loading sample 330/10000
Loading sample 340/10000
Loading sample 350/10000
Loading sample 360/10000
Loading sample 370/10000
Loading sample 380/10000
Loading sample 390/10000
Loading sam



Epoch 1: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]


Epoch 1 Summary:
  Train Loss: 1.6814, Acc: 37.66%
  Val Loss: 1.5212, Acc: 43.79%
  Time: 2370.93s, Throughput: 21.1 samples/s


Epoch 2: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]


Epoch 2 Summary:
  Train Loss: 1.3170, Acc: 52.35%
  Val Loss: 1.2236, Acc: 55.09%
  Time: 2361.28s, Throughput: 21.2 samples/s


Epoch 3: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]


Epoch 3 Summary:
  Train Loss: 1.1288, Acc: 59.04%
  Val Loss: 1.1599, Acc: 58.42%
  Time: 2358.10s, Throughput: 21.2 samples/s
✓ Resource monitoring stopped

✓ Experiment completed in 7625.80s
  Final train acc: 59.04%
  Final val acc: 58.42%

Running experiment: lmdb (compress_none)

✓ Random seed set to 42
Loading dataloader from 14_loader_lmdb.ipynb...
✓ Common utilities loaded successfully

Available functions:
  - set_seed(seed)
  - get_transforms(augment)
  - write_sysinfo(path)
  - time_first_batch(dataloader, device)
  - start_monitor(log_path, interval)
  - stop_monitor(thread, stop_event)
  - append_to_summary(path, row_dict)
  - compute_metrics_from_logs(log_path)
  - get_device()
  - format_bytes(bytes)
  - count_parameters(model)

Constants:
  - STANDARD_TRANSFORM
Running LMDB DataLoader smoke test...

Testing with dataset: cifar10, variant: compress_lz4

Loaded LMDB dataset: 50,000 samples from train.lmdb
  Compression: lz4

DataLoader created:
  Dataset size: 50,000
  



Epoch 1:   0%|          | 0/782 [00:00<?, ?it/s]

Validation:   0%|          | 0/157 [00:00<?, ?it/s]


Epoch 1 Summary:
  Train Loss: 1.6875, Acc: 37.38%
  Val Loss: 1.4479, Acc: 46.98%
  Time: 2324.72s, Throughput: 21.5 samples/s


Epoch 2:   0%|          | 0/782 [00:00<?, ?it/s]

Validation:   0%|          | 0/157 [00:00<?, ?it/s]


Epoch 2 Summary:
  Train Loss: 1.3210, Acc: 52.04%
  Val Loss: 1.2647, Acc: 54.49%
  Time: 2315.06s, Throughput: 21.6 samples/s


Epoch 3:   0%|          | 0/782 [00:00<?, ?it/s]

Validation:   0%|          | 0/157 [00:00<?, ?it/s]


Epoch 3 Summary:
  Train Loss: 1.1375, Acc: 59.36%
  Val Loss: 1.1191, Acc: 59.79%
  Time: 2322.17s, Throughput: 21.5 samples/s
✓ Resource monitoring stopped

✓ Experiment completed in 7475.59s
  Final train acc: 59.36%
  Final val acc: 59.79%

All experiments completed!


## Results Summary

In [8]:
import pandas as pd

if SUMMARY_CSV.exists() and SUMMARY_CSV.stat().st_size > 0:
    df = pd.read_csv(SUMMARY_CSV)
    
    print("\n" + "="*80)
    print("BASELINE EXPERIMENTS SUMMARY")
    print("="*80)
    
    # Group by format and show final epoch results
    final_epoch = df[df['epoch'] == CONFIG['num_epochs']]
    
    print(f"\nFinal Epoch ({CONFIG['num_epochs']}) Results:\n")
    print(f"{'Format':<15} {'Variant':<20} {'Train Acc':<12} {'Val Acc':<12} {'Time (s)':<12} {'Throughput':<15}")
    print("-" * 90)
    
    for _, row in final_epoch.iterrows():
        print(f"{row['format']:<15} {row['variant']:<20} "
              f"{row['train_acc']:>10.2f}% {row['val_acc']:>10.2f}% "
              f"{row['train_time']:>10.2f}s {row['train_samples_per_sec']:>12.1f} samp/s")
    
    # Resource utilization summary
    print(f"\n\nResource Utilization (Mean):\n")
    print(f"{'Format':<15} {'Variant':<20} {'GPU %':<10} {'CPU %':<10} {'Disk R (MB/s)':<15} {'Disk W (MB/s)':<15}")
    print("-" * 90)
    
    for _, row in final_epoch.iterrows():
        gpu_util = row.get('gpu_util_mean', 0) or 0
        cpu_util = row.get('cpu_util_mean', 0) or 0
        disk_r = row.get('disk_read_mb_s_mean', 0) or 0
        disk_w = row.get('disk_write_mb_s_mean', 0) or 0
        
        print(f"{row['format']:<15} {row['variant']:<20} "
              f"{gpu_util:>8.1f}% {cpu_util:>8.1f}% "
              f"{disk_r:>13.2f} {disk_w:>13.2f}")
    
    print("\n" + "="*80)
    print(f"\nResults saved to: {SUMMARY_CSV}")
    print(f"Resource logs saved to: {RUN_DIR}")
else:
    print("No results available")


BASELINE EXPERIMENTS SUMMARY

Final Epoch (3) Results:

Format          Variant              Train Acc    Val Acc      Time (s)     Throughput     
------------------------------------------------------------------------------------------
webdataset      shard256_none             61.74%      43.45%    2444.17s         20.5 samp/s
csv             default                   59.87%      61.41%    2349.66s         21.3 samp/s
tfrecord        shard256_none             59.04%      58.42%    2358.10s         21.2 samp/s
lmdb            compress_none             59.36%      59.79%    2322.17s         21.5 samp/s


Resource Utilization (Mean):

Format          Variant              GPU %      CPU %      Disk R (MB/s)   Disk W (MB/s)  
------------------------------------------------------------------------------------------
webdataset      shard256_none             nan%    792.8%          0.13          0.00
csv             default                   nan%    792.5%          0.05          0.00
tfre

## ✅ Baseline Experiments Complete

**What was measured:**
- Training throughput (samples/second)
- Time per epoch
- Model accuracy (train and validation)
- GPU utilization
- CPU utilization
- Disk I/O (read/write rates)
- Memory usage

**Key Insights:**
- Compare training throughput across formats
- Identify bottlenecks (GPU idle time, disk I/O)
- Understand resource utilization patterns

**Next steps:**
1. Run scaling experiments (21_train_scaling.ipynb)
2. Analyze results (30_analysis_summary.ipynb)
3. Create visualizations (31_analysis_plots.ipynb)
4. Generate decision guide (40_decision_guide.ipynb)