# 11 - CSV DataLoader

This notebook implements the PyTorch DataLoader for CSV manifest format.

**Format:** CSV files listing image paths with labels
- Reads from `train.csv` / `val.csv`
- Loads images from individual files
- Applies standard transforms

**Usage in other notebooks:**
```python
%run ./11_loader_csv.ipynb
loader = make_dataloader('cifar10', 'train', batch_size=64, num_workers=4)
```

In [1]:
import os
from pathlib import Path
from typing import Optional

import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from PIL import Image

# Load common utilities
%run ./10_common_utils.ipynb

✓ Common utilities loaded successfully

Available functions:
  - set_seed(seed)
  - get_transforms(augment)
  - write_sysinfo(path)
  - time_first_batch(dataloader, device)
  - start_monitor(log_path, interval)
  - stop_monitor(thread, stop_event)
  - append_to_summary(path, row_dict)
  - compute_metrics_from_logs(log_path)
  - get_device()
  - format_bytes(bytes)
  - count_parameters(model)

Constants:
  - STANDARD_TRANSFORM


## CSV Dataset Class

In [2]:
class CSVDataset(Dataset):
    """
    PyTorch Dataset for CSV manifest format.
    
    Args:
        csv_path: Path to CSV file (train.csv or val.csv)
        transform: Torchvision transforms to apply
    """
    
    def __init__(self, csv_path: Path, transform=None):
        self.csv_path = Path(csv_path)
        self.transform = transform
        
        # Read CSV
        self.df = pd.read_csv(csv_path)
        
        # Validate columns
        required_cols = ['path', 'label']
        if not all(col in self.df.columns for col in required_cols):
            raise ValueError(f"CSV must contain columns: {required_cols}")
        
        # Convert paths to Path objects
        self.paths = [Path(p) for p in self.df['path'].tolist()]
        self.labels = self.df['label'].tolist()
        
        print(f"Loaded CSV dataset: {len(self)} samples from {csv_path.name}")
    
    def __len__(self):
        return len(self.paths)
    
    def __getitem__(self, idx):
        # Load image
        if idx % 10 == 0:
            print(f"Loading sample {idx}/{len(self.paths)}")
    
        img_path = self.paths[idx]
        try:
            image = Image.open(img_path).convert('RGB')
        except Exception as e:
            raise RuntimeError(f"Failed to load image {img_path}: {e}")
    
        if self.transform:
            image = self.transform(image)
    
        label = self.labels[idx]
        return image, label

## DataLoader Factory Function

In [3]:
def make_dataloader(
    dataset: str,
    split: str,
    batch_size: int,
    num_workers: int,
    pin_memory: bool = True,
    variant: Optional[str] = None,
    shuffle: bool = True,
    transform=None,
) -> DataLoader:
    """
    Create a DataLoader for CSV manifest format.
    
    Args:
        dataset: Dataset name (e.g., 'cifar10', 'imagenet-mini')
        split: Split name ('train' or 'val')
        batch_size: Batch size
        num_workers: Number of data loading workers
        pin_memory: Whether to pin memory for faster GPU transfer
        variant: Format variant (unused for CSV, kept for API consistency)
        shuffle: Whether to shuffle data
        transform: Custom transform (uses STANDARD_TRANSFORM if None)
    
    Returns:
        PyTorch DataLoader
    """
    # Detect environment
    IS_KAGGLE = "KAGGLE_KERNEL_RUN_TYPE" in os.environ
    BASE_DIR = Path('/kaggle/working/format-matters') if IS_KAGGLE else Path('..').resolve()
    
    # Build path to CSV file
    csv_path = BASE_DIR / 'data' / 'built' / dataset / 'csv' / 'default' / f'{split}.csv'
    
    if not csv_path.exists():
        raise FileNotFoundError(f"CSV file not found: {csv_path}")
    
    # Use standard transform if none provided
    if transform is None:
        transform = STANDARD_TRANSFORM
    
    # Create dataset
    dataset_obj = CSVDataset(csv_path, transform=transform)
    
    # Create dataloader
    dataloader = DataLoader(
        dataset_obj,
        batch_size=batch_size,
        shuffle=shuffle,
        num_workers=num_workers,
        pin_memory=pin_memory,
        drop_last=False,
    )
    
    return dataloader

## Smoke Test

In [4]:
if __name__ == "__main__":
    print("Running CSV DataLoader smoke test...\n")
    
    # Detect environment
    IS_KAGGLE = "KAGGLE_KERNEL_RUN_TYPE" in os.environ
    BASE_DIR = Path('/kaggle/working/format-matters') if IS_KAGGLE else Path('..').resolve()
    BUILT_DIR = BASE_DIR / 'data' / 'built'
    
    # Find available datasets
    available_datasets = []
    for dataset_name in ['cifar10', 'imagenet-mini']:
        csv_dir = BUILT_DIR / dataset_name / 'csv' / 'default'
        if csv_dir.exists() and (csv_dir / 'train.csv').exists():
            available_datasets.append(dataset_name)
    
    if not available_datasets:
        print("⚠ No CSV datasets found. Run 02_build_csv_manifest.ipynb first.")
    else:
        # Test with first available dataset
        test_dataset = available_datasets[0]
        print(f"Testing with dataset: {test_dataset}\n")
        
        try:
            # Create dataloader
            loader = make_dataloader(
                dataset=test_dataset,
                split='train',
                batch_size=16,
                num_workers=0,
                pin_memory=False,
                shuffle=False
            )
            
            print(f"\nDataLoader created:")
            print(f"  Dataset size: {len(loader.dataset):,}")
            print(f"  Batch size: {loader.batch_size}")
            print(f"  Num batches: {len(loader):,}")
            print(f"  Num workers: {loader.num_workers}")
            
            # Load first batch
            print("\nLoading first batch...")
            with Timer("First batch"):
                images, labels = next(iter(loader))
            
            print(f"\nBatch shapes:")
            print(f"  Images: {images.shape} ({images.dtype})")
            print(f"  Labels: {labels.shape} ({labels.dtype})")
            print(f"  Image range: [{images.min():.3f}, {images.max():.3f}]")
            print(f"  Label range: [{labels.min()}, {labels.max()}]")
            
            # Load a few more batches to test throughput
            print("\nTesting throughput (10 batches)...")
            with Timer("10 batches"):
                for i, (images, labels) in enumerate(loader):
                    if i >= 9:
                        break
            
            print("\n✓ CSV DataLoader smoke test passed!")
            
        except Exception as e:
            print(f"\n✗ Smoke test failed: {e}")
            import traceback
            traceback.print_exc()

Running CSV DataLoader smoke test...

Testing with dataset: cifar10

Loaded CSV dataset: 50000 samples from train.csv

DataLoader created:
  Dataset size: 50,000
  Batch size: 16
  Num batches: 3,125
  Num workers: 0

Loading first batch...
Loading sample 0/50000
Loading sample 10/50000
First batch took 0.10s

Batch shapes:
  Images: torch.Size([16, 3, 224, 224]) (torch.float32)
  Labels: torch.Size([16]) (torch.int64)
  Image range: [-2.118, 2.640]
  Label range: [0, 0]

Testing throughput (10 batches)...
Loading sample 0/50000
Loading sample 10/50000
Loading sample 20/50000
Loading sample 30/50000
Loading sample 40/50000
Loading sample 50/50000
Loading sample 60/50000
Loading sample 70/50000
Loading sample 80/50000
Loading sample 90/50000
Loading sample 100/50000
Loading sample 110/50000
Loading sample 120/50000
Loading sample 130/50000
Loading sample 140/50000
Loading sample 150/50000
10 batches took 0.39s

✓ CSV DataLoader smoke test passed!


## ✅ CSV DataLoader Ready

**Usage:**
```python
# In training notebooks
%run ./11_loader_csv.ipynb

train_loader = make_dataloader(
    dataset='cifar10',
    split='train',
    batch_size=64,
    num_workers=4,
    pin_memory=True,
    shuffle=True
)

val_loader = make_dataloader(
    dataset='cifar10',
    split='val',
    batch_size=64,
    num_workers=4,
    pin_memory=True,
    shuffle=False
)
```

**Features:**
- Simple CSV-based manifest
- Loads images from individual files
- Standard PyTorch DataLoader interface
- Configurable workers and batch size

**Next steps:**
1. Create other format loaders (12-14)
2. Run training experiments (20-21)