# 02 - Build CSV Manifest (Baseline Format)

This notebook creates CSV manifest files for the baseline format.

**Format:** CSV files with columns: `path,label,split`
- Simple text file listing all image paths
- Images remain as individual files
- No compression (canonical baseline)

**Output:**
- `data/built/<dataset>/csv/default/train.csv`
- `data/built/<dataset>/csv/default/val.csv`
- Build statistics logged to `runs/<session>/summary.csv`

In [1]:
import os
import sys
import time
import csv
from pathlib import Path
from collections import defaultdict

import pandas as pd
from tqdm.auto import tqdm

# Load common utilities
%run ./10_common_utils.ipynb

✓ Common utilities loaded successfully

Available functions:
  - set_seed(seed)
  - get_transforms(augment)
  - write_sysinfo(path)
  - time_first_batch(dataloader, device)
  - start_monitor(log_path, interval)
  - stop_monitor(thread, stop_event)
  - append_to_summary(path, row_dict)
  - compute_metrics_from_logs(log_path)
  - get_device()
  - format_bytes(bytes)
  - count_parameters(model)

Constants:
  - STANDARD_TRANSFORM


## Configuration

In [2]:
# Detect environment
IS_KAGGLE = "KAGGLE_KERNEL_RUN_TYPE" in os.environ
BASE_DIR = Path('/kaggle/working/format-matters') if IS_KAGGLE else Path('..').resolve()

RAW_DIR = BASE_DIR / 'data/raw'
BUILT_DIR = BASE_DIR / 'data/built'

# Create run directory for this session
RUN_DIR = BASE_DIR / 'runs' / time.strftime('%Y%m%d-%H%M%S') / 'builds'
RUN_DIR.mkdir(parents=True, exist_ok=True)

SUMMARY_CSV = RUN_DIR / 'summary.csv'
SUMMARY_CSV.touch(exist_ok=True)

print(f"Environment: {'Kaggle' if IS_KAGGLE else 'Local'}")
print(f"Base directory: {BASE_DIR}")
print(f"Run directory: {RUN_DIR}")
print(f"Summary log: {SUMMARY_CSV}")

Environment: Local
Base directory: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters
Run directory: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\runs\20251127-133536\builds
Summary log: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\runs\20251127-133536\builds\summary.csv


## CSV Manifest Builder

In [3]:
def build_csv_manifest(dataset_name: str, raw_path: Path, output_path: Path):
    """
    Build CSV manifest for a dataset.
    
    Args:
        dataset_name: Name of dataset (e.g., 'cifar10')
        raw_path: Path to raw dataset directory
        output_path: Path to output CSV directory
    
    Returns:
        Dictionary with build statistics
    """
    print(f"\nBuilding CSV manifest for {dataset_name}...")
    print(f"  Source: {raw_path}")
    print(f"  Output: {output_path}")
    
    output_path.mkdir(parents=True, exist_ok=True)
    
    stats = {
        'stage': 'build',
        'dataset': dataset_name,
        'format': 'csv',
        'variant': 'default',
    }
    
    start_time = time.time()
    
    # FIXED: Use case-insensitive matching to avoid duplicates on Windows/macOS
    image_extensions = {'.jpg', '.jpeg', '.png'}
    all_class_names = set()
    for split in ['train', 'val']:
        split_dir = raw_path / split
        if not split_dir.exists():
            continue
        for img_path in split_dir.rglob('*.*'):
            if img_path.suffix.lower() in image_extensions:
                all_class_names.add(img_path.parent.name)
    
    class_names = sorted(all_class_names)
    class_to_label = {name: idx for idx, name in enumerate(class_names)}
    print(f"  Total classes across splits: {len(class_names)}")
    
    # Process each split
    for split in ['train', 'val']:
        split_dir = raw_path / split
        if not split_dir.exists():
            print(f"  ⚠ {split} split not found, skipping")
            continue
        
        csv_path = output_path / f"{split}.csv"
        
        print(f"\n  Processing {split} split...")
        
        # Find all image files (FIXED: deduplicated for case-insensitive filesystems)
        image_files = [
            f for f in split_dir.rglob('*.*')
            if f.suffix.lower() in image_extensions
        ]
        
        print(f"    Found {len(image_files):,} images")
        split_class_count = len({f.parent.name for f in image_files})
        print(f"    Classes in split: {split_class_count}")
        
        # Write CSV
        with open(csv_path, 'w', newline='') as f:
            writer = csv.writer(f)
            writer.writerow(['path', 'label', 'split'])
            
            for img_path in tqdm(image_files, desc=f"    Writing {split}.csv"):
                class_name = img_path.parent.name
                if class_name not in class_to_label:
                    raise KeyError(f"Class {class_name} missing from mapping for split {split}")
                label = class_to_label[class_name]
                # Use absolute path for reliability
                abs_path = str(img_path.resolve())
                writer.writerow([abs_path, label, split])
        
        # Get file size
        csv_size = csv_path.stat().st_size
        
        print(f"    ✓ {split}.csv: {len(image_files):,} rows, {format_bytes(csv_size)}")
        
        # Log split stats
        split_stats = stats.copy()
        split_stats.update({
            'split': split,
            'items': len(image_files),
            'bytes_on_disk': csv_size,
            'num_files': 1,
            'avg_file_size': csv_size,
        })
        
        # Don't append yet, wait for total time
        if split == 'train':
            train_stats = split_stats
        else:
            val_stats = split_stats
    
    # Calculate total time
    build_time = time.time() - start_time
    
    # Append to summary
    if 'train_stats' in locals():
        train_stats['build_wall_s'] = build_time
        append_to_summary(SUMMARY_CSV, train_stats)
    
    if 'val_stats' in locals():
        val_stats['build_wall_s'] = build_time
        append_to_summary(SUMMARY_CSV, val_stats)
    
    print(f"\n  ✓ Build completed in {build_time:.2f}s")
    
    return {
        'dataset': dataset_name,
        'format': 'csv',
        'build_time': build_time,
    }


## Build CSV Manifests for All Datasets

In [4]:
# Find available datasets
available_datasets = []

for dataset_name in ['cifar10', 'imagenet-mini']:
    dataset_path = RAW_DIR / dataset_name
    if dataset_path.exists() and (dataset_path / 'train').exists():
        available_datasets.append(dataset_name)

print(f"Found {len(available_datasets)} dataset(s): {', '.join(available_datasets)}")
print("\n" + "="*60)

Found 2 dataset(s): cifar10, imagenet-mini



In [5]:
# Build CSV manifests
build_results = []

for dataset_name in available_datasets:
    raw_path = RAW_DIR / dataset_name
    output_path = BUILT_DIR / dataset_name / 'csv' / 'default'
    
    result = build_csv_manifest(dataset_name, raw_path, output_path)
    build_results.append(result)
    
    print("="*60)

print(f"\n✓ Built CSV manifests for {len(build_results)} dataset(s)")


Building CSV manifest for cifar10...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\cifar10
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\cifar10\csv\default
  Total classes across splits: 10

  Processing train split...
    Found 50,000 images
    Classes in split: 10


    Writing train.csv:   0%|          | 0/50000 [00:00<?, ?it/s]

    ✓ train.csv: 50,000 rows, 5.6 MB

  Processing val split...
    Found 10,000 images
    Classes in split: 10


    Writing val.csv:   0%|          | 0/10000 [00:00<?, ?it/s]

    ✓ val.csv: 10,000 rows, 1.1 MB

  ✓ Build completed in 22.64s

Building CSV manifest for imagenet-mini...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\imagenet-mini
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\imagenet-mini\csv\default
  Total classes across splits: 1000

  Processing train split...
    Found 34,745 images
    Classes in split: 1000


    Writing train.csv:   0%|          | 0/34745 [00:00<?, ?it/s]

    ✓ train.csv: 34,745 rows, 4.6 MB

  Processing val split...
    Found 3,923 images
    Classes in split: 1000


    Writing val.csv:   0%|          | 0/3923 [00:00<?, ?it/s]

    ✓ val.csv: 3,923 rows, 551.3 KB

  ✓ Build completed in 18.55s

✓ Built CSV manifests for 2 dataset(s)


## Verification

In [6]:
print("\nVerifying CSV manifests...\n")

for dataset_name in available_datasets:
    csv_dir = BUILT_DIR / dataset_name / 'csv' / 'default'
    
    print(f"{dataset_name}:")
    
    for split in ['train', 'val']:
        csv_path = csv_dir / f"{split}.csv"
        
        if not csv_path.exists():
            print(f"  ✗ {split}.csv not found")
            continue
        
        # Read CSV
        df = pd.read_csv(csv_path)
        
        # Check columns
        expected_cols = ['path', 'label', 'split']
        if list(df.columns) != expected_cols:
            print(f"  ✗ {split}.csv: Invalid columns {list(df.columns)}")
            continue
        
        # Check paths exist (sample)
        sample_paths = df['path'].sample(min(5, len(df))).tolist()
        paths_exist = all(Path(p).exists() for p in sample_paths)
        
        # Stats
        num_rows = len(df)
        num_classes = df['label'].nunique()
        file_size = csv_path.stat().st_size
        
        status = "✓" if paths_exist else "⚠"
        print(f"  {status} {split}.csv: {num_rows:,} rows, {num_classes} classes, {format_bytes(file_size)}")
        
        if not paths_exist:
            print(f"      Warning: Some sampled paths don't exist")
    
    print()


Verifying CSV manifests...

cifar10:
  ✓ train.csv: 50,000 rows, 10 classes, 5.6 MB
  ✓ val.csv: 10,000 rows, 10 classes, 1.1 MB

imagenet-mini:
  ✓ train.csv: 34,745 rows, 1000 classes, 4.6 MB
  ✓ val.csv: 3,923 rows, 1000 classes, 551.3 KB



## Sample Data Inspection

In [7]:
# Show sample from first dataset
if available_datasets:
    dataset_name = available_datasets[0]
    csv_path = BUILT_DIR / dataset_name / 'csv' / 'default' / 'train.csv'
    
    print(f"Sample from {dataset_name} train.csv:\n")
    df = pd.read_csv(csv_path)
    print(df.head(10))
    
    print(f"\nDataset statistics:")
    print(f"  Total rows: {len(df):,}")
    print(f"  Unique labels: {df['label'].nunique()}")
    print(f"  Label distribution:")
    print(df['label'].value_counts().head(10))

Sample from cifar10 train.csv:

                                                path  label  split
0  C:\Users\arjya\Fall 2025\Systems for ML\Projec...      0  train
1  C:\Users\arjya\Fall 2025\Systems for ML\Projec...      0  train
2  C:\Users\arjya\Fall 2025\Systems for ML\Projec...      0  train
3  C:\Users\arjya\Fall 2025\Systems for ML\Projec...      0  train
4  C:\Users\arjya\Fall 2025\Systems for ML\Projec...      0  train
5  C:\Users\arjya\Fall 2025\Systems for ML\Projec...      0  train
6  C:\Users\arjya\Fall 2025\Systems for ML\Projec...      0  train
7  C:\Users\arjya\Fall 2025\Systems for ML\Projec...      0  train
8  C:\Users\arjya\Fall 2025\Systems for ML\Projec...      0  train
9  C:\Users\arjya\Fall 2025\Systems for ML\Projec...      0  train

Dataset statistics:
  Total rows: 50,000
  Unique labels: 10
  Label distribution:
label
0    5000
1    5000
2    5000
3    5000
4    5000
5    5000
6    5000
7    5000
8    5000
9    5000
Name: count, dtype: int64


## Build Summary

In [8]:
# # Read and display summary
# if SUMMARY_CSV.exists() and SUMMARY_CSV.stat().st_size > 0:
#     cols = [
#     "build", "dataset", "format", "profile", "split",
#     "items", "bytes_on_disk", "shards", "avg_bytes",
#     "buildwall_s", "timestamp"
#     ]
#     summary_df = pd.read_csv(SUMMARY_CSV, names=cols, header=None)
    
#     print("\nBuild Summary:")
#     print("="*80)
    
#     for _, row in summary_df.iterrows():
#         print(f"\n{row['dataset']} - {row['split']}:")
#         print(f"  Items: {row['items']:,}")
#         print(f"  CSV size: {format_bytes(row['bytes_on_disk'])}")
#         print(f"  Build time: {row['build_wall_s']:.2f}s")
    
#     print("\n" + "="*80)
#     print(f"\nSummary saved to: {SUMMARY_CSV}")
# else:
#     print("No summary data available")

In [9]:
if SUMMARY_CSV.exists() and SUMMARY_CSV.stat().st_size > 0:
    cols = [
        "build", "dataset", "format", "profile", "split",
        "items", "bytes_on_disk", "shards", "avg_bytes",
        "build_wall_s", "timestamp"
    ]
    summary_df = pd.read_csv(SUMMARY_CSV, names=cols, header=None)

    # Convert numeric columns safely
    numeric_cols = ["items", "bytes_on_disk", "shards", "avg_bytes", "build_wall_s"]
    summary_df[numeric_cols] = summary_df[numeric_cols].apply(pd.to_numeric, errors='coerce')

    print("\nBuild Summary:")
    print("=" * 80)

    for _, row in summary_df.iterrows():
        dataset = row["dataset"] if pd.notna(row["dataset"]) else "Unknown"
        split = row["split"] if pd.notna(row["split"]) else "Unknown"

        items = (
            f"{int(row['items']):,}" if pd.notna(row["items"]) else "N/A"
        )
        csv_size = (
            format_bytes(row["bytes_on_disk"]) if pd.notna(row["bytes_on_disk"]) else "N/A"
        )
        build_time = (
            f"{row['build_wall_s']:.2f}s" if pd.notna(row["build_wall_s"]) else "N/A"
        )

        print(f"\n{dataset} - {split}:")
        print(f"  Items: {items}")
        print(f"  CSV size: {csv_size}")
        print(f"  Build time: {build_time}")

    print("\n" + "=" * 80)
    print(f"\nSummary saved to: {SUMMARY_CSV}")
else:
    print("No summary data available")



Build Summary:

dataset - split:
  Items: N/A
  CSV size: N/A
  Build time: N/A

cifar10 - train:
  Items: 50,000
  CSV size: 5.6 MB
  Build time: 22.64s

cifar10 - val:
  Items: 10,000
  CSV size: 1.1 MB
  Build time: 22.64s

imagenet-mini - train:
  Items: 34,745
  CSV size: 4.6 MB
  Build time: 18.55s

imagenet-mini - val:
  Items: 3,923
  CSV size: 551.3 KB
  Build time: 18.55s


Summary saved to: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\runs\20251127-133536\builds\summary.csv


## ✅ CSV Manifest Build Complete

**What was created:**
- CSV manifest files for each dataset and split
- Format: `path,label,split` with absolute paths
- No compression (baseline format)

**Output locations:**
- `data/built/<dataset>/csv/default/train.csv`
- `data/built/<dataset>/csv/default/val.csv`

**Next steps:**
1. Run `11_loader_csv.ipynb` to create the dataloader
2. Or continue with other format builders (03-05)
3. Then run training experiments (20-21)