# 04 - Build TFRecord Format

This notebook creates TFRecord format for efficient data loading.

**Format:** TensorFlow's TFRecord (binary protocol buffer)
- Efficient binary serialization
- Sequential I/O friendly
- Configurable shard sizes
- Optional compression (GZIP, ZLIB)
- Compatible with TensorFlow and PyTorch

**Variants:**
- Shard sizes: 64MB, 256MB, 1024MB
- Compression: none, gzip

**Output:**
- `data/built/<dataset>/tfrecord/<variant>/*.tfrecord[.gz]`
- Build statistics logged to `runs/<session>/summary.csv`

In [1]:
pip install tensorflow

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import os
import sys
import time
import json
from pathlib import Path
from collections import defaultdict

import tensorflow as tf
from PIL import Image
from tqdm.auto import tqdm
import numpy as np

# Load common utilities
%run ./10_common_utils.ipynb

✓ Common utilities loaded successfully

Available functions:
  - set_seed(seed)
  - get_transforms(augment)
  - write_sysinfo(path)
  - time_first_batch(dataloader, device)
  - start_monitor(log_path, interval)
  - stop_monitor(thread, stop_event)
  - append_to_summary(path, row_dict)
  - compute_metrics_from_logs(log_path)
  - get_device()
  - format_bytes(bytes)
  - count_parameters(model)

Constants:
  - STANDARD_TRANSFORM


## Configuration

In [3]:
# Detect environment
IS_KAGGLE = "KAGGLE_KERNEL_RUN_TYPE" in os.environ
BASE_DIR = Path('/kaggle/working/format-matters') if IS_KAGGLE else Path('..').resolve()

RAW_DIR = BASE_DIR / 'data/raw'
BUILT_DIR = BASE_DIR / 'data/built'

# Create run directory for this session
RUN_DIR = BASE_DIR / 'runs' / time.strftime('%Y%m%d-%H%M%S') / 'builds'
RUN_DIR.mkdir(parents=True, exist_ok=True)

SUMMARY_CSV = RUN_DIR / 'summary.csv'
SUMMARY_CSV.touch(exist_ok=True)

print(f"Environment: {'Kaggle' if IS_KAGGLE else 'Local'}")
print(f"Base directory: {BASE_DIR}")
print(f"Run directory: {RUN_DIR}")
print(f"Summary log: {SUMMARY_CSV}")

Environment: Local
Base directory: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters
Run directory: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\runs\20251127-135410\builds
Summary log: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\runs\20251127-135410\builds\summary.csv


## Build Configuration

Configure which variants to build:

In [4]:
# Shard sizes in MB
SHARD_SIZES = [64, 256, 1024]

# Compression options
COMPRESSIONS = ['none', 'gzip']

# Generate all variants
VARIANTS = []
for shard_mb in SHARD_SIZES:
    for compression in COMPRESSIONS:
        VARIANTS.append({
            'shard_mb': shard_mb,
            'compression': compression,
            'name': f"shard{shard_mb}_{compression}"
        })

print(f"Will build {len(VARIANTS)} variants:")
for v in VARIANTS:
    print(f"  - {v['name']}")

Will build 6 variants:
  - shard64_none
  - shard64_gzip
  - shard256_none
  - shard256_gzip
  - shard1024_none
  - shard1024_gzip


## TFRecord Helper Functions

In [5]:
def _bytes_feature(value):
    """Returns a bytes_list from a string / byte."""
    if isinstance(value, type(tf.constant(0))):
        value = value.numpy()
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))


def _int64_feature(value):
    """Returns an int64_list from a bool / enum / int / uint."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))


def create_example(image_bytes, label):
    """
    Create a TFRecord example from image bytes and label.
    
    Args:
        image_bytes: Raw image bytes (JPEG/PNG encoded)
        label: Integer label
    
    Returns:
        tf.train.Example
    """
    feature = {
        'image': _bytes_feature(image_bytes),
        'label': _int64_feature(label),
    }
    
    return tf.train.Example(features=tf.train.Features(feature=feature))

## TFRecord Builder

In [6]:
def build_tfrecord_split(
    dataset_name: str,
    split: str,
    raw_path: Path,
    output_path: Path,
    shard_mb: int,
    compression: str,
    class_to_label: dict,
    image_extensions: list
):
    """
    Build TFRecord shards for a single split.
    
    Args:
        dataset_name: Name of dataset
        split: Split name ('train' or 'val')
        raw_path: Path to raw dataset split directory
        output_path: Path to output directory
        shard_mb: Target shard size in MB
        compression: Compression type ('none' or 'gzip')
        class_to_label: Mapping from class name to label index
        image_extensions: List of image file extensions to include
    
    Returns:
        Dictionary with build statistics
    """
    output_path.mkdir(parents=True, exist_ok=True)
    
    # Find all image files
    image_files = []
    for ext in image_extensions:
        image_files.extend(raw_path.rglob(f'*{ext}'))
    
    # Sort for reproducibility
    image_files = sorted(image_files)
    split_class_count = len({f.parent.name for f in image_files})
    print(f"    Found {len(image_files):,} images in {split_class_count} classes")
    
    # Set compression options
    if compression == 'gzip':
        if hasattr(tf.io, 'TFRecordCompressionType'):
            compression_type = tf.io.TFRecordCompressionType.GZIP
        else:
            compression_type = 'GZIP'
        file_ext = '.tfrecord.gz'
    else:
        if hasattr(tf.io, 'TFRecordCompressionType'):
            compression_type = tf.io.TFRecordCompressionType.NONE
        else:
            compression_type = None
        file_ext = '.tfrecord'
    
    # Target shard size in bytes
    maxsize = shard_mb * 1024 * 1024
    
    # Write TFRecords
    shard_idx = 0
    current_size = 0
    writer = None
    shard_files = []
    
    options = tf.io.TFRecordOptions(compression_type=compression_type)
    
    for idx, img_path in enumerate(tqdm(image_files, desc=f"    Writing {split}")):
        # Read image bytes
        with open(img_path, 'rb') as f:
            img_bytes = f.read()
        
        # Get label
        class_name = img_path.parent.name
        if class_name not in class_to_label:
            raise KeyError(f"Class {class_name} missing from mapping for split {split}")
        label = class_to_label[class_name]
        
        # Create example
        example = create_example(img_bytes, label)
        serialized = example.SerializeToString()
        
        # Check if we need a new shard
        if writer is None or current_size >= maxsize:
            if writer is not None:
                writer.close()
            
            shard_path = output_path / f"{split}-{shard_idx:06d}{file_ext}"
            shard_files.append(shard_path)
            writer = tf.io.TFRecordWriter(str(shard_path), options=options)
            current_size = 0
            shard_idx += 1
        
        # Write example
        writer.write(serialized)
        current_size += len(serialized)
    
    # Close final writer
    if writer is not None:
        writer.close()
    
    # Calculate statistics
    total_bytes = sum(f.stat().st_size for f in shard_files)
    
    return {
        'items': len(image_files),
        'bytes_on_disk': total_bytes,
        'num_files': len(shard_files),
        'avg_file_size': total_bytes // len(shard_files) if shard_files else 0,
    }


def build_tfrecord(
    dataset_name: str,
    raw_path: Path,
    output_path: Path,
    shard_mb: int,
    compression: str,
    variant_name: str
):
    """
    Build TFRecord for a dataset with specific configuration.
    
    Args:
        dataset_name: Name of dataset
        raw_path: Path to raw dataset directory
        output_path: Path to output directory
        shard_mb: Target shard size in MB
        compression: Compression type ('none' or 'gzip')
        variant_name: Variant identifier
    
    Returns:
        Dictionary with build statistics
    """
    print(f"\nBuilding TFRecord for {dataset_name} ({variant_name})...")
    print(f"  Source: {raw_path}")
    print(f"  Output: {output_path}")
    print(f"  Shard size: {shard_mb}MB, Compression: {compression}")
    
    start_time = time.time()
    # FIXED: Use case-insensitive matching
    image_extensions = {'.jpg', '.jpeg', '.png'}
    all_class_names = set()
    for split in ['train', 'val']:
        split_dir = raw_path / split
        if not split_dir.exists():
            continue
        for ext in image_extensions:
            for img_path in split_dir.rglob(f'*{ext}'):
                all_class_names.add(img_path.parent.name)
    class_names = sorted(all_class_names)
    class_to_label = {name: idx for idx, name in enumerate(class_names)}
    print(f"  Total classes across splits: {len(class_names)}")
    
    # Process each split
    for split in ['train', 'val']:
        split_dir = raw_path / split
        if not split_dir.exists():
            print(f"  ⚠ {split} split not found, skipping")
            continue
        
        print(f"\n  Processing {split} split...")
        
        split_stats = build_tfrecord_split(
            dataset_name, split, split_dir, output_path,
            shard_mb, compression, class_to_label, image_extensions
        )
        
        print(f"    ✓ {split}: {split_stats['num_files']} shards, "
              f"{format_bytes(split_stats['bytes_on_disk'])}")
        
        # Log to summary
        build_time = time.time() - start_time
        row = {
            'stage': 'build',
            'dataset': dataset_name,
            'format': 'tfrecord',
            'variant': variant_name,
            'split': split,
            'items': split_stats['items'],
            'bytes_on_disk': split_stats['bytes_on_disk'],
            'num_files': split_stats['num_files'],
            'avg_file_size': split_stats['avg_file_size'],
            'build_wall_s': build_time,
        }
        append_to_summary(SUMMARY_CSV, row)
    
    build_time = time.time() - start_time
    print(f"\n  ✓ Build completed in {build_time:.2f}s")
    
    return {'dataset': dataset_name, 'variant': variant_name, 'build_time': build_time}


## Build TFRecords for All Datasets and Variants

In [7]:
# Find available datasets
available_datasets = []
for dataset_name in ['cifar10', 'imagenet-mini', 'tiny-imagenet-200']:
    dataset_path = RAW_DIR / dataset_name
    if dataset_path.exists() and (dataset_path / 'train').exists():
        available_datasets.append(dataset_name)

print(f"Found {len(available_datasets)} dataset(s): {', '.join(available_datasets)}")
print("\n" + "="*60)

Found 2 dataset(s): cifar10, imagenet-mini



In [8]:
# Build all variants for all datasets
build_results = []

for dataset_name in available_datasets:
    raw_path = RAW_DIR / dataset_name
    
    for variant in VARIANTS:
        output_path = BUILT_DIR / dataset_name / 'tfrecord' / variant['name']
        
        result = build_tfrecord(
            dataset_name=dataset_name,
            raw_path=raw_path,
            output_path=output_path,
            shard_mb=variant['shard_mb'],
            compression=variant['compression'],
            variant_name=variant['name']
        )
        build_results.append(result)
        
        print("="*60)

print(f"\n✓ Built {len(build_results)} TFRecord variant(s)")


Building TFRecord for cifar10 (shard64_none)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\cifar10
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\cifar10\tfrecord\shard64_none
  Shard size: 64MB, Compression: none
  Total classes across splits: 10

  Processing train split...
    Found 50,000 images in 10 classes


    Writing train:   0%|          | 0/50000 [00:00<?, ?it/s]

    ✓ train: 2 shards, 110.4 MB

  Processing val split...
    Found 10,000 images in 10 classes


    Writing val:   0%|          | 0/10000 [00:00<?, ?it/s]

    ✓ val: 1 shards, 22.1 MB

  ✓ Build completed in 48.08s

Building TFRecord for cifar10 (shard64_gzip)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\cifar10
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\cifar10\tfrecord\shard64_gzip
  Shard size: 64MB, Compression: gzip
  Total classes across splits: 10

  Processing train split...
    Found 50,000 images in 10 classes


    Writing train:   0%|          | 0/50000 [00:00<?, ?it/s]

    ✓ train: 2 shards, 107.2 MB

  Processing val split...
    Found 10,000 images in 10 classes


    Writing val:   0%|          | 0/10000 [00:00<?, ?it/s]

    ✓ val: 1 shards, 21.4 MB

  ✓ Build completed in 19.01s

Building TFRecord for cifar10 (shard256_none)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\cifar10
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\cifar10\tfrecord\shard256_none
  Shard size: 256MB, Compression: none
  Total classes across splits: 10

  Processing train split...
    Found 50,000 images in 10 classes


    Writing train:   0%|          | 0/50000 [00:00<?, ?it/s]

    ✓ train: 1 shards, 110.4 MB

  Processing val split...
    Found 10,000 images in 10 classes


    Writing val:   0%|          | 0/10000 [00:00<?, ?it/s]

    ✓ val: 1 shards, 22.1 MB

  ✓ Build completed in 10.06s

Building TFRecord for cifar10 (shard256_gzip)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\cifar10
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\cifar10\tfrecord\shard256_gzip
  Shard size: 256MB, Compression: gzip
  Total classes across splits: 10

  Processing train split...
    Found 50,000 images in 10 classes


    Writing train:   0%|          | 0/50000 [00:00<?, ?it/s]

    ✓ train: 1 shards, 107.2 MB

  Processing val split...
    Found 10,000 images in 10 classes


    Writing val:   0%|          | 0/10000 [00:00<?, ?it/s]

    ✓ val: 1 shards, 21.4 MB

  ✓ Build completed in 12.08s

Building TFRecord for cifar10 (shard1024_none)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\cifar10
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\cifar10\tfrecord\shard1024_none
  Shard size: 1024MB, Compression: none
  Total classes across splits: 10

  Processing train split...
    Found 50,000 images in 10 classes


    Writing train:   0%|          | 0/50000 [00:00<?, ?it/s]

    ✓ train: 1 shards, 110.4 MB

  Processing val split...
    Found 10,000 images in 10 classes


    Writing val:   0%|          | 0/10000 [00:00<?, ?it/s]

    ✓ val: 1 shards, 22.1 MB

  ✓ Build completed in 9.21s

Building TFRecord for cifar10 (shard1024_gzip)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\cifar10
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\cifar10\tfrecord\shard1024_gzip
  Shard size: 1024MB, Compression: gzip
  Total classes across splits: 10

  Processing train split...
    Found 50,000 images in 10 classes


    Writing train:   0%|          | 0/50000 [00:00<?, ?it/s]

    ✓ train: 1 shards, 107.2 MB

  Processing val split...
    Found 10,000 images in 10 classes


    Writing val:   0%|          | 0/10000 [00:00<?, ?it/s]

    ✓ val: 1 shards, 21.4 MB

  ✓ Build completed in 19.75s

Building TFRecord for imagenet-mini (shard64_none)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\imagenet-mini
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\imagenet-mini\tfrecord\shard64_none
  Shard size: 64MB, Compression: none
  Total classes across splits: 1000

  Processing train split...
    Found 34,745 images in 1000 classes


    Writing train:   0%|          | 0/34745 [00:00<?, ?it/s]

    ✓ train: 56 shards, 3.5 GB

  Processing val split...
    Found 3,923 images in 1000 classes


    Writing val:   0%|          | 0/3923 [00:00<?, ?it/s]

    ✓ val: 8 shards, 480.3 MB

  ✓ Build completed in 73.26s

Building TFRecord for imagenet-mini (shard64_gzip)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\imagenet-mini
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\imagenet-mini\tfrecord\shard64_gzip
  Shard size: 64MB, Compression: gzip
  Total classes across splits: 1000

  Processing train split...
    Found 34,745 images in 1000 classes


    Writing train:   0%|          | 0/34745 [00:00<?, ?it/s]

    ✓ train: 56 shards, 3.4 GB

  Processing val split...
    Found 3,923 images in 1000 classes


    Writing val:   0%|          | 0/3923 [00:00<?, ?it/s]

    ✓ val: 8 shards, 476.1 MB

  ✓ Build completed in 230.10s

Building TFRecord for imagenet-mini (shard256_none)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\imagenet-mini
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\imagenet-mini\tfrecord\shard256_none
  Shard size: 256MB, Compression: none
  Total classes across splits: 1000

  Processing train split...
    Found 34,745 images in 1000 classes


    Writing train:   0%|          | 0/34745 [00:00<?, ?it/s]

    ✓ train: 14 shards, 3.5 GB

  Processing val split...
    Found 3,923 images in 1000 classes


    Writing val:   0%|          | 0/3923 [00:00<?, ?it/s]

    ✓ val: 2 shards, 480.3 MB

  ✓ Build completed in 70.25s

Building TFRecord for imagenet-mini (shard256_gzip)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\imagenet-mini
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\imagenet-mini\tfrecord\shard256_gzip
  Shard size: 256MB, Compression: gzip
  Total classes across splits: 1000

  Processing train split...
    Found 34,745 images in 1000 classes


    Writing train:   0%|          | 0/34745 [00:00<?, ?it/s]

    ✓ train: 14 shards, 3.4 GB

  Processing val split...
    Found 3,923 images in 1000 classes


    Writing val:   0%|          | 0/3923 [00:00<?, ?it/s]

    ✓ val: 2 shards, 476.1 MB

  ✓ Build completed in 248.99s

Building TFRecord for imagenet-mini (shard1024_none)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\imagenet-mini
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\imagenet-mini\tfrecord\shard1024_none
  Shard size: 1024MB, Compression: none
  Total classes across splits: 1000

  Processing train split...
    Found 34,745 images in 1000 classes


    Writing train:   0%|          | 0/34745 [00:00<?, ?it/s]

    ✓ train: 4 shards, 3.5 GB

  Processing val split...
    Found 3,923 images in 1000 classes


    Writing val:   0%|          | 0/3923 [00:00<?, ?it/s]

    ✓ val: 1 shards, 480.3 MB

  ✓ Build completed in 84.91s

Building TFRecord for imagenet-mini (shard1024_gzip)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\imagenet-mini
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\imagenet-mini\tfrecord\shard1024_gzip
  Shard size: 1024MB, Compression: gzip
  Total classes across splits: 1000

  Processing train split...
    Found 34,745 images in 1000 classes


    Writing train:   0%|          | 0/34745 [00:00<?, ?it/s]

    ✓ train: 4 shards, 3.4 GB

  Processing val split...
    Found 3,923 images in 1000 classes


    Writing val:   0%|          | 0/3923 [00:00<?, ?it/s]

    ✓ val: 1 shards, 476.1 MB

  ✓ Build completed in 226.68s

✓ Built 12 TFRecord variant(s)


## Verification

In [9]:
print("\nVerifying TFRecord shards...\n")

for dataset_name in available_datasets:
    print(f"{dataset_name}:")
    
    for variant in VARIANTS:
        tfr_dir = BUILT_DIR / dataset_name / 'tfrecord' / variant['name']
        
        if not tfr_dir.exists():
            print(f"  ✗ {variant['name']}: directory not found")
            continue
        
        # Count shards
        train_shards = list(tfr_dir.glob('train-*.tfrecord*'))
        val_shards = list(tfr_dir.glob('val-*.tfrecord*'))
        
        if not train_shards:
            print(f"  ✗ {variant['name']}: no train shards found")
            continue
        
        # Calculate sizes
        train_size = sum(f.stat().st_size for f in train_shards)
        val_size = sum(f.stat().st_size for f in val_shards)
        
        print(f"  ✓ {variant['name']}:")
        print(f"      Train: {len(train_shards)} shards, {format_bytes(train_size)}")
        print(f"      Val: {len(val_shards)} shards, {format_bytes(val_size)}")
    
    print()


Verifying TFRecord shards...

cifar10:
  ✓ shard64_none:
      Train: 2 shards, 110.4 MB
      Val: 1 shards, 22.1 MB
  ✓ shard64_gzip:
      Train: 2 shards, 107.2 MB
      Val: 1 shards, 21.4 MB
  ✓ shard256_none:
      Train: 1 shards, 110.4 MB
      Val: 1 shards, 22.1 MB
  ✓ shard256_gzip:
      Train: 1 shards, 107.2 MB
      Val: 1 shards, 21.4 MB
  ✓ shard1024_none:
      Train: 1 shards, 110.4 MB
      Val: 1 shards, 22.1 MB
  ✓ shard1024_gzip:
      Train: 1 shards, 107.2 MB
      Val: 1 shards, 21.4 MB

imagenet-mini:
  ✓ shard64_none:
      Train: 56 shards, 3.5 GB
      Val: 8 shards, 480.3 MB
  ✓ shard64_gzip:
      Train: 56 shards, 3.4 GB
      Val: 8 shards, 476.1 MB
  ✓ shard256_none:
      Train: 14 shards, 3.5 GB
      Val: 2 shards, 480.3 MB
  ✓ shard256_gzip:
      Train: 14 shards, 3.4 GB
      Val: 2 shards, 476.1 MB
  ✓ shard1024_none:
      Train: 4 shards, 3.5 GB
      Val: 1 shards, 480.3 MB
  ✓ shard1024_gzip:
      Train: 4 shards, 3.4 GB
      Val: 1 sha

## Sample Data Inspection

In [10]:
# Inspect first shard of first dataset/variant
if available_datasets and VARIANTS:
    dataset_name = available_datasets[0]
    variant = VARIANTS[0]
    tfr_dir = BUILT_DIR / dataset_name / 'tfrecord' / variant['name']
    
    train_shards = sorted(tfr_dir.glob('train-*.tfrecord*'))
    if train_shards:
        print(f"Inspecting first shard of {dataset_name} ({variant['name']}):\n")
        print(f"Shard: {train_shards[0].name}")
        print(f"Size: {format_bytes(train_shards[0].stat().st_size)}")
        
        # Read first few samples
        print("\nFirst 5 samples:")
        
        # Determine compression
        compression = 'GZIP' if train_shards[0].suffix == '.gz' else ''
        
        dataset = tf.data.TFRecordDataset(
            str(train_shards[0]),
            compression_type=compression
        )
        
        for i, raw_record in enumerate(dataset.take(5)):
            example = tf.train.Example()
            example.ParseFromString(raw_record.numpy())
            
            # Extract features
            image_bytes = example.features.feature['image'].bytes_list.value[0]
            label = example.features.feature['label'].int64_list.value[0]
            
            print(f"\nSample {i}:")
            print(f"  Image bytes: {len(image_bytes)}")
            print(f"  Label: {label}")

Inspecting first shard of cifar10 (shard64_none):

Shard: train-000000.tfrecord
Size: 64.5 MB

First 5 samples:

Sample 0:
  Image bytes: 2063
  Label: 0

Sample 1:
  Image bytes: 2169
  Label: 0

Sample 2:
  Image bytes: 2237
  Label: 0

Sample 3:
  Image bytes: 2104
  Label: 0

Sample 4:
  Image bytes: 2181
  Label: 0


## Build Summary

In [11]:
# # Read and display summary
# if SUMMARY_CSV.exists() and SUMMARY_CSV.stat().st_size > 0:
#     import pandas as pd
#     summary_df = pd.read_csv(SUMMARY_CSV)
    
#     print("\nBuild Summary:")
#     print("="*80)
    
#     for dataset in summary_df['dataset'].unique():
#         print(f"\n{dataset}:")
#         dataset_df = summary_df[summary_df['dataset'] == dataset]
        
#         for variant in dataset_df['variant'].unique():
#             variant_df = dataset_df[dataset_df['variant'] == variant]
#             print(f"\n  {variant}:")
            
#             for _, row in variant_df.iterrows():
#                 print(f"    {row['split']}:")
#                 print(f"      Items: {row['items']:,}")
#                 print(f"      Shards: {row['num_files']}")
#                 print(f"      Size: {format_bytes(row['bytes_on_disk'])}")
#                 print(f"      Avg shard: {format_bytes(row['avg_file_size'])}")
    
#     print("\n" + "="*80)
#     print(f"\nSummary saved to: {SUMMARY_CSV}")
# else:
#     print("No summary data available")

In [12]:
# Read and display summary
import pandas as pd
import os

# Ensure column headers exist
expected_cols = [
    "stage", "dataset", "format", "variant", "split",
    "items", "bytes_on_disk", "num_files", "avg_file_size",
    "build_wall_s", "timestamp"
]

# If file exists but lacks headers, rewrite with them
if SUMMARY_CSV.exists() and SUMMARY_CSV.stat().st_size > 0:
    # Try reading with headers first
    try:
        summary_df = pd.read_csv(SUMMARY_CSV)
        # If headers are missing (numeric columns instead of named ones)
        if list(summary_df.columns) != expected_cols:
            summary_df = pd.read_csv(SUMMARY_CSV, names=expected_cols, header=None)
            summary_df.to_csv(SUMMARY_CSV, index=False)
    except pd.errors.ParserError:
        # In case of malformed header, force rewrite
        summary_df = pd.read_csv(SUMMARY_CSV, names=expected_cols, header=None)
        summary_df.to_csv(SUMMARY_CSV, index=False)
else:
    print("No summary data available")
    summary_df = None

if summary_df is not None and not summary_df.empty:
    print("\nBuild Summary:")
    print("=" * 80)

    for dataset in summary_df['dataset'].unique():
        print(f"\n{dataset}:")
        dataset_df = summary_df[summary_df['dataset'] == dataset]

        for variant in dataset_df['variant'].unique():
            variant_df = dataset_df[dataset_df['variant'] == variant]
            print(f"\n  {variant}:")

            for _, row in variant_df.iterrows():
                print(f"    {row['split']}:")
                print(f"      Items: {int(row['items']):,}")
                print(f"      Shards: {int(row['num_files'])}")
                print(f"      Size: {format_bytes(row['bytes_on_disk'])}")
                print(f"      Avg shard: {format_bytes(row['avg_file_size'])}")

    print("\n" + "=" * 80)
    print(f"\nSummary saved to: {SUMMARY_CSV}")
else:
    print("No summary data available")



Build Summary:

cifar10:

  shard64_none:
    train:
      Items: 50,000
      Shards: 2
      Size: 110.4 MB
      Avg shard: 55.2 MB
    val:
      Items: 10,000
      Shards: 1
      Size: 22.1 MB
      Avg shard: 22.1 MB

  shard64_gzip:
    train:
      Items: 50,000
      Shards: 2
      Size: 107.2 MB
      Avg shard: 53.6 MB
    val:
      Items: 10,000
      Shards: 1
      Size: 21.4 MB
      Avg shard: 21.4 MB

  shard256_none:
    train:
      Items: 50,000
      Shards: 1
      Size: 110.4 MB
      Avg shard: 110.4 MB
    val:
      Items: 10,000
      Shards: 1
      Size: 22.1 MB
      Avg shard: 22.1 MB

  shard256_gzip:
    train:
      Items: 50,000
      Shards: 1
      Size: 107.2 MB
      Avg shard: 107.2 MB
    val:
      Items: 10,000
      Shards: 1
      Size: 21.4 MB
      Avg shard: 21.4 MB

  shard1024_none:
    train:
      Items: 50,000
      Shards: 1
      Size: 110.4 MB
      Avg shard: 110.4 MB
    val:
      Items: 10,000
      Shards: 1
      Size: 

## ✅ TFRecord Build Complete

**What was created:**
- TFRecord shards with configurable sizes
- Multiple variants with different shard sizes and compression
- Sequential I/O friendly format
- Compatible with TensorFlow and PyTorch

**Variants built:**
- `shard64_none`: 64MB shards, no compression
- `shard64_gzip`: 64MB shards, gzip compression
- `shard256_none`: 256MB shards, no compression
- `shard256_gzip`: 256MB shards, gzip compression
- `shard1024_none`: 1024MB shards, no compression
- `shard1024_gzip`: 1024MB shards, gzip compression

**Output locations:**
- `data/built/<dataset>/tfrecord/<variant>/*.tfrecord[.gz]`

**Next steps:**
1. Run `13_loader_tfrecord.ipynb` to create the dataloader
2. Or continue with other format builders (05)
3. Then run training experiments (20-21)