# 05 - Build LMDB Format

This notebook creates LMDB (Lightning Memory-Mapped Database) format for efficient data loading.

**Format:** LMDB key-value store
- Memory-mapped database for fast random access
- Single database file per split
- Excellent for random access patterns
- Optional compression (zstd, lz4)
- Used by many computer vision frameworks

**Variants:**
- Compression: none, zstd, lz4

**Output:**
- `data/built/<dataset>/lmdb/<variant>/<split>.lmdb/`
- Build statistics logged to `runs/<session>/summary.csv`

In [1]:
# pip install zstandard

In [2]:
# pip install lz4

In [3]:
import os
import sys
import time
import json
import pickle
from pathlib import Path
from collections import defaultdict

import lmdb
from PIL import Image
from tqdm.auto import tqdm
import numpy as np

# Optional compression libraries
try:
    import zstandard as zstd
    HAS_ZSTD = True
except ImportError:
    HAS_ZSTD = False
    print("⚠ zstandard not available, zstd compression will be skipped")

try:
    import lz4.frame
    HAS_LZ4 = True
except ImportError:
    HAS_LZ4 = False
    print("⚠ lz4 not available, lz4 compression will be skipped")

# Load common utilities
%run ./10_common_utils.ipynb

✓ Common utilities loaded successfully

Available functions:
  - set_seed(seed)
  - get_transforms(augment)
  - write_sysinfo(path)
  - time_first_batch(dataloader, device)
  - start_monitor(log_path, interval)
  - stop_monitor(thread, stop_event)
  - append_to_summary(path, row_dict)
  - compute_metrics_from_logs(log_path)
  - get_device()
  - format_bytes(bytes)
  - count_parameters(model)

Constants:
  - STANDARD_TRANSFORM


## Configuration

In [4]:
# Detect environment
IS_KAGGLE = "KAGGLE_KERNEL_RUN_TYPE" in os.environ
BASE_DIR = Path('/kaggle/working/format-matters') if IS_KAGGLE else Path('..').resolve()

RAW_DIR = BASE_DIR / 'data/raw'
BUILT_DIR = BASE_DIR / 'data/built'

# Create run directory for this session
RUN_DIR = BASE_DIR / 'runs' / time.strftime('%Y%m%d-%H%M%S') / 'builds'
RUN_DIR.mkdir(parents=True, exist_ok=True)

SUMMARY_CSV = RUN_DIR / 'summary.csv'
SUMMARY_CSV.touch(exist_ok=True)

print(f"Environment: {'Kaggle' if IS_KAGGLE else 'Local'}")
print(f"Base directory: {BASE_DIR}")
print(f"Run directory: {RUN_DIR}")
print(f"Summary log: {SUMMARY_CSV}")

Environment: Local
Base directory: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters
Run directory: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\runs\20251127-141322\builds
Summary log: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\runs\20251127-141322\builds\summary.csv


## Build Configuration

Configure which variants to build:

In [5]:
# Compression options (only include available ones)
COMPRESSIONS = ['none']
if HAS_ZSTD:
    COMPRESSIONS.append('zstd')
if HAS_LZ4:
    COMPRESSIONS.append('lz4')

# Generate all variants
VARIANTS = []
for compression in COMPRESSIONS:
    VARIANTS.append({
        'compression': compression,
        'name': f"compress_{compression}"
    })

print(f"Will build {len(VARIANTS)} variants:")
for v in VARIANTS:
    print(f"  - {v['name']}")

Will build 3 variants:
  - compress_none
  - compress_zstd
  - compress_lz4


## LMDB Helper Functions

In [6]:
def compress_data(data: bytes, compression: str) -> bytes:
    """
    Compress data using specified compression algorithm.
    
    Args:
        data: Raw bytes to compress
        compression: Compression type ('none', 'zstd', 'lz4')
    
    Returns:
        Compressed bytes (or original if compression='none')
    """
    if compression == 'none':
        return data
    elif compression == 'zstd' and HAS_ZSTD:
        compressor = zstd.ZstdCompressor(level=3)
        return compressor.compress(data)
    elif compression == 'lz4' and HAS_LZ4:
        return lz4.frame.compress(data)
    else:
        raise ValueError(f"Unsupported compression: {compression}")


def create_lmdb_entry(image_bytes: bytes, label: int, compression: str) -> bytes:
    """
    Create an LMDB entry from image bytes and label.
    
    Args:
        image_bytes: Raw image bytes (JPEG/PNG encoded)
        label: Integer label
        compression: Compression type
    
    Returns:
        Serialized entry bytes
    """
    # Compress image if requested
    compressed_image = compress_data(image_bytes, compression)
    
    # Create entry dict
    entry = {
        'image': compressed_image,
        'label': label,
        'compression': compression,
    }
    
    # Serialize with pickle
    return pickle.dumps(entry)

## LMDB Builder

In [7]:
def build_lmdb_split(
    dataset_name: str,
    split: str,
    raw_path: Path,
    output_path: Path,
    compression: str,
    class_to_label: dict,
    image_extensions: list
):
    """
    Build LMDB database for a single split.
    
    Args:
        dataset_name: Name of dataset
        split: Split name ('train' or 'val')
        raw_path: Path to raw dataset split directory
        output_path: Path to output directory
        compression: Compression type ('none', 'zstd', 'lz4')
        class_to_label: Mapping from class name to label index
        image_extensions: List of image file extensions to include
    
    Returns:
        Dictionary with build statistics
    """
    output_path.mkdir(parents=True, exist_ok=True)
    
    # Find all image files
    image_files = []
    for ext in image_extensions:
        image_files.extend(raw_path.rglob(f'*{ext}'))
    
    # Sort for reproducibility
    image_files = sorted(image_files)
    split_class_count = len({f.parent.name for f in image_files})
    print(f"    Found {len(image_files):,} images in {split_class_count} classes")
    
    # Create LMDB database
    lmdb_path = output_path / f"{split}.lmdb"
    
    # Estimate map size (generous estimate: 2x raw data size)
    estimated_size = len(image_files) * 100 * 1024  # 100KB per image estimate
    map_size = max(estimated_size * 2, 1024 * 1024 * 1024)  # At least 1GB
    
    # Open LMDB environment
    env = lmdb.open(
        str(lmdb_path),
        map_size=map_size,
        writemap=True,
        map_async=True,
        max_dbs=0
    )
    
    # Write data
    with env.begin(write=True) as txn:
        for idx, img_path in enumerate(tqdm(image_files, desc=f"    Writing {split}")):
            # Read image bytes
            with open(img_path, 'rb') as f:
                img_bytes = f.read()
            
            # Get label
            class_name = img_path.parent.name
            if class_name not in class_to_label:
                raise KeyError(f"Class {class_name} missing from mapping for split {split}")
            label = class_to_label[class_name]
            
            # Create entry
            entry_bytes = create_lmdb_entry(img_bytes, label, compression)
            
            # Write to LMDB with index as key
            key = f"{idx:08d}".encode('utf-8')
            txn.put(key, entry_bytes)
        
        # Store metadata
        metadata = {
            'num_samples': len(image_files),
            'num_classes': len(class_to_label),
            'class_names': sorted(class_to_label, key=class_to_label.get),
            'compression': compression,
        }
        txn.put(b'__metadata__', pickle.dumps(metadata))
    
    # Close environment
    env.close()
    
    # Calculate statistics
    lmdb_size = sum(f.stat().st_size for f in lmdb_path.iterdir())
    
    return {
        'items': len(image_files),
        'bytes_on_disk': lmdb_size,
        'num_files': 1,  # LMDB is a single database (directory)
        'avg_file_size': lmdb_size,
    }


def build_lmdb(
    dataset_name: str,
    raw_path: Path,
    output_path: Path,
    compression: str,
    variant_name: str
):
    """
    Build LMDB for a dataset with specific configuration.
    
    Args:
        dataset_name: Name of dataset
        raw_path: Path to raw dataset directory
        output_path: Path to output directory
        compression: Compression type ('none', 'zstd', 'lz4')
        variant_name: Variant identifier
    
    Returns:
        Dictionary with build statistics
    """
    print(f"\nBuilding LMDB for {dataset_name} ({variant_name})...")
    print(f"  Source: {raw_path}")
    print(f"  Output: {output_path}")
    print(f"  Compression: {compression}")
    
    start_time = time.time()
    # FIXED: Use case-insensitive matching
    image_extensions = {'.jpg', '.jpeg', '.png'}
    all_class_names = set()
    for split in ['train', 'val']:
        split_dir = raw_path / split
        if not split_dir.exists():
            continue
        for ext in image_extensions:
            for img_path in split_dir.rglob(f'*{ext}'):
                all_class_names.add(img_path.parent.name)
    class_names = sorted(all_class_names)
    class_to_label = {name: idx for idx, name in enumerate(class_names)}
    print(f"  Total classes across splits: {len(class_names)}")
    
    # Process each split
    for split in ['train', 'val']:
        split_dir = raw_path / split
        if not split_dir.exists():
            print(f"  ⚠ {split} split not found, skipping")
            continue
        
        print(f"\n  Processing {split} split...")
        
        split_stats = build_lmdb_split(
            dataset_name, split, split_dir, output_path,
            compression, class_to_label, image_extensions
        )
        
        print(f"    ✓ {split}: {format_bytes(split_stats['bytes_on_disk'])}")
        
        # Log to summary
        build_time = time.time() - start_time
        row = {
            'stage': 'build',
            'dataset': dataset_name,
            'format': 'lmdb',
            'variant': variant_name,
            'split': split,
            'items': split_stats['items'],
            'bytes_on_disk': split_stats['bytes_on_disk'],
            'num_files': split_stats['num_files'],
            'avg_file_size': split_stats['avg_file_size'],
            'build_wall_s': build_time,
        }
        append_to_summary(SUMMARY_CSV, row)
    
    build_time = time.time() - start_time
    print(f"\n  ✓ Build completed in {build_time:.2f}s")
    
    return {'dataset': dataset_name, 'variant': variant_name, 'build_time': build_time}


## Build LMDB for All Datasets and Variants

In [8]:
# Find available datasets
available_datasets = []
for dataset_name in ['cifar10', 'imagenet-mini', 'tiny-imagenet-200']:
    dataset_path = RAW_DIR / dataset_name
    if dataset_path.exists() and (dataset_path / 'train').exists():
        available_datasets.append(dataset_name)

print(f"Found {len(available_datasets)} dataset(s): {', '.join(available_datasets)}")
print("\n" + "="*60)

Found 2 dataset(s): cifar10, imagenet-mini



In [9]:
# Build all variants for all datasets
build_results = []

for dataset_name in available_datasets:
    raw_path = RAW_DIR / dataset_name
    
    for variant in VARIANTS:
        output_path = BUILT_DIR / dataset_name / 'lmdb' / variant['name']
        
        result = build_lmdb(
            dataset_name=dataset_name,
            raw_path=raw_path,
            output_path=output_path,
            compression=variant['compression'],
            variant_name=variant['name']
        )
        build_results.append(result)
        
        print("="*60)

print(f"\n✓ Built {len(build_results)} LMDB variant(s)")


Building LMDB for cifar10 (compress_none)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\cifar10
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\cifar10\lmdb\compress_none
  Compression: none
  Total classes across splits: 10

  Processing train split...
    Found 50,000 images in 10 classes


    Writing train:   0%|          | 0/50000 [00:00<?, ?it/s]

    ✓ train: 9.5 GB

  Processing val split...
    Found 10,000 images in 10 classes


    Writing val:   0%|          | 0/10000 [00:00<?, ?it/s]

    ✓ val: 1.9 GB

  ✓ Build completed in 39.96s

Building LMDB for cifar10 (compress_zstd)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\cifar10
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\cifar10\lmdb\compress_zstd
  Compression: zstd
  Total classes across splits: 10

  Processing train split...
    Found 50,000 images in 10 classes


    Writing train:   0%|          | 0/50000 [00:00<?, ?it/s]

    ✓ train: 9.5 GB

  Processing val split...
    Found 10,000 images in 10 classes


    Writing val:   0%|          | 0/10000 [00:00<?, ?it/s]

    ✓ val: 1.9 GB

  ✓ Build completed in 12.71s

Building LMDB for cifar10 (compress_lz4)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\cifar10
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\cifar10\lmdb\compress_lz4
  Compression: lz4
  Total classes across splits: 10

  Processing train split...
    Found 50,000 images in 10 classes


    Writing train:   0%|          | 0/50000 [00:00<?, ?it/s]

    ✓ train: 9.5 GB

  Processing val split...
    Found 10,000 images in 10 classes


    Writing val:   0%|          | 0/10000 [00:00<?, ?it/s]

    ✓ val: 1.9 GB

  ✓ Build completed in 10.98s

Building LMDB for imagenet-mini (compress_none)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\imagenet-mini
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\imagenet-mini\lmdb\compress_none
  Compression: none
  Total classes across splits: 1000

  Processing train split...
    Found 34,745 images in 1000 classes


    Writing train:   0%|          | 0/34745 [00:00<?, ?it/s]

    ✓ train: 6.6 GB

  Processing val split...
    Found 3,923 images in 1000 classes


    Writing val:   0%|          | 0/3923 [00:00<?, ?it/s]

    ✓ val: 1.0 GB

  ✓ Build completed in 58.38s

Building LMDB for imagenet-mini (compress_zstd)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\imagenet-mini
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\imagenet-mini\lmdb\compress_zstd
  Compression: zstd
  Total classes across splits: 1000

  Processing train split...
    Found 34,745 images in 1000 classes


    Writing train:   0%|          | 0/34745 [00:00<?, ?it/s]

    ✓ train: 6.6 GB

  Processing val split...
    Found 3,923 images in 1000 classes


    Writing val:   0%|          | 0/3923 [00:00<?, ?it/s]

    ✓ val: 1.0 GB

  ✓ Build completed in 72.92s

Building LMDB for imagenet-mini (compress_lz4)...
  Source: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\raw\imagenet-mini
  Output: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\data\built\imagenet-mini\lmdb\compress_lz4
  Compression: lz4
  Total classes across splits: 1000

  Processing train split...
    Found 34,745 images in 1000 classes


    Writing train:   0%|          | 0/34745 [00:00<?, ?it/s]

    ✓ train: 6.6 GB

  Processing val split...
    Found 3,923 images in 1000 classes


    Writing val:   0%|          | 0/3923 [00:00<?, ?it/s]

    ✓ val: 1.0 GB

  ✓ Build completed in 50.86s

✓ Built 6 LMDB variant(s)


## Verification

In [10]:
print("\nVerifying LMDB databases...\n")

for dataset_name in available_datasets:
    print(f"{dataset_name}:")
    
    for variant in VARIANTS:
        lmdb_dir = BUILT_DIR / dataset_name / 'lmdb' / variant['name']
        
        if not lmdb_dir.exists():
            print(f"  ✗ {variant['name']}: directory not found")
            continue
        
        # Check databases
        train_db = lmdb_dir / 'train.lmdb'
        val_db = lmdb_dir / 'val.lmdb'
        
        if not train_db.exists():
            print(f"  ✗ {variant['name']}: no train database found")
            continue
        
        # Calculate sizes
        train_size = sum(f.stat().st_size for f in train_db.iterdir())
        val_size = sum(f.stat().st_size for f in val_db.iterdir()) if val_db.exists() else 0
        
        print(f"  ✓ {variant['name']}:")
        print(f"      Train: {format_bytes(train_size)}")
        if val_db.exists():
            print(f"      Val: {format_bytes(val_size)}")
    
    print()


Verifying LMDB databases...

cifar10:
  ✓ compress_none:
      Train: 9.5 GB
      Val: 1.9 GB
  ✓ compress_zstd:
      Train: 9.5 GB
      Val: 1.9 GB
  ✓ compress_lz4:
      Train: 9.5 GB
      Val: 1.9 GB

imagenet-mini:
  ✓ compress_none:
      Train: 6.6 GB
      Val: 1.0 GB
  ✓ compress_zstd:
      Train: 6.6 GB
      Val: 1.0 GB
  ✓ compress_lz4:
      Train: 6.6 GB
      Val: 1.0 GB



## Sample Data Inspection

In [11]:
# Inspect first database of first dataset/variant
if available_datasets and VARIANTS:
    dataset_name = available_datasets[0]
    variant = VARIANTS[0]
    lmdb_dir = BUILT_DIR / dataset_name / 'lmdb' / variant['name']
    
    train_db = lmdb_dir / 'train.lmdb'
    if train_db.exists():
        print(f"Inspecting train database of {dataset_name} ({variant['name']}):\n")
        print(f"Database: {train_db.name}")
        
        train_size = sum(f.stat().st_size for f in train_db.iterdir())
        print(f"Size: {format_bytes(train_size)}")
        
        # Open database and read metadata
        env = lmdb.open(str(train_db), readonly=True, lock=False)
        
        with env.begin() as txn:
            # Read metadata
            metadata_bytes = txn.get(b'__metadata__')
            if metadata_bytes:
                metadata = pickle.loads(metadata_bytes)
                print(f"\nMetadata:")
                print(f"  Num samples: {metadata['num_samples']:,}")
                print(f"  Num classes: {metadata['num_classes']}")
                print(f"  Compression: {metadata['compression']}")
            
            # Read first few samples
            print("\nFirst 5 samples:")
            for i in range(5):
                key = f"{i:08d}".encode('utf-8')
                entry_bytes = txn.get(key)
                
                if entry_bytes:
                    entry = pickle.loads(entry_bytes)
                    print(f"\nSample {i}:")
                    print(f"  Image bytes: {len(entry['image'])}")
                    print(f"  Label: {entry['label']}")
                    print(f"  Compression: {entry['compression']}")
        
        env.close()

Inspecting train database of cifar10 (compress_none):

Database: train.lmdb
Size: 9.5 GB

Metadata:
  Num samples: 50,000
  Num classes: 10
  Compression: none

First 5 samples:

Sample 0:
  Image bytes: 2063
  Label: 0
  Compression: none

Sample 1:
  Image bytes: 2169
  Label: 0
  Compression: none

Sample 2:
  Image bytes: 2237
  Label: 0
  Compression: none

Sample 3:
  Image bytes: 2104
  Label: 0
  Compression: none

Sample 4:
  Image bytes: 2181
  Label: 0
  Compression: none


## Build Summary

In [12]:
# # Read and display summary
# if SUMMARY_CSV.exists() and SUMMARY_CSV.stat().st_size > 0:
#     import pandas as pd
#     summary_df = pd.read_csv(SUMMARY_CSV)
    
#     print("\nBuild Summary:")
#     print("="*80)
    
#     for dataset in summary_df['dataset'].unique():
#         print(f"\n{dataset}:")
#         dataset_df = summary_df[summary_df['dataset'] == dataset]
        
#         for variant in dataset_df['variant'].unique():
#             variant_df = dataset_df[dataset_df['variant'] == variant]
#             print(f"\n  {variant}:")
            
#             for _, row in variant_df.iterrows():
#                 print(f"    {row['split']}:")
#                 print(f"      Items: {row['items']:,}")
#                 print(f"      Size: {format_bytes(row['bytes_on_disk'])}")
    
#     print("\n" + "="*80)
#     print(f"\nSummary saved to: {SUMMARY_CSV}")
# else:
#     print("No summary data available")

In [13]:
# === Read and display LMDB build summary ===
expected_cols = [
    "stage", "dataset", "format", "variant", "split",
    "items", "bytes_on_disk", "num_files", "avg_file_size",
    "build_wall_s", "timestamp"
]

if SUMMARY_CSV.exists() and SUMMARY_CSV.stat().st_size > 0:
    try:
        summary_df = pd.read_csv(SUMMARY_CSV)
        # If old file without headers, reload with defined column names
        if not set(expected_cols).issubset(summary_df.columns):
            summary_df = pd.read_csv(SUMMARY_CSV, names=expected_cols, header=None)
            summary_df.to_csv(SUMMARY_CSV, index=False)
    except pd.errors.ParserError:
        summary_df = pd.read_csv(SUMMARY_CSV, names=expected_cols, header=None)
        summary_df.to_csv(SUMMARY_CSV, index=False)
else:
    print("No summary data available")
    summary_df = None

if summary_df is not None and not summary_df.empty:
    # Filter only LMDB entries
    lmdb_df = summary_df[summary_df['format'] == 'lmdb']
    if lmdb_df.empty:
        print("No LMDB builds found in summary.")
    else:
        print("\nLMDB Build Summary:")
        print("=" * 80)

        for dataset in lmdb_df['dataset'].unique():
            print(f"\n{dataset}:")
            dataset_df = lmdb_df[lmdb_df['dataset'] == dataset]

            for variant in dataset_df['variant'].unique():
                variant_df = dataset_df[dataset_df['variant'] == variant]
                print(f"\n  {variant}:")

                for _, row in variant_df.iterrows():
                    print(f"    {row['split']}:")
                    print(f"      Items: {int(row['items']):,}")
                    print(f"      Size: {format_bytes(row['bytes_on_disk'])}")
                    if 'num_files' in row and not pd.isna(row['num_files']):
                        print(f"      Files: {int(row['num_files'])}")
                    if 'avg_file_size' in row and not pd.isna(row['avg_file_size']):
                        print(f"      Avg file: {format_bytes(row['avg_file_size'])}")
                    if 'build_wall_s' in row and not pd.isna(row['build_wall_s']):
                        print(f"      Build time: {row['build_wall_s']:.2f}s")

        print("\n" + "=" * 80)
        print(f"\nSummary saved to: {SUMMARY_CSV}")
else:
    print("No summary data available")


LMDB Build Summary:

cifar10:

  compress_none:
    train:
      Items: 50,000
      Size: 9.5 GB
      Files: 1
      Avg file: 9.5 GB
      Build time: 34.01s
    val:
      Items: 10,000
      Size: 1.9 GB
      Files: 1
      Avg file: 1.9 GB
      Build time: 39.95s

  compress_zstd:
    train:
      Items: 50,000
      Size: 9.5 GB
      Files: 1
      Avg file: 9.5 GB
      Build time: 10.74s
    val:
      Items: 10,000
      Size: 1.9 GB
      Files: 1
      Avg file: 1.9 GB
      Build time: 12.71s

  compress_lz4:
    train:
      Items: 50,000
      Size: 9.5 GB
      Files: 1
      Avg file: 9.5 GB
      Build time: 9.27s
    val:
      Items: 10,000
      Size: 1.9 GB
      Files: 1
      Avg file: 1.9 GB
      Build time: 10.97s

imagenet-mini:

  compress_none:
    train:
      Items: 34,745
      Size: 6.6 GB
      Files: 1
      Avg file: 6.6 GB
      Build time: 53.34s
    val:
      Items: 3,923
      Size: 1.0 GB
      Files: 1
      Avg file: 1.0 GB
      Build t

## ✅ LMDB Build Complete

**What was created:**
- LMDB databases (one per split)
- Memory-mapped for fast random access
- Multiple variants with different compression options
- Excellent for random access patterns

**Variants built:**
- `compress_none`: No compression
- `compress_zstd`: Zstandard compression (if available)
- `compress_lz4`: LZ4 compression (if available)

**Output locations:**
- `data/built/<dataset>/lmdb/<variant>/<split>.lmdb/`

**Next steps:**
1. Run `14_loader_lmdb.ipynb` to create the dataloader
2. Then run training experiments (20-21)
3. Finally run analysis notebooks (30-31, 40)