# 30 - Analysis Summary

This notebook aggregates and summarizes all experimental results:
- Build statistics (disk usage, build time)
- Baseline training results (throughput, accuracy)
- Scaling experiments (batch size, workers)
- Resource utilization (GPU, CPU, disk I/O)

**Output:**
- Comprehensive summary tables
- Statistical comparisons
- Performance rankings
- Key findings and insights

In [1]:
import os
import sys
from pathlib import Path
from collections import defaultdict

import pandas as pd
import numpy as np

# Load common utilities
%run ./10_common_utils.ipynb

âœ“ Common utilities loaded successfully

Available functions:
  - set_seed(seed)
  - get_transforms(augment)
  - write_sysinfo(path)
  - time_first_batch(dataloader, device)
  - start_monitor(log_path, interval)
  - stop_monitor(thread, stop_event)
  - append_to_summary(path, row_dict)
  - compute_metrics_from_logs(log_path)
  - get_device()
  - format_bytes(bytes)
  - count_parameters(model)

Constants:
  - STANDARD_TRANSFORM


## Configuration

In [2]:
# Detect environment
IS_KAGGLE = "KAGGLE_KERNEL_RUN_TYPE" in os.environ
BASE_DIR = Path('/kaggle/working/format-matters') if IS_KAGGLE else Path('..').resolve()

RUNS_DIR = BASE_DIR / 'runs'

print(f"Environment: {'Kaggle' if IS_KAGGLE else 'Local'}")
print(f"Base directory: {BASE_DIR}")
print(f"Runs directory: {RUNS_DIR}")

Environment: Local
Base directory: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters
Runs directory: C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\runs


## Load All Results

In [3]:
def load_all_summaries(runs_dir):
    """
    Load all summary.csv files from runs directory.
    
    Args:
        runs_dir: Path to runs directory
    
    Returns:
        Dictionary of DataFrames by experiment type
    """
    summaries = {
        'builds': [],
        'train_baselines': [],
        'train_scaling': [],
    }
    
    if not runs_dir.exists():
        print(f"âš  Runs directory not found: {runs_dir}")
        return summaries
    
    # Find all summary.csv files
    for summary_file in runs_dir.rglob('summary.csv'):
        if summary_file.stat().st_size == 0:
            continue
        
        try:
            df = pd.read_csv(summary_file)
            
            # Determine experiment type from path
            if 'builds' in str(summary_file):
                summaries['builds'].append(df)
            elif 'train_baselines' in str(summary_file):
                summaries['train_baselines'].append(df)
            elif 'train_scaling' in str(summary_file):
                summaries['train_scaling'].append(df)
        except Exception as e:
            print(f"âš  Failed to load {summary_file}: {e}")
    
    # Concatenate DataFrames
    for key in summaries:
        if summaries[key]:
            summaries[key] = pd.concat(summaries[key], ignore_index=True)
            print(f"âœ“ Loaded {len(summaries[key])} rows from {key}")
        else:
            summaries[key] = pd.DataFrame()
            print(f"âš  No data found for {key}")
    
    return summaries

# Load all results
results = load_all_summaries(RUNS_DIR)

âœ“ Loaded 68 rows from builds
âœ“ Loaded 12 rows from train_baselines
âš  No data found for train_scaling


## 1. Build Statistics Summary

In [4]:
if not results['builds'].empty:
    print("\n" + "="*80)
    print("BUILD STATISTICS SUMMARY")
    print("="*80)
    
    builds_df = results['builds']
    
    # Group by format and variant
    summary = builds_df.groupby(['format', 'variant']).agg({
        'items': 'sum',
        'bytes_on_disk': 'sum',
        'num_files': 'sum',
        'build_wall_s': 'mean',
    }).reset_index()
    
    print("\nDisk Usage by Format:\n")
    print(f"{'Format':<15} {'Variant':<20} {'Items':<10} {'Size':<15} {'Files':<8} {'Build Time':<12}")
    print("-" * 90)
    
    for _, row in summary.iterrows():
        print(f"{row['format']:<15} {row['variant']:<20} "
              f"{int(row['items']):<10} {format_bytes(row['bytes_on_disk']):<15} "
              f"{int(row['num_files']):<8} {row['build_wall_s']:.2f}s")
    
    # Compression ratios
    print("\n\nCompression Analysis:\n")
    
    # Compare compressed vs uncompressed variants
    for format_name in summary['format'].unique():
        format_data = summary[summary['format'] == format_name]
        
        if len(format_data) > 1:
            print(f"\n{format_name.upper()}:")
            baseline = format_data.iloc[0]
            
            for _, variant in format_data.iterrows():
                ratio = variant['bytes_on_disk'] / baseline['bytes_on_disk']
                savings = (1 - ratio) * 100
                print(f"  {variant['variant']:<25} {format_bytes(variant['bytes_on_disk']):<15} "
                      f"({savings:+.1f}% vs baseline)")
else:
    print("\nâš  No build statistics available")


BUILD STATISTICS SUMMARY

Disk Usage by Format:

Format          Variant              Items      Size            Files    Build Time  
------------------------------------------------------------------------------------------
csv             default              296004     35.4 MB         8        29.39s
lmdb            compress_lz4         98668      19.1 GB         4        29.19s
lmdb            compress_none        98668      19.1 GB         4        46.42s
lmdb            compress_zstd        98668      19.1 GB         4        40.66s
tfrecord        shard1024_gzip       98668      4.0 GB          7        116.29s
tfrecord        shard1024_none       98668      4.1 GB          7        44.95s
tfrecord        shard256_gzip        98668      4.0 GB          18       122.70s
tfrecord        shard256_none        98668      4.1 GB          18       37.93s
tfrecord        shard64_gzip         98668      4.0 GB          67       117.32s
tfrecord        shard64_none         98668      4.

## 2. Training Performance Summary

In [5]:
if not results['train_baselines'].empty:
    print("\n" + "="*80)
    print("TRAINING PERFORMANCE SUMMARY")
    print("="*80)
    
    train_df = results['train_baselines']
    
    # Get final epoch results
    final_epoch = train_df.groupby(['format', 'variant'])['epoch'].max().reset_index()
    final_results = train_df.merge(final_epoch, on=['format', 'variant', 'epoch'])
    
    print("\nFinal Epoch Performance:\n")
    print(f"{'Format':<15} {'Variant':<20} {'Train Acc':<12} {'Val Acc':<12} {'Throughput':<15}")
    print("-" * 80)
    
    for _, row in final_results.iterrows():
        print(f"{row['format']:<15} {row['variant']:<20} "
              f"{row['train_acc']:>10.2f}% {row['val_acc']:>10.2f}% "
              f"{row['train_samples_per_sec']:>12.1f} samp/s")
    
    # Throughput comparison
    print("\n\nThroughput Ranking:\n")
    ranked = final_results.sort_values('train_samples_per_sec', ascending=False)
    
    baseline_throughput = ranked.iloc[-1]['train_samples_per_sec']
    
    print(f"{'Rank':<6} {'Format':<15} {'Variant':<20} {'Throughput':<15} {'vs Baseline':<15}")
    print("-" * 80)
    
    for rank, (_, row) in enumerate(ranked.iterrows(), 1):
        speedup = row['train_samples_per_sec'] / baseline_throughput
        print(f"{rank:<6} {row['format']:<15} {row['variant']:<20} "
              f"{row['train_samples_per_sec']:>12.1f} samp/s {speedup:>12.2f}x")
    
    # Resource utilization
    print("\n\nResource Utilization:\n")
    print(f"{'Format':<15} {'Variant':<20} {'GPU %':<10} {'CPU %':<10} {'Disk R':<12} {'Disk W':<12}")
    print("-" * 85)
    
    for _, row in final_results.iterrows():
        gpu = row.get('gpu_util_mean', 0) or 0
        cpu = row.get('cpu_util_mean', 0) or 0
        disk_r = row.get('disk_read_mb_s_mean', 0) or 0
        disk_w = row.get('disk_write_mb_s_mean', 0) or 0
        
        print(f"{row['format']:<15} {row['variant']:<20} "
              f"{gpu:>8.1f}% {cpu:>8.1f}% "
              f"{disk_r:>9.2f} MB/s {disk_w:>9.2f} MB/s")
else:
    print("\nâš  No training baseline results available")


TRAINING PERFORMANCE SUMMARY

Final Epoch Performance:

Format          Variant              Train Acc    Val Acc      Throughput     
--------------------------------------------------------------------------------
webdataset      shard256_none             61.74%      43.45%         20.5 samp/s
csv             default                   59.87%      61.41%         21.3 samp/s
tfrecord        shard256_none             59.04%      58.42%         21.2 samp/s
lmdb            compress_none             59.36%      59.79%         21.5 samp/s


Throughput Ranking:

Rank   Format          Variant              Throughput      vs Baseline    
--------------------------------------------------------------------------------
1      lmdb            compress_none                21.5 samp/s         1.05x
2      csv             default                      21.3 samp/s         1.04x
3      tfrecord        shard256_none                21.2 samp/s         1.04x
4      webdataset      shard256_none         

## 3. Scaling Analysis

In [6]:
if not results['train_scaling'].empty:
    print("\n" + "="*80)
    print("SCALING ANALYSIS")
    print("="*80)
    
    scaling_df = results['train_scaling']
    
    # Batch size scaling
    print("\nBatch Size Scaling:\n")
    
    for format_name in scaling_df['format'].unique():
        format_data = scaling_df[
            (scaling_df['format'] == format_name) & 
            (scaling_df['num_workers'] == 4)  # Fixed workers
        ].sort_values('batch_size')
        
        if not format_data.empty:
            print(f"\n{format_name.upper()}:")
            print(f"  {'Batch Size':<12} {'Throughput':<20} {'GPU Util %':<12}")
            print("  " + "-" * 50)
            
            for _, row in format_data.iterrows():
                gpu = row.get('gpu_util_mean', 0) or 0
                print(f"  {row['batch_size']:<12} {row['samples_per_sec']:>17.1f} samp/s {gpu:>10.1f}%")
    
    # Worker scaling
    print("\n\nWorker Scaling:\n")
    
    for format_name in scaling_df['format'].unique():
        format_data = scaling_df[
            (scaling_df['format'] == format_name) & 
            (scaling_df['batch_size'] == 64)  # Fixed batch size
        ].sort_values('num_workers')
        
        if not format_data.empty:
            print(f"\n{format_name.upper()}:")
            print(f"  {'Workers':<10} {'Throughput':<20} {'CPU Util %':<12}")
            print("  " + "-" * 50)
            
            for _, row in format_data.iterrows():
                cpu = row.get('cpu_util_mean', 0) or 0
                print(f"  {row['num_workers']:<10} {row['samples_per_sec']:>17.1f} samp/s {cpu:>10.1f}%")
    
    # Best configurations
    print("\n\nOptimal Configurations:\n")
    print(f"{'Format':<15} {'Batch Size':<12} {'Workers':<10} {'Throughput':<20}")
    print("-" * 65)
    
    for format_name in scaling_df['format'].unique():
        format_data = scaling_df[scaling_df['format'] == format_name]
        best = format_data.loc[format_data['samples_per_sec'].idxmax()]
        print(f"{best['format']:<15} {best['batch_size']:<12} "
              f"{best['num_workers']:<10} {best['samples_per_sec']:>17.1f} samp/s")
else:
    print("\nâš  No scaling results available")


âš  No scaling results available


## 4. Key Findings

In [7]:
print("\n" + "="*80)
print("KEY FINDINGS")
print("="*80)

findings = []

# Disk usage findings
if not results['builds'].empty:
    builds_df = results['builds']
    total_by_format = builds_df.groupby('format')['bytes_on_disk'].sum()
    most_compact = total_by_format.idxmin()
    least_compact = total_by_format.idxmax()
    
    findings.append(f"\nðŸ“Š DISK USAGE:")
    findings.append(f"  â€¢ Most compact format: {most_compact.upper()} ({format_bytes(total_by_format[most_compact])})")
    findings.append(f"  â€¢ Largest format: {least_compact.upper()} ({format_bytes(total_by_format[least_compact])})")
    
    ratio = total_by_format[least_compact] / total_by_format[most_compact]
    findings.append(f"  â€¢ Size difference: {ratio:.2f}x")

# Training performance findings
if not results['train_baselines'].empty:
    train_df = results['train_baselines']
    final_epoch = train_df.groupby(['format', 'variant'])['epoch'].max().reset_index()
    final_results = train_df.merge(final_epoch, on=['format', 'variant', 'epoch'])
    
    fastest = final_results.loc[final_results['train_samples_per_sec'].idxmax()]
    slowest = final_results.loc[final_results['train_samples_per_sec'].idxmin()]
    
    findings.append(f"\nâš¡ TRAINING THROUGHPUT:")
    findings.append(f"  â€¢ Fastest format: {fastest['format'].upper()} ({fastest['train_samples_per_sec']:.1f} samples/s)")
    findings.append(f"  â€¢ Slowest format: {slowest['format'].upper()} ({slowest['train_samples_per_sec']:.1f} samples/s)")
    
    speedup = fastest['train_samples_per_sec'] / slowest['train_samples_per_sec']
    findings.append(f"  â€¢ Performance difference: {speedup:.2f}x")
    
    # GPU utilization
    if 'gpu_util_mean' in final_results.columns:
        # Check if there are any non-NaN GPU values
        gpu_values = final_results['gpu_util_mean'].dropna()
        if len(gpu_values) > 0:
            avg_gpu = gpu_values.mean()
            findings.append(f"\nðŸŽ® GPU UTILIZATION:")
            findings.append(f"  â€¢ Average GPU utilization: {avg_gpu:.1f}%")
            
            best_gpu = final_results.loc[final_results['gpu_util_mean'].idxmax()]
            findings.append(f"  â€¢ Best GPU utilization: {best_gpu['format'].upper()} ({best_gpu['gpu_util_mean']:.1f}%)")
        else:
            findings.append(f"\nðŸŽ® GPU UTILIZATION:")
            findings.append(f"  â€¢ No GPU detected (CPU-only training)")

# Scaling findings
if not results['train_scaling'].empty:
    scaling_df = results['train_scaling']
    
    findings.append(f"\nðŸ“ˆ SCALING CHARACTERISTICS:")
    
    # Best batch size scaling
    for format_name in scaling_df['format'].unique():
        format_data = scaling_df[
            (scaling_df['format'] == format_name) & 
            (scaling_df['num_workers'] == 4)
        ].sort_values('batch_size')
        
        if len(format_data) >= 2:
            first = format_data.iloc[0]
            last = format_data.iloc[-1]
            scaling_factor = last['samples_per_sec'] / first['samples_per_sec']
            findings.append(f"  â€¢ {format_name.upper()} batch scaling: {scaling_factor:.2f}x improvement")

# Print all findings
for finding in findings:
    print(finding)

print("\n" + "="*80)


KEY FINDINGS

ðŸ“Š DISK USAGE:
  â€¢ Most compact format: CSV (35.4 MB)
  â€¢ Largest format: LMDB (57.2 GB)
  â€¢ Size difference: 1653.86x

âš¡ TRAINING THROUGHPUT:
  â€¢ Fastest format: LMDB (21.5 samples/s)
  â€¢ Slowest format: WEBDATASET (20.5 samples/s)
  â€¢ Performance difference: 1.05x

ðŸŽ® GPU UTILIZATION:
  â€¢ No GPU detected (CPU-only training)



## 5. Export Summary Report

In [8]:
# Create summary report directory
REPORT_DIR = BASE_DIR / 'reports'
REPORT_DIR.mkdir(exist_ok=True)

# Export summary tables
if not results['builds'].empty:
    builds_summary = results['builds'].groupby(['format', 'variant']).agg({
        'items': 'sum',
        'bytes_on_disk': 'sum',
        'num_files': 'sum',
        'build_wall_s': 'mean',
    }).reset_index()
    builds_summary.to_csv(REPORT_DIR / 'builds_summary.csv', index=False)
    print(f"âœ“ Exported builds summary to {REPORT_DIR / 'builds_summary.csv'}")

if not results['train_baselines'].empty:
    final_epoch = results['train_baselines'].groupby(['format', 'variant'])['epoch'].max().reset_index()
    train_summary = results['train_baselines'].merge(final_epoch, on=['format', 'variant', 'epoch'])
    train_summary.to_csv(REPORT_DIR / 'training_summary.csv', index=False)
    print(f"âœ“ Exported training summary to {REPORT_DIR / 'training_summary.csv'}")

if not results['train_scaling'].empty:
    results['train_scaling'].to_csv(REPORT_DIR / 'scaling_summary.csv', index=False)
    print(f"âœ“ Exported scaling summary to {REPORT_DIR / 'scaling_summary.csv'}")

print(f"\nâœ“ All summaries exported to {REPORT_DIR}")

âœ“ Exported builds summary to C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\reports\builds_summary.csv
âœ“ Exported training summary to C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\reports\training_summary.csv

âœ“ All summaries exported to C:\Users\arjya\Fall 2025\Systems for ML\Project 1\SML\format-matters\reports


## âœ… Analysis Summary Complete

**What was analyzed:**
- Build statistics (disk usage, compression ratios)
- Training performance (throughput, accuracy)
- Resource utilization (GPU, CPU, disk I/O)
- Scaling characteristics (batch size, workers)
- Optimal configurations per format

**Key Outputs:**
- Comprehensive summary tables
- Performance rankings
- Key findings and insights
- Exported CSV reports

**Next steps:**
1. Create visualizations (31_analysis_plots.ipynb)
2. Generate decision guide (40_decision_guide.ipynb)