# UMAP Benchmark: Preprocess Results

## Overview

This notebook preprocesses UMAP benchmark results by flattening nested JSON files into a single tabular CSV format suitable for statistical analysis. 

**Key Features:**
- Extracts performance metrics from benchmark runs
- Combines machine metadata, git information, and UI performance data
- Automatically filters out incomplete results (missing UI metrics or empty rows)
- Outputs a clean, analysis-ready CSV file

**Input:** Multiple JSON files in `../results/` following the pattern `bench-runs-*.json`

**Output:** `../outputs/preprocessed.csv` with all benchmark data in tabular format

In [None]:
import json
import glob
import os
import pandas as pd

# Paths relative to bench/analysis/ folder
results_dir = "../results"
output_path = "../outputs/preprocessed.csv"

## Setup and Configuration

Import required libraries and configure paths:
- `results_dir`: Directory containing benchmark JSON files
- `output_path`: Path for the output CSV file


## Data Processing

The following cell processes each benchmark JSON file and extracts:

1. **Top-level metadata**: Generation time, run count, WASM configuration
2. **Machine information**: Platform, CPU, memory, load averages
3. **Git context**: Commit hash, branch, dirty status
4. **Result metrics**: Run number, duration
5. **Performance data**: Runtime, memory usage, trustworthiness, FPS, responsiveness

Each metric row (one per dataset) becomes a separate record in the output, with all parent context preserved.

Note: Supports both the new flat `metrics[]` format and the legacy `uiMetrics[].rows[]` format for backward compatibility.

In [None]:
rows = []
files = sorted(glob.glob(os.path.join(results_dir, "bench-runs-*.json")))

for path in files:
    with open(path, "r", encoding="utf-8") as f:
        data = json.load(f)

    top = {
        "generated_at": data.get("generatedAt"),
        "runs_declared": data.get("runs"),
        "wasm_features_file": data.get("wasmFeatures"),
        "wasm_preload": data.get("wasmPreload"),
    }

    machine = data.get("machine") or {}
    top.update({
        "machine_platform": machine.get("platform"),
        "machine_release": machine.get("release"),
        "machine_arch": machine.get("arch"),
        "cpu_model": machine.get("cpuModel"),
        "cpu_cores": machine.get("cpuCores"),
        "total_mem_bytes": machine.get("totalMemBytes"),
        "load_avg_1": (machine.get("loadAvg") or [None, None, None])[0],
        "load_avg_5": (machine.get("loadAvg") or [None, None, None])[1],
        "load_avg_15": (machine.get("loadAvg") or [None, None, None])[2],
        "hostname": machine.get("hostname"),
    })

    git = data.get("git") or {}
    top.update({
        "git_commit": git.get("commit"),
        "git_branch": git.get("branch"),
        "git_status_dirty": git.get("statusDirty"),
    })

    for result in data.get("results", []):
        # Support both old format (uiMetrics[].rows[]) and new format (metrics[])
        metrics_list = result.get("metrics")
        if metrics_list is None:
            # Old format fallback
            ui_metrics = result.get("uiMetrics") or []
            metrics_list = []
            for ui in ui_metrics:
                metrics_list.extend(ui.get("rows") or [])
        
        if not metrics_list:
            continue
            
        base = dict(top)
        base.update({
            "result_run": result.get("run"),
            "result_duration_ms": result.get("durationMs"),
        })

        stats = result.get("stats") or {}
        base.update({
            "stats_start_time": stats.get("startTime"),
            "stats_duration_ms": stats.get("duration"),
        })

        for row in metrics_list:
            # Do not infer renderingEnabled. If missing, default to False.
            rendering_enabled = row.get("renderingEnabled")
            if rendering_enabled is None:
                rendering_enabled = False
            
            out = dict(base)
            out.update({
                "dataset_index": row.get("datasetIndex") or row.get("runId"),  # Support old field name
                "timestamp": row.get("timestamp"),
                "dataset_name": row.get("datasetName"),
                "dataset_size": row.get("datasetSize"),
                "dimensions": row.get("dimensions"),
                "wasm_features": row.get("wasmFeatures"),
                "rendering_enabled": rendering_enabled,
                "runtime_ms": row.get("runtimeMs"),
                "memory_delta_mb": row.get("memoryDeltaMb"),
                "trustworthiness": row.get("trustworthiness"),
                "fps_avg": row.get("fpsAvg"),
                "responsiveness_ms": row.get("responsivenessMs"),
            })
            rows.append(out)

## Processing Summary

Review the data extraction results:


In [None]:
print(f"Files processed: {len(files)}")
print(f"Total rows extracted: {len(rows)}")
print(f"\nSample of first row keys:")
if rows:
    for key in list(rows[0].keys())[:10]:
        print(f"  - {key}")


## Create DataFrame and Export

Convert the extracted rows into a pandas DataFrame with a consistent column order, then export to CSV. The output file will be ready for statistical analysis.


In [None]:
columns = [
    # File/run metadata
    "generated_at",
    "runs_declared",
    "result_run",
    "result_duration_ms",
    "stats_start_time",
    "stats_duration_ms",
    # WASM config (file-level)
    "wasm_features_file",
    "wasm_preload",
    # Machine info
    "machine_platform",
    "machine_release",
    "machine_arch",
    "cpu_model",
    "cpu_cores",
    "total_mem_bytes",
    "load_avg_1",
    "load_avg_5",
    "load_avg_15",
    "hostname",
    # Git context
    "git_commit",
    "git_branch",
    "git_status_dirty",
    # Per-dataset metrics (the core data)
    "dataset_index",
    "timestamp",
    "dataset_name",
    "dataset_size",
    "dimensions",
    "wasm_features",
    "rendering_enabled",
    "runtime_ms",
    "memory_delta_mb",
    "trustworthiness",
    "fps_avg",
    "responsiveness_ms",
]

df = pd.DataFrame(rows).reindex(columns=columns)
df.to_csv(output_path, index=False)

print(f"âœ“ Successfully saved {len(df)} rows to '{output_path}'")
print(f"\nDataFrame shape: {df.shape}")
print(f"Columns: {len(df.columns)}")

df.head()