# UMAP Benchmark: Preprocess Results

## Overview

This notebook preprocesses UMAP benchmark results by flattening nested JSON files into a single tabular CSV format suitable for statistical analysis. 

**Key Features:**
- Extracts performance metrics from benchmark runs
- Combines machine metadata, git information, and UI performance data
- Automatically filters out incomplete results (missing UI metrics or empty rows)
- Outputs a clean, analysis-ready CSV file

**Input:** Multiple JSON files in the `results/` directory following the pattern `bench-runs-*.json`

**Output:** A single `preprocessed.csv` file with all benchmark data in tabular format


In [1]:
import json
import glob
import os
import pandas as pd

results_dir = "results"
output_path = "preprocessed.csv"


## Setup and Configuration

Import required libraries and configure paths:
- `results_dir`: Directory containing benchmark JSON files
- `output_path`: Path for the output CSV file


## Data Processing

The following cell processes each benchmark JSON file and extracts:

1. **Top-level metadata**: Generation time, run count, WASM configuration
2. **Machine information**: Platform, CPU, memory, load averages
3. **Git context**: Commit hash, branch, dirty status
4. **Result metrics**: Run number, WASM features, duration
5. **Performance data**: Runtime, memory usage, trustworthiness, FPS, responsiveness

Each UI metric row becomes a separate record in the output, with all parent context preserved.


In [2]:
rows = []
files = sorted(glob.glob(os.path.join(results_dir, "bench-runs-*.json")))

for path in files:
    with open(path, "r", encoding="utf-8") as f:
        data = json.load(f)

    top = {
        "generated_at": data.get("generatedAt"),
        "runs_declared": data.get("runs"),
        "wasm_features_file": data.get("wasmFeatures"),
        "wasm_preload": data.get("wasmPreload"),
    }

    machine = data.get("machine") or {}
    top.update({
        "machine_platform": machine.get("platform"),
        "machine_release": machine.get("release"),
        "machine_arch": machine.get("arch"),
        "cpu_model": machine.get("cpuModel"),
        "cpu_cores": machine.get("cpuCores"),
        "total_mem_bytes": machine.get("totalMemBytes"),
        "load_avg_1": (machine.get("loadAvg") or [None, None, None])[0],
        "load_avg_5": (machine.get("loadAvg") or [None, None, None])[1],
        "load_avg_15": (machine.get("loadAvg") or [None, None, None])[2],
        "hostname": machine.get("hostname"),
    })

    git = data.get("git") or {}
    top.update({
        "git_commit": git.get("commit"),
        "git_branch": git.get("branch"),
        "git_status_dirty": git.get("statusDirty"),
    })

    for result in data.get("results", []):
        # Skip results with empty or missing uiMetrics
        ui_metrics = result.get("uiMetrics")
        if not ui_metrics:
            continue
            
        base = dict(top)
        base.update({
            "result_run": result.get("run"),
            "result_wasm_features": result.get("wasmFeatures"),
            "result_duration_ms": result.get("durationMs"),
        })

        stats = result.get("stats") or {}
        base.update({
            "stats_start_time": stats.get("startTime"),
            "stats_duration_ms": stats.get("duration"),
        })

        for ui in ui_metrics:
            # Skip UI entries with empty or missing rows
            ui_rows = ui.get("rows")
            if not ui_rows:
                continue
                
            for row in ui_rows:
                out = dict(base)
                out.update({
                    "ui_run_id": row.get("runId"),
                    "ui_timestamp": row.get("timestamp"),
                    "scope": row.get("scope"),
                    "dataset_name": row.get("datasetName"),
                    "dataset_size": row.get("datasetSize"),
                    "dimensions": row.get("dimensions"),
                    "ui_wasm_features": row.get("wasmFeatures"),
                    "wasm_mode": row.get("wasmMode"),
                    "runtime_ms": row.get("runtimeMs"),
                    "memory_delta_mb": row.get("memoryDeltaMb"),
                    "trustworthiness": row.get("trustworthiness"),
                    "fps_avg": row.get("fpsAvg"),
                    "responsiveness_ms": row.get("responsivenessMs"),
                })
                rows.append(out)


## Processing Summary

Review the data extraction results:


In [3]:
print(f"Files processed: {len(files)}")
print(f"Total rows extracted: {len(rows)}")
print(f"\nSample of first row keys:")
if rows:
    for key in list(rows[0].keys())[:10]:
        print(f"  - {key}")


Files processed: 21
Total rows extracted: 420

Sample of first row keys:
  - generated_at
  - runs_declared
  - wasm_features_file
  - wasm_preload
  - machine_platform
  - machine_release
  - machine_arch
  - cpu_model
  - cpu_cores
  - total_mem_bytes


## Create DataFrame and Export

Convert the extracted rows into a pandas DataFrame with a consistent column order, then export to CSV. The output file will be ready for statistical analysis.


In [4]:
columns = [
    "generated_at",
    "runs_declared",
    "result_run",
    "result_wasm_features",
    "result_duration_ms",
    "stats_start_time",
    "stats_duration_ms",
    "wasm_features_file",
    "wasm_preload",
    "machine_platform",
    "machine_release",
    "machine_arch",
    "cpu_model",
    "cpu_cores",
    "total_mem_bytes",
    "load_avg_1",
    "load_avg_5",
    "load_avg_15",
    "hostname",
    "git_commit",
    "git_branch",
    "git_status_dirty",
    "ui_run_id",
    "ui_timestamp",
    "scope",
    "dataset_name",
    "dataset_size",
    "dimensions",
    "ui_wasm_features",
    "wasm_mode",
    "runtime_ms",
    "memory_delta_mb",
    "trustworthiness",
    "fps_avg",
    "responsiveness_ms",
]

df = pd.DataFrame(rows).reindex(columns=columns)
df.to_csv(output_path, index=False)

print(f"✓ Successfully saved {len(df)} rows to '{output_path}'")
print(f"\nDataFrame shape: {df.shape}")
print(f"Columns: {len(df.columns)}")

df.head()


✓ Successfully saved 420 rows to 'preprocessed.csv'

DataFrame shape: (420, 35)
Columns: 35


Unnamed: 0,generated_at,runs_declared,result_run,result_wasm_features,result_duration_ms,stats_start_time,stats_duration_ms,wasm_features_file,wasm_preload,machine_platform,...,dataset_name,dataset_size,dimensions,ui_wasm_features,wasm_mode,runtime_ms,memory_delta_mb,trustworthiness,fps_avg,responsiveness_ms
0,2026-01-31T12:39:27.635Z,10,1,none,43526,2026-01-31T12:39:45.906Z,25293.62,none,True,linux,...,"Iris Dataset (150 points, 4D)",150,4,none,js,2326.9,-0.055005,0.986882,60.0,8.6
1,2026-01-31T12:39:27.635Z,10,1,none,43526,2026-01-31T12:39:45.906Z,25293.62,none,True,linux,...,Small Random (80 points),80,10,none,js,3404.4,-4.720348,0.876677,48.204624,24.0
2,2026-01-31T12:39:27.635Z,10,2,none,45124,2026-01-31T12:40:27.525Z,28781.219,none,True,linux,...,"Iris Dataset (150 points, 4D)",150,4,none,js,2297.1,2.98244,0.991627,60.0,9.072727
3,2026-01-31T12:39:27.635Z,10,2,none,45124,2026-01-31T12:40:27.525Z,28781.219,none,True,linux,...,Small Random (80 points),80,10,none,js,3291.8,14.746518,0.830674,48.0,22.9
4,2026-01-31T12:39:27.635Z,10,3,none,46734,2026-01-31T12:41:11.846Z,31195.437,none,True,linux,...,"Iris Dataset (150 points, 4D)",150,4,none,js,2334.0,1.219504,0.988568,59.994001,9.881818
