# UMAP Benchmark: Preprocess Results

## Overview

This notebook preprocesses UMAP benchmark results by flattening nested JSON files into a single tabular CSV format suitable for statistical analysis. 

**Key Features:**
- Extracts performance metrics from benchmark runs
- Combines machine metadata, git information, and UI performance data
- Automatically filters out incomplete results (missing UI metrics or empty rows)
- Outputs a clean, analysis-ready CSV file

**Input:** Multiple JSON files in `../results/` following the pattern `bench-runs-*.json`

**Output:** `../outputs/preprocessed.csv` with all benchmark data in tabular format

In [1]:
import json
import glob
import os
import pandas as pd

# Paths relative to bench/analysis/ folder
results_dir = "../results"
output_path = "../outputs/preprocessed.csv"

## Setup and Configuration

Import required libraries and configure paths:
- `results_dir`: Directory containing benchmark JSON files
- `output_path`: Path for the output CSV file


## Data Processing

The following cell processes each benchmark JSON file and extracts:

1. **Top-level metadata**: Generation time, run count, WASM configuration
2. **Machine information**: Platform, CPU, memory, load averages
3. **Git context**: Commit hash, branch, dirty status
4. **Result metrics**: Run number, duration
5. **Performance data**: Runtime, memory usage, trustworthiness, FPS, responsiveness

Each metric row (one per dataset) becomes a separate record in the output, with all parent context preserved.

Note: Supports both the new flat `metrics[]` format and the legacy `uiMetrics[].rows[]` format for backward compatibility.

In [2]:
rows = []
files = sorted(glob.glob(os.path.join(results_dir, "bench-runs-*.json")))

for path in files:
    with open(path, "r", encoding="utf-8") as f:
        data = json.load(f)

    top = {
        "generated_at": data.get("generatedAt"),
        "runs_declared": data.get("runs"),
        "wasm_features_file": data.get("wasmFeatures"),
        "wasm_preload": data.get("wasmPreload"),
    }

    machine = data.get("machine") or {}
    top.update({
        "machine_platform": machine.get("platform"),
        "machine_release": machine.get("release"),
        "machine_arch": machine.get("arch"),
        "cpu_model": machine.get("cpuModel"),
        "cpu_cores": machine.get("cpuCores"),
        "total_mem_bytes": machine.get("totalMemBytes"),
        "load_avg_1": (machine.get("loadAvg") or [None, None, None])[0],
        "load_avg_5": (machine.get("loadAvg") or [None, None, None])[1],
        "load_avg_15": (machine.get("loadAvg") or [None, None, None])[2],
        "hostname": machine.get("hostname"),
    })

    git = data.get("git") or {}
    top.update({
        "git_commit": git.get("commit"),
        "git_branch": git.get("branch"),
        "git_status_dirty": git.get("statusDirty"),
    })

    for result in data.get("results", []):
        # Support both old format (uiMetrics[].rows[]) and new format (metrics[])
        metrics_list = result.get("metrics")
        if metrics_list is None:
            # Old format fallback
            ui_metrics = result.get("uiMetrics") or []
            metrics_list = []
            for ui in ui_metrics:
                metrics_list.extend(ui.get("rows") or [])
        
        if not metrics_list:
            continue
            
        base = dict(top)
        base.update({
            "result_run": result.get("run"),
            "result_duration_ms": result.get("durationMs"),
        })

        stats = result.get("stats") or {}
        base.update({
            "stats_start_time": stats.get("startTime"),
            "stats_duration_ms": stats.get("duration"),
        })

        for row in metrics_list:
            # Do not infer renderingEnabled. If missing, default to False.
            rendering_enabled = row.get("renderingEnabled")
            if rendering_enabled is None:
                rendering_enabled = False
            
            out = dict(base)
            out.update({
                "dataset_index": row.get("datasetIndex") or row.get("runId"),  # Support old field name
                "timestamp": row.get("timestamp"),
                "dataset_name": row.get("datasetName"),
                "dataset_size": row.get("datasetSize"),
                "dimensions": row.get("dimensions"),
                "wasm_features": row.get("wasmFeatures"),
                "rendering_enabled": rendering_enabled,
                "runtime_ms": row.get("runtimeMs"),
                "memory_delta_mb": row.get("memoryDeltaMb"),
                "trustworthiness": row.get("trustworthiness"),
                "fps_avg": row.get("fpsAvg"),
                "responsiveness_ms": row.get("responsivenessMs"),
            })
            rows.append(out)

## Processing Summary

Review the data extraction results:


In [3]:
print(f"Files processed: {len(files)}")
print(f"Total rows extracted: {len(rows)}")
print(f"\nSample of first row keys:")
if rows:
    for key in list(rows[0].keys())[:10]:
        print(f"  - {key}")


Files processed: 25
Total rows extracted: 500

Sample of first row keys:
  - generated_at
  - runs_declared
  - wasm_features_file
  - wasm_preload
  - machine_platform
  - machine_release
  - machine_arch
  - cpu_model
  - cpu_cores
  - total_mem_bytes


## Create DataFrame and Export

Convert the extracted rows into a pandas DataFrame with a consistent column order, then export to CSV. The output file will be ready for statistical analysis.


In [4]:
columns = [
    # File/run metadata
    "generated_at",
    "runs_declared",
    "result_run",
    "result_duration_ms",
    "stats_start_time",
    "stats_duration_ms",
    # WASM config (file-level)
    "wasm_features_file",
    "wasm_preload",
    # Machine info
    "machine_platform",
    "machine_release",
    "machine_arch",
    "cpu_model",
    "cpu_cores",
    "total_mem_bytes",
    "load_avg_1",
    "load_avg_5",
    "load_avg_15",
    "hostname",
    # Git context
    "git_commit",
    "git_branch",
    "git_status_dirty",
    # Per-dataset metrics (the core data)
    "dataset_index",
    "timestamp",
    "dataset_name",
    "dataset_size",
    "dimensions",
    "wasm_features",
    "rendering_enabled",
    "runtime_ms",
    "memory_delta_mb",
    "trustworthiness",
    "fps_avg",
    "responsiveness_ms",
]

df = pd.DataFrame(rows).reindex(columns=columns)

# Create output directory if it doesn't exist
os.makedirs(os.path.dirname(output_path), exist_ok=True)

df.to_csv(output_path, index=False)

print(f"✓ Successfully saved {len(df)} rows to '{output_path}'")
print(f"\nDataFrame shape: {df.shape}")
print(f"Columns: {len(df.columns)}")

df.head()

✓ Successfully saved 500 rows to '../outputs/preprocessed.csv'

DataFrame shape: (500, 33)
Columns: 33


Unnamed: 0,generated_at,runs_declared,result_run,result_duration_ms,stats_start_time,stats_duration_ms,wasm_features_file,wasm_preload,machine_platform,machine_release,...,dataset_name,dataset_size,dimensions,wasm_features,rendering_enabled,runtime_ms,memory_delta_mb,trustworthiness,fps_avg,responsiveness_ms
0,2026-02-07T03:48:51.452Z,10,1,40567,2026-02-07T03:49:07.248Z,25205.813,none,True,linux,6.6.87.2-microsoft-standard-WSL2,...,"Iris Dataset (150 points, 4D)",150,4,none,True,2366.1,-0.013417,0.987833,60.03937,8.281818
1,2026-02-07T03:48:51.452Z,10,1,40567,2026-02-07T03:49:07.248Z,25205.813,none,True,linux,6.6.87.2-microsoft-standard-WSL2,...,Small Random (80 points),80,10,none,True,3326.6,15.658463,0.860789,47.216211,21.033333
2,2026-02-07T03:48:51.452Z,10,2,37428,2026-02-07T03:49:46.215Z,23675.601,none,True,linux,6.6.87.2-microsoft-standard-WSL2,...,"Iris Dataset (150 points, 4D)",150,4,none,True,2392.7,-0.051042,0.987807,60.003935,8.709091
3,2026-02-07T03:48:51.452Z,10,2,37428,2026-02-07T03:49:46.215Z,23675.601,none,True,linux,6.6.87.2-microsoft-standard-WSL2,...,Small Random (80 points),80,10,none,True,3021.8,8.784738,0.853083,48.190401,21.14
4,2026-02-07T03:48:51.452Z,10,3,36905,2026-02-07T03:50:23.352Z,23448.497,none,True,linux,6.6.87.2-microsoft-standard-WSL2,...,"Iris Dataset (150 points, 4D)",150,4,none,True,2350.5,0.020347,0.987706,60.015742,8.490909


## Data Standardization and Cleaning

Prepare the dataset for analysis by:
- Standardizing WASM feature names into readable labels
- Mapping machine platforms to identifiable types
- Converting all numeric columns to proper types
- Removing rows with missing critical values
- Creating a clean analysis-ready dataset

In [5]:
# Standardize column names and prepare data
df_clean = df.copy()

# Create standardized feature column
if 'wasm_features' in df_clean.columns:
    df_clean['feature'] = df_clean['wasm_features'].fillna('none').str.lower().str.strip()
else:
    df_clean['feature'] = df_clean.get('feature', 'none')

# Map comma-separated features to 'all'
df_clean.loc[df_clean['feature'].astype(str).str.contains(',', na=False), 'feature'] = 'all'

# Standardize feature names
feature_map = {
    'none': 'Baseline (JS)',
    'dist': 'Distance',
    'tree': 'Tree',
    'matrix': 'Matrix',
    'nn': 'NN Descent',
    'opt': 'Optimizer',
    'all': 'All Features'
}
df_clean['feature_name'] = df_clean['feature'].map(feature_map).fillna(df_clean['feature'])

# Identify machine types
if 'machine_platform' in df_clean.columns:
    platform_map = {'darwin': 'MacBook', 'linux': 'Linux'}
    df_clean['machine_type'] = df_clean['machine_platform'].map(platform_map).fillna(df_clean['machine_platform'])
else:
    if 'machine_type' not in df_clean.columns:
        df_clean['machine_type'] = pd.NA

# Convert numeric columns
numeric_cols = ['runtime_ms', 'memory_delta_mb', 'trustworthiness', 'fps_avg', 'responsiveness_ms']
for col in numeric_cols:
    if col in df_clean.columns:
        df_clean[col] = pd.to_numeric(df_clean[col], errors='coerce')

# Remove missing values in key columns
before = len(df_clean)
required = [c for c in ['runtime_ms', 'feature_name'] if c in df_clean.columns]
df_clean = df_clean.dropna(subset=required)
dropped = before - len(df_clean)

# Create analysis dataset
df_analysis = df_clean.copy()

# Define standard feature order
feature_order = ['Baseline (JS)', 'Distance', 'Tree', 'Matrix', 'NN Descent', 'Optimizer', 'All Features']
feature_order = [f for f in feature_order if f in df_analysis['feature_name'].unique()]

print(f"✓ Dataset ready: {len(df_analysis):,} measurements (dropped {dropped} rows)")
print(f"Features: {sorted(df_analysis['feature_name'].unique())}")
print(f"Datasets: {df_analysis['dataset_name'].nunique()}")
print(f"\nMeasurements per feature:")
print(df_analysis['feature_name'].value_counts().sort_index())

✓ Dataset ready: 500 measurements (dropped 0 rows)
Features: ['All Features', 'Baseline (JS)', 'Distance', 'Matrix', 'NN Descent', 'Optimizer', 'Tree']
Datasets: 6

Measurements per feature:
feature_name
All Features     100
Baseline (JS)     60
Distance          60
Matrix            60
NN Descent        60
Optimizer        100
Tree              60
Name: count, dtype: int64


In [6]:
# Save cleaned analysis-ready dataset
output_cleaned_path = "../outputs/preprocessed.csv"
df_analysis.to_csv(output_cleaned_path, index=False)

print(f"✓ Saved cleaned dataset to '{output_cleaned_path}'")
print(f"  Rows: {len(df_analysis):,}")
print(f"  Added columns: feature, feature_name, machine_type")

✓ Saved cleaned dataset to '../outputs/preprocessed.csv'
  Rows: 500
  Added columns: feature, feature_name, machine_type
