ZenosInteractive · virtexalejandro · Apr 23, 2026 · Apr 23, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -12,6 +12,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Added
 - **tests**: `ReaderApiFlatBuffers.CancelledPrefetchReEntersWindow` -- focused regression for the bug above.  Runs the cancel + re-enter pattern (prime chunks 0..2, jump to chunk 10 to cancel, jump to chunk 2 to revive) 50 times against a fresh reader each iteration.  Pre-fix this fails ~every run under TSan and flakes at single-digit-% on stock release; post-fix it is deterministic green on both
+- **benchmarks**: `BM_FrameAccessor_Creation` (isolated `reader->CreateAccessor()` cost, ~6.77 µs/iter) and `BM_EntityView_SingleGet` (isolated `EntityView view(entity); view.Get(key)` cost, 1.70 ns/iter = 585 M reads/s ceiling).  Fills the "what does each accessor layer cost on its own?" framing gap -- previously every accessor number was bundled with entity iteration overhead, so customers asking "how fast is your property read?" only had the mixed number.  Paired with the existing `BM_AccessorKeyResolution` (74 ns/key), the three new benchmarks decompose the public property-access API (`FrameAccessor` -> `PropertyKey<T>` -> `EntityView`) into three independently measurable layers
+- **docs**: new `docs/PERFORMANCE.md` landing page -- visual health-check table, headline numbers, three-layer accessor breakdown, cache-window finding with its 59 %-slower trap, format-choice caveat, honest "what these numbers do not prove" section.  Linked from `README.md` and `docs/BUILD.md`
+- **docs**: `README.md` now has a "Performance at a glance" section up top with a headline table and the traffic-light health check.  Non-engineering readers landing on the GitHub page get a truthful one-screen summary before the build instructions
+- **docs/BUILD.md**: new "Running benchmarks locally" section mirroring the sanitizer section -- covers configure, full-suite run, filtered run, fixture requirements
+- **reports/benchmarks/**: raw output of the 2026-04-23 16:20 canonical run committed (`bench_20260423_162008.{txt,json}`) for baseline tracking and future regression comparison.  First run to separately measure the three accessor layers (`FrameAccessor` creation, `PropertyKey` resolution, `EntityView::Get`).  Narrative interpretation lives in `docs/PERFORMANCE.md` rather than in a per-run markdown snapshot -- evergreen doc, one source of truth
 
 ## [Unreleased] - 2026-04-22
 

diff --git a/README.md b/README.md
@@ -4,6 +4,40 @@
 
 A high-performance C++20 toolkit for recording, reading, comparing, and inspecting structured replay data. VTX serializes frame-based entity state into a compact, chunked binary format with support for **Protocol Buffers** and **FlatBuffers** backends.
 
+## Performance at a glance
+
+Measured on the CS2 (92 MB, 10 656 frames) and Rocket League (5 MB, ~21 k frames) real-world fixtures on a modern dev laptop. Full breakdown in [`docs/PERFORMANCE.md`](docs/PERFORMANCE.md).
+
+| Task | Number | Notes |
+|---|---|---|
+| Writing frames end-to-end | **~82 k frames/s** | 30-min 60 fps match ≈ 1.3 s of CPU. |
+| Full sequential read (CS2, 92 MB) | **~5.6 s** (median) | Competitive with peer tooling. |
+| Preview first 1 000 frames | **~1 s** | Below UX perceptibility. |
+| `EntityView::Get` (isolated) | **1.70 ns** (585 M/s) | Theoretical ceiling. |
+| `EntityView::Get` (realistic hot loop) | **~80 ns** (13 M/s) | What real integrations observe. |
+| Diff consecutive frames | **4 µs** (267 k/s) | Instant from a user perspective. |
+
+### Health check
+
+| Area | Verdict | Why |
+|---|---|---|
+| Writer throughput | 🟢 Fast | ~10× headroom over real-time recording. |
+| Full sequential read | 🟢 Fast enough | 5.6 s for a 92 MB replay (median). |
+| Preview + seek-play | 🟢 Fast | Sub-second scrubbing. |
+| Cache-window **well-sized** | 🟢 Fast | <4 s for 50 random jumps. |
+| Cache-window **mis-sized** | 🔴 **Actively bad** | 59 % slower than no cache — needs docs guidance. |
+| `FrameAccessor` / `PropertyKey` setup | 🟢 Negligible | 6.77 µs + 74 ns/key. Once per integration. |
+| `EntityView` hot loop | 🟢 Excellent | 1.70 ns isolated / ~80 ns realistic. |
+| Diff + short-circuit | 🟢 Instant | 4 µs consecutive; 10×/2× shortcut. |
+| Schema parse | 🟢 Negligible | 200 µs, one-time. |
+| Format choice (FBS vs Proto) | 🟡 Depends | No universal winner — fixture-shape dictates. |
+
+**Honest caveats** (detail in [`docs/PERFORMANCE.md`](docs/PERFORMANCE.md)):
+- One machine, one run. Ratios hold across hardware; absolute numbers vary.
+- No direct comparison to competitor libraries.
+- `items_per_second` is CPU-time based; use wall time for user-observable claims.
+- One known fixture counter inflated 2× (`BM_AccessorRandomWithinBucket`); flagged for fix.
+
 ## Architecture
 
 ```

diff --git a/benchmarks/bench_accessor.cpp b/benchmarks/bench_accessor.cpp
@@ -266,3 +266,78 @@ static void BM_AccessorRandomWithinBucket(benchmark::State& state) {
     state.SetItemsProcessed(state.iterations() * ops_per_sweep);
 }
 BENCHMARK(BM_AccessorRandomWithinBucket)->Unit(benchmark::kMillisecond);
+
+// Isolated cost of `reader->CreateAccessor()` -- the one-time handshake a
+// consumer pays at integration startup.  Complements BM_AccessorKeyResolution
+// which measures what happens *after* an accessor exists.  Together they
+// bracket the "SDK initialization" cost the consumer observes.
+static void BM_FrameAccessor_Creation(benchmark::State& state) {
+    auto result = VTX::OpenReplayFile(ArenaReplayPath());
+    if (!result) {
+        state.SkipWithError("OpenReplayFile failed");
+        return;
+    }
+    auto& reader = result.reader;
+
+    for (auto _ : state) {
+        auto accessor = reader->CreateAccessor();
+        benchmark::DoNotOptimize(accessor);
+    }
+}
+BENCHMARK(BM_FrameAccessor_Creation)->Unit(benchmark::kMicrosecond);
+
+// Isolated cost of `EntityView(entity); view.Get(key)` -- one construction
+// plus one typed property read, on a single Player entity held in RAM.  This
+// is the per-property-access cost a consumer pays in their steady state
+// once everything else (accessor, keys, frame cache) is warm.
+//
+// Preloads a copy of frame 0 into a local Frame so the measured loop cannot
+// be invalidated by chunk eviction.  Picks the first Player entity it finds.
+static void BM_EntityView_SingleGet(benchmark::State& state) {
+    auto result = VTX::OpenReplayFile(ArenaReplayPath());
+    if (!result) {
+        state.SkipWithError("OpenReplayFile failed");
+        return;
+    }
+    auto& reader = result.reader;
+
+    auto accessor = reader->CreateAccessor();
+    auto key_position = accessor.Get<VTX::Vector>("Player", "Position");
+    if (!key_position.IsValid()) {
+        state.SkipWithError("Player::Position did not resolve");
+        return;
+    }
+
+    // Own the frame so chunk eviction can't yank the entity out from under us.
+    const auto* frame_ptr = reader->GetFrameSync(0);
+    if (!frame_ptr) {
+        state.SkipWithError("frame 0 not loaded");
+        return;
+    }
+    VTX::Frame frame_copy = *frame_ptr;
+
+    const VTX::PropertyContainer* sample_entity = nullptr;
+    for (const auto& bucket : frame_copy.GetBuckets()) {
+        for (const auto& entity : bucket.entities) {
+            if (entity.entity_type_id == 0) {
+                sample_entity = &entity;
+                break;
+            }
+        }
+        if (sample_entity)
+            break;
+    }
+    if (!sample_entity) {
+        state.SkipWithError("no Player entity in frame 0");
+        return;
+    }
+
+    double sink = 0.0;
+    for (auto _ : state) {
+        VTX::EntityView view(*sample_entity);
+        sink += view.Get(key_position).x;
+    }
+    benchmark::DoNotOptimize(sink);
+    state.SetItemsProcessed(state.iterations());
+}
+BENCHMARK(BM_EntityView_SingleGet)->Unit(benchmark::kNanosecond);
diff --git a/docs/BUILD.md b/docs/BUILD.md
@@ -284,6 +284,33 @@ VTX/
   thirdparty/   Header-only deps + legacy Windows binary fallback
   vcpkg.json    Windows package-manager manifest
 ```
+## Running benchmarks locally
+
+The benchmark binary is gated behind `VTX_BUILD_BENCHMARKS` and uses `google/benchmark` (fetched via `FetchContent`). Release builds only -- debug numbers are not meaningful.
+
+```bash
+# Configure + build
+cmake -S . -B build-bench -DCMAKE_BUILD_TYPE=Release -DVTX_BUILD_BENCHMARKS=ON
+cmake --build build-bench --target vtx_benchmarks --config Release --parallel
+
+# Run the full suite (writes JSON + console side-by-side for diffing over time)
+build-bench/bin/Release/vtx_benchmarks.exe \
+    --benchmark_out=reports/benchmarks/bench_$(date +%Y%m%d_%H%M%S).json \
+    --benchmark_out_format=json \
+    --benchmark_counters_tabular=true
+
+# Filter to a single family
+build-bench/bin/Release/vtx_benchmarks.exe \
+    --benchmark_filter='BM_FrameAccessor_Creation|BM_EntityView_SingleGet|BM_AccessorKeyResolution'
+```
+
+Fixtures required at run time:
+
+- `samples/content/reader/{cs,rl,arena}/*.vtx` -- real replays; checked in.
+- `benchmarks/fixtures/synth_10k.vtx` -- small synthetic fixture generated by `vtx_sample_write` when `VTX_BUILD_BENCHMARKS=ON`.
+
+Narrative results (what the numbers mean, not just the tables) live in [`docs/PERFORMANCE.md`](PERFORMANCE.md). The raw output of the canonical run is committed under `reports/benchmarks/` as JSON + console text for regression tracking and graphing.
+
 ## Running sanitizers locally
 
 VTX's root `CMakeLists.txt` exposes a `VTX_SANITIZE` option that enables gcc/clang runtime sanitizers.  Useful before pushing a change that touches threading or memory-ownership code.

diff --git a/docs/PERFORMANCE.md b/docs/PERFORMANCE.md
@@ -0,0 +1,135 @@
+# Performance
+
+> Numbers in this page come from the benchmark run on **2026-04-23** against `main` at commit `78b20f3`.
+> Raw data for that run (re-usable for graphing or regression tracking): [`reports/benchmarks/bench_20260423_162008.json`](../reports/benchmarks/bench_20260423_162008.json).
+
+## At a glance
+
+A one-screen view of which areas of the SDK are fast, which are slow, and which depend on how you use them.
+
+| Area | Verdict | Why |
+|---|---|---|
+| Writer throughput | 🟢 Fast | 82 k frames/s end-to-end; ~10× headroom over real-time recording. |
+| Full sequential read (CS, 92 MB) | 🟢 Fast enough | 5.6 s median; competitive with peer tooling. |
+| Preview (first 1 000 frames) | 🟢 Fast | ~1 s; below UX perceptibility. |
+| Scrubbing with **well-sized** cache | 🟢 Fast | <4 s for 50 random jumps. |
+| Scrubbing with **mis-sized** cache | 🔴 **Actively bad** | 23 s for the same workload — 59 % slower than no cache. |
+| `FrameAccessor` creation | 🟢 Negligible | 6.77 µs once per replay. |
+| `PropertyKey` resolution | 🟢 Negligible | 74 ns per key × keys_you_resolve. |
+| `EntityView` hot-loop read | 🟢 Excellent | 1.70 ns isolated / ~80 ns in realistic loop. |
+| Diff on consecutive frames | 🟢 Instant | 4 µs; 267 k comparisons/s. |
+| Diff short-circuit on identical frames | 🟢 Instant | xxHash shortcut working; 10× on arena, 2× on CS. |
+| Schema parse | 🟢 Negligible | 200 µs, one-time per file. |
+| Format choice (FBS vs Proto) | 🟡 Context-dependent | No universal winner; fixture-shape dictates. |
+| CS sequential-scan stability | 🟡 Noisy | One outlier iter per run; median trustworthy, mean fluctuates. |
+
+## Headline numbers
+
+| Task | Result | Plain-language reading |
+|---|---|---|
+| Writing a replay | ~82 000 frames/s | A 30-minute match at 60 fps (~108 k frames) costs ~1.3 s of CPU. |
+| Reading the full CS2 fixture (92 MB, median) | ~5.6 s | Comparable to opening a long video in an editor. |
+| Preview (first 1 000 frames) | ~1 s | Thumbnails + file-browse UI have no perceptible lag. |
+| Seek to 50 % + play 300 frames | ~0.9 s | Timeline scrubbing is fluid. |
+| `FrameAccessor` creation | 6.77 µs | Once per replay. Invisible. |
+| `PropertyKey` resolution | ~74 ns / key | 10 properties ≈ 0.7 µs total. Invisible. |
+| `EntityView` read (isolated) | **1.70 ns** | 585 M reads/s — theoretical ceiling. |
+| `EntityView` read (hot loop) | ~80 ns | ~13 M reads/s — what real integrations observe. |
+| Frame diff | 4 µs | 267 k comparisons/s. |
+
+## The three-layer accessor API
+
+Customers asking "how fast is the property-access API?" are really asking about one of three layers with very different costs:
+
+| Layer | When you pay | How often | Cost |
+|---|---|---|---|
+| `FrameAccessor` (`reader->CreateAccessor()`) | Integration startup | Once per replay | **6.77 µs** |
+| `PropertyKey<T>` (`accessor.Get<T>("Struct", "prop")`) | Integration setup | Once per property you care about | **~74 ns / key** |
+| `EntityView` + `Get` (inside your hot loop) | Every property read | Millions of times per second | **1.70 ns isolated / ~80 ns realistic** |
+
+Lead with the hot-loop number (~13 M reads/s) when setting customer expectations. Use the isolated ceiling (585 M/s) only when explicitly asked about the theoretical upper bound.
+
+## Key findings
+
+### 1. A small cache is *worse* than no cache
+
+The reader has a configurable cache window (`SetCacheWindow(back, forward)`) that keeps recently-read chunks in RAM. Intuition says "bigger is better, and a small cache is still better than nothing". **This is false for random-access workloads.**
+
+Measured on the CS2 fixture (50 random jumps):
+
+| Cache window | Wall time | vs. no cache |
+|---:|---:|---|
+| 0 (disabled) | 14.56 s | baseline |
+| 2 | **23.20 s** | **+59 % slower** |
+| 5 | 16.87 s | +16 % slower |
+| 10 | 3.27 s | **4.5× faster** |
+| 20 | 3.09 s | plateau (fixture fits in 10) |
+
+**Why.** A too-small window evicts the chunks that were about to be reused. Every jump pays the eviction cost *plus* the reload cost. When the window finally fits the working set, throughput jumps 5×.
+
+**Guidance.** Size `SetCacheWindow` to the expected scrubbing span. Don't pick a small default and hope for the best.
+
+### 2. No universal "faster" format
+
+VTX supports both FlatBuffers and Protobuf. Which is faster depends on the replay:
+
+| Fixture | FBS median | Proto median | Winner | Margin |
+|---|---:|---:|---|---|
+| CS2 (92 MB, dense per-frame payloads) | 5.56 s | 2.93 s | **Proto** | FBS 90 % slower |
+| Rocket League (5 MB, different schema shape) | **0.72 s** | 2.56 s | **FBS** | Proto 3.5× slower |
+
+The flip is driven by per-frame payload size and schema shape — CS2 favours Proto's streaming decode, RL favours FBS's zero-copy access. **Measure on your replay shape before picking a default.**
+
+### 3. Short-circuit diffing is earning its keep
+
+The differ fingerprints two frames (xxHash) before falling back to the full structural diff. On identical frames the shortcut pays off:
+
+| Fixture | Identical (short-circuit) | First-vs-last (worst case) | Speedup |
+|---|---:|---:|---:|
+| Arena (small) | 6.3 µs | 59 µs | **~10×** |
+| CS2 (big) | 66 µs | 131 µs | **~2×** |
+
+Ratio shrinks on big frames because hashing itself becomes non-trivial — but the shortcut is still a clear win.
+
+## What these numbers do *not* prove
+
+- **One machine.** Windows, i9-13900H, 20 threads, SSD. Customer hardware will vary; *ratios* hold, absolute numbers don't.
+- **One benchmark run.** `repeats:5` on the heavy workloads, adaptive on the rest. No multi-machine statistical harness yet.
+- **No competitor comparison.** We measure VTX against itself.
+- **`items_per_second` in google/benchmark uses CPU time, not wall time.** Where wall ≫ CPU (async I/O), that metric overstates user-observable throughput. For customer-facing claims use the wall-time column.
+- **One known fixture bug** — `BM_AccessorRandomWithinBucket` inflates its own counter 2× due to a duplicate-push in the shuffle setup. Flagged for a follow-up fix; everything else is trustworthy.
+
+## Running benchmarks locally
+
+The benchmark binary is gated behind `VTX_BUILD_BENCHMARKS`. It uses google/benchmark (fetched via `FetchContent`), Release builds only.
+
+```bash
+# Configure + build
+cmake -S . -B build-bench -DCMAKE_BUILD_TYPE=Release -DVTX_BUILD_BENCHMARKS=ON
+cmake --build build-bench --target vtx_benchmarks --config Release --parallel
+
+# Run the full suite (JSON + console output)
+build-bench/bin/Release/vtx_benchmarks.exe \
+    --benchmark_out=reports/benchmarks/bench_$(date +%Y%m%d_%H%M%S).json \
+    --benchmark_out_format=json \
+    --benchmark_counters_tabular=true
+
+# Only the three isolated accessor layers
+build-bench/bin/Release/vtx_benchmarks.exe \
+    --benchmark_filter='BM_FrameAccessor_Creation|BM_EntityView_SingleGet|BM_AccessorKeyResolution'
+
+# Only the cache-window sweep
+build-bench/bin/Release/vtx_benchmarks.exe \
+    --benchmark_filter='BM_CS_AccessorRandomAccess_CacheSweep_FBS'
+```
+
+Fixtures required: CS, RL, and arena replays under `samples/content/reader/{cs,rl,arena}/`. The small `synth_10k.vtx` fixture is generated at build time by `vtx_sample_write` whenever `VTX_BUILD_BENCHMARKS=ON`.
+
+## Where the raw data lives
+
+| Path | What it is |
+|---|---|
+| [`reports/benchmarks/bench_20260423_162008.json`](../reports/benchmarks/bench_20260423_162008.json) | Raw google/benchmark JSON — reusable for graphing or regression tracking. |
+| [`reports/benchmarks/bench_20260423_162008.txt`](../reports/benchmarks/bench_20260423_162008.txt) | Console output as produced. |
+
+This page is the canonical narrative version of that data. If the benchmarks are re-run, update the numbers here (and commit the new raw outputs alongside) rather than maintaining a parallel per-run markdown report.