Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Added
- **tests**: `ReaderApiFlatBuffers.CancelledPrefetchReEntersWindow` -- focused regression for the bug above. Runs the cancel + re-enter pattern (prime chunks 0..2, jump to chunk 10 to cancel, jump to chunk 2 to revive) 50 times against a fresh reader each iteration. Pre-fix this fails ~every run under TSan and flakes at single-digit-% on stock release; post-fix it is deterministic green on both
- **benchmarks**: `BM_FrameAccessor_Creation` (isolated `reader->CreateAccessor()` cost, ~6.77 µs/iter) and `BM_EntityView_SingleGet` (isolated `EntityView view(entity); view.Get(key)` cost, 1.70 ns/iter = 585 M reads/s ceiling). Fills the "what does each accessor layer cost on its own?" framing gap -- previously every accessor number was bundled with entity iteration overhead, so customers asking "how fast is your property read?" only had the mixed number. Paired with the existing `BM_AccessorKeyResolution` (74 ns/key), the three new benchmarks decompose the public property-access API (`FrameAccessor` -> `PropertyKey<T>` -> `EntityView`) into three independently measurable layers
- **docs**: new `docs/PERFORMANCE.md` landing page -- visual health-check table, headline numbers, three-layer accessor breakdown, cache-window finding with its 59 %-slower trap, format-choice caveat, honest "what these numbers do not prove" section. Linked from `README.md` and `docs/BUILD.md`
- **docs**: `README.md` now has a "Performance at a glance" section up top with a headline table and the traffic-light health check. Non-engineering readers landing on the GitHub page get a truthful one-screen summary before the build instructions
- **docs/BUILD.md**: new "Running benchmarks locally" section mirroring the sanitizer section -- covers configure, full-suite run, filtered run, fixture requirements
- **reports/benchmarks/**: raw output of the 2026-04-23 16:20 canonical run committed (`bench_20260423_162008.{txt,json}`) for baseline tracking and future regression comparison. First run to separately measure the three accessor layers (`FrameAccessor` creation, `PropertyKey` resolution, `EntityView::Get`). Narrative interpretation lives in `docs/PERFORMANCE.md` rather than in a per-run markdown snapshot -- evergreen doc, one source of truth

## [Unreleased] - 2026-04-22

Expand Down
34 changes: 34 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,40 @@

A high-performance C++20 toolkit for recording, reading, comparing, and inspecting structured replay data. VTX serializes frame-based entity state into a compact, chunked binary format with support for **Protocol Buffers** and **FlatBuffers** backends.

## Performance at a glance

Measured on the CS2 (92 MB, 10 656 frames) and Rocket League (5 MB, ~21 k frames) real-world fixtures on a modern dev laptop. Full breakdown in [`docs/PERFORMANCE.md`](docs/PERFORMANCE.md).

| Task | Number | Notes |
|---|---|---|
| Writing frames end-to-end | **~82 k frames/s** | 30-min 60 fps match ≈ 1.3 s of CPU. |
| Full sequential read (CS2, 92 MB) | **~5.6 s** (median) | Competitive with peer tooling. |
| Preview first 1 000 frames | **~1 s** | Below UX perceptibility. |
| `EntityView::Get` (isolated) | **1.70 ns** (585 M/s) | Theoretical ceiling. |
| `EntityView::Get` (realistic hot loop) | **~80 ns** (13 M/s) | What real integrations observe. |
| Diff consecutive frames | **4 µs** (267 k/s) | Instant from a user perspective. |

### Health check

| Area | Verdict | Why |
|---|---|---|
| Writer throughput | 🟢 Fast | ~10× headroom over real-time recording. |
| Full sequential read | 🟢 Fast enough | 5.6 s for a 92 MB replay (median). |
| Preview + seek-play | 🟢 Fast | Sub-second scrubbing. |
| Cache-window **well-sized** | 🟢 Fast | <4 s for 50 random jumps. |
| Cache-window **mis-sized** | 🔴 **Actively bad** | 59 % slower than no cache — needs docs guidance. |
| `FrameAccessor` / `PropertyKey` setup | 🟢 Negligible | 6.77 µs + 74 ns/key. Once per integration. |
| `EntityView` hot loop | 🟢 Excellent | 1.70 ns isolated / ~80 ns realistic. |
| Diff + short-circuit | 🟢 Instant | 4 µs consecutive; 10×/2× shortcut. |
| Schema parse | 🟢 Negligible | 200 µs, one-time. |
| Format choice (FBS vs Proto) | 🟡 Depends | No universal winner — fixture-shape dictates. |

**Honest caveats** (detail in [`docs/PERFORMANCE.md`](docs/PERFORMANCE.md)):
- One machine, one run. Ratios hold across hardware; absolute numbers vary.
- No direct comparison to competitor libraries.
- `items_per_second` is CPU-time based; use wall time for user-observable claims.
- One known fixture counter inflated 2× (`BM_AccessorRandomWithinBucket`); flagged for fix.

## Architecture

```
Expand Down
75 changes: 75 additions & 0 deletions benchmarks/bench_accessor.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -266,3 +266,78 @@ static void BM_AccessorRandomWithinBucket(benchmark::State& state) {
state.SetItemsProcessed(state.iterations() * ops_per_sweep);
}
BENCHMARK(BM_AccessorRandomWithinBucket)->Unit(benchmark::kMillisecond);

// Isolated cost of `reader->CreateAccessor()` -- the one-time handshake a
// consumer pays at integration startup. Complements BM_AccessorKeyResolution
// which measures what happens *after* an accessor exists. Together they
// bracket the "SDK initialization" cost the consumer observes.
static void BM_FrameAccessor_Creation(benchmark::State& state) {
auto result = VTX::OpenReplayFile(ArenaReplayPath());
if (!result) {
state.SkipWithError("OpenReplayFile failed");
return;
}
auto& reader = result.reader;

for (auto _ : state) {
auto accessor = reader->CreateAccessor();
benchmark::DoNotOptimize(accessor);
}
}
BENCHMARK(BM_FrameAccessor_Creation)->Unit(benchmark::kMicrosecond);

// Isolated cost of `EntityView(entity); view.Get(key)` -- one construction
// plus one typed property read, on a single Player entity held in RAM. This
// is the per-property-access cost a consumer pays in their steady state
// once everything else (accessor, keys, frame cache) is warm.
//
// Preloads a copy of frame 0 into a local Frame so the measured loop cannot
// be invalidated by chunk eviction. Picks the first Player entity it finds.
static void BM_EntityView_SingleGet(benchmark::State& state) {
auto result = VTX::OpenReplayFile(ArenaReplayPath());
if (!result) {
state.SkipWithError("OpenReplayFile failed");
return;
}
auto& reader = result.reader;

auto accessor = reader->CreateAccessor();
auto key_position = accessor.Get<VTX::Vector>("Player", "Position");
if (!key_position.IsValid()) {
state.SkipWithError("Player::Position did not resolve");
return;
}

// Own the frame so chunk eviction can't yank the entity out from under us.
const auto* frame_ptr = reader->GetFrameSync(0);
if (!frame_ptr) {
state.SkipWithError("frame 0 not loaded");
return;
}
VTX::Frame frame_copy = *frame_ptr;

const VTX::PropertyContainer* sample_entity = nullptr;
for (const auto& bucket : frame_copy.GetBuckets()) {
for (const auto& entity : bucket.entities) {
if (entity.entity_type_id == 0) {
sample_entity = &entity;
break;
}
}
if (sample_entity)
break;
}
if (!sample_entity) {
state.SkipWithError("no Player entity in frame 0");
return;
}

double sink = 0.0;
for (auto _ : state) {
VTX::EntityView view(*sample_entity);
sink += view.Get(key_position).x;
}
benchmark::DoNotOptimize(sink);
state.SetItemsProcessed(state.iterations());
}
BENCHMARK(BM_EntityView_SingleGet)->Unit(benchmark::kNanosecond);
27 changes: 27 additions & 0 deletions docs/BUILD.md
Original file line number Diff line number Diff line change
Expand Up @@ -284,6 +284,33 @@ VTX/
thirdparty/ Header-only deps + legacy Windows binary fallback
vcpkg.json Windows package-manager manifest
```
## Running benchmarks locally

The benchmark binary is gated behind `VTX_BUILD_BENCHMARKS` and uses `google/benchmark` (fetched via `FetchContent`). Release builds only -- debug numbers are not meaningful.

```bash
# Configure + build
cmake -S . -B build-bench -DCMAKE_BUILD_TYPE=Release -DVTX_BUILD_BENCHMARKS=ON
cmake --build build-bench --target vtx_benchmarks --config Release --parallel

# Run the full suite (writes JSON + console side-by-side for diffing over time)
build-bench/bin/Release/vtx_benchmarks.exe \
--benchmark_out=reports/benchmarks/bench_$(date +%Y%m%d_%H%M%S).json \
--benchmark_out_format=json \
--benchmark_counters_tabular=true

# Filter to a single family
build-bench/bin/Release/vtx_benchmarks.exe \
--benchmark_filter='BM_FrameAccessor_Creation|BM_EntityView_SingleGet|BM_AccessorKeyResolution'
```

Fixtures required at run time:

- `samples/content/reader/{cs,rl,arena}/*.vtx` -- real replays; checked in.
- `benchmarks/fixtures/synth_10k.vtx` -- small synthetic fixture generated by `vtx_sample_write` when `VTX_BUILD_BENCHMARKS=ON`.

Narrative results (what the numbers mean, not just the tables) live in [`docs/PERFORMANCE.md`](PERFORMANCE.md). The raw output of the canonical run is committed under `reports/benchmarks/` as JSON + console text for regression tracking and graphing.

## Running sanitizers locally

VTX's root `CMakeLists.txt` exposes a `VTX_SANITIZE` option that enables gcc/clang runtime sanitizers. Useful before pushing a change that touches threading or memory-ownership code.
Expand Down
135 changes: 135 additions & 0 deletions docs/PERFORMANCE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# Performance

> Numbers in this page come from the benchmark run on **2026-04-23** against `main` at commit `78b20f3`.
> Raw data for that run (re-usable for graphing or regression tracking): [`reports/benchmarks/bench_20260423_162008.json`](../reports/benchmarks/bench_20260423_162008.json).

## At a glance

A one-screen view of which areas of the SDK are fast, which are slow, and which depend on how you use them.

| Area | Verdict | Why |
|---|---|---|
| Writer throughput | 🟢 Fast | 82 k frames/s end-to-end; ~10× headroom over real-time recording. |
| Full sequential read (CS, 92 MB) | 🟢 Fast enough | 5.6 s median; competitive with peer tooling. |
| Preview (first 1 000 frames) | 🟢 Fast | ~1 s; below UX perceptibility. |
| Scrubbing with **well-sized** cache | 🟢 Fast | <4 s for 50 random jumps. |
| Scrubbing with **mis-sized** cache | 🔴 **Actively bad** | 23 s for the same workload — 59 % slower than no cache. |
| `FrameAccessor` creation | 🟢 Negligible | 6.77 µs once per replay. |
| `PropertyKey` resolution | 🟢 Negligible | 74 ns per key × keys_you_resolve. |
| `EntityView` hot-loop read | 🟢 Excellent | 1.70 ns isolated / ~80 ns in realistic loop. |
| Diff on consecutive frames | 🟢 Instant | 4 µs; 267 k comparisons/s. |
| Diff short-circuit on identical frames | 🟢 Instant | xxHash shortcut working; 10× on arena, 2× on CS. |
| Schema parse | 🟢 Negligible | 200 µs, one-time per file. |
| Format choice (FBS vs Proto) | 🟡 Context-dependent | No universal winner; fixture-shape dictates. |
| CS sequential-scan stability | 🟡 Noisy | One outlier iter per run; median trustworthy, mean fluctuates. |

## Headline numbers

| Task | Result | Plain-language reading |
|---|---|---|
| Writing a replay | ~82 000 frames/s | A 30-minute match at 60 fps (~108 k frames) costs ~1.3 s of CPU. |
| Reading the full CS2 fixture (92 MB, median) | ~5.6 s | Comparable to opening a long video in an editor. |
| Preview (first 1 000 frames) | ~1 s | Thumbnails + file-browse UI have no perceptible lag. |
| Seek to 50 % + play 300 frames | ~0.9 s | Timeline scrubbing is fluid. |
| `FrameAccessor` creation | 6.77 µs | Once per replay. Invisible. |
| `PropertyKey` resolution | ~74 ns / key | 10 properties ≈ 0.7 µs total. Invisible. |
| `EntityView` read (isolated) | **1.70 ns** | 585 M reads/s — theoretical ceiling. |
| `EntityView` read (hot loop) | ~80 ns | ~13 M reads/s — what real integrations observe. |
| Frame diff | 4 µs | 267 k comparisons/s. |

## The three-layer accessor API

Customers asking "how fast is the property-access API?" are really asking about one of three layers with very different costs:

| Layer | When you pay | How often | Cost |
|---|---|---|---|
| `FrameAccessor` (`reader->CreateAccessor()`) | Integration startup | Once per replay | **6.77 µs** |
| `PropertyKey<T>` (`accessor.Get<T>("Struct", "prop")`) | Integration setup | Once per property you care about | **~74 ns / key** |
| `EntityView` + `Get` (inside your hot loop) | Every property read | Millions of times per second | **1.70 ns isolated / ~80 ns realistic** |

Lead with the hot-loop number (~13 M reads/s) when setting customer expectations. Use the isolated ceiling (585 M/s) only when explicitly asked about the theoretical upper bound.

## Key findings

### 1. A small cache is *worse* than no cache

The reader has a configurable cache window (`SetCacheWindow(back, forward)`) that keeps recently-read chunks in RAM. Intuition says "bigger is better, and a small cache is still better than nothing". **This is false for random-access workloads.**

Measured on the CS2 fixture (50 random jumps):

| Cache window | Wall time | vs. no cache |
|---:|---:|---|
| 0 (disabled) | 14.56 s | baseline |
| 2 | **23.20 s** | **+59 % slower** |
| 5 | 16.87 s | +16 % slower |
| 10 | 3.27 s | **4.5× faster** |
| 20 | 3.09 s | plateau (fixture fits in 10) |

**Why.** A too-small window evicts the chunks that were about to be reused. Every jump pays the eviction cost *plus* the reload cost. When the window finally fits the working set, throughput jumps 5×.

**Guidance.** Size `SetCacheWindow` to the expected scrubbing span. Don't pick a small default and hope for the best.

### 2. No universal "faster" format

VTX supports both FlatBuffers and Protobuf. Which is faster depends on the replay:

| Fixture | FBS median | Proto median | Winner | Margin |
|---|---:|---:|---|---|
| CS2 (92 MB, dense per-frame payloads) | 5.56 s | 2.93 s | **Proto** | FBS 90 % slower |
| Rocket League (5 MB, different schema shape) | **0.72 s** | 2.56 s | **FBS** | Proto 3.5× slower |

The flip is driven by per-frame payload size and schema shape — CS2 favours Proto's streaming decode, RL favours FBS's zero-copy access. **Measure on your replay shape before picking a default.**

### 3. Short-circuit diffing is earning its keep

The differ fingerprints two frames (xxHash) before falling back to the full structural diff. On identical frames the shortcut pays off:

| Fixture | Identical (short-circuit) | First-vs-last (worst case) | Speedup |
|---|---:|---:|---:|
| Arena (small) | 6.3 µs | 59 µs | **~10×** |
| CS2 (big) | 66 µs | 131 µs | **~2×** |

Ratio shrinks on big frames because hashing itself becomes non-trivial — but the shortcut is still a clear win.

## What these numbers do *not* prove

- **One machine.** Windows, i9-13900H, 20 threads, SSD. Customer hardware will vary; *ratios* hold, absolute numbers don't.
- **One benchmark run.** `repeats:5` on the heavy workloads, adaptive on the rest. No multi-machine statistical harness yet.
- **No competitor comparison.** We measure VTX against itself.
- **`items_per_second` in google/benchmark uses CPU time, not wall time.** Where wall ≫ CPU (async I/O), that metric overstates user-observable throughput. For customer-facing claims use the wall-time column.
- **One known fixture bug** — `BM_AccessorRandomWithinBucket` inflates its own counter 2× due to a duplicate-push in the shuffle setup. Flagged for a follow-up fix; everything else is trustworthy.

## Running benchmarks locally

The benchmark binary is gated behind `VTX_BUILD_BENCHMARKS`. It uses google/benchmark (fetched via `FetchContent`), Release builds only.

```bash
# Configure + build
cmake -S . -B build-bench -DCMAKE_BUILD_TYPE=Release -DVTX_BUILD_BENCHMARKS=ON
cmake --build build-bench --target vtx_benchmarks --config Release --parallel

# Run the full suite (JSON + console output)
build-bench/bin/Release/vtx_benchmarks.exe \
--benchmark_out=reports/benchmarks/bench_$(date +%Y%m%d_%H%M%S).json \
--benchmark_out_format=json \
--benchmark_counters_tabular=true

# Only the three isolated accessor layers
build-bench/bin/Release/vtx_benchmarks.exe \
--benchmark_filter='BM_FrameAccessor_Creation|BM_EntityView_SingleGet|BM_AccessorKeyResolution'

# Only the cache-window sweep
build-bench/bin/Release/vtx_benchmarks.exe \
--benchmark_filter='BM_CS_AccessorRandomAccess_CacheSweep_FBS'
```

Fixtures required: CS, RL, and arena replays under `samples/content/reader/{cs,rl,arena}/`. The small `synth_10k.vtx` fixture is generated at build time by `vtx_sample_write` whenever `VTX_BUILD_BENCHMARKS=ON`.

## Where the raw data lives

| Path | What it is |
|---|---|
| [`reports/benchmarks/bench_20260423_162008.json`](../reports/benchmarks/bench_20260423_162008.json) | Raw google/benchmark JSON — reusable for graphing or regression tracking. |
| [`reports/benchmarks/bench_20260423_162008.txt`](../reports/benchmarks/bench_20260423_162008.txt) | Console output as produced. |

This page is the canonical narrative version of that data. If the benchmarks are re-run, update the numbers here (and commit the new raw outputs alongside) rather than maintaining a parallel per-run markdown report.
Loading
Loading