-
Notifications
You must be signed in to change notification settings - Fork 0
Performance
Headline numbers from real fixtures. For the full breakdown (methodology, hardware, caveats, per-benchmark numbers) see docs/PERFORMANCE.md in the repo.
- CS2: 92 MB, 10 656 frames.
- Rocket League: 5 MB, ~21 000 frames.
Measured on a modern dev laptop. One machine, one run. Ratios hold across hardware; absolute numbers vary.
| Task | Number | Notes |
|---|---|---|
| Writing frames end-to-end | ~82 k frames/s | 30-min 60 fps match ~= 1.3 s of CPU. |
| Full sequential read (CS2, 92 MB) | ~5.6 s (median) | ~16 MB/s decoded. |
| Preview first 1 000 frames | ~1 s | Fast feedback for preview UI. |
EntityView::Get (realistic hot loop) |
~80 ns (13 M/s) | What real integrations observe. See Micro-benchmarks for the methodology and the isolated-loop number. |
| Diff consecutive frames | 4 us (267 k/s) | Instant from a user perspective. |
| Area | Verdict | Why |
|---|---|---|
| Writer throughput | Fast | ~10x headroom over real-time at 60 fps. |
| Full sequential read | Fast enough | 5.6 s for a 92 MB replay (median). |
| Preview + seek-play | Fast | Sub-second scrubbing. |
| Cache-window well-sized | Fast | < 4 s for 50 random jumps. |
| Cache-window mis-sized | Actively bad | 59 % slower than no cache. See Cache-window sizing. |
FrameAccessor / PropertyKey setup |
Negligible | 6.77 us + 74 ns/key. Once per integration. |
EntityView hot loop |
Excellent | ~80 ns in realistic loops. |
| Diff + short-circuit | Instant | 4 us consecutive; 10x / 2x shortcut. |
| Schema parse | Negligible | 200 us, one-time. |
| Format choice (FBS vs. Proto) | Depends | No universal winner. Fixture shape dictates. |
The benchmark suite in benchmarks/ includes an isolated EntityView::Get loop that clocks in at 1.70 ns / op (585 M/s). That number is the theoretical ceiling for the instruction sequence on this CPU, not a realistic integration cost.
Caveats on how to read it:
- The tight loop fits in L1. A real game integration walks a larger working set and pays ~80 ns per call from memory latency alone.
-
google/benchmark'sDoNotOptimizeguards the return value but does not prevent all loop-invariant hoisting; treat micro-benchmark numbers as "the call is at worst this cheap," not "this is what you will observe." - One known fixture counter inflates 2x (
BM_AccessorRandomWithinBucket); flagged for fix indocs/PERFORMANCE.md.
The realistic ~80 ns / 13 M/s number is what to plan against when integrating. The isolated number is useful for regression detection, not capacity planning.
The reader's chunk cache is the single sharpest performance edge in the SDK. A well-sized cache speeds random access dramatically; a mis-sized cache can be 59 % slower than no cache at all.
The knob is IVtxReaderFacade::SetCacheWindow(backward, forward). Defaults favour sequential playback. The "mis-sized default" failure mode shows up when you run a random-access workload against the default window: the cache evicts chunks that are about to be re-requested. Two safe patterns:
-
Sequential playback / short-range scrubbing: keep the default window. Optionally call
WarmAt(next_frame)before a known jump. -
Heavy random access (timeline scrubbing, analytics): either raise the window until your working set fits, or call
SetCacheWindow(0, 0)to disable caching entirely. Zero cache beats a wrong cache.
If you don't know your access pattern yet, measure before tuning. Sizing is workload-specific and the wrong setting is worse than none.
A better default is a known gap. Track the discussion in issues or propose one: an adaptive policy that observes access patterns and resizes automatically would close this cliff for most users.
If you are reading a lot of properties per frame:
- Resolve
PropertyKeys once during setup (74 ns each, one-time cost). - Use
PropertyAddressCachefor O(1) lookup rather than name-based queries. - Batch property reads per entity to keep cache lines hot.
- The ~80 ns / call figure assumes a warm cache line. Bouncing between entities and properties randomly will be slower; stride your access to be locality-friendly.
The Diff Engine computes deltas between two frame states at ~4 us per consecutive pair (267 k diffs/s). Non-consecutive diffs have short-circuits: a 10x shortcut for large jumps and a 2x shortcut for common patterns. Diffing is effectively free in a real-time playback loop.
Protobuf produces smaller files; FlatBuffers is faster at diffing and data access. Neither wins universally. The right choice depends on fixture shape and what your pipeline is bottlenecked on.
- I/O-bound pipeline, archival, or network transmission: Protobuf.
- CPU-bound real-time playback or heavy random access: FlatBuffers.
- Not sure: start with one, switch if measurement says so. Both are supported by the same SDK; the format is announced in the file's magic bytes and auto-detected on read.
See File Format, Serialisation backends for the wider comparison.
- One machine, one run. Ratios hold across hardware; absolute numbers vary.
- No direct comparison to competitor libraries included in-tree.
-
items_per_secondis CPU-time-based; use wall time for user-observable claims. - One known fixture counter inflated 2x (
BM_AccessorRandomWithinBucket); flagged for fix indocs/PERFORMANCE.md.
The benchmark suite is built with google/benchmark and gated behind a CMake flag:
cmake -S . -B build -A x64 -DVTX_BUILD_BENCHMARKS=ON
cmake --build build --config Release --parallelBenchmark binaries install alongside the SDK. Full instructions and the fixture layout are in docs/PERFORMANCE.md.
VTX is an open, self-describing binary format for real-time state data. Apache-2.0. (c) 2026 Zenos Interactive.