Skip to content

Performance

Tim Mcguinness edited this page Apr 24, 2026 · 5 revisions

Performance

Headline numbers from real fixtures. For the full breakdown (methodology, hardware, caveats, per-benchmark numbers) see docs/PERFORMANCE.md in the repo.

Test fixtures

  • CS2: 92 MB, 10 656 frames.
  • Rocket League: 5 MB, ~21 000 frames.

Measured on a modern dev laptop. One machine, one run. Ratios hold across hardware; absolute numbers vary.

At a glance

Task Number Notes
Writing frames end-to-end ~82 k frames/s 30-min 60 fps match ~= 1.3 s of CPU.
Full sequential read (CS2, 92 MB) ~5.6 s (median) ~16 MB/s decoded.
Preview first 1 000 frames ~1 s Fast feedback for preview UI.
EntityView::Get (realistic hot loop) ~80 ns (13 M/s) What real integrations observe. See Micro-benchmarks for the methodology and the isolated-loop number.
Diff consecutive frames 4 us (267 k/s) Instant from a user perspective.

Health check

Area Verdict Why
Writer throughput Fast ~10x headroom over real-time at 60 fps.
Full sequential read Fast enough 5.6 s for a 92 MB replay (median).
Preview + seek-play Fast Sub-second scrubbing.
Cache-window well-sized Fast < 4 s for 50 random jumps.
Cache-window mis-sized Actively bad 59 % slower than no cache. See Cache-window sizing.
FrameAccessor / PropertyKey setup Negligible 6.77 us + 74 ns/key. Once per integration.
EntityView hot loop Excellent ~80 ns in realistic loops.
Diff + short-circuit Instant 4 us consecutive; 10x / 2x shortcut.
Schema parse Negligible 200 us, one-time.
Format choice (FBS vs. Proto) Depends No universal winner. Fixture shape dictates.

Micro-benchmarks

The benchmark suite in benchmarks/ includes an isolated EntityView::Get loop that clocks in at 1.70 ns / op (585 M/s). That number is the theoretical ceiling for the instruction sequence on this CPU, not a realistic integration cost.

Caveats on how to read it:

  • The tight loop fits in L1. A real game integration walks a larger working set and pays ~80 ns per call from memory latency alone.
  • google/benchmark's DoNotOptimize guards the return value but does not prevent all loop-invariant hoisting; treat micro-benchmark numbers as "the call is at worst this cheap," not "this is what you will observe."
  • One known fixture counter inflates 2x (BM_AccessorRandomWithinBucket); flagged for fix in docs/PERFORMANCE.md.

The realistic ~80 ns / 13 M/s number is what to plan against when integrating. The isolated number is useful for regression detection, not capacity planning.

Cache-window sizing

The reader's chunk cache is the single sharpest performance edge in the SDK. A well-sized cache speeds random access dramatically; a mis-sized cache can be 59 % slower than no cache at all.

The knob is IVtxReaderFacade::SetCacheWindow(backward, forward). Defaults favour sequential playback. The "mis-sized default" failure mode shows up when you run a random-access workload against the default window: the cache evicts chunks that are about to be re-requested. Two safe patterns:

  • Sequential playback / short-range scrubbing: keep the default window. Optionally call WarmAt(next_frame) before a known jump.
  • Heavy random access (timeline scrubbing, analytics): either raise the window until your working set fits, or call SetCacheWindow(0, 0) to disable caching entirely. Zero cache beats a wrong cache.

If you don't know your access pattern yet, measure before tuning. Sizing is workload-specific and the wrong setting is worse than none.

A better default is a known gap. Track the discussion in issues or propose one: an adaptive policy that observes access patterns and resizes automatically would close this cliff for most users.

Hot-loop guidance

If you are reading a lot of properties per frame:

  • Resolve PropertyKeys once during setup (74 ns each, one-time cost).
  • Use PropertyAddressCache for O(1) lookup rather than name-based queries.
  • Batch property reads per entity to keep cache lines hot.
  • The ~80 ns / call figure assumes a warm cache line. Bouncing between entities and properties randomly will be slower; stride your access to be locality-friendly.

Diffing

The Diff Engine computes deltas between two frame states at ~4 us per consecutive pair (267 k diffs/s). Non-consecutive diffs have short-circuits: a 10x shortcut for large jumps and a 2x shortcut for common patterns. Diffing is effectively free in a real-time playback loop.

Choosing a serialisation backend

Protobuf produces smaller files; FlatBuffers is faster at diffing and data access. Neither wins universally. The right choice depends on fixture shape and what your pipeline is bottlenecked on.

  • I/O-bound pipeline, archival, or network transmission: Protobuf.
  • CPU-bound real-time playback or heavy random access: FlatBuffers.
  • Not sure: start with one, switch if measurement says so. Both are supported by the same SDK; the format is announced in the file's magic bytes and auto-detected on read.

See File Format, Serialisation backends for the wider comparison.

Caveats

  • One machine, one run. Ratios hold across hardware; absolute numbers vary.
  • No direct comparison to competitor libraries included in-tree.
  • items_per_second is CPU-time-based; use wall time for user-observable claims.
  • One known fixture counter inflated 2x (BM_AccessorRandomWithinBucket); flagged for fix in docs/PERFORMANCE.md.

Reproducing the benchmarks

The benchmark suite is built with google/benchmark and gated behind a CMake flag:

cmake -S . -B build -A x64 -DVTX_BUILD_BENCHMARKS=ON
cmake --build build --config Release --parallel

Benchmark binaries install alongside the SDK. Full instructions and the fixture layout are in docs/PERFORMANCE.md.

Clone this wiki locally