Performance

Headline numbers from real fixtures. For the full breakdown (methodology, hardware, caveats, per-benchmark numbers) see docs/PERFORMANCE.md in the repo.

Test fixtures

CS2: 92 MB, 10 656 frames.
Rocket League: 5 MB, ~21 000 frames.

Measured on a modern dev laptop. One machine, one run. Ratios hold across hardware; absolute numbers vary.

At a glance

Task	Number	Notes
Writing frames end-to-end	~82 k frames/s	30-min 60 fps match ~= 1.3 s of CPU.
Full sequential read (CS2, 92 MB)	~5.6 s (median)	~16 MB/s decoded.
Preview first 1 000 frames	~1 s	Fast feedback for preview UI.
`EntityView::Get` (realistic hot loop)	~80 ns (13 M/s)	What real integrations observe. See Micro-benchmarks for the methodology and the isolated-loop number.
Diff consecutive frames	4 us (267 k/s)	Instant from a user perspective.

Health check

Area	Verdict	Why
Writer throughput	Fast	~10x headroom over real-time at 60 fps.
Full sequential read	Fast enough	5.6 s for a 92 MB replay (median).
Preview + seek-play	Fast	Sub-second scrubbing.
Cache-window well-sized	Fast	< 4 s for 50 random jumps.
Cache-window mis-sized	Actively bad	59 % slower than no cache. See Cache-window sizing.
`FrameAccessor` / `PropertyKey` setup	Negligible	6.77 us + 74 ns/key. Once per integration.
`EntityView` hot loop	Excellent	~80 ns in realistic loops.
Diff + short-circuit	Instant	4 us consecutive; 10x / 2x shortcut.
Schema parse	Negligible	200 us, one-time.
Format choice (FBS vs. Proto)	Depends	No universal winner. Fixture shape dictates.

Micro-benchmarks

The benchmark suite in benchmarks/ includes an isolated EntityView::Get loop that clocks in at 1.70 ns / op (585 M/s). That number is the theoretical ceiling for the instruction sequence on this CPU, not a realistic integration cost.

Caveats on how to read it:

The tight loop fits in L1. A real game integration walks a larger working set and pays ~80 ns per call from memory latency alone.
google/benchmark's DoNotOptimize guards the return value but does not prevent all loop-invariant hoisting; treat micro-benchmark numbers as "the call is at worst this cheap," not "this is what you will observe."
One known fixture counter inflates 2x (BM_AccessorRandomWithinBucket); flagged for fix in docs/PERFORMANCE.md.

The realistic ~80 ns / 13 M/s number is what to plan against when integrating. The isolated number is useful for regression detection, not capacity planning.

Cache-window sizing

The reader's chunk cache is the single sharpest performance edge in the SDK. A well-sized cache speeds random access dramatically; a mis-sized cache can be 59 % slower than no cache at all.

The knob is IVtxReaderFacade::SetCacheWindow(backward, forward). Defaults favour sequential playback. The "mis-sized default" failure mode shows up when you run a random-access workload against the default window: the cache evicts chunks that are about to be re-requested. Two safe patterns:

Sequential playback / short-range scrubbing: keep the default window. Optionally call WarmAt(next_frame) before a known jump.
Heavy random access (timeline scrubbing, analytics): either raise the window until your working set fits, or call SetCacheWindow(0, 0) to disable caching entirely. Zero cache beats a wrong cache.

If you don't know your access pattern yet, measure before tuning. Sizing is workload-specific and the wrong setting is worse than none.

A better default is a known gap. Track the discussion in issues or propose one: an adaptive policy that observes access patterns and resizes automatically would close this cliff for most users.

Hot-loop guidance

If you are reading a lot of properties per frame:

Resolve PropertyKeys once during setup (74 ns each, one-time cost).
Use PropertyAddressCache for O(1) lookup rather than name-based queries.
Batch property reads per entity to keep cache lines hot.
The ~80 ns / call figure assumes a warm cache line. Bouncing between entities and properties randomly will be slower; stride your access to be locality-friendly.

Diffing

The Diff Engine computes deltas between two frame states at ~4 us per consecutive pair (267 k diffs/s). Non-consecutive diffs have short-circuits: a 10x shortcut for large jumps and a 2x shortcut for common patterns. Diffing is effectively free in a real-time playback loop.

Choosing a serialisation backend

Protobuf produces smaller files; FlatBuffers is faster at diffing and data access. Neither wins universally. The right choice depends on fixture shape and what your pipeline is bottlenecked on.

I/O-bound pipeline, archival, or network transmission: Protobuf.
CPU-bound real-time playback or heavy random access: FlatBuffers.
Not sure: start with one, switch if measurement says so. Both are supported by the same SDK; the format is announced in the file's magic bytes and auto-detected on read.

See File Format, Serialisation backends for the wider comparison.

Caveats

One machine, one run. Ratios hold across hardware; absolute numbers vary.
No direct comparison to competitor libraries included in-tree.
items_per_second is CPU-time-based; use wall time for user-observable claims.
One known fixture counter inflated 2x (BM_AccessorRandomWithinBucket); flagged for fix in docs/PERFORMANCE.md.

Reproducing the benchmarks

The benchmark suite is built with google/benchmark and gated behind a CMake flag:

cmake -S . -B build -A x64 -DVTX_BUILD_BENCHMARKS=ON
cmake --build build --config Release --parallel

Benchmark binaries install alongside the SDK. Full instructions and the fixture layout are in docs/PERFORMANCE.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance

Performance

Test fixtures

At a glance

Health check

Micro-benchmarks

Cache-window sizing

Hot-loop guidance

Diffing

Choosing a serialisation backend

Caveats

Reproducing the benchmarks

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VTX Wiki

Repo

Clone this wiki locally