perf(ffi): fold legacy qjd_parse to single allocation; preserve pooled API#13
perf(ffi): fold legacy qjd_parse to single allocation; preserve pooled API#13membphis wants to merge 2 commits into
Conversation
The decoder/document pooling refactor (commit 0721d7d) split the legacy qjd_parse path into two heap allocations (`qjd_decoder` + `qjd_doc`) and made every `qjd_get_*` walk a state-machine check via `check_doc_alive`. That made the new pooled API faster when the decoder is reused, but left the one-shot `qjd_parse()` users paying double-alloc / double-free per call plus a gen check per field access, with no upside since their doc cannot be stale. Co-locate `qjd_doc` and its private `qjd_decoder` in a single `#[repr(C)] OwnedDocBlock` allocation: - qjd_parse: one Box::new instead of two; parse in place to avoid a ~100-byte stack-to-heap memcpy of the freshly-constructed Decoder. - qjd_free: branch on owns_decoder and recover the block pointer via the offset-0 invariant for the legacy path. - check_doc_alive: legacy docs skip the state/gen check (unreachable by construction — the decoder is private) and reach the decoder by static offset from the doc pointer, not by loading doc.decoder. The pooled API (`qjd_decoder_new` + `qjd_decoder_parse`) is unchanged and still owns_decoder = false. All existing tests, including the count-allocs gate (`pooled < legacy / 2`), still pass.
The single-run-with-mean output the bench used to print swung 30-40%
between invocations on noisy machines, making it hard to tell signal
from noise when comparing perf commits.
- bench() now runs a warmup pass (JIT trace compile, pool fill), then
five timed rounds. Reports median and mean ops/s plus the round-by-
round min..max range so reviewers can see whether a delta is real.
- Add an `interleaved 100k,200k,500k,1m` scenario that rotates through
four payload sizes, matching a server that handles varying request
sizes back to back. The single-payload loops cannot exercise the
doc pool the way real traffic does.
- For each scenario, probe `qd.new_decoder` and run two extra qd
variants when present:
quickdecode pooled :parse — reused decoder across iters
quickdecode new_decoder()+parse — one-shot per iter (no reuse)
So a reader can directly compare the legacy qd.parse path, the
pool-API-with-reuse path, and the realistic "user creates a fresh
decoder per request" pattern in one bench run.
Also ship benches/perf_probe.lua: a minimal hammer over qd.parse on a
fixed payload for use under `perf record` when investigating FFI hot
paths. Not invoked by Makefile targets.
|
Closing in favor of a clean revert of #11 (commit 0721d7d). Final benches (3-run median-of-medians) showed this fold approach still ran 14-27% behind scan-only on 100 KB–5 MB payloads, the very range that dominates the API gateway traffic this library serves. Given the production usage pattern (fresh decoder per request, no reuse), the pool API does not pay off, and the residual fold cost is not worth retaining for the optionality. Revert PR coming next. |
|
Closing in favor of a clean revert of #11 (commit 0721d7d). Final benches (3-run median-of-medians) showed this fold approach still ran 14-27% behind scan-only on 100 KB–5 MB payloads, the very range that dominates the API gateway traffic this library serves. Given the production usage pattern (fresh decoder per request, no reuse), the pool API does not pay off, and the residual fold cost is not worth retaining for the optionality. Revert PR coming next. |
Summary
qjd_docand its privateqjd_decoderin a single#[repr(C)] OwnedDocBlockallocation, restoring the legacyqjd_parsepath to one-Box-per-parse (was two after perf: decoder/document instance pooling (#6) #11).check_doc_alivenow fast-pathsowns_decoder == truedocs: their decoder is unreachable to the pool API, so state/gen are pristine by construction — skip the runtime checks and reach the decoder by static offset from the doc pointer.qd.parse(legacy),qd.new_decoder():parse(pooled, reused), andqd.new_decoder():parse(one-shot per iter).benches/perf_probe.luaminimal hammer for use underperf recordwhen investigating FFI hot paths.The pooled API (
qjd_decoder_new+qjd_decoder_parse) is unchanged. All existing tests pass, including the count-allocs gate (pooled < legacy / 2).Why
Commit #11 introduced the pool refactor and, as a side effect, the one-shot
qjd_parsepath started doing two heap allocations per call (decoder + doc handle) and walking a state-machine check on everyqjd_get_*. That regressed legacy users without giving them anything in exchange — by construction their doc cannot be stale.Profiling (
perf recordon a 100 KB-payload hammer) showed the regression concentrated indrop_in_place<SkipCache>,qjd_free's double-Box::from_raw, and thecheck_doc_alivebranch chain. Folding the allocations closes that gap while leaving the pooled fast path intact.Bench: legacy qd.parse vs scan-only (7a895e5)
LuaJIT 2.1, payload generator from
benches/lua_bench.lua, 5 warmup + 5 timed rounds, median ops/s. Numbers are noisy on this machine (esp. 100K–500K, see range columns in raw output); treat single-digit-percent gaps as in-noise.qd.parseqd.parse(*) The 60K–500K rows show ranges that overlap between scan-only and this PR runs (e.g. fold 100K range observed
45,310..99,502); the median deltas here are partly real and partly noise. Multiple back-to-back runs are needed to nail down the exact size of the residual gap —OwnedDocBlockis ~5 bytes larger than the pre-poolqjd_docand may land in a different allocator size class.Bench: pooled API vs legacy in this PR
The fold preserves the real win of the pool refactor — users who keep one
qd.new_decoder()alive across requests get a clean acceleration over the one-shot path:qd.parse(legacy)dec:parse(reused)new_decoder()+parseUsers who do not (or cannot) hoist the decoder across requests get roughly the same perf as legacy
qd.parse. Users who can hoist it get +30%–95% on small-to-medium payloads where FFI/alloc overhead dominates.Test plan
cargo test --release(132 tests pass)cargo test --release --no-default-features(scalar-only)cargo test --features test-panic --releasecargo test --release --features count-allocs --test alloc_count(legacy + pooled invariants hold; legacy ratio improves)make test(Lua busted) —bustedis not installed on this dev machine; CI will run itmake benchagainst7a895e5baseline (numbers above)