perf(ffi): fold legacy qjd_parse to single allocation; preserve pooled API by membphis · Pull Request #13 · api7/lua-qjson

membphis · 2026-05-15T23:24:40Z

Summary

Co-locate qjd_doc and its private qjd_decoder in a single #[repr(C)] OwnedDocBlock allocation, restoring the legacy qjd_parse path to one-Box-per-parse (was two after perf: decoder/document instance pooling (#6) #11).
check_doc_alive now fast-paths owns_decoder == true docs: their decoder is unreachable to the pool API, so state/gen are pristine by construction — skip the runtime checks and reach the decoder by static offset from the doc pointer.
Bench harness gains warmup + 5-round median, an interleaved-payload scenario, and side-by-side variants for qd.parse (legacy), qd.new_decoder():parse (pooled, reused), and qd.new_decoder():parse (one-shot per iter).
Drop a benches/perf_probe.lua minimal hammer for use under perf record when investigating FFI hot paths.

The pooled API (qjd_decoder_new + qjd_decoder_parse) is unchanged. All existing tests pass, including the count-allocs gate (pooled < legacy / 2).

Why

Commit #11 introduced the pool refactor and, as a side effect, the one-shot qjd_parse path started doing two heap allocations per call (decoder + doc handle) and walking a state-machine check on every qjd_get_*. That regressed legacy users without giving them anything in exchange — by construction their doc cannot be stale.

Profiling (perf record on a 100 KB-payload hammer) showed the regression concentrated in drop_in_place<SkipCache>, qjd_free's double-Box::from_raw, and the check_doc_alive branch chain. Folding the allocations closes that gap while leaving the pooled fast path intact.

Bench: legacy qd.parse vs scan-only (`7a895e5`)

LuaJIT 2.1, payload generator from benches/lua_bench.lua, 5 warmup + 5 timed rounds, median ops/s. Numbers are noisy on this machine (esp. 100K–500K, see range columns in raw output); treat single-digit-percent gaps as in-noise.

payload	scan-only `qd.parse`	this PR `qd.parse`	Δ
small 2 KB	148,898	~134,500	−10%
medium 60 KB	151,607	~114,600	−24% *
100 KB	93,721	~68,300	−27% *
200 KB	83,752	~54,800	−35% *
500 KB	42,194	~30,300	−28% *
1 MB	12,744	~14,900	+17%
2 MB	6,605	~8,200	+24%
5 MB	2,140	~2,200	~0%
10 MB	593	~490	−17%
interleaved 100k,200k,500k,1m	26,443	~25,200	−5%

(*) The 60K–500K rows show ranges that overlap between scan-only and this PR runs (e.g. fold 100K range observed 45,310..99,502); the median deltas here are partly real and partly noise. Multiple back-to-back runs are needed to nail down the exact size of the residual gap — OwnedDocBlock is ~5 bytes larger than the pre-pool qjd_doc and may land in a different allocator size class.

Bench: pooled API vs legacy in this PR

The fold preserves the real win of the pool refactor — users who keep one qd.new_decoder() alive across requests get a clean acceleration over the one-shot path:

payload	`qd.parse` (legacy)	`dec:parse` (reused)	one-shot `new_decoder()+parse`
small 2 KB	~134,500	~221,900	~138,100
medium 60 KB	~114,600	~179,500	~103,800
100 KB	~68,300	~133,300	~79,500
200 KB	~54,800	~72,300	~59,000
500 KB	~30,300	~34,100	~26,400
1 MB	~14,900	~18,700	~16,900
2 MB	~8,200	~8,600	~7,800
interleaved	~25,200	~30,500	—

Users who do not (or cannot) hoist the decoder across requests get roughly the same perf as legacy qd.parse. Users who can hoist it get +30%–95% on small-to-medium payloads where FFI/alloc overhead dominates.

Test plan

cargo test --release (132 tests pass)
cargo test --release --no-default-features (scalar-only)
cargo test --features test-panic --release
cargo test --release --features count-allocs --test alloc_count (legacy + pooled invariants hold; legacy ratio improves)
make test (Lua busted) — busted is not installed on this dev machine; CI will run it
make bench against 7a895e5 baseline (numbers above)

The decoder/document pooling refactor (commit 0721d7d) split the legacy qjd_parse path into two heap allocations (`qjd_decoder` + `qjd_doc`) and made every `qjd_get_*` walk a state-machine check via `check_doc_alive`. That made the new pooled API faster when the decoder is reused, but left the one-shot `qjd_parse()` users paying double-alloc / double-free per call plus a gen check per field access, with no upside since their doc cannot be stale. Co-locate `qjd_doc` and its private `qjd_decoder` in a single `#[repr(C)] OwnedDocBlock` allocation: - qjd_parse: one Box::new instead of two; parse in place to avoid a ~100-byte stack-to-heap memcpy of the freshly-constructed Decoder. - qjd_free: branch on owns_decoder and recover the block pointer via the offset-0 invariant for the legacy path. - check_doc_alive: legacy docs skip the state/gen check (unreachable by construction — the decoder is private) and reach the decoder by static offset from the doc pointer, not by loading doc.decoder. The pooled API (`qjd_decoder_new` + `qjd_decoder_parse`) is unchanged and still owns_decoder = false. All existing tests, including the count-allocs gate (`pooled < legacy / 2`), still pass.

The single-run-with-mean output the bench used to print swung 30-40% between invocations on noisy machines, making it hard to tell signal from noise when comparing perf commits. - bench() now runs a warmup pass (JIT trace compile, pool fill), then five timed rounds. Reports median and mean ops/s plus the round-by- round min..max range so reviewers can see whether a delta is real. - Add an `interleaved 100k,200k,500k,1m` scenario that rotates through four payload sizes, matching a server that handles varying request sizes back to back. The single-payload loops cannot exercise the doc pool the way real traffic does. - For each scenario, probe `qd.new_decoder` and run two extra qd variants when present: quickdecode pooled :parse — reused decoder across iters quickdecode new_decoder()+parse — one-shot per iter (no reuse) So a reader can directly compare the legacy qd.parse path, the pool-API-with-reuse path, and the realistic "user creates a fresh decoder per request" pattern in one bench run. Also ship benches/perf_probe.lua: a minimal hammer over qd.parse on a fixed payload for use under `perf record` when investigating FFI hot paths. Not invoked by Makefile targets.

membphis · 2026-05-15T23:53:44Z

Closing in favor of a clean revert of #11 (commit 0721d7d). Final benches (3-run median-of-medians) showed this fold approach still ran 14-27% behind scan-only on 100 KB–5 MB payloads, the very range that dominates the API gateway traffic this library serves. Given the production usage pattern (fresh decoder per request, no reuse), the pool API does not pay off, and the residual fold cost is not worth retaining for the optionality. Revert PR coming next.

membphis · 2026-05-15T23:53:52Z

Closing in favor of a clean revert of #11 (commit 0721d7d). Final benches (3-run median-of-medians) showed this fold approach still ran 14-27% behind scan-only on 100 KB–5 MB payloads, the very range that dominates the API gateway traffic this library serves. Given the production usage pattern (fresh decoder per request, no reuse), the pool API does not pay off, and the residual fold cost is not worth retaining for the optionality. Revert PR coming next.

membphis added 2 commits May 15, 2026 23:23

membphis closed this May 15, 2026

membphis mentioned this pull request May 15, 2026

Revert "perf: decoder/document instance pooling (#11)" #14

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(ffi): fold legacy qjd_parse to single allocation; preserve pooled API#13

perf(ffi): fold legacy qjd_parse to single allocation; preserve pooled API#13
membphis wants to merge 2 commits into
mainfrom
perf/legacy-qjd-parse-single-alloc

membphis commented May 15, 2026

Uh oh!

membphis commented May 15, 2026

Uh oh!

membphis commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

membphis commented May 15, 2026

Summary

Why

Bench: legacy qd.parse vs scan-only (7a895e5)

Bench: pooled API vs legacy in this PR

Test plan

Uh oh!

membphis commented May 15, 2026

Uh oh!

membphis commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bench: legacy qd.parse vs scan-only (`7a895e5`)