Skip to content

perf(ffi): fold legacy qjd_parse to single allocation; preserve pooled API#13

Closed
membphis wants to merge 2 commits into
mainfrom
perf/legacy-qjd-parse-single-alloc
Closed

perf(ffi): fold legacy qjd_parse to single allocation; preserve pooled API#13
membphis wants to merge 2 commits into
mainfrom
perf/legacy-qjd-parse-single-alloc

Conversation

@membphis
Copy link
Copy Markdown
Collaborator

Summary

  • Co-locate qjd_doc and its private qjd_decoder in a single #[repr(C)] OwnedDocBlock allocation, restoring the legacy qjd_parse path to one-Box-per-parse (was two after perf: decoder/document instance pooling (#6) #11).
  • check_doc_alive now fast-paths owns_decoder == true docs: their decoder is unreachable to the pool API, so state/gen are pristine by construction — skip the runtime checks and reach the decoder by static offset from the doc pointer.
  • Bench harness gains warmup + 5-round median, an interleaved-payload scenario, and side-by-side variants for qd.parse (legacy), qd.new_decoder():parse (pooled, reused), and qd.new_decoder():parse (one-shot per iter).
  • Drop a benches/perf_probe.lua minimal hammer for use under perf record when investigating FFI hot paths.

The pooled API (qjd_decoder_new + qjd_decoder_parse) is unchanged. All existing tests pass, including the count-allocs gate (pooled < legacy / 2).

Why

Commit #11 introduced the pool refactor and, as a side effect, the one-shot qjd_parse path started doing two heap allocations per call (decoder + doc handle) and walking a state-machine check on every qjd_get_*. That regressed legacy users without giving them anything in exchange — by construction their doc cannot be stale.

Profiling (perf record on a 100 KB-payload hammer) showed the regression concentrated in drop_in_place<SkipCache>, qjd_free's double-Box::from_raw, and the check_doc_alive branch chain. Folding the allocations closes that gap while leaving the pooled fast path intact.

Bench: legacy qd.parse vs scan-only (7a895e5)

LuaJIT 2.1, payload generator from benches/lua_bench.lua, 5 warmup + 5 timed rounds, median ops/s. Numbers are noisy on this machine (esp. 100K–500K, see range columns in raw output); treat single-digit-percent gaps as in-noise.

payload scan-only qd.parse this PR qd.parse Δ
small 2 KB 148,898 ~134,500 −10%
medium 60 KB 151,607 ~114,600 −24% *
100 KB 93,721 ~68,300 −27% *
200 KB 83,752 ~54,800 −35% *
500 KB 42,194 ~30,300 −28% *
1 MB 12,744 ~14,900 +17%
2 MB 6,605 ~8,200 +24%
5 MB 2,140 ~2,200 ~0%
10 MB 593 ~490 −17%
interleaved 100k,200k,500k,1m 26,443 ~25,200 −5%

(*) The 60K–500K rows show ranges that overlap between scan-only and this PR runs (e.g. fold 100K range observed 45,310..99,502); the median deltas here are partly real and partly noise. Multiple back-to-back runs are needed to nail down the exact size of the residual gap — OwnedDocBlock is ~5 bytes larger than the pre-pool qjd_doc and may land in a different allocator size class.

Bench: pooled API vs legacy in this PR

The fold preserves the real win of the pool refactor — users who keep one qd.new_decoder() alive across requests get a clean acceleration over the one-shot path:

payload qd.parse (legacy) dec:parse (reused) one-shot new_decoder()+parse
small 2 KB ~134,500 ~221,900 ~138,100
medium 60 KB ~114,600 ~179,500 ~103,800
100 KB ~68,300 ~133,300 ~79,500
200 KB ~54,800 ~72,300 ~59,000
500 KB ~30,300 ~34,100 ~26,400
1 MB ~14,900 ~18,700 ~16,900
2 MB ~8,200 ~8,600 ~7,800
interleaved ~25,200 ~30,500

Users who do not (or cannot) hoist the decoder across requests get roughly the same perf as legacy qd.parse. Users who can hoist it get +30%–95% on small-to-medium payloads where FFI/alloc overhead dominates.

Test plan

  • cargo test --release (132 tests pass)
  • cargo test --release --no-default-features (scalar-only)
  • cargo test --features test-panic --release
  • cargo test --release --features count-allocs --test alloc_count (legacy + pooled invariants hold; legacy ratio improves)
  • make test (Lua busted) — busted is not installed on this dev machine; CI will run it
  • make bench against 7a895e5 baseline (numbers above)

membphis added 2 commits May 15, 2026 23:23
The decoder/document pooling refactor (commit 0721d7d) split the legacy
qjd_parse path into two heap allocations (`qjd_decoder` + `qjd_doc`) and
made every `qjd_get_*` walk a state-machine check via `check_doc_alive`.
That made the new pooled API faster when the decoder is reused, but
left the one-shot `qjd_parse()` users paying double-alloc / double-free
per call plus a gen check per field access, with no upside since their
doc cannot be stale.

Co-locate `qjd_doc` and its private `qjd_decoder` in a single
`#[repr(C)] OwnedDocBlock` allocation:

  - qjd_parse: one Box::new instead of two; parse in place to avoid a
    ~100-byte stack-to-heap memcpy of the freshly-constructed Decoder.
  - qjd_free: branch on owns_decoder and recover the block pointer via
    the offset-0 invariant for the legacy path.
  - check_doc_alive: legacy docs skip the state/gen check (unreachable
    by construction — the decoder is private) and reach the decoder by
    static offset from the doc pointer, not by loading doc.decoder.

The pooled API (`qjd_decoder_new` + `qjd_decoder_parse`) is unchanged
and still owns_decoder = false. All existing tests, including the
count-allocs gate (`pooled < legacy / 2`), still pass.
The single-run-with-mean output the bench used to print swung 30-40%
between invocations on noisy machines, making it hard to tell signal
from noise when comparing perf commits.

- bench() now runs a warmup pass (JIT trace compile, pool fill), then
  five timed rounds. Reports median and mean ops/s plus the round-by-
  round min..max range so reviewers can see whether a delta is real.
- Add an `interleaved 100k,200k,500k,1m` scenario that rotates through
  four payload sizes, matching a server that handles varying request
  sizes back to back. The single-payload loops cannot exercise the
  doc pool the way real traffic does.
- For each scenario, probe `qd.new_decoder` and run two extra qd
  variants when present:
    quickdecode pooled :parse           — reused decoder across iters
    quickdecode new_decoder()+parse     — one-shot per iter (no reuse)
  So a reader can directly compare the legacy qd.parse path, the
  pool-API-with-reuse path, and the realistic "user creates a fresh
  decoder per request" pattern in one bench run.

Also ship benches/perf_probe.lua: a minimal hammer over qd.parse on a
fixed payload for use under `perf record` when investigating FFI hot
paths. Not invoked by Makefile targets.
@membphis
Copy link
Copy Markdown
Collaborator Author

Closing in favor of a clean revert of #11 (commit 0721d7d). Final benches (3-run median-of-medians) showed this fold approach still ran 14-27% behind scan-only on 100 KB–5 MB payloads, the very range that dominates the API gateway traffic this library serves. Given the production usage pattern (fresh decoder per request, no reuse), the pool API does not pay off, and the residual fold cost is not worth retaining for the optionality. Revert PR coming next.

@membphis membphis closed this May 15, 2026
@membphis
Copy link
Copy Markdown
Collaborator Author

Closing in favor of a clean revert of #11 (commit 0721d7d). Final benches (3-run median-of-medians) showed this fold approach still ran 14-27% behind scan-only on 100 KB–5 MB payloads, the very range that dominates the API gateway traffic this library serves. Given the production usage pattern (fresh decoder per request, no reuse), the pool API does not pay off, and the residual fold cost is not worth retaining for the optionality. Revert PR coming next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant