build: Makefile with bench target + AVX2 scanner perf fixes by membphis · Pull Request #3 · api7/lua-qjson

membphis · 2026-05-15T16:12:37Z

Summary

Add a root Makefile wrapping build / test / lint / bench / clean so contributors don't have to remember the LD_LIBRARY_PATH / LUA_CPATH / LUAJIT dance. Defaults autodetect OpenResty's LuaJIT; all paths are overridable.
Rewrite the Lua bench: adds 100 K – 10 M scenarios, reports decodes/sec (replacing per-op μs), and models payloads as realistic multimodal chat requests (one ~1.5 KB text part + N base64 image parts, 50–500 KB each).
Fix two latent AVX2 scanner perf issues uncovered by the new bench:
1. Tail-bypass: when buf.len() % 64 != 0 the scanner was clearing its AVX2-emitted indices and re-scanning the whole buffer with ScalarScanner. Bracket pairing is now a post-pass (validate_brackets) over the merged indices, and the <64-byte tail uses scan_emit_resume with carried in_string / bs_carry state.
2. String-skip fast path: when the previous chunk left in_string == 1 and real_quote == 0, the whole chunk is string interior — skip structural_mask_chunk (14 cmpeq + 14 movemask) and emit_bits. Mathematically equivalent to the slow path; verified via the existing scalar parity proptest plus targeted unit tests.

Combined impact on the multimodal bench (10 MB payload): qd goes from 33 → 352 ops/s (~11×), beating lua-cjson by 11× and lua-resty-simdjson by 4.3×, while still allocating only ~0.3 KB per parse (vs ~10 MB for both alternatives).

README's Deferred section drops the now-completed AVX2 tail-bypass item and gains three new follow-up perf items (shuffle-based structural set check, adaptive out.reserve, AVX-512 backend) plus the # Safety doc backfill that make lint -D warnings currently surfaces.

Test plan

cargo test --release — 70 unit + 3 + 10 + 5 + 3 + 12 + 1 + 1 integration tests + 2000-case scanner cross-check proptest all pass
cargo test --release --no-default-features — scalar-only build, all 60 unit + integration tests pass
make help / make build / make bench exercised manually
Three-way bench (cjson / simdjson / qd) sanity-checked across 9 size scenarios, 3 runs for stability
CI green
make lint — will surface 22 missing_safety_doc warnings (pre-existing, tracked in README Deferred for a follow-up PR)

bench wires up LD_LIBRARY_PATH and LUA_CPATH so 'make bench' invokes luajit on benches/lua_bench.lua without manual env setup. LUAJIT and LUA_CPATH are overridable. Defaults autodetect luajit on PATH and fall back to OpenResty's install path. README lists '# Safety' doc backfill on the 22 unsafe FFI exports as deferred — 'make lint' uses -D warnings and currently fails until that follow-up lands.

Adds 7 new size scenarios (100k/200k/500k/1m/2m/5m/10m) generated in-memory from a repeated message template, so no large fixture files are added to the repo. Replaces the per-op microsecond column with a decodes-per-second throughput column for easier cross-size comparison. Iter counts scale down with payload size, with a floor of 10.

Replaces the prior synthetic payload (many small message objects) with the realistic case the bench is meant to reflect: one ~1.5 KB text question plus N base64 image parts, each 50-500 KB, repeated until the payload reaches the target size. This is what production traffic looks like for vision chat-completion APIs.

…date Before: when buf.len() % 64 != 0 the AVX2 scanner cleared its emitted indices and re-scanned the whole buffer with ScalarScanner, because the scalar fallback validates bracket pairing on a partial slice and would falsely reject a tail that closes a bracket opened in the AVX2 prefix. For realistic JSON payloads (which are essentially never 64-aligned) this made the AVX2 path dead code. After: bracket pairing is moved out-of-line into a post-pass 'validate_brackets' that walks the emitted index array. The AVX2 path emits structural offsets for the main loop, hands the <64-byte tail off to a scalar 'scan_emit_resume' (carrying in_string and bs_carry state from the last AVX2 chunk), then runs the post-pass. Measured on 1-10 MB multimodal chat payloads (one ~1.5 KB text part plus base64 image parts): size before after speedup ------ ------- ------- ------- 1 M 335 op/s 4000 12x 2 M 162 2000 12x 5 M 69 683 10x 10 M 33 325 10x This beats lua-resty-simdjson by 2-3x on the same workload while keeping qd's existing memory-locality win (no full Lua table built). Cross-validated against ScalarScanner via existing proptest harness (2000 cases, including string/escape sequences) plus new targeted tests for unaligned tails, strings crossing the chunk boundary, and backslash-escape sequences straddling the boundary.

When the previous chunk left us inside a string (in_string == 1) and the current chunk's real_quote mask is 0, the entire chunk is string interior — no structurals can appear and in_string stays 1. Skip the structural_mask_chunk computation (14 cmpeq + 14 movemask) and the PCLMUL prefix-XOR. bs_carry is still updated via the backslash mask so escape state remains correct across the boundary. Mathematically equivalent to the slow path in this case (inside = all-1s makes final_mask = 0, emit_bits is a no-op, new_in_string = 1). Verified by the existing scalar parity proptest + two new unit tests (long quote-free string interior, escaped quotes inside a string). Measured on the same multimodal bench: size tail-fix only + fast path gain ------ ------------- ----------- ---- 1 M 4000 op/s 6500 +60% 2 M 2000 2900 +45% 5 M 683 830 +22% 10 M 325 420 +29% Against lua-resty-simdjson at 10 MB: qd 420 vs simdjson 96 = 4.4x. README's deferred list gains three follow-up perf items (shuffle-based structural set check, adaptive out.reserve, AVX-512 backend) that were identified while diagnosing this workload.

* chore: address PR #3 review hygiene items Five Important findings + four small Minor follow-ups from the PR #3 local code review: Important - tests/scanner_crosscheck.rs — drop the stale "AVX2 does not validate brackets" comment and tighten the proptest to require full Result equality (Ok/Err verdict + error offset) and indices equality on every case, not just on Ok. After the tail-bypass fix scalar and AVX2 run the same scan_emit_resume + validate_brackets pipeline, so this is now enforceable. Still passes the 2000-case proptest. - benches/lua_bench.lua — switch image-size RNG from math.random (which delegates to libc rand() and varies across machines) to a deterministic Park-Miller LCG. Same target_bytes now produces byte-identical output on any LuaJIT 2.1 host. - benches/lua_bench.lua — tighten the loop so the actual payload size matches its label. Cap the per-iteration `upper` at `remaining` and allow the last image to shrink below the 50 KB floor when fewer bytes remain. Observed: every scenario now lands within ~0.1% of its label (100k -> 102351 bytes, 1m -> 1048527, 10m -> 10485711) vs up to +49% before. - src/scan/avx2.rs — remove the dead `else if in_string != 0` branch inside the tail handler. `i < buf.len()` makes the `scalar_start <= buf.len()` check trivially true, and scan_emit_resume already returns Err(buf.len()) when start == buf.len() and in_string is set. Replace the unreachable branch with a comment that documents the invariant. - src/scan/mod.rs — drop the inaccurate "the check is defensive" wording from validate_brackets. The function is correctness-coupled with the scanner that produced its index list; a forged quote would flip in_string and mask later mismatches. Minor - Makefile — match the spec's "target — description" help format (em-dash separator) and tighten the awk FS pattern from `:.*## ` to `:[^#]*## ` so descriptions containing `##` aren't truncated by the greedy `.*`. - benches/lua_bench.lua — bump 2m / 5m / 10m iters from 10 to 20 so bigger-payload measurements ride out one-shot allocator / page-fault noise. - src/scan/avx2.rs — rename `escaped_quotes_do_not_trip_fastpath` to `escaped_quotes_remain_correct_with_fastpath`. The test asserts parity with scalar, not that the branch was taken (we have no counter to observe that), so the name should reflect what's actually checked. cargo test --release: 70+3+10+1+5+3+12+1+1 = 106 unit/integration tests plus the 2000-case proptest, all pass. cargo test --release --no-default-features (scalar-only build): 60+3+10+1+5+3+12+1+1 = 96 tests, all pass. * docs: align review-followup comments with actual behavior Address the Important #1 + three Minor findings from the local review of PR #4 (the docs drifted from the code introduced in the same PR): - benches/lua_bench.lua: rewrite the size-accuracy comment. The prior draft referenced a `remaining + slack` upper-bound expression that the final code does not use; replace with the two-branch behaviour actually implemented (normal `min(500K, remaining)`; final image falls through to `max(1024, remaining)`) and the real worst-case overshoot (~1 KB, not "~10 KB"). - Makefile: explain the `:[^#]*## ` FS choice and its tradeoff (targets with `#` in the prerequisite list won't render — none today). - tests/scanner_crosscheck.rs: clarify that the indices assertion holds for both Ok and Err cases because scan_emit_resume always completes emission before any potential Err, and validate_brackets does not modify the index list. - src/scan/avx2.rs: spell out the scalar_start invariant including the exact boundary case (scalar_start == buf.len() when i == buf.len()-1 && in_string != 0 && bs_carry != 0) that scan_emit_resume's post-loop in_str check covers. No code changes. All tests still pass.

membphis added 6 commits May 15, 2026 15:21

docs: spec for root Makefile with bench/test/lint targets

85bd9fc

membphis merged commit 3fa8633 into main May 15, 2026
2 checks passed

membphis mentioned this pull request May 15, 2026

chore: PR #3 review hygiene followups #4

Merged

5 tasks

This was referenced May 15, 2026

perf(scan): memchr-based fast path for in-string content #5

Closed

perf(scan): scanner micro-optimizations — shuffle structural detection + adaptive reserve + SmallVec #8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build: Makefile with bench target + AVX2 scanner perf fixes#3

build: Makefile with bench target + AVX2 scanner perf fixes#3
membphis merged 6 commits into
mainfrom
worktree-makefile-add-bench-target

membphis commented May 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

membphis commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

membphis commented May 15, 2026 •

edited

Loading