build: Makefile with bench target + AVX2 scanner perf fixes#3
Merged
Conversation
bench wires up LD_LIBRARY_PATH and LUA_CPATH so 'make bench' invokes luajit on benches/lua_bench.lua without manual env setup. LUAJIT and LUA_CPATH are overridable. Defaults autodetect luajit on PATH and fall back to OpenResty's install path. README lists '# Safety' doc backfill on the 22 unsafe FFI exports as deferred — 'make lint' uses -D warnings and currently fails until that follow-up lands.
Adds 7 new size scenarios (100k/200k/500k/1m/2m/5m/10m) generated in-memory from a repeated message template, so no large fixture files are added to the repo. Replaces the per-op microsecond column with a decodes-per-second throughput column for easier cross-size comparison. Iter counts scale down with payload size, with a floor of 10.
Replaces the prior synthetic payload (many small message objects) with the realistic case the bench is meant to reflect: one ~1.5 KB text question plus N base64 image parts, each 50-500 KB, repeated until the payload reaches the target size. This is what production traffic looks like for vision chat-completion APIs.
…date Before: when buf.len() % 64 != 0 the AVX2 scanner cleared its emitted indices and re-scanned the whole buffer with ScalarScanner, because the scalar fallback validates bracket pairing on a partial slice and would falsely reject a tail that closes a bracket opened in the AVX2 prefix. For realistic JSON payloads (which are essentially never 64-aligned) this made the AVX2 path dead code. After: bracket pairing is moved out-of-line into a post-pass 'validate_brackets' that walks the emitted index array. The AVX2 path emits structural offsets for the main loop, hands the <64-byte tail off to a scalar 'scan_emit_resume' (carrying in_string and bs_carry state from the last AVX2 chunk), then runs the post-pass. Measured on 1-10 MB multimodal chat payloads (one ~1.5 KB text part plus base64 image parts): size before after speedup ------ ------- ------- ------- 1 M 335 op/s 4000 12x 2 M 162 2000 12x 5 M 69 683 10x 10 M 33 325 10x This beats lua-resty-simdjson by 2-3x on the same workload while keeping qd's existing memory-locality win (no full Lua table built). Cross-validated against ScalarScanner via existing proptest harness (2000 cases, including string/escape sequences) plus new targeted tests for unaligned tails, strings crossing the chunk boundary, and backslash-escape sequences straddling the boundary.
When the previous chunk left us inside a string (in_string == 1) and the current chunk's real_quote mask is 0, the entire chunk is string interior — no structurals can appear and in_string stays 1. Skip the structural_mask_chunk computation (14 cmpeq + 14 movemask) and the PCLMUL prefix-XOR. bs_carry is still updated via the backslash mask so escape state remains correct across the boundary. Mathematically equivalent to the slow path in this case (inside = all-1s makes final_mask = 0, emit_bits is a no-op, new_in_string = 1). Verified by the existing scalar parity proptest + two new unit tests (long quote-free string interior, escaped quotes inside a string). Measured on the same multimodal bench: size tail-fix only + fast path gain ------ ------------- ----------- ---- 1 M 4000 op/s 6500 +60% 2 M 2000 2900 +45% 5 M 683 830 +22% 10 M 325 420 +29% Against lua-resty-simdjson at 10 MB: qd 420 vs simdjson 96 = 4.4x. README's deferred list gains three follow-up perf items (shuffle-based structural set check, adaptive out.reserve, AVX-512 backend) that were identified while diagnosing this workload.
5 tasks
membphis
added a commit
that referenced
this pull request
May 15, 2026
* chore: address PR #3 review hygiene items Five Important findings + four small Minor follow-ups from the PR #3 local code review: Important - tests/scanner_crosscheck.rs — drop the stale "AVX2 does not validate brackets" comment and tighten the proptest to require full Result equality (Ok/Err verdict + error offset) and indices equality on every case, not just on Ok. After the tail-bypass fix scalar and AVX2 run the same scan_emit_resume + validate_brackets pipeline, so this is now enforceable. Still passes the 2000-case proptest. - benches/lua_bench.lua — switch image-size RNG from math.random (which delegates to libc rand() and varies across machines) to a deterministic Park-Miller LCG. Same target_bytes now produces byte-identical output on any LuaJIT 2.1 host. - benches/lua_bench.lua — tighten the loop so the actual payload size matches its label. Cap the per-iteration `upper` at `remaining` and allow the last image to shrink below the 50 KB floor when fewer bytes remain. Observed: every scenario now lands within ~0.1% of its label (100k -> 102351 bytes, 1m -> 1048527, 10m -> 10485711) vs up to +49% before. - src/scan/avx2.rs — remove the dead `else if in_string != 0` branch inside the tail handler. `i < buf.len()` makes the `scalar_start <= buf.len()` check trivially true, and scan_emit_resume already returns Err(buf.len()) when start == buf.len() and in_string is set. Replace the unreachable branch with a comment that documents the invariant. - src/scan/mod.rs — drop the inaccurate "the check is defensive" wording from validate_brackets. The function is correctness-coupled with the scanner that produced its index list; a forged quote would flip in_string and mask later mismatches. Minor - Makefile — match the spec's "target — description" help format (em-dash separator) and tighten the awk FS pattern from `:.*## ` to `:[^#]*## ` so descriptions containing `##` aren't truncated by the greedy `.*`. - benches/lua_bench.lua — bump 2m / 5m / 10m iters from 10 to 20 so bigger-payload measurements ride out one-shot allocator / page-fault noise. - src/scan/avx2.rs — rename `escaped_quotes_do_not_trip_fastpath` to `escaped_quotes_remain_correct_with_fastpath`. The test asserts parity with scalar, not that the branch was taken (we have no counter to observe that), so the name should reflect what's actually checked. cargo test --release: 70+3+10+1+5+3+12+1+1 = 106 unit/integration tests plus the 2000-case proptest, all pass. cargo test --release --no-default-features (scalar-only build): 60+3+10+1+5+3+12+1+1 = 96 tests, all pass. * docs: align review-followup comments with actual behavior Address the Important #1 + three Minor findings from the local review of PR #4 (the docs drifted from the code introduced in the same PR): - benches/lua_bench.lua: rewrite the size-accuracy comment. The prior draft referenced a `remaining + slack` upper-bound expression that the final code does not use; replace with the two-branch behaviour actually implemented (normal `min(500K, remaining)`; final image falls through to `max(1024, remaining)`) and the real worst-case overshoot (~1 KB, not "~10 KB"). - Makefile: explain the `:[^#]*## ` FS choice and its tradeoff (targets with `#` in the prerequisite list won't render — none today). - tests/scanner_crosscheck.rs: clarify that the indices assertion holds for both Ok and Err cases because scan_emit_resume always completes emission before any potential Err, and validate_brackets does not modify the index list. - src/scan/avx2.rs: spell out the scalar_start invariant including the exact boundary case (scalar_start == buf.len() when i == buf.len()-1 && in_string != 0 && bs_carry != 0) that scan_emit_resume's post-loop in_str check covers. No code changes. All tests still pass.
This was referenced May 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makefilewrappingbuild/test/lint/bench/cleanso contributors don't have to remember theLD_LIBRARY_PATH/LUA_CPATH/LUAJITdance. Defaults autodetect OpenResty's LuaJIT; all paths are overridable.buf.len() % 64 != 0the scanner was clearing its AVX2-emitted indices and re-scanning the whole buffer withScalarScanner. Bracket pairing is now a post-pass (validate_brackets) over the merged indices, and the <64-byte tail usesscan_emit_resumewith carriedin_string/bs_carrystate.in_string == 1andreal_quote == 0, the whole chunk is string interior — skipstructural_mask_chunk(14 cmpeq + 14 movemask) andemit_bits. Mathematically equivalent to the slow path; verified via the existing scalar parity proptest plus targeted unit tests.Combined impact on the multimodal bench (10 MB payload): qd goes from 33 → 352 ops/s (~11×), beating lua-cjson by 11× and lua-resty-simdjson by 4.3×, while still allocating only ~0.3 KB per parse (vs ~10 MB for both alternatives).
README's Deferred section drops the now-completed AVX2 tail-bypass item and gains three new follow-up perf items (shuffle-based structural set check, adaptive
out.reserve, AVX-512 backend) plus the# Safetydoc backfill thatmake lint -D warningscurrently surfaces.Test plan
cargo test --release— 70 unit + 3 + 10 + 5 + 3 + 12 + 1 + 1 integration tests + 2000-case scanner cross-check proptest all passcargo test --release --no-default-features— scalar-only build, all 60 unit + integration tests passmake help/make build/make benchexercised manuallymake lint— will surface 22missing_safety_docwarnings (pre-existing, tracked in README Deferred for a follow-up PR)