Skip to content

build: Makefile with bench target + AVX2 scanner perf fixes#3

Merged
membphis merged 6 commits into
mainfrom
worktree-makefile-add-bench-target
May 15, 2026
Merged

build: Makefile with bench target + AVX2 scanner perf fixes#3
membphis merged 6 commits into
mainfrom
worktree-makefile-add-bench-target

Conversation

@membphis
Copy link
Copy Markdown
Collaborator

@membphis membphis commented May 15, 2026

Summary

  • Add a root Makefile wrapping build / test / lint / bench / clean so contributors don't have to remember the LD_LIBRARY_PATH / LUA_CPATH / LUAJIT dance. Defaults autodetect OpenResty's LuaJIT; all paths are overridable.
  • Rewrite the Lua bench: adds 100 K – 10 M scenarios, reports decodes/sec (replacing per-op μs), and models payloads as realistic multimodal chat requests (one ~1.5 KB text part + N base64 image parts, 50–500 KB each).
  • Fix two latent AVX2 scanner perf issues uncovered by the new bench:
    1. Tail-bypass: when buf.len() % 64 != 0 the scanner was clearing its AVX2-emitted indices and re-scanning the whole buffer with ScalarScanner. Bracket pairing is now a post-pass (validate_brackets) over the merged indices, and the <64-byte tail uses scan_emit_resume with carried in_string / bs_carry state.
    2. String-skip fast path: when the previous chunk left in_string == 1 and real_quote == 0, the whole chunk is string interior — skip structural_mask_chunk (14 cmpeq + 14 movemask) and emit_bits. Mathematically equivalent to the slow path; verified via the existing scalar parity proptest plus targeted unit tests.

Combined impact on the multimodal bench (10 MB payload): qd goes from 33 → 352 ops/s (~11×), beating lua-cjson by 11× and lua-resty-simdjson by 4.3×, while still allocating only ~0.3 KB per parse (vs ~10 MB for both alternatives).

README's Deferred section drops the now-completed AVX2 tail-bypass item and gains three new follow-up perf items (shuffle-based structural set check, adaptive out.reserve, AVX-512 backend) plus the # Safety doc backfill that make lint -D warnings currently surfaces.

Test plan

  • cargo test --release — 70 unit + 3 + 10 + 5 + 3 + 12 + 1 + 1 integration tests + 2000-case scanner cross-check proptest all pass
  • cargo test --release --no-default-features — scalar-only build, all 60 unit + integration tests pass
  • make help / make build / make bench exercised manually
  • Three-way bench (cjson / simdjson / qd) sanity-checked across 9 size scenarios, 3 runs for stability
  • CI green
  • make lint — will surface 22 missing_safety_doc warnings (pre-existing, tracked in README Deferred for a follow-up PR)

membphis added 6 commits May 15, 2026 15:21
bench wires up LD_LIBRARY_PATH and LUA_CPATH so 'make bench' invokes
luajit on benches/lua_bench.lua without manual env setup. LUAJIT and
LUA_CPATH are overridable. Defaults autodetect luajit on PATH and fall
back to OpenResty's install path.

README lists '# Safety' doc backfill on the 22 unsafe FFI exports as
deferred — 'make lint' uses -D warnings and currently fails until that
follow-up lands.
Adds 7 new size scenarios (100k/200k/500k/1m/2m/5m/10m) generated
in-memory from a repeated message template, so no large fixture
files are added to the repo.

Replaces the per-op microsecond column with a decodes-per-second
throughput column for easier cross-size comparison. Iter counts
scale down with payload size, with a floor of 10.
Replaces the prior synthetic payload (many small message objects) with
the realistic case the bench is meant to reflect: one ~1.5 KB text
question plus N base64 image parts, each 50-500 KB, repeated until the
payload reaches the target size. This is what production traffic looks
like for vision chat-completion APIs.
…date

Before: when buf.len() % 64 != 0 the AVX2 scanner cleared its emitted
indices and re-scanned the whole buffer with ScalarScanner, because the
scalar fallback validates bracket pairing on a partial slice and would
falsely reject a tail that closes a bracket opened in the AVX2 prefix.
For realistic JSON payloads (which are essentially never 64-aligned)
this made the AVX2 path dead code.

After: bracket pairing is moved out-of-line into a post-pass
'validate_brackets' that walks the emitted index array. The AVX2 path
emits structural offsets for the main loop, hands the <64-byte tail
off to a scalar 'scan_emit_resume' (carrying in_string and bs_carry
state from the last AVX2 chunk), then runs the post-pass.

Measured on 1-10 MB multimodal chat payloads (one ~1.5 KB text part
plus base64 image parts):

  size   before    after    speedup
  ------ -------   -------  -------
  1 M    335 op/s  4000     12x
  2 M    162       2000     12x
  5 M    69        683      10x
  10 M   33        325      10x

This beats lua-resty-simdjson by 2-3x on the same workload while
keeping qd's existing memory-locality win (no full Lua table built).

Cross-validated against ScalarScanner via existing proptest harness
(2000 cases, including string/escape sequences) plus new targeted
tests for unaligned tails, strings crossing the chunk boundary, and
backslash-escape sequences straddling the boundary.
When the previous chunk left us inside a string (in_string == 1) and
the current chunk's real_quote mask is 0, the entire chunk is string
interior — no structurals can appear and in_string stays 1. Skip the
structural_mask_chunk computation (14 cmpeq + 14 movemask) and the
PCLMUL prefix-XOR. bs_carry is still updated via the backslash mask
so escape state remains correct across the boundary.

Mathematically equivalent to the slow path in this case (inside =
all-1s makes final_mask = 0, emit_bits is a no-op, new_in_string = 1).
Verified by the existing scalar parity proptest + two new unit tests
(long quote-free string interior, escaped quotes inside a string).

Measured on the same multimodal bench:

  size   tail-fix only   + fast path   gain
  ------ -------------   -----------   ----
  1 M    4000 op/s       6500          +60%
  2 M    2000            2900          +45%
  5 M    683             830           +22%
  10 M   325             420           +29%

Against lua-resty-simdjson at 10 MB: qd 420 vs simdjson 96 = 4.4x.

README's deferred list gains three follow-up perf items (shuffle-based
structural set check, adaptive out.reserve, AVX-512 backend) that were
identified while diagnosing this workload.
@membphis membphis merged commit 3fa8633 into main May 15, 2026
2 checks passed
@membphis membphis mentioned this pull request May 15, 2026
5 tasks
membphis added a commit that referenced this pull request May 15, 2026
* chore: address PR #3 review hygiene items

Five Important findings + four small Minor follow-ups from the PR #3
local code review:

Important
- tests/scanner_crosscheck.rs — drop the stale "AVX2 does not validate
  brackets" comment and tighten the proptest to require full Result
  equality (Ok/Err verdict + error offset) and indices equality on every
  case, not just on Ok. After the tail-bypass fix scalar and AVX2 run the
  same scan_emit_resume + validate_brackets pipeline, so this is now
  enforceable. Still passes the 2000-case proptest.
- benches/lua_bench.lua — switch image-size RNG from math.random (which
  delegates to libc rand() and varies across machines) to a deterministic
  Park-Miller LCG. Same target_bytes now produces byte-identical output
  on any LuaJIT 2.1 host.
- benches/lua_bench.lua — tighten the loop so the actual payload size
  matches its label. Cap the per-iteration `upper` at `remaining` and
  allow the last image to shrink below the 50 KB floor when fewer bytes
  remain. Observed: every scenario now lands within ~0.1% of its label
  (100k -> 102351 bytes, 1m -> 1048527, 10m -> 10485711) vs up to +49%
  before.
- src/scan/avx2.rs — remove the dead `else if in_string != 0` branch
  inside the tail handler. `i < buf.len()` makes the `scalar_start <=
  buf.len()` check trivially true, and scan_emit_resume already returns
  Err(buf.len()) when start == buf.len() and in_string is set. Replace
  the unreachable branch with a comment that documents the invariant.
- src/scan/mod.rs — drop the inaccurate "the check is defensive" wording
  from validate_brackets. The function is correctness-coupled with the
  scanner that produced its index list; a forged quote would flip
  in_string and mask later mismatches.

Minor
- Makefile — match the spec's "target — description" help format (em-dash
  separator) and tighten the awk FS pattern from `:.*## ` to `:[^#]*## `
  so descriptions containing `##` aren't truncated by the greedy `.*`.
- benches/lua_bench.lua — bump 2m / 5m / 10m iters from 10 to 20 so
  bigger-payload measurements ride out one-shot allocator / page-fault
  noise.
- src/scan/avx2.rs — rename `escaped_quotes_do_not_trip_fastpath` to
  `escaped_quotes_remain_correct_with_fastpath`. The test asserts parity
  with scalar, not that the branch was taken (we have no counter to
  observe that), so the name should reflect what's actually checked.

cargo test --release: 70+3+10+1+5+3+12+1+1 = 106 unit/integration tests
plus the 2000-case proptest, all pass. cargo test --release
--no-default-features (scalar-only build): 60+3+10+1+5+3+12+1+1 = 96
tests, all pass.

* docs: align review-followup comments with actual behavior

Address the Important #1 + three Minor findings from the local review
of PR #4 (the docs drifted from the code introduced in the same PR):

- benches/lua_bench.lua: rewrite the size-accuracy comment. The prior
  draft referenced a `remaining + slack` upper-bound expression that
  the final code does not use; replace with the two-branch behaviour
  actually implemented (normal `min(500K, remaining)`; final image
  falls through to `max(1024, remaining)`) and the real worst-case
  overshoot (~1 KB, not "~10 KB").
- Makefile: explain the `:[^#]*## ` FS choice and its tradeoff (targets
  with `#` in the prerequisite list won't render — none today).
- tests/scanner_crosscheck.rs: clarify that the indices assertion holds
  for both Ok and Err cases because scan_emit_resume always completes
  emission before any potential Err, and validate_brackets does not
  modify the index list.
- src/scan/avx2.rs: spell out the scalar_start invariant including the
  exact boundary case (scalar_start == buf.len() when i == buf.len()-1
  && in_string != 0 && bs_carry != 0) that scan_emit_resume's post-loop
  in_str check covers.

No code changes. All tests still pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant