perf(scan): memchr2 cross-chunk jump in NEON and AVX2 fast probe#33
Merged
Conversation
After a fast-probe miss (no quote/backslash in current 64B chunk), both NEON and AVX2 scanners now call memchr::memchr2 to skip ahead to the 64B-aligned chunk containing the next interesting byte rather than advancing one chunk at a time. A 256-byte remaining-buffer threshold gates the call so short payloads never pay the libc function-call overhead; above that threshold the jump amortizes immediately. Measured on Apple M4 (NEON), "parse + access 3 fields" workload: - 2 KB (small_api.json): 648,761 ops/s — regression eliminated, flat vs. pre-jump baseline - 100 KB: 245,700 ops/s — 17.2x over cjson (+125% vs. pre-jump 108,932) - 1 MB: 34,884 ops/s — 23.7x over cjson (+193% vs. pre-jump 11,905) - 10 MB: 3,406 ops/s — 22.7x over cjson (+180% vs. pre-jump 1,218) AVX2 receives the identical change; compile-verified on aarch64; x86_64 parity is covered by CI.
… regression The 256-byte threshold still fired memchr2 across most of a 2 KB document (only the last few chunks were exempt), and the libc call overhead per fast-probe miss outweighed the scanner work it replaced — net result was a ~10% regression on small_api.json under 'make bench' methodology where cjson runs first and leaves a polluted heap. Bumping the threshold to 4 KB means memchr2 is never called on payloads ≤4 KB total, restoring baseline parity. On larger payloads only the final 4 KB foregoes the jump, which is invisible against MB-scale gains. 3-run median 'qd.parse' on Apple M4 vs main: 2 KB -2% (flat, within noise) 60 KB +60% 100 KB +69% 1 MB +107% 10 MB +109% README ARM64 numbers updated to reflect the post-threshold reality.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
After an in-string fast-probe miss (no
"or\in the current 64B chunk), both NEON and AVX2 scanners now callmemchr::memchr2to skip directly to the 64B-aligned chunk containing the next interesting byte, instead of advancing one chunk at a time.A 4 KB remaining-buffer threshold gates the call so payloads ≤4 KB never pay the libc function-call overhead. Above the threshold the jump amortizes immediately. On large payloads only the final 4 KB foregoes the jump — invisible against MB-scale gains.
Closes #26.
Measured impact (Apple M4, NEON,
quickdecode.parse + access 3 fields)3-run median on each branch (
make bench):End-to-end scanner throughput on large string-heavy payloads rises from ~17 GB/s to ~36 GB/s — the fast probe was ALU-bound rather than memory-bound, and
memchr2exposes a much tighter inner loop.Note on small payloads
Small JSON (≤4 KB) is flat, not improved. The 4 KB threshold guarantees
memchr2is never called on those payloads, so they match baseline performance to within bench noise. There's no meaningful win to be had at that size: scanner work is a small fraction of total parse time for 2 KB inputs (Lua FFI dispatch dominates), so any scanner-targeted optimization is invisible there. The threshold's job is to ensure no regression, which it achieves.Why this works
In an in-string chunk with no
"and no\, thein_stringpolarity cannot flip and no escape sequence can begin. The skipped span is therefore invariant for bothin_stringandbs_carry, so jumping multiple chunks at once is sound. The jump lands on a 64B boundary so the main SIMD loop invariants are preserved.Test plan
cargo test --releasecargo test --release --no-default-features(scalar control)make lint(clippy-D warnings)make bench3-run medians match READMEtests/scanner_crosscheck.rson x86 runner)Commits
7844484— initial NEON + AVX2 cross-chunk jump (256B threshold)2fa76cb— bump threshold to 4 KB to eliminate small-payload regression