feat(scan): ARM64 NEON scanner + validate_brackets fusion#17
Merged
Conversation
Add src/scan/neon.rs: a 64-byte-per-iteration NEON scanner for aarch64
that mirrors the AVX2 scanner structure. Uses four uint8x16_t registers
per chunk, movemask16 via pairwise-sum for movemask simulation, and
vmull_p64 (PMULL, requires aes target feature) for prefix-XOR
inside-string masking. Dispatched automatically by the OnceCell resolver
when is_aarch64_feature_detected!("aes") is true.
Fuse validate_brackets into ScalarScanner via scan_and_validate: single
pass emits structural offsets and validates bracket pairing inline,
eliminating the second walk over indices. NeonScanner and Avx2Scanner
keep their two-pass design (emit via scan_emit_resume + validate).
Move find_escape_mask_with_carry and emit_bits from avx2.rs to
scan/mod.rs as pub(crate), shared by both NEON and AVX2 scanners.
Extend scanner_crosscheck.rs with a proptest scalar-vs-NEON cross-check
gated on target_arch = "aarch64". Add NeonScanner to __test_api.
Bench on arm64 (M-series): medium fixture (60 KB) 6.1 us/iter ~9.9 GB/s;
small fixture (2 KB) 1.1 us/iter ~1.9 GB/s.
After fusing validate_brackets into ScalarScanner, the fused scalar
aborts at the first bracket mismatch while AVX2 still emits all
structural offsets before validate_brackets runs. The two paths
therefore disagree on the partial-emit indices in error cases, even
though they always agree on the Ok/Err verdict and the error offset.
Mirror the existing NEON crosscheck and only compare indices when the
result is Ok. Result equality (same Err offset on failure) remains a
required invariant. Minimal failing seed: "}{" — scalar emits [0],
AVX2 emits [0, 1], both return Err(0).
- Status: note ARM64 NEON/PMULL scanner alongside AVX2 - Benchmarks: add Apple M4 NEON results row (2 KB / 100 KB / 1 MB / 10 MB), showing 2.6x-8.0x speedup over lua-cjson on the same multimodal workload - Roadmap: remove the "ARM64 NEON scanner backend" deferred entry (shipped) - Roadmap: rewrite the validate_brackets fusion entry — fused into ScalarScanner via scan_and_validate; SIMD scanners still two-pass docs/benchmarks.md retains its Skylake methodology; a separate ARM64 section there can be added later if needed.
This was referenced May 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
src/scan/neon.rs): a 64-byte-per-iteration scanner for aarch64 that mirrors the AVX2 scanner. Uses fouruint8x16_tregisters per chunk,vpaddlq-based movemask simulation, andvmull_p64(PMULL, requiresaestarget feature) for prefix-XOR inside-string masking. Dispatched at runtime whenis_aarch64_feature_detected!("aes")is true.validate_bracketsfusion inScalarScanner: single-pass emit+validate viascan_and_validate, eliminating the second walk overindices. NeonScanner and Avx2Scanner keep their two-pass design.find_escape_mask_with_carryandemit_bitsfromavx2.rstoscan/mod.rsaspub(crate), shared by both SIMD scanners.scalar_neon_bit_identical(aarch64-gated).Performance (Apple M4, multimodal payload benchmark)
Brings ARM64 performance in line with the x86_64 AVX2 numbers documented in the README.
Test plan
cargo test --releasepasses (incl. new proptest)cargo test --release --no-default-featurespasses (scalar-only gate)make benchruns end-to-end on Apple M4