Skip to content

feat(scan): ARM64 NEON scanner + validate_brackets fusion#17

Merged
membphis merged 3 commits into
mainfrom
worktree-arm64-scanner
May 16, 2026
Merged

feat(scan): ARM64 NEON scanner + validate_brackets fusion#17
membphis merged 3 commits into
mainfrom
worktree-arm64-scanner

Conversation

@membphis
Copy link
Copy Markdown
Collaborator

Summary

  • ARM64 NEON scanner (src/scan/neon.rs): a 64-byte-per-iteration scanner for aarch64 that mirrors the AVX2 scanner. Uses four uint8x16_t registers per chunk, vpaddlq-based movemask simulation, and vmull_p64 (PMULL, requires aes target feature) for prefix-XOR inside-string masking. Dispatched at runtime when is_aarch64_feature_detected!("aes") is true.
  • validate_brackets fusion in ScalarScanner: single-pass emit+validate via scan_and_validate, eliminating the second walk over indices. NeonScanner and Avx2Scanner keep their two-pass design.
  • Move find_escape_mask_with_carry and emit_bits from avx2.rs to scan/mod.rs as pub(crate), shared by both SIMD scanners.
  • Cross-check test: 2000-case proptest scalar_neon_bit_identical (aarch64-gated).

Performance (Apple M4, multimodal payload benchmark)

Size cjson quickdecode (was scalar) quickdecode (NEON) speedup vs cjson
2 KB 255K 296K 654K 2.6x
60 KB 25K 22K 168K 6.7x
100 KB 15K 19K 109K 7.1x
500 KB 3.1K 4.0K 25K 8.2x
1 MB 1.5K 1.8K 12K 7.8x
10 MB 153 195 1,218 8.0x

Brings ARM64 performance in line with the x86_64 AVX2 numbers documented in the README.

Test plan

  • cargo test --release passes (incl. new proptest)
  • cargo test --release --no-default-features passes (scalar-only gate)
  • make bench runs end-to-end on Apple M4
  • CI: x86_64 default-features
  • CI: x86_64 scalar-only (--no-default-features)
  • CI: x86_64 test-panic feature
  • CI: busted Lua suite

membphis added 3 commits May 16, 2026 18:48
Add src/scan/neon.rs: a 64-byte-per-iteration NEON scanner for aarch64
that mirrors the AVX2 scanner structure. Uses four uint8x16_t registers
per chunk, movemask16 via pairwise-sum for movemask simulation, and
vmull_p64 (PMULL, requires aes target feature) for prefix-XOR
inside-string masking. Dispatched automatically by the OnceCell resolver
when is_aarch64_feature_detected!("aes") is true.

Fuse validate_brackets into ScalarScanner via scan_and_validate: single
pass emits structural offsets and validates bracket pairing inline,
eliminating the second walk over indices. NeonScanner and Avx2Scanner
keep their two-pass design (emit via scan_emit_resume + validate).

Move find_escape_mask_with_carry and emit_bits from avx2.rs to
scan/mod.rs as pub(crate), shared by both NEON and AVX2 scanners.

Extend scanner_crosscheck.rs with a proptest scalar-vs-NEON cross-check
gated on target_arch = "aarch64". Add NeonScanner to __test_api.

Bench on arm64 (M-series): medium fixture (60 KB) 6.1 us/iter ~9.9 GB/s;
small fixture (2 KB) 1.1 us/iter ~1.9 GB/s.
After fusing validate_brackets into ScalarScanner, the fused scalar
aborts at the first bracket mismatch while AVX2 still emits all
structural offsets before validate_brackets runs. The two paths
therefore disagree on the partial-emit indices in error cases, even
though they always agree on the Ok/Err verdict and the error offset.

Mirror the existing NEON crosscheck and only compare indices when the
result is Ok. Result equality (same Err offset on failure) remains a
required invariant. Minimal failing seed: "}{" — scalar emits [0],
AVX2 emits [0, 1], both return Err(0).
- Status: note ARM64 NEON/PMULL scanner alongside AVX2
- Benchmarks: add Apple M4 NEON results row (2 KB / 100 KB / 1 MB / 10 MB),
  showing 2.6x-8.0x speedup over lua-cjson on the same multimodal workload
- Roadmap: remove the "ARM64 NEON scanner backend" deferred entry (shipped)
- Roadmap: rewrite the validate_brackets fusion entry — fused into
  ScalarScanner via scan_and_validate; SIMD scanners still two-pass

docs/benchmarks.md retains its Skylake methodology; a separate ARM64
section there can be added later if needed.
@membphis membphis merged commit 53a1169 into main May 16, 2026
2 checks passed
@membphis membphis deleted the worktree-arm64-scanner branch May 16, 2026 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant