feat(scan): ARM64 NEON scanner + validate_brackets fusion by membphis · Pull Request #17 · api7/lua-qjson

membphis · 2026-05-16T10:51:18Z

Summary

ARM64 NEON scanner (src/scan/neon.rs): a 64-byte-per-iteration scanner for aarch64 that mirrors the AVX2 scanner. Uses four uint8x16_t registers per chunk, vpaddlq-based movemask simulation, and vmull_p64 (PMULL, requires aes target feature) for prefix-XOR inside-string masking. Dispatched at runtime when is_aarch64_feature_detected!("aes") is true.
validate_brackets fusion in ScalarScanner: single-pass emit+validate via scan_and_validate, eliminating the second walk over indices. NeonScanner and Avx2Scanner keep their two-pass design.
Move find_escape_mask_with_carry and emit_bits from avx2.rs to scan/mod.rs as pub(crate), shared by both SIMD scanners.
Cross-check test: 2000-case proptest scalar_neon_bit_identical (aarch64-gated).

Performance (Apple M4, multimodal payload benchmark)

Size	cjson	quickdecode (was scalar)	quickdecode (NEON)	speedup vs cjson
2 KB	255K	296K	654K	2.6x
60 KB	25K	22K	168K	6.7x
100 KB	15K	19K	109K	7.1x
500 KB	3.1K	4.0K	25K	8.2x
1 MB	1.5K	1.8K	12K	7.8x
10 MB	153	195	1,218	8.0x

Brings ARM64 performance in line with the x86_64 AVX2 numbers documented in the README.

Test plan

cargo test --release passes (incl. new proptest)
cargo test --release --no-default-features passes (scalar-only gate)
make bench runs end-to-end on Apple M4
CI: x86_64 default-features
CI: x86_64 scalar-only (--no-default-features)
CI: x86_64 test-panic feature
CI: busted Lua suite

Add src/scan/neon.rs: a 64-byte-per-iteration NEON scanner for aarch64 that mirrors the AVX2 scanner structure. Uses four uint8x16_t registers per chunk, movemask16 via pairwise-sum for movemask simulation, and vmull_p64 (PMULL, requires aes target feature) for prefix-XOR inside-string masking. Dispatched automatically by the OnceCell resolver when is_aarch64_feature_detected!("aes") is true. Fuse validate_brackets into ScalarScanner via scan_and_validate: single pass emits structural offsets and validates bracket pairing inline, eliminating the second walk over indices. NeonScanner and Avx2Scanner keep their two-pass design (emit via scan_emit_resume + validate). Move find_escape_mask_with_carry and emit_bits from avx2.rs to scan/mod.rs as pub(crate), shared by both NEON and AVX2 scanners. Extend scanner_crosscheck.rs with a proptest scalar-vs-NEON cross-check gated on target_arch = "aarch64". Add NeonScanner to __test_api. Bench on arm64 (M-series): medium fixture (60 KB) 6.1 us/iter ~9.9 GB/s; small fixture (2 KB) 1.1 us/iter ~1.9 GB/s.

After fusing validate_brackets into ScalarScanner, the fused scalar aborts at the first bracket mismatch while AVX2 still emits all structural offsets before validate_brackets runs. The two paths therefore disagree on the partial-emit indices in error cases, even though they always agree on the Ok/Err verdict and the error offset. Mirror the existing NEON crosscheck and only compare indices when the result is Ok. Result equality (same Err offset on failure) remains a required invariant. Minimal failing seed: "}{" — scalar emits [0], AVX2 emits [0, 1], both return Err(0).

- Status: note ARM64 NEON/PMULL scanner alongside AVX2 - Benchmarks: add Apple M4 NEON results row (2 KB / 100 KB / 1 MB / 10 MB), showing 2.6x-8.0x speedup over lua-cjson on the same multimodal workload - Roadmap: remove the "ARM64 NEON scanner backend" deferred entry (shipped) - Roadmap: rewrite the validate_brackets fusion entry — fused into ScalarScanner via scan_and_validate; SIMD scanners still two-pass docs/benchmarks.md retains its Skylake methodology; a separate ARM64 section there can be added later if needed.

membphis added 3 commits May 16, 2026 18:48

membphis merged commit 53a1169 into main May 16, 2026
2 checks passed

membphis deleted the worktree-arm64-scanner branch May 16, 2026 11:10

This was referenced May 16, 2026

perf(scan): fuse validate_brackets into SIMD emit loops #18

Closed

perf(decode): SIMD-accelerated backslash search in decode_string fast path #20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scan): ARM64 NEON scanner + validate_brackets fusion#17

feat(scan): ARM64 NEON scanner + validate_brackets fusion#17
membphis merged 3 commits into
mainfrom
worktree-arm64-scanner

membphis commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

membphis commented May 16, 2026

Summary

Performance (Apple M4, multimodal payload benchmark)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant