Background
The NEON scanner (src/scan/neon.rs) currently performs structural-character detection with structural_mask64, which does 7 × vceqq_u8 per 16-byte vector (one per char in {, }, [, ], :, ,, ") — 28 compare instructions per 64-byte chunk. It also calls byte_mask64 twice more to build the \ and " masks needed for escape detection and the inside-string PMULL prefix-XOR. Total detection work per non-fast-path chunk is roughly 400 NEON ops (~315 structural + ~44 \ + ~44 ").
Issue #8 B1 proposed a shuffle-based set-membership check for AVX2 (_mm256_shuffle_epi8). ARM64 has the equivalent primitive vqtbl1q_u8 (16-byte single-table lookup), and — crucially — has enough bits per tag byte (8) to fuse all three masks (structural / " / \) into a single table-lookup pass, eliminating both byte_mask64 calls. This three-in-one fusion is NEON-specific because the tag fits in one byte; the AVX2 lane layout does not enable the same trick as cleanly.
Proposal
Nibble dual-LUT classifier
Construct two static 16-byte tables. For each byte vector:
hi = vshrq_n_u8(c, 4) // 0..15
lo = vandq_u8(c, vdupq_n_u8(0x0F)) // 0..15
tag = vandq_u8(vqtbl1q_u8(HI_LUT, hi), vqtbl1q_u8(LO_LUT, lo))
Tag bit assignment:
| char |
bit |
" 0x22 |
0 |
, 0x2C |
1 |
: 0x3A |
2 |
[ 0x5B |
3 |
] 0x5D |
4 |
{ 0x7B |
5 |
} 0x7D |
6 |
\ 0x5C |
7 |
Tables (all other indices = 0):
HI_LUT[2]=0x03 HI_LUT[3]=0x04 HI_LUT[5]=0x98 HI_LUT[7]=0x60
LO_LUT[2]=0x01 LO_LUT[A]=0x04 LO_LUT[B]=0x28 LO_LUT[C]=0x82 LO_LUT[D]=0x50
After the lookup, derive the three per-char u64 masks using vtstq_u8 (single-instruction (a & b) != 0, returns 0xFF/0x00 per lane — directly feeds the existing movemask16):
quote_mask = movemask( vtstq_u8(tag, splat(0x01)) ) // bit 0
backslash_mask = movemask( vtstq_u8(tag, splat(0x80)) ) // bit 7
struct_mask = movemask( vtstq_u8(tag, splat(0x7F)) ) // bits 0..6
The downstream pipeline (find_escape_mask_with_carry, inside_string_neon via PMULL, emit_bits) is unchanged.
Correctness — exhaustive verification
For all 256 byte values, HI_LUT[byte>>4] & LO_LUT[byte&0x0F] ≠ 0 ⇔ byte ∈ {}[]:,"\. Worked through manually for each structural char and for adjacent non-structural neighbors that share a nibble:
| byte |
hi & lo |
result |
* 0x2A |
0x03 & 0x04 |
0 ✓ |
+ 0x2B |
0x03 & 0x28 |
0 ✓ |
- 0x2D |
0x03 & 0x50 |
0 ✓ |
; 0x3B |
0x04 & 0x28 |
0 ✓ |
< 0x3C |
0x04 & 0x82 |
0 ✓ |
= 0x3D |
0x04 & 0x50 |
0 ✓ |
Z 0x5A |
0x98 & 0x04 |
0 ✓ |
r 0x72 |
0x60 & 0x01 |
0 ✓ |
z 0x7A |
0x60 & 0x04 |
0 ✓ |
| 0x7C |
0x60 & 0x82 |
0 ✓ |
| any byte ≥ 0x80 |
HI_LUT[8..F] = 0 |
0 ✓ |
Cross-contamination check on shared low-nibble C: , (0x2C) yields 0x02 only (bit 1, comma) and \ (0x5C) yields 0x80 only (bit 7, backslash) — neither contaminates the other because their respective hi-rows lack the other's bit. An exhaustive 256-byte unit test will guard this in CI.
Estimated impact
Per non-fast-path 64-byte chunk:
| Stage |
Current |
Proposed |
byte_mask64('\\') |
~44 ops |
merged |
byte_mask64('"') |
~44 ops |
merged |
structural_mask64 (7 chars × 4 quarter-vectors) |
~315 ops |
— |
LUT classify + 3× vtstq_u8 + 3× movemask |
— |
~140 ops |
| escape mask + PMULL + emit_bits (unchanged) |
~30 ops |
~30 ops |
| total per non-fast-path chunk |
~433 ops |
~170 ops |
~2.5× speedup on non-fast-path chunks. End-to-end gain depends on the in-string fast-probe hit rate (existing neon.rs:103 short-circuit that skips classification when in_string && (backslash | quote) == 0):
| Workload |
Fast-path hit rate |
Expected end-to-end |
| object-heavy (small_api.json, config JSON) |
~0% |
15–25% |
| mixed (medium_resp.json) |
~40% |
8–12% |
| string-heavy (multimodal base64) |
~80% |
3–6% |
These estimates run slightly higher than the original #8 B1 figure because the fused three-in-one classification removes the two byte_mask64 calls that #8 B1 does not address.
Validation plan
Implementation notes
- Lives entirely in
src/scan/neon.rs. ~150 LOC. AVX2 and scalar paths untouched.
HI_LUT / LO_LUT as module-level static. Verify vld1q_u8 is hoisted out of the chunk loop (inspect with cargo asm if in doubt).
vqtbl1q_u8 is 1-cycle latency / 2 IPC throughput on Apple M1/M2/M3. Older Cortex-A cores have higher latency, but instruction-count reduction alone still wins.
- The in-string fast-probe path stays as-is and continues to skip classification entirely on plain-string chunks.
- Once landed,
structural_mask64 and the dual-purpose byte_mask64 become dead in neon.rs and should be removed. AVX2 keeps its own copies (separate file).
References
Background
The NEON scanner (
src/scan/neon.rs) currently performs structural-character detection withstructural_mask64, which does 7 ×vceqq_u8per 16-byte vector (one per char in{,},[,],:,,,") — 28 compare instructions per 64-byte chunk. It also callsbyte_mask64twice more to build the\and"masks needed for escape detection and the inside-string PMULL prefix-XOR. Total detection work per non-fast-path chunk is roughly 400 NEON ops (~315 structural + ~44\+ ~44").Issue #8 B1 proposed a shuffle-based set-membership check for AVX2 (
_mm256_shuffle_epi8). ARM64 has the equivalent primitivevqtbl1q_u8(16-byte single-table lookup), and — crucially — has enough bits per tag byte (8) to fuse all three masks (structural /"/\) into a single table-lookup pass, eliminating bothbyte_mask64calls. This three-in-one fusion is NEON-specific because the tag fits in one byte; the AVX2 lane layout does not enable the same trick as cleanly.Proposal
Nibble dual-LUT classifier
Construct two
static16-byte tables. For each byte vector:Tag bit assignment:
"0x22,0x2C:0x3A[0x5B]0x5D{0x7B}0x7D\0x5CTables (all other indices = 0):
After the lookup, derive the three per-char u64 masks using
vtstq_u8(single-instruction(a & b) != 0, returns 0xFF/0x00 per lane — directly feeds the existingmovemask16):The downstream pipeline (
find_escape_mask_with_carry,inside_string_neonvia PMULL,emit_bits) is unchanged.Correctness — exhaustive verification
For all 256 byte values,
HI_LUT[byte>>4] & LO_LUT[byte&0x0F] ≠ 0⇔ byte ∈{}[]:,"\. Worked through manually for each structural char and for adjacent non-structural neighbors that share a nibble:*0x2A+0x2B-0x2D;0x3B<0x3C=0x3DZ0x5Ar0x72z0x7A|0x7CCross-contamination check on shared low-nibble C:
,(0x2C) yields 0x02 only (bit 1, comma) and\(0x5C) yields 0x80 only (bit 7, backslash) — neither contaminates the other because their respective hi-rows lack the other's bit. An exhaustive 256-byte unit test will guard this in CI.Estimated impact
Per non-fast-path 64-byte chunk:
byte_mask64('\\')byte_mask64('"')structural_mask64(7 chars × 4 quarter-vectors)vtstq_u8+ 3× movemask~2.5× speedup on non-fast-path chunks. End-to-end gain depends on the in-string fast-probe hit rate (existing
neon.rs:103short-circuit that skips classification whenin_string && (backslash | quote) == 0):These estimates run slightly higher than the original #8 B1 figure because the fused three-in-one classification removes the two
byte_mask64calls that #8 B1 does not address.Validation plan
cargo test --release— NEON gatecargo test --release --no-default-features— scalar unchanged (control)cargo test --release --features test-panic— FFI panic barrier intacttests/scanner_crosscheck.rsproptest passes (≥ 2000 cases)nibble_lut_classifies_all_256_bytes_correctlymake bench3-run median, before vs after, on small_api.json and medium_resp.json — report per-fixture delta separatelyImplementation notes
src/scan/neon.rs. ~150 LOC. AVX2 and scalar paths untouched.HI_LUT/LO_LUTas module-levelstatic. Verifyvld1q_u8is hoisted out of the chunk loop (inspect withcargo asmif in doubt).vqtbl1q_u8is 1-cycle latency / 2 IPC throughput on Apple M1/M2/M3. Older Cortex-A cores have higher latency, but instruction-count reduction alone still wins.structural_mask64and the dual-purposebyte_mask64become dead inneon.rsand should be removed. AVX2 keeps its own copies (separate file).References