Skip to content

perf(scan): NEON nibble-LUT classifier — fuse structural / quote / backslash detection #31

@membphis

Description

@membphis

Background

The NEON scanner (src/scan/neon.rs) currently performs structural-character detection with structural_mask64, which does 7 × vceqq_u8 per 16-byte vector (one per char in {, }, [, ], :, ,, ") — 28 compare instructions per 64-byte chunk. It also calls byte_mask64 twice more to build the \ and " masks needed for escape detection and the inside-string PMULL prefix-XOR. Total detection work per non-fast-path chunk is roughly 400 NEON ops (~315 structural + ~44 \ + ~44 ").

Issue #8 B1 proposed a shuffle-based set-membership check for AVX2 (_mm256_shuffle_epi8). ARM64 has the equivalent primitive vqtbl1q_u8 (16-byte single-table lookup), and — crucially — has enough bits per tag byte (8) to fuse all three masks (structural / " / \) into a single table-lookup pass, eliminating both byte_mask64 calls. This three-in-one fusion is NEON-specific because the tag fits in one byte; the AVX2 lane layout does not enable the same trick as cleanly.

Proposal

Nibble dual-LUT classifier

Construct two static 16-byte tables. For each byte vector:

hi  = vshrq_n_u8(c, 4)                 // 0..15
lo  = vandq_u8(c, vdupq_n_u8(0x0F))    // 0..15
tag = vandq_u8(vqtbl1q_u8(HI_LUT, hi), vqtbl1q_u8(LO_LUT, lo))

Tag bit assignment:

char bit
" 0x22 0
, 0x2C 1
: 0x3A 2
[ 0x5B 3
] 0x5D 4
{ 0x7B 5
} 0x7D 6
\ 0x5C 7

Tables (all other indices = 0):

HI_LUT[2]=0x03  HI_LUT[3]=0x04  HI_LUT[5]=0x98  HI_LUT[7]=0x60
LO_LUT[2]=0x01  LO_LUT[A]=0x04  LO_LUT[B]=0x28  LO_LUT[C]=0x82  LO_LUT[D]=0x50

After the lookup, derive the three per-char u64 masks using vtstq_u8 (single-instruction (a & b) != 0, returns 0xFF/0x00 per lane — directly feeds the existing movemask16):

quote_mask     = movemask( vtstq_u8(tag, splat(0x01)) )    // bit 0
backslash_mask = movemask( vtstq_u8(tag, splat(0x80)) )    // bit 7
struct_mask    = movemask( vtstq_u8(tag, splat(0x7F)) )    // bits 0..6

The downstream pipeline (find_escape_mask_with_carry, inside_string_neon via PMULL, emit_bits) is unchanged.

Correctness — exhaustive verification

For all 256 byte values, HI_LUT[byte>>4] & LO_LUT[byte&0x0F] ≠ 0 ⇔ byte ∈ {}[]:,"\. Worked through manually for each structural char and for adjacent non-structural neighbors that share a nibble:

byte hi & lo result
* 0x2A 0x03 & 0x04 0 ✓
+ 0x2B 0x03 & 0x28 0 ✓
- 0x2D 0x03 & 0x50 0 ✓
; 0x3B 0x04 & 0x28 0 ✓
< 0x3C 0x04 & 0x82 0 ✓
= 0x3D 0x04 & 0x50 0 ✓
Z 0x5A 0x98 & 0x04 0 ✓
r 0x72 0x60 & 0x01 0 ✓
z 0x7A 0x60 & 0x04 0 ✓
| 0x7C 0x60 & 0x82 0 ✓
any byte ≥ 0x80 HI_LUT[8..F] = 0 0 ✓

Cross-contamination check on shared low-nibble C: , (0x2C) yields 0x02 only (bit 1, comma) and \ (0x5C) yields 0x80 only (bit 7, backslash) — neither contaminates the other because their respective hi-rows lack the other's bit. An exhaustive 256-byte unit test will guard this in CI.

Estimated impact

Per non-fast-path 64-byte chunk:

Stage Current Proposed
byte_mask64('\\') ~44 ops merged
byte_mask64('"') ~44 ops merged
structural_mask64 (7 chars × 4 quarter-vectors) ~315 ops
LUT classify + 3× vtstq_u8 + 3× movemask ~140 ops
escape mask + PMULL + emit_bits (unchanged) ~30 ops ~30 ops
total per non-fast-path chunk ~433 ops ~170 ops

~2.5× speedup on non-fast-path chunks. End-to-end gain depends on the in-string fast-probe hit rate (existing neon.rs:103 short-circuit that skips classification when in_string && (backslash | quote) == 0):

Workload Fast-path hit rate Expected end-to-end
object-heavy (small_api.json, config JSON) ~0% 15–25%
mixed (medium_resp.json) ~40% 8–12%
string-heavy (multimodal base64) ~80% 3–6%

These estimates run slightly higher than the original #8 B1 figure because the fused three-in-one classification removes the two byte_mask64 calls that #8 B1 does not address.

Validation plan

  • cargo test --release — NEON gate
  • cargo test --release --no-default-features — scalar unchanged (control)
  • cargo test --release --features test-panic — FFI panic barrier intact
  • Existing tests/scanner_crosscheck.rs proptest passes (≥ 2000 cases)
  • New exhaustive test nibble_lut_classifies_all_256_bytes_correctly
  • make bench 3-run median, before vs after, on small_api.json and medium_resp.json — report per-fixture delta separately

Implementation notes

  • Lives entirely in src/scan/neon.rs. ~150 LOC. AVX2 and scalar paths untouched.
  • HI_LUT / LO_LUT as module-level static. Verify vld1q_u8 is hoisted out of the chunk loop (inspect with cargo asm if in doubt).
  • vqtbl1q_u8 is 1-cycle latency / 2 IPC throughput on Apple M1/M2/M3. Older Cortex-A cores have higher latency, but instruction-count reduction alone still wins.
  • The in-string fast-probe path stays as-is and continues to skip classification entirely on plain-string chunks.
  • Once landed, structural_mask64 and the dual-purpose byte_mask64 become dead in neon.rs and should be removed. AVX2 keeps its own copies (separate file).

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions