perf(scan): NEON nibble-LUT classifier — fuse structural / quote / backslash detection

## Background

The NEON scanner (`src/scan/neon.rs`) currently performs structural-character detection with `structural_mask64`, which does **7 × `vceqq_u8`** per 16-byte vector (one per char in `{`, `}`, `[`, `]`, `:`, `,`, `"`) — **28 compare instructions** per 64-byte chunk. It also calls `byte_mask64` twice more to build the `\` and `"` masks needed for escape detection and the inside-string PMULL prefix-XOR. Total detection work per non-fast-path chunk is roughly **400 NEON ops** (~315 structural + ~44 `\` + ~44 `"`).

Issue #8 B1 proposed a shuffle-based set-membership check for AVX2 (`_mm256_shuffle_epi8`). ARM64 has the equivalent primitive `vqtbl1q_u8` (16-byte single-table lookup), and — crucially — has enough bits per tag byte (8) to **fuse all three masks** (structural / `"` / `\`) into a **single table-lookup pass**, eliminating both `byte_mask64` calls. This three-in-one fusion is NEON-specific because the tag fits in one byte; the AVX2 lane layout does not enable the same trick as cleanly.

## Proposal

### Nibble dual-LUT classifier

Construct two `static` 16-byte tables. For each byte vector:

```
hi  = vshrq_n_u8(c, 4)                 // 0..15
lo  = vandq_u8(c, vdupq_n_u8(0x0F))    // 0..15
tag = vandq_u8(vqtbl1q_u8(HI_LUT, hi), vqtbl1q_u8(LO_LUT, lo))
```

Tag bit assignment:

| char | bit |
|---|---|
| `"`  0x22 | 0 |
| `,`  0x2C | 1 |
| `:`  0x3A | 2 |
| `[`  0x5B | 3 |
| `]`  0x5D | 4 |
| `{`  0x7B | 5 |
| `}`  0x7D | 6 |
| `\`  0x5C | 7 |

Tables (all other indices = 0):

```
HI_LUT[2]=0x03  HI_LUT[3]=0x04  HI_LUT[5]=0x98  HI_LUT[7]=0x60
LO_LUT[2]=0x01  LO_LUT[A]=0x04  LO_LUT[B]=0x28  LO_LUT[C]=0x82  LO_LUT[D]=0x50
```

After the lookup, derive the three per-char u64 masks using `vtstq_u8` (single-instruction `(a & b) != 0`, returns 0xFF/0x00 per lane — directly feeds the existing `movemask16`):

```
quote_mask     = movemask( vtstq_u8(tag, splat(0x01)) )    // bit 0
backslash_mask = movemask( vtstq_u8(tag, splat(0x80)) )    // bit 7
struct_mask    = movemask( vtstq_u8(tag, splat(0x7F)) )    // bits 0..6
```

The downstream pipeline (`find_escape_mask_with_carry`, `inside_string_neon` via PMULL, `emit_bits`) is unchanged.

### Correctness — exhaustive verification

For all 256 byte values, `HI_LUT[byte>>4] & LO_LUT[byte&0x0F] ≠ 0` ⇔ byte ∈ `{}[]:,"\`. Worked through manually for each structural char and for adjacent non-structural neighbors that share a nibble:

| byte | hi & lo | result |
|---|---|---|
| `*` 0x2A | 0x03 & 0x04 | 0 ✓ |
| `+` 0x2B | 0x03 & 0x28 | 0 ✓ |
| `-` 0x2D | 0x03 & 0x50 | 0 ✓ |
| `;` 0x3B | 0x04 & 0x28 | 0 ✓ |
| `<` 0x3C | 0x04 & 0x82 | 0 ✓ |
| `=` 0x3D | 0x04 & 0x50 | 0 ✓ |
| `Z` 0x5A | 0x98 & 0x04 | 0 ✓ |
| `r` 0x72 | 0x60 & 0x01 | 0 ✓ |
| `z` 0x7A | 0x60 & 0x04 | 0 ✓ |
| `\|` 0x7C | 0x60 & 0x82 | 0 ✓ |
| any byte ≥ 0x80 | HI_LUT[8..F] = 0 | 0 ✓ |

Cross-contamination check on shared low-nibble C: `,` (0x2C) yields 0x02 only (bit 1, comma) and `\` (0x5C) yields 0x80 only (bit 7, backslash) — neither contaminates the other because their respective hi-rows lack the other's bit. An exhaustive 256-byte unit test will guard this in CI.

## Estimated impact

Per non-fast-path 64-byte chunk:

| Stage | Current | Proposed |
|---|---:|---:|
| `byte_mask64('\\')` | ~44 ops | merged |
| `byte_mask64('"')`  | ~44 ops | merged |
| `structural_mask64` (7 chars × 4 quarter-vectors) | ~315 ops | — |
| LUT classify + 3× `vtstq_u8` + 3× movemask | — | ~140 ops |
| escape mask + PMULL + emit_bits (unchanged) | ~30 ops | ~30 ops |
| **total per non-fast-path chunk** | **~433 ops** | **~170 ops** |

~2.5× speedup on non-fast-path chunks. End-to-end gain depends on the in-string fast-probe hit rate (existing `neon.rs:103` short-circuit that skips classification when `in_string && (backslash | quote) == 0`):

| Workload | Fast-path hit rate | Expected end-to-end |
|---|---|---:|
| object-heavy (small_api.json, config JSON) | ~0% | **15–25%** |
| mixed (medium_resp.json) | ~40% | **8–12%** |
| string-heavy (multimodal base64) | ~80% | **3–6%** |

These estimates run slightly higher than the original #8 B1 figure because the fused three-in-one classification removes the two `byte_mask64` calls that #8 B1 does not address.

## Validation plan

- [ ] `cargo test --release` — NEON gate
- [ ] `cargo test --release --no-default-features` — scalar unchanged (control)
- [ ] `cargo test --release --features test-panic` — FFI panic barrier intact
- [ ] Existing `tests/scanner_crosscheck.rs` proptest passes (≥ 2000 cases)
- [ ] New exhaustive test `nibble_lut_classifies_all_256_bytes_correctly`
- [ ] `make bench` 3-run median, before vs after, on small_api.json and medium_resp.json — report per-fixture delta separately

## Implementation notes

- Lives entirely in `src/scan/neon.rs`. ~150 LOC. AVX2 and scalar paths untouched.
- `HI_LUT` / `LO_LUT` as module-level `static`. Verify `vld1q_u8` is hoisted out of the chunk loop (inspect with `cargo asm` if in doubt).
- `vqtbl1q_u8` is 1-cycle latency / 2 IPC throughput on Apple M1/M2/M3. Older Cortex-A cores have higher latency, but instruction-count reduction alone still wins.
- The in-string fast-probe path stays as-is and continues to skip classification entirely on plain-string chunks.
- Once landed, `structural_mask64` and the dual-purpose `byte_mask64` become dead in `neon.rs` and should be removed. AVX2 keeps its own copies (separate file).

## References

- Related: #8 (AVX2 shuffle-structural — same idea, different ISA)
- Related: #25 (validate_brackets fusion — orthogonal; can stack on top)
- Control gate: scalar scanner — unaffected


char	bit
`"` 0x22	0
`,` 0x2C	1
`:` 0x3A	2
`[` 0x5B	3
`]` 0x5D	4
`{` 0x7B	5
`}` 0x7D	6
`\` 0x5C	7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(scan): NEON nibble-LUT classifier — fuse structural / quote / backslash detection #31

Background

Proposal

Nibble dual-LUT classifier

Correctness — exhaustive verification

Estimated impact

Validation plan

Implementation notes

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

byte	hi & lo	result
`*` 0x2A	0x03 & 0x04	0 ✓
`+` 0x2B	0x03 & 0x28	0 ✓
`-` 0x2D	0x03 & 0x50	0 ✓
`;` 0x3B	0x04 & 0x28	0 ✓
`<` 0x3C	0x04 & 0x82	0 ✓
`=` 0x3D	0x04 & 0x50	0 ✓
`Z` 0x5A	0x98 & 0x04	0 ✓
`r` 0x72	0x60 & 0x01	0 ✓
`z` 0x7A	0x60 & 0x04	0 ✓
`\|` 0x7C	0x60 & 0x82	0 ✓
any byte ≥ 0x80	HI_LUT[8..F] = 0	0 ✓

Stage	Current	Proposed
`byte_mask64('\\')`	~44 ops	merged
`byte_mask64('"')`	~44 ops	merged
`structural_mask64` (7 chars × 4 quarter-vectors)	~315 ops	—
LUT classify + 3× `vtstq_u8` + 3× movemask	—	~140 ops
escape mask + PMULL + emit_bits (unchanged)	~30 ops	~30 ops
total per non-fast-path chunk	~433 ops	~170 ops

Workload	Fast-path hit rate	Expected end-to-end
object-heavy (small_api.json, config JSON)	~0%	15–25%
mixed (medium_resp.json)	~40%	8–12%
string-heavy (multimodal base64)	~80%	3–6%

perf(scan): NEON nibble-LUT classifier — fuse structural / quote / backslash detection #31

Description

Background

Proposal

Nibble dual-LUT classifier

Correctness — exhaustive verification

Estimated impact

Validation plan

Implementation notes

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions