Context
The AVX2 scanner's string-skip fast path (added in PR #3, src/scan/avx2.rs:35-43) detects when an entire 64-byte chunk lies inside a string and skips the structural-mask computation. The per-chunk work is still significant:
- 2 ×
loadu (free)
- backslash mask: ~6 ops
- quote mask: ~6 ops
find_escape_mask_with_carry: ~10 scalar ALU ops + branches
- fast-path branch
~25 ops per "skip" chunk. For the multimodal bench's 10 MB scenario (~95% chunks inside the giant base64 strings), this is the dominant cost.
Proposal
Replace the SIMD-mask path with a memchr-style search for the next interesting byte (" or \) while in_string holds:
- 1 SIMD load + 1
cmpeq against either-of-two-bytes + 1 movemask + 1 test-and-skip per chunk
- glibc's
memchr peaks at ~30 GB/s; our base64 payload has no quotes/backslashes mid-string so the search bails per chunk with 0 hits
Estimated impact (op-count analysis, not measured)
| size |
est. speedup |
| 100 KB – 1 MB |
~1.5–2× |
| 5 MB – 10 MB |
~3× |
10 MB scan would drop from ~2.9 ms → ~1 ms per iter. Lower bound could be 1.5× if cache or front-end effects dominate.
Validation plan
Related
Context
The AVX2 scanner's string-skip fast path (added in PR #3,
src/scan/avx2.rs:35-43) detects when an entire 64-byte chunk lies inside a string and skips the structural-mask computation. The per-chunk work is still significant:loadu(free)find_escape_mask_with_carry: ~10 scalar ALU ops + branches~25 ops per "skip" chunk. For the multimodal bench's 10 MB scenario (~95% chunks inside the giant base64 strings), this is the dominant cost.
Proposal
Replace the SIMD-mask path with a memchr-style search for the next interesting byte (
"or\) whilein_stringholds:cmpeqagainst either-of-two-bytes + 1movemask+ 1 test-and-skip per chunkmemchrpeaks at ~30 GB/s; our base64 payload has no quotes/backslashes mid-string so the search bails per chunk with 0 hitsEstimated impact (op-count analysis, not measured)
10 MB scan would drop from ~2.9 ms → ~1 ms per iter. Lower bound could be 1.5× if cache or front-end effects dominate.
Validation plan
scanner_crosscheckproptest (scalar/AVX2 parity)make bench3-run median before/after, posted in PRRelated