Skip to content

perf(scan): cheaper in-string fast probe in AVX2 scanner#12

Merged
membphis merged 2 commits into
mainfrom
worktree-avx2-memchr-string-skip
May 15, 2026
Merged

perf(scan): cheaper in-string fast probe in AVX2 scanner#12
membphis merged 2 commits into
mainfrom
worktree-avx2-memchr-string-skip

Conversation

@membphis
Copy link
Copy Markdown
Collaborator

Closes #5.

Summary

  • Replace the in_string fast-path condition (real_quote == 0, requires computing backslash + escape masks first) with an early probe that checks for " or \ directly. When neither byte appears in a chunk, skip the entire structural-mask + PCLMUL prefix-XOR path: bs_carry is necessarily 0 (no backslashes → no trailing run) and in_string stays 1.
  • Per-chunk cost on pure string-interior chunks drops from ~25 ops to ~10 vector ops (4 cmpeq + 2 or + 2 movemask + shift/or), removing find_escape_mask_with_carry from the hot path. Dominant on string-heavy payloads (multimodal chat-completion responses, base64 image parts, long log lines).
  • New probe condition (quote | backslash) == 0 is a strict subset of the prior fast path real_quote == 0, so any chunk that would have hit the prior path either hits the new one or falls back to the unchanged slow path with identical output. Correctness argument written up in docs/superpowers/specs/2026-05-15-avx2-memchr-string-skip-design.md.

Test plan

  • cargo test --release — 72 tests pass, includes 2 new regression tests and an upsized 1 MB long_string_engages_skip_fastpath
  • cargo test --release --no-default-features (scalar-only gate)
  • cargo test --features test-panic --release (FFI panic-barrier gate)
  • cargo clippy --release --all-targets -- -D warnings — clean
  • scanner_crosscheck proptest (2000 cases, scalar vs AVX2 parity)
  • Lua busted suite — runs in CI; not exercised locally (no luajit/busted on dev host)
  • make bench median-of-three before/after on 10m synthetic multimodal scenario — to be posted in a follow-up comment

New / changed tests

test new / modified purpose
long_string_engages_skip_fastpath modified bump from 10 KB to 1 MB so thousands of consecutive probe-hit chunks exercise the new path
long_string_with_periodic_backslash new alternates probe-hit chunks and slow-path chunks via injected \n / \" sequences
bs_carry_one_at_pure_string_chunk_boundary new construct prior chunk ending in odd-length backslash run (bs_carry=1) followed by pure-interior chunks; asserts the probe correctly resets bs_carry to 0

Deferred

README Roadmap / Deferred gains one bullet for the next step (cross-chunk memchr2 jump for very long single strings) that this PR explicitly does not pursue. The chunk-granularity probe is a strict subset and won't conflict with that future work.

membphis added 2 commits May 15, 2026 17:25
Design for replacing the current in_string fast-path condition
(real_quote == 0, requires computing backslash+escape masks) with
a cheaper quote-or-backslash probe (~10 vector ops, no scalar ALU,
no branches). Targets the ~25 ops/chunk cost on string-heavy
payloads such as multimodal 10 MB base64-style values.

Includes correctness proofs for bs_carry / in_string preservation
on skipped chunks, op-count comparison, test matrix, and a
synthetic bench fixture (10 MB, run-time generated, not committed).

Cross-chunk memchr2 jumps are deferred and tracked in
README Roadmap / Deferred.
Replace the in_string fast-path condition (`real_quote == 0`,
requires computing backslash + escape masks first) with an early
probe that checks for `"` or `\` directly. When the probe finds
neither in a chunk, skip the entire structural-mask + PCLMUL
prefix-XOR path; bs_carry is necessarily 0 leaving the chunk (no
backslashes → no trailing run) and in_string stays 1.

Per-chunk cost on pure string-interior chunks drops from ~25
vector + scalar ops to ~10 vector ops (4 cmpeq + 2 or + 2 movemask
+ shift/or), removing find_escape_mask_with_carry from the hot
path. The dominant cost on string-heavy payloads (multimodal
chat-completion responses, base64 image parts, long log lines).

The new probe condition (`(quote | backslash) == 0`) is a strict
subset of the prior fast path (`real_quote == 0`), so any chunk
that would have hit the prior path either hits the new one or
falls back to the unchanged slow path with identical output.

Closes #5.

Test updates:
- long_string_engages_skip_fastpath: 10 KB → 1 MB so thousands of
  consecutive probe-hit chunks exercise the new path
- long_string_with_periodic_backslash (new): alternates probe-hit
  and slow path via injected `\n` / `\"` escape sequences
- bs_carry_one_at_pure_string_chunk_boundary (new): verifies the
  probe correctly resets bs_carry when entering a pure-interior
  chunk with bs_carry=1 from a prior chunk ending in an odd-length
  backslash run

README Roadmap / Deferred gains a single bullet tracking the
future memchr2 cross-chunk jump that this commit explicitly does
not pursue.
@membphis membphis merged commit 7a895e5 into main May 15, 2026
1 check passed
@membphis membphis deleted the worktree-avx2-memchr-string-skip branch May 15, 2026 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(scan): memchr-based fast path for in-string content

1 participant