perf(scan): cheaper in-string fast probe in AVX2 scanner#12
Merged
Conversation
Design for replacing the current in_string fast-path condition (real_quote == 0, requires computing backslash+escape masks) with a cheaper quote-or-backslash probe (~10 vector ops, no scalar ALU, no branches). Targets the ~25 ops/chunk cost on string-heavy payloads such as multimodal 10 MB base64-style values. Includes correctness proofs for bs_carry / in_string preservation on skipped chunks, op-count comparison, test matrix, and a synthetic bench fixture (10 MB, run-time generated, not committed). Cross-chunk memchr2 jumps are deferred and tracked in README Roadmap / Deferred.
Replace the in_string fast-path condition (`real_quote == 0`, requires computing backslash + escape masks first) with an early probe that checks for `"` or `\` directly. When the probe finds neither in a chunk, skip the entire structural-mask + PCLMUL prefix-XOR path; bs_carry is necessarily 0 leaving the chunk (no backslashes → no trailing run) and in_string stays 1. Per-chunk cost on pure string-interior chunks drops from ~25 vector + scalar ops to ~10 vector ops (4 cmpeq + 2 or + 2 movemask + shift/or), removing find_escape_mask_with_carry from the hot path. The dominant cost on string-heavy payloads (multimodal chat-completion responses, base64 image parts, long log lines). The new probe condition (`(quote | backslash) == 0`) is a strict subset of the prior fast path (`real_quote == 0`), so any chunk that would have hit the prior path either hits the new one or falls back to the unchanged slow path with identical output. Closes #5. Test updates: - long_string_engages_skip_fastpath: 10 KB → 1 MB so thousands of consecutive probe-hit chunks exercise the new path - long_string_with_periodic_backslash (new): alternates probe-hit and slow path via injected `\n` / `\"` escape sequences - bs_carry_one_at_pure_string_chunk_boundary (new): verifies the probe correctly resets bs_carry when entering a pure-interior chunk with bs_carry=1 from a prior chunk ending in an odd-length backslash run README Roadmap / Deferred gains a single bullet tracking the future memchr2 cross-chunk jump that this commit explicitly does not pursue.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #5.
Summary
real_quote == 0, requires computing backslash + escape masks first) with an early probe that checks for"or\directly. When neither byte appears in a chunk, skip the entire structural-mask + PCLMUL prefix-XOR path:bs_carryis necessarily 0 (no backslashes → no trailing run) andin_stringstays 1.find_escape_mask_with_carryfrom the hot path. Dominant on string-heavy payloads (multimodal chat-completion responses, base64 image parts, long log lines).(quote | backslash) == 0is a strict subset of the prior fast pathreal_quote == 0, so any chunk that would have hit the prior path either hits the new one or falls back to the unchanged slow path with identical output. Correctness argument written up indocs/superpowers/specs/2026-05-15-avx2-memchr-string-skip-design.md.Test plan
cargo test --release— 72 tests pass, includes 2 new regression tests and an upsized 1 MBlong_string_engages_skip_fastpathcargo test --release --no-default-features(scalar-only gate)cargo test --features test-panic --release(FFI panic-barrier gate)cargo clippy --release --all-targets -- -D warnings— cleanscanner_crosscheckproptest (2000 cases, scalar vs AVX2 parity)make benchmedian-of-three before/after on10msynthetic multimodal scenario — to be posted in a follow-up commentNew / changed tests
long_string_engages_skip_fastpathlong_string_with_periodic_backslash\n/\"sequencesbs_carry_one_at_pure_string_chunk_boundaryDeferred
README Roadmap / Deferred gains one bullet for the next step (cross-chunk
memchr2jump for very long single strings) that this PR explicitly does not pursue. The chunk-granularity probe is a strict subset and won't conflict with that future work.