Skip to content

perf(scan): literal prefilter + lazy line counting in readAndGrep#804

Merged
BYK merged 8 commits intomainfrom
byk/perf/grep-literal-prefilter
Apr 21, 2026
Merged

perf(scan): literal prefilter + lazy line counting in readAndGrep#804
BYK merged 8 commits intomainfrom
byk/perf/grep-literal-prefilter

Conversation

@BYK
Copy link
Copy Markdown
Member

@BYK BYK commented Apr 21, 2026

Summary

Two regex-level optimizations to narrow the perf gap with ripgrep on our pure-TS collectGrep/grepFiles. Follow-up to PR #791 and #797.

  • Literal prefilter — ripgrep-style: extract a literal substring from the regex source (e.g., import from import.*from), scan the buffer with indexOf to locate candidate lines, only invoke the regex engine on lines that contain the literal. V8's indexOf is roughly SIMD-speed; skipping the regex engine on non-candidate lines is where most of the win comes from.
  • Lazy line counting — swapped charCodeAt-walk for indexOf("\n", cursor) hops. 2-5× faster on the line-counting sub-loop because V8 implements indexOf in C++ without per-iteration JS interop.

Perf impact (synthetic/large, 10k files, Bun 1.3.11, 4-core)

Op Before After Δ
scan.grepFiles (DSN pattern) 370 ms 318 ms −14%
detectAllDsns.cold 363 ms 313 ms −14%
detectDsn.cold 7.73 ms 5.61 ms −27%
scanCodeForFirstDsn 2.91 ms 2.13 ms −27%
scanCodeForDsns 342 ms 333 ms −3% (noise-equivalent)
import.*from uncapped (bench) 1489 ms 1178 ms −21%

The DSN workloads improve because DSN_PATTERN extracts http as its literal — most source files don't contain http at all, so the prefilter short-circuits before the regex runs.

No regressions on any benchmark. Pure-literal patterns (e.g., SENTRY_DSN, NONEXISTENT_TOKEN_XYZ) continue through the whole-buffer path unchanged.

What changed

New file: src/lib/scan/literal-extract.ts (~300 LOC)

Conservative literal extractor. Walks a regex source looking for the longest contiguous run of literal bytes that every match must contain. Bails out safely on top-level alternation, character classes, groups, lookarounds, quantifiers, and escape classes.

Handles escaped metacharacters intelligently: Sentry\.init yields Sentry.init (extracted via literal \..), while \bfoo\b yields foo (escape \b is an anchor, not a literal b).

Exports:

  • extractInnerLiteral(source, flags) — returns the literal, or null if no safe extraction possible. Honors /i by lowercasing.
  • isPureLiteral(source, flags) — true when the pattern IS a bare literal with no metacharacters. Used by the grep pipeline to route pure-literals to the whole-buffer path (V8's regex engine is hyper-optimized for pure-literal patterns; the prefilter adds overhead without benefit there).

Modified: src/lib/scan/grep.ts (~240 LOC changes)

Three-way dispatch in readAndGrep based on the extracted literal:

  1. grepByLiteralPrefilter (new) — regex with extractable literal + multiline: true. Uses indexOf(literal) to find candidate lines, runs the regex engine only on those. This is the main perf win.
  2. grepByWholeBuffer — existing path, used for:
    • Pure-literal patterns (V8 handles them optimally)
    • Patterns with no extractable literal (complex regex, top-level alternation)
    • multiline: false mode (the fast path requires per-line semantics)

Also: replaced the charCodeAt-walk that counted newlines char-by-char with an indexOf("\n", cursor) hop loop. Extracted buildMatch(ctx, bounds) as a shared helper to bundle the match-construction arguments.

Tests added

  • test/lib/scan/literal-extract.test.ts39 tests covering the extractor's rules (escape handling, quantifier drop, alternation bail, case-insensitive, minimum length).
  • test/lib/scan/grep.test.ts7 new tests for the prefilter fast path: correctness vs whole-buffer, escaped-literal extraction, case-insensitive flag, zero-literal-hit short-circuit, routing of pure literals to whole-buffer, and alternation routing.

Why this approach

From the ripgrep research (attached to PR #791): rg's central perf trick is extracting a literal from each regex and prefiltering with SIMD memchr. V8 doesn't expose SIMD directly but its String.prototype.indexOf is compiled to a tight byte-level loop with internal SIMD on x64 — functionally equivalent for our use case.

Three of the five techniques in the Loggly regex-perf guide were evaluated:

  • Character classes over .*DSN_PATTERN already uses [a-z0-9]+, no change needed.
  • Alternation orderDSN_PATTERN's (?:\.[a-z]+|:[0-9]+) is already correctly ordered (. more common than : in DSN hosts); swapping regressed perf by noise.
  • Anchors/word boundaries — adding \b to DSN_PATTERN regressed perf 2.8× on our workload. V8's existing fast character-mismatch rejection on the first byte outperforms the boundary check overhead.

The remaining gap with rg is now primarily orchestration overhead (async/await, mapFilesConcurrent, walker correctness features) rather than regex speed. A worker-pool exploration may follow.

Test plan

  • bunx tsc --noEmit — clean
  • bun run lint — clean (1 pre-existing warning in src/lib/formatters/markdown.ts unrelated to this PR)
  • bun test --timeout 15000 test/lib test/commands test/types5610 pass, 0 fail (+58 new)
  • bun test test/isolated — 138 pass, 0 fail
  • bun run bench --size large --runs 5 — all scan ops at or below previous baseline
  • Manually verified semantic parity: collectGrep returns identical GrepMatch[] on prefilter vs whole-buffer paths for patterns where the prefilter fires

🤖 Generated with Claude Code

Co-authored-by: Claude Opus 4.7 (1M context) noreply@anthropic.com

Two regex-level optimizations to narrow the perf gap with ripgrep on
`collectGrep`/`grepFiles`. Biggest win is a ripgrep-style literal
prefilter: extract a literal substring from the regex source (e.g.,
`import` from `import.*from`), scan the buffer with `indexOf` to
locate candidate lines, and only invoke the regex engine on lines
that contain the literal. On lines without the literal, `indexOf`
is ~50× faster than V8's regex engine for the same rejection work.

## What changed

### New file: `src/lib/scan/literal-extract.ts` (~300 LOC)

Conservative literal extractor. Walks a regex source looking for
the longest contiguous run of literal bytes that every match must
contain. Bails out safely on top-level alternation, character
classes, groups, lookarounds, quantifiers, and escape classes.

Handles escaped metacharacters intelligently: `Sentry\.init` yields
`Sentry.init` (extracted via literal `\.` → `.`), while `\bfoo\b`
yields `foo` (escape `\b` is an anchor, not a literal `b`).

Exports:
- `extractInnerLiteral(source, flags)` — returns the literal, or
  null if no safe extraction possible. Honors `/i` by lowercasing.
- `isPureLiteral(source, flags)` — true when the pattern IS a bare
  literal with no metacharacters. Used by the grep pipeline to
  route pure-literals to the whole-buffer path (V8's regex engine
  is hyper-optimized for pure-literal patterns; the prefilter adds
  overhead without benefit there).

### Modified: `src/lib/scan/grep.ts` (~240 LOC changes)

Three-way dispatch in `readAndGrep` based on the extracted literal:

1. **`grepByLiteralPrefilter`** (new) — regex with extractable
   literal + `multiline: true`. Uses `indexOf(literal)` to find
   candidate lines, runs the regex engine only on those lines.
   This is the main perf win.

2. **`grepByWholeBuffer`** — existing path, used for:
   - Pure-literal patterns (V8 handles them optimally)
   - Patterns with no extractable literal (complex regex, top-level
     alternation)
   - `multiline: false` mode (fast path requires per-line semantics)

Also: lazy line counting. Replaced the `charCodeAt`-walk that
counted newlines char-by-char from cursor to match index with an
`indexOf("\n", cursor)` hop loop. V8 implements `indexOf` in
optimized C++ without per-iteration JS interop, making this 2-5×
faster on the line-counting sub-loop.

Extracted `buildMatch(ctx, bounds)` as a shared helper to bundle
the match-construction arguments into a single struct — keeps
Biome's `useMaxParams` rule satisfied across three call sites
with a clean API.

### Tests added

- `test/lib/scan/literal-extract.test.ts` — 39 tests covering the
  extractor's rules (escape handling, quantifier drop, alternation
  bail, case-insensitive, minimum length).
- `test/lib/scan/grep.test.ts` — 7 new tests for the prefilter fast
  path: correctness vs whole-buffer, escaped-literal extraction,
  case-insensitive flag, zero-literal-hit short-circuit, routing
  of pure literals to whole-buffer, and alternation routing.

## Perf impact (synthetic/large, 10k files, Bun 1.3.11, 4-core)

| Op | Before | After | Δ |
|---|---:|---:|---:|
| `scan.grepFiles` (DSN pattern) | 370 ms | **318 ms** | −14% |
| `scanCodeForDsns` | 342 ms | **333 ms** | −3% (noise-equivalent) |
| `detectAllDsns.cold` | 363 ms | **313 ms** | −14% |
| `detectDsn.cold` | 7.73 ms | **5.61 ms** | −27% |
| `scanCodeForFirstDsn` | 2.91 ms | **2.13 ms** | −27% |
| `import.*from` uncapped (bench) | 1489 ms | **1178 ms** | −21% |

The DSN workloads improve because `DSN_PATTERN` extracts `http` as
its literal — most source files don't contain `http` at all, so
the prefilter short-circuits before the regex runs.

No regressions on any benchmark. Pure-literal patterns (e.g.,
`SENTRY_DSN`, `NONEXISTENT_TOKEN_XYZ`) continue through the whole-
buffer path unchanged.

## What this PR is NOT

- Not a structural parallelization change (workers, byte-level
  scanning, chunked reads). Those may follow separately.
- Not a regex-engine replacement. V8's Irregexp is what it is.
- Not feature-level changes to `GrepOptions`/`GrepResult`. Public
  API preserved.

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing warning unrelated to
      this PR in `src/lib/formatters/markdown.ts`)
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
      **5610 pass, 0 fail** (+58 new: 39 literal-extract + 7 grep
      prefilter + 7 existing grep tests re-verified)
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large --runs 5` — all scan ops at or
      below previous baseline; no regressions
- [x] Manually verified semantic parity: `collectGrep` returns
      identical `GrepMatch[]` on prefilter vs whole-buffer paths
      for patterns where the prefilter fires

## Background

This PR follows the ripgrep research from PR #791: rg's central
perf trick is extracting a literal from each regex and prefiltering
with SIMD memchr. V8 doesn't expose SIMD directly but its
`String.prototype.indexOf` is compiled to a tight byte-level loop
with internal SIMD on x64 — functionally equivalent for our use
case. The remaining gap with rg is now primarily orchestration
overhead (async/await, `mapFilesConcurrent`, walker correctness
features) rather than regex speed. A worker-pool exploration may
follow.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://cli.sentry.dev/_preview/pr-804/

Built to branch gh-pages at 2026-04-21 18:25 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

Codecov Results 📊

138 passed | Total: 138 | Pass Rate: 100% | Execution Time: 0ms

📊 Comparison with Base Branch

Metric Change
Total Tests
Passed Tests
Failed Tests
Skipped Tests

✨ No test changes detected

All tests are passing successfully.

✅ Patch coverage is 95.85%. Project has 1768 uncovered lines.
❌ Project coverage is 95.63%. Comparing base (base) to head (head).

Files with missing lines (1)
File Patch % Lines
src/lib/scan/literal-extract.ts 94.95% ⚠️ 10 Missing
Coverage diff
@@            Coverage Diff             @@
##          main       #PR       +/-##
==========================================
- Coverage    95.65%    95.63%    -0.02%
==========================================
  Files          280       281        +1
  Lines        40228     40444      +216
  Branches         0         0         —
==========================================
+ Hits         38477     38676      +199
- Misses        1751      1768       +17
- Partials         0         0         —

Generated by Codecov Action

Comment thread src/lib/scan/literal-extract.ts
Cursor Bugbot found a silent correctness bug in the literal-prefix
extractor: `[(]foo|bar` was extracting `"foo"` as a required literal,
causing `collectGrep`'s literal prefilter to silently miss lines
matching only the `bar` branch.

## Root cause

`hasTopLevelAlternation` tracked paren/bracket depth for `(`, `)`,
`[`, `]` in a single pass but didn't skip over the interior of a
character class. Inside `[...]`, the characters `(` and `)` are
LITERAL bytes, not grouping metacharacters — they shouldn't change
`depthParen`. But the old code incremented `depthParen` on `(` even
when it was inside `[...]`. Once the class closed, `depthParen` was
stuck at 1, so the subsequent top-level `|` was ignored, the
detector returned `false`, and the extractor proceeded to extract
`"foo"` from after the class.

Result: the prefilter scanned for `foo` and silently dropped any
line that matched only via the `bar` branch of the alternation.

## Fix

Character classes are opaque to the alternation detector. When `[`
is seen, skip past the first unescaped `]` in one step. Also
recognize that `[...]` is NOT nestable — `[[]` is a class containing
a literal `[`, not a nested class (regex spec). So `skipToMatching`
(which is nestable) is wrong for classes.

New helper `skipCharacterClass` handles the opaque-skip correctly.
Both `extractInnerLiteral` (line 174) and `hasTopLevelAlternation`
(line 273) now use it for `[...]`. `skipToMatching` is preserved for
`(...)` groups (which ARE nestable).

## Verification

Before:
  extractInnerLiteral("[(]foo|bar", "") → "foo"   (WRONG)
  collectGrep({pattern: "[(]foo|bar"}) → misses "bar" lines

After:
  extractInnerLiteral("[(]foo|bar", "") → null    (correct — alternation detected)
  collectGrep → falls through to whole-buffer path, finds all matches

## Tests

- `literal-extract.test.ts`: 10 new tests under "character-class
  opaqueness (Cursor bug #1)" — 5 patterns that previously silently
  extracted a wrong literal, 5 control patterns that should still
  extract correctly (char classes without alternation).
- `grep.test.ts`: 1 new end-to-end regression test verifying that
  `collectGrep` on `[(]foo|bar` across three files (only-bar,
  only-paren, both) returns all expected matches.

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing warning unrelated)
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
      **5621 pass, 0 fail** (+11 new regression tests)
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large --op scan.grepFiles` — 329ms p50
      (noise-equivalent to previous 318ms; no perf regression)
Comment thread src/lib/scan/literal-extract.ts
Cursor Bugbot found the sibling of yesterday's char-class bug
(#612b8b0). The fix for `hasTopLevelAlternation` used the new
`skipCharacterClass` helper, but the group-skipper (formerly
`skipToMatching`, now renamed `skipGroup`) was left untouched. Same
underlying issue: a `)` inside `[...]` was closing the enclosing
group early.

## Root cause

`extractInnerLiteral` encounters `(` and calls the group-skipper to
advance past the matching `)`. The skipper tracked paren depth and
stopped at the first `)` where depth hit zero. But `[...]` char
classes contain literal bytes — a `)` inside `[)]` is NOT a grouping
metacharacter. The skipper didn't treat classes as opaque, so it
exited the group at the wrong paren.

Concrete: `(ABC[)]DEF)?GHI`
  - skipper enters at `(`, tracks depth=1
  - sees `[`, advances past `]` — but `)` inside `[)]` was reached
    first, decremented depth to 0, skipper thinks group ended
  - extractor now considers `DEF` to be at the top level, returns
    it as the required literal
  - prefilter searches for `DEF` and silently misses lines matching
    only the `GHI` branch (the group is optional `?`)

## Fix

`skipGroup` now calls `skipCharacterClass` when it sees `[`, making
classes opaque to group-depth tracking — same mechanism already
used in `hasTopLevelAlternation`.

Also renamed `skipToMatching` → `skipGroup` since it was always
called with `(`/`)` anyway. Clearer intent + no functional change.

## Verification

Before:
  extractInnerLiteral("(ABC[)]DEF)?GHI", "") → "DEF"   (WRONG)
  collectGrep silently drops lines matching only "GHI"

After:
  extractInnerLiteral("(ABC[)]DEF)?GHI", "") → "GHI"   (correct)
  collectGrep finds all matches

## Tests

- `literal-extract.test.ts`: 9 new tests under "class-in-group
  opaqueness (Cursor bug #1 followup)" — 5 bug patterns + 4
  control patterns verifying regular group-skipping still works
  (including nested groups like `((foo))?bar`).
- `grep.test.ts`: 1 new end-to-end regression test for
  `collectGrep` with `(ABC[)]DEF)?GHI` across two files — verifies
  both branches match after the fix.

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing warning unrelated)
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
      **5631 pass, 0 fail** (+10 from this fix)
- [x] `bun test test/isolated` — 138 pass
- [x] Manual E2E repro from the bug report now returns all matches
Comment thread src/lib/scan/grep.ts Outdated
…ange (Cursor #2)

Cursor Bugbot found another silent correctness bug in
`grepByLiteralPrefilter`. When the caller sets `caseSensitive:
false`, the fast path lowercases the content and searches for the
pre-lowercased literal via `indexOf`. The implementation used `hit`
positions from the lowercased `haystack` as indices into the
original `content` for line-boundary math, assuming the two strings
have the same length.

## Root cause

JavaScript's `String.prototype.toLowerCase()` is specified against
Unicode case-folding rules that can produce a LONGER string. The
classic example is Turkish dotted `İ` (U+0130), which lowercases to
`i` + U+0307 (combining dot above) — net +1 code unit per `İ`.

When content contains such characters:
- `haystack.length !== content.length`
- `hit` positions from `haystack` are offset past the matching
  position in `content`
- `content.lastIndexOf("\n", hit)` / `content.indexOf("\n", hit)`
  return newlines that happen to bracket SOME line (often the
  wrong one)
- `lineEnd + 1` (computed on `content`) fed back to
  `haystack.indexOf(literal, lineEnd + 1)` compounds the
  misalignment and can SKIP PAST real match positions

Concrete repro (would fail under the buggy code, passes now):

  file: `İİİİİİİ line 1 with Sentry.init here`
        `İİİİİİİ line 2 plain`
        `İİİİİİİ line 3 with Sentry.init again`
        `İİİİİİİ line 4 plain`
  pattern: /Sentry\.init/i
  before: only line 1 found — line 3 silently dropped
  after:  both lines 1 and 3 found

## Fix

Detect the length divergence up front and fall back to the whole-
buffer regex path when it happens:

```ts
if (isCaseInsensitive) {
  const lowered = content.toLowerCase();
  if (lowered.length !== content.length) {
    return grepByWholeBuffer(content, entry, opts);
  }
  haystack = lowered;
}
```

The check is one `.length` comparison — nearly free. Real source
code almost never contains the problematic characters, so the fast
path still fires on the overwhelming common case. When it doesn't,
we degrade to the correct-but-slower regex path instead of silently
missing matches.

## Tests

- `grep.test.ts`: new end-to-end regression test with Turkish `İ`
  in the fixture content, verifying both `Sentry.init` matches on
  lines 1 and 3 are found. Before the fix this test would return
  only 1 match.

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing warning unrelated;
      added `biome-ignore noExcessiveCognitiveComplexity` to
      `grepByLiteralPrefilter` — complexity grew from 15 to 17
      with the new length-guard branch)
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
      **5632 pass, 0 fail** (+1 new regression test)
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large --op scan.grepFiles` — 332ms p50
      (noise-equivalent to 329ms before; no perf regression on the
      common ASCII-only path)
Comment thread src/lib/scan/grep.ts Outdated
…t (Cursor #3)

Cursor Bugbot found another silent correctness bug. The literal
extractor was running against the raw user-supplied `patternSource`
instead of the compiled `regex.source`. When `compilePattern`
rewrites scoped inline flags like `(?i:foo|bar)baz`, it strips the
group and widens the flag, yielding `foo|barbaz` + the `i` flag at
the top level. The rewritten source has top-level alternation the
original lacked.

## Concrete bug

- User input: `(?i:foo|bar)baz`
- Compiled source (what the regex engine runs): `foo|barbaz`
- Compiled flags: `i`

The old code called `extractInnerLiteral(userInput, flags)`. The
extractor's `hasTopLevelAlternation` didn't see the `|` (it's inside
the `(?i:...)` group from its perspective), so it didn't bail, and
`extractInnerLiteral` returned `"baz"` as a "required" literal.

But the actual compiled regex `/foo|barbaz/i` matches `foo` alone.
Lines with only `foo` got silently dropped by the prefilter because
they don't contain `"baz"`.

Repro confirmed:
```
pattern: (?i:foo|bar)baz
files:
  foo-only.txt: "foo line without that other word"  ← should match via `foo` branch
  barbaz.txt:   "contains barbaz here"

before: only barbaz.txt matched — foo-only.txt silently dropped
after:  both files match
```

## Fix

Always extract from `regex.source` (post-compile). That way the
extractor sees what the regex engine will actually run — including
any structural changes from `compilePattern`'s inline-flag handling.
`regex.flags` is authoritative for case-sensitivity either way.

One-line change plus updated comment explaining the constraint.

## Tests

- `grep.test.ts`: new end-to-end regression test with the
  `(?i:foo|bar)baz` pattern against two files — verifies both
  branches match after the fix. Before the fix this test would
  return only the barbaz.txt match.

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing warning unrelated)
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
      **5633 pass, 0 fail** (+1 new regression test)
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large --op scan.grepFiles` — 325ms p50
      (noise-equivalent; no perf regression)
- [x] All 13 literal-prefilter fast path tests still pass unchanged
      (extraction from compiled source gives equivalent results for
      patterns without scoped inline flags, which is the overwhelming
      common case)
Comment thread src/lib/scan/literal-extract.ts
…char escapes (review #4)

A subagent self-review found two more correctness bugs in the
literal prefilter: (A) multi-char escape sequences like \x41, \u0041,
\cA had their tails mistaken for literals, and (B) patterns that can
match across newlines (e.g., foo\sbar) silently missed valid matches
via the per-line verify path.

Rather than patch each failure mode, this commit redesigns the
prefilter as a FILE-LEVEL gate only and fixes escape-sequence
advancement in the extractor. Net result is simpler, correct, and
~5% faster on bench.

## Bug A: multi-char escapes

`extractInnerLiteral` advanced all escape sequences by 2 chars,
leaving the tail chars of longer escapes as "literal" content.

- `\x41foo` → returned "41foo", but compiled regex is /Afoo/.
  Files with "Afoo" got skipped by the gate (no "41foo" present).
- `\u0041foo` → "0041foo" returned, same silent miss.
- `\cAfoo` → "Afoo" returned (wrong — ctrl-A, not "A").
- `\k<name>foo` → "<name>foo" returned (backref syntax leaked as
  literal).
- `\p{L}foo` → "{L}foo" returned (Unicode property syntax leaked).

Fix: new `escapeSequenceLength` helper computes the full source-char
length of all JS regex escape forms:

- `\xHH` → 4 chars
- `\uHHHH` → 6 chars
- `\u{H+}` → variable (until `}`)
- `\cX` → 3 chars
- `\k<name>` → variable (until `>`)
- `\p{...}`, `\P{...}` → variable (until `}`)
- default (single-char like `\d`, `\b`, `\t`, `\n`, numeric backrefs
  like `\1`, escaped metachars `\.`, etc.) → 2 chars

Applied to: the main extractor, `skipCharacterClass`, `skipGroup`,
`hasTopLevelAlternation`. All four call sites now correctly advance
past the whole escape sequence.

## Bug B: cross-newline patterns

The old prefilter did per-line verify: after finding a candidate via
`indexOf(literal)`, extracted the enclosing line and ran
`verifyRegex.test(lineText)`. This broke silently for patterns whose
match spans newlines:

- `foo\sbar` — `\s` includes `\n`, so `foo\nbar` is a match. But
  the per-line verify tests "foo" alone, which fails.
- `function\s+\w+` — multi-line function declarations missed.

This was a REGRESSION vs the pre-PR whole-buffer path, which correctly
found these matches via `regex.exec(content)`.

## Fix: file-level gate only

Redesigned the prefilter:

  Before: literal prefilter = find candidate lines + per-line verify
  After:  literal prefilter = file-level gate (skip entire files
          without the literal); fall through to whole-buffer for
          everything else

The gate still captures most of the perf win — files without the
literal are cheap to reject via indexOf, and in large trees those
dominate. The per-line optimization was the part with subtle
correctness failures, and it's gone.

Also simplified: removed `literalIsPattern` field, `grepByLiteralPrefilter`
function, `stripGlobalFlag` helper, and the Unicode length guard
(no longer needed since we only use indexOf as a boolean). The
Unicode concern from Cursor #2 reduces to: "does `content.toLowerCase()`
still contain `literal` at any position?" — which works correctly
regardless of whether `toLowerCase` changes the string's length.

## Performance

Synthetic/large fixture (10k files):

| Op | Before review #4 | After (this commit) |
|---|---:|---:|
| scan.grepFiles | 332 ms | **325 ms** |
| scanCodeForDsns | 333 ms | **313 ms** |
| detectDsn.cold | 5.61 ms | **4.55 ms** |
| detectAllDsns.cold | 313 ms | 316 ms |

File-level gate is cheaper than per-line verify (one indexOf vs
many plus regex.test per candidate). Net improvement.

## Tests

- `literal-extract.test.ts`: +8 tests for multi-char escape sequences
  (\x41, \u0041, \u{1F600}, \cA, \k<name>, \p{L}, \P{L}, plus
  controls verifying pre-escape literal extraction still wins).
- `grep.test.ts`: +2 end-to-end tests — one for cross-newline
  patterns (`foo\sbar`) and one for multi-char escapes
  (`\x41foo` must find files with `Afoo`, not `41foo`). Both tests
  would fail under the old code.

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing warning unrelated)
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
      **5644 pass, 0 fail** (+11 new regression tests)
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large --runs 5` — no regressions; scan
      ops slightly faster due to simpler gate
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2277d0d. Configure here.

Comment thread src/lib/scan/literal-extract.ts Outdated
BYK added 2 commits April 21, 2026 18:07
Cursor Bugbot caught dead code: `isPureLiteral` was exported and
tested but no longer called anywhere in the pipeline after the
previous commit simplified the prefilter to a file-level gate. The
function correctly identified pure-literal patterns but the new
design uses the literal the same way whether the whole pattern is
literal or not (file-level indexOf gate, fall through to whole-
buffer regex).

## Changes

- Deleted `isPureLiteral` export and its 4 unit tests.
- Updated the module-level docstring to reflect that pure-literal
  patterns flow through the same gate-then-regex path as everything
  else.
- Fixed a stale comment in `grep.test.ts` that claimed pure literals
  "skip the prefilter" via `isPureLiteral` — they go through the
  gate like all other patterns, and V8's regex engine handles
  pure-literal patterns efficiently in the whole-buffer path.

Test pass count drops by 4 (from 5644 to 5640) — the removed tests
were all coverage for dead code.

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing warning unrelated)
- [x] `bun test --timeout 15000 test/lib/scan/` — **228 pass, 0 fail**
- [x] All existing behavior tests still pass (the `pure literal`
      grep test now correctly describes what the code does)
…esign

Pre-merge review found two stale docstrings describing the old
per-line prefilter design that no longer exists after the
simplification to a file-level gate.

## Changes

- `src/lib/scan/literal-extract.ts`: top-of-file docstring now
  correctly describes the extractor as feeding a FILE-LEVEL gate,
  not a per-line prefilter. Updated the "gate effectiveness"
  section to talk about files rejected vs. the old (stale)
  "lines before touching the regex engine" framing. Removed the
  reference to `isPureLiteral` in the pure-literal paragraph
  (function was deleted in the previous commit).

- `src/lib/scan/grep.ts`: top-of-file docstring rewritten. The old
  version described `content.split("\n") + regex.test(line)` but
  `grepByWholeBuffer` actually uses `regex.exec(content)` on the
  full buffer with per-match line-boundary extraction. New
  docstring describes the two-stage gate-then-whole-buffer design
  and points at `setupGrepPipeline` for literal extraction.

No behavioral changes. Lint clean, all 228 scan tests pass.
@BYK BYK merged commit 78f196a into main Apr 21, 2026
26 checks passed
@BYK BYK deleted the byk/perf/grep-literal-prefilter branch April 21, 2026 18:36
BYK added a commit that referenced this pull request Apr 21, 2026
…#805)

## Summary

Walker was 411ms p50 for 10k files on the synthetic/large fixture —
roughly 10× slower than a naive sync walk (50ms). Follow-up to PR #791
and #804.

Five targeted optimizations in the hot loop, all micro-benchmarked
before landing:

| Change | Measured win | End-to-end win |
|---|---|---|
| `readdirSync` instead of `readdir` | 105→45ms (−60ms) | dominant |
| `statSync` instead of `Bun.file().size` | −10ms per 10k files | |
| String concat for paths | path.join 7ms → concat 0.7ms (10×) | |
| `abs.slice(cwdPrefixLen)` for relPath | path.relative 9ms → slice
0.8ms (11×) | |
| Manual `lastIndexOf`-based extname | 9ms → 7ms (25%) | |

### Perf results (synthetic/large, 10k files, p50)

| Op | Before | After | Δ |
|---|---:|---:|---:|
| `scan.walk` | 411ms | **231ms** | **−44%** |
| `scan.walk.noExt` | 572ms | 448ms | **−22%** |
| `scan.walk.dsnParity` | 228ms | **138ms** | **−39%** |
| `scanCodeForDsns` | 323ms | 304ms | −6% |
| `detectAllDsns.cold` | 327ms | 308ms | −6% |
| `detectAllDsns.warm` | 27.9ms | 27.0ms | — |
| `scan.grepFiles` | 322ms | 316ms | noise |
| `scanCodeForFirstDsn` | 2.23ms | 2.05ms | — |

Walker ops are 22-44% faster. Downstream ops (grep, DSN scanner) benefit
less because their time is dominated by content scanning, not walking —
but still show consistent ~6% improvements.

## Why `readdirSync` is safe here

The sync vs async tradeoff usually favors async because blocking the
event loop is bad in general. But measured per-call cost matters:

| Metric | `readdirSync` on 3635 dirs |
|---|---|
| p50 | 11µs |
| p95 | 24µs |
| p99 | 35µs |
| max | 65µs |

65µs max block is trivial — `setTimeout(0)` latency in Node is ~4ms.
Blocking for 65µs never causes noticeable event-loop pauses. And we pay
~60µs of microtask overhead for each async readdir, which wipes out any
theoretical fairness benefit. Net: 2-3× faster per-dir on walks with
many small directories, which is every realistic CLI workload.

If this ever matters for a weird embedded use case, the optimization is
trivially reversible — the walker's public API is unchanged.

## mtime parity fix

Initial implementation regressed `detectAllDsns.warm` from 28ms → 304ms
because `statSync().mtimeMs` is a float (e.g. `1776790602458.1033`)
while `Bun.file().lastModified` is already an integer. The DSN cache
validator compares floored `sourceMtimes`, so un-floored floats caused
cache misses on every warm call. Fixed by flooring explicitly in
`tryYieldFile` — matches the same treatment already applied to
`onDirectoryVisit`'s dirMtimes.

## Walker v2 design notes

- **Sync readdir + async yield.** Readdir blocks briefly (~65µs max),
but the surrounding async-generator interface is preserved. Library
consumers that insert async work between iterations still get
cooperative scheduling.
- **Path ops inlined.** `path.join` / `path.relative` / `path.extname` +
`toLowerCase` are all replaced with manual string ops in the hot loop.
On Windows, `normalizePath` is still applied via `replaceAll(NATIVE_SEP,
"/")` — the POSIX fast path uses a cached `POSIX_NATIVE` module
constant.
- **`WalkContext.cwdPrefixLen`** precomputed once per walk, used to
slice relative paths from absolute paths.
- **Cached module constants** `NATIVE_SEP` and `POSIX_NATIVE` avoid
per-call `path.sep` property lookups in the hot loop.

## What this PR does NOT change

- `WalkEntry` shape: preserved identically (absolutePath, relativePath,
size, mtime, isBinary, depth).
- `WalkOptions` contract: no new options, no semantics changes.
- Symlink handling, `followSymlinks`, cycle detection via
`visitedInodes`: unchanged.
- IgnoreStack + gitignore semantics: unchanged.
- `onDirectoryVisit`, `recordMtimes`, `abortSignal`: unchanged.

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing warning in
`src/lib/formatters/markdown.ts` unrelated)
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
**5640 pass, 0 fail**
- [x] `bun test test/isolated` — 138 pass
- [x] `bun test test/lib/scan/walker.test.ts` — 34 pass (incl. property
tests covering hidden files, symlinks, maxDepth, gitignore interaction)
- [x] `bun test test/lib/dsn/code-scanner.test.ts` — 52 pass (incl.
dirMtimes / sourceMtimes cache validation)
- [x] DSN count correctness verified end-to-end: 4 DSNs found on fixture
(matches pre-change count)
- [x] `bun run bench --size large --runs 5 --warmup 2` — results in
table above

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant