perf(scan): literal prefilter + lazy line counting in readAndGrep by BYK · Pull Request #804 · getsentry/cli

BYK · 2026-04-21T15:36:25Z

Summary

Two regex-level optimizations to narrow the perf gap with ripgrep on our pure-TS collectGrep/grepFiles. Follow-up to PR #791 and #797.

Literal prefilter — ripgrep-style: extract a literal substring from the regex source (e.g., import from import.*from), scan the buffer with indexOf to locate candidate lines, only invoke the regex engine on lines that contain the literal. V8's indexOf is roughly SIMD-speed; skipping the regex engine on non-candidate lines is where most of the win comes from.
Lazy line counting — swapped charCodeAt-walk for indexOf("\n", cursor) hops. 2-5× faster on the line-counting sub-loop because V8 implements indexOf in C++ without per-iteration JS interop.

Perf impact (synthetic/large, 10k files, Bun 1.3.11, 4-core)

Op	Before	After	Δ
`scan.grepFiles` (DSN pattern)	370 ms	318 ms	−14%
`detectAllDsns.cold`	363 ms	313 ms	−14%
`detectDsn.cold`	7.73 ms	5.61 ms	−27%
`scanCodeForFirstDsn`	2.91 ms	2.13 ms	−27%
`scanCodeForDsns`	342 ms	333 ms	−3% (noise-equivalent)
`import.*from` uncapped (bench)	1489 ms	1178 ms	−21%

The DSN workloads improve because DSN_PATTERN extracts http as its literal — most source files don't contain http at all, so the prefilter short-circuits before the regex runs.

No regressions on any benchmark. Pure-literal patterns (e.g., SENTRY_DSN, NONEXISTENT_TOKEN_XYZ) continue through the whole-buffer path unchanged.

What changed

New file: `src/lib/scan/literal-extract.ts` (~300 LOC)

Conservative literal extractor. Walks a regex source looking for the longest contiguous run of literal bytes that every match must contain. Bails out safely on top-level alternation, character classes, groups, lookarounds, quantifiers, and escape classes.

Handles escaped metacharacters intelligently: Sentry\.init yields Sentry.init (extracted via literal \. → .), while \bfoo\b yields foo (escape \b is an anchor, not a literal b).

Exports:

extractInnerLiteral(source, flags) — returns the literal, or null if no safe extraction possible. Honors /i by lowercasing.
isPureLiteral(source, flags) — true when the pattern IS a bare literal with no metacharacters. Used by the grep pipeline to route pure-literals to the whole-buffer path (V8's regex engine is hyper-optimized for pure-literal patterns; the prefilter adds overhead without benefit there).

Modified: `src/lib/scan/grep.ts` (~240 LOC changes)

Three-way dispatch in readAndGrep based on the extracted literal:

grepByLiteralPrefilter (new) — regex with extractable literal + multiline: true. Uses indexOf(literal) to find candidate lines, runs the regex engine only on those. This is the main perf win.
grepByWholeBuffer — existing path, used for:
- Pure-literal patterns (V8 handles them optimally)
- Patterns with no extractable literal (complex regex, top-level alternation)
- multiline: false mode (the fast path requires per-line semantics)

Also: replaced the charCodeAt-walk that counted newlines char-by-char with an indexOf("\n", cursor) hop loop. Extracted buildMatch(ctx, bounds) as a shared helper to bundle the match-construction arguments.

Tests added

test/lib/scan/literal-extract.test.ts — 39 tests covering the extractor's rules (escape handling, quantifier drop, alternation bail, case-insensitive, minimum length).
test/lib/scan/grep.test.ts — 7 new tests for the prefilter fast path: correctness vs whole-buffer, escaped-literal extraction, case-insensitive flag, zero-literal-hit short-circuit, routing of pure literals to whole-buffer, and alternation routing.

Why this approach

From the ripgrep research (attached to PR #791): rg's central perf trick is extracting a literal from each regex and prefiltering with SIMD memchr. V8 doesn't expose SIMD directly but its String.prototype.indexOf is compiled to a tight byte-level loop with internal SIMD on x64 — functionally equivalent for our use case.

Three of the five techniques in the Loggly regex-perf guide were evaluated:

Character classes over .* — DSN_PATTERN already uses [a-z0-9]+, no change needed.
Alternation order — DSN_PATTERN's (?:\.[a-z]+|:[0-9]+) is already correctly ordered (. more common than : in DSN hosts); swapping regressed perf by noise.
Anchors/word boundaries — adding \b to DSN_PATTERN regressed perf 2.8× on our workload. V8's existing fast character-mismatch rejection on the first byte outperforms the boundary check overhead.

The remaining gap with rg is now primarily orchestration overhead (async/await, mapFilesConcurrent, walker correctness features) rather than regex speed. A worker-pool exploration may follow.

Test plan

bunx tsc --noEmit — clean
bun run lint — clean (1 pre-existing warning in src/lib/formatters/markdown.ts unrelated to this PR)
bun test --timeout 15000 test/lib test/commands test/types — 5610 pass, 0 fail (+58 new)
bun test test/isolated — 138 pass, 0 fail
bun run bench --size large --runs 5 — all scan ops at or below previous baseline
Manually verified semantic parity: collectGrep returns identical GrepMatch[] on prefilter vs whole-buffer paths for patterns where the prefilter fires

🤖 Generated with Claude Code

Co-authored-by: Claude Opus 4.7 (1M context) noreply@anthropic.com

Two regex-level optimizations to narrow the perf gap with ripgrep on `collectGrep`/`grepFiles`. Biggest win is a ripgrep-style literal prefilter: extract a literal substring from the regex source (e.g., `import` from `import.*from`), scan the buffer with `indexOf` to locate candidate lines, and only invoke the regex engine on lines that contain the literal. On lines without the literal, `indexOf` is ~50× faster than V8's regex engine for the same rejection work. ## What changed ### New file: `src/lib/scan/literal-extract.ts` (~300 LOC) Conservative literal extractor. Walks a regex source looking for the longest contiguous run of literal bytes that every match must contain. Bails out safely on top-level alternation, character classes, groups, lookarounds, quantifiers, and escape classes. Handles escaped metacharacters intelligently: `Sentry\.init` yields `Sentry.init` (extracted via literal `\.` → `.`), while `\bfoo\b` yields `foo` (escape `\b` is an anchor, not a literal `b`). Exports: - `extractInnerLiteral(source, flags)` — returns the literal, or null if no safe extraction possible. Honors `/i` by lowercasing. - `isPureLiteral(source, flags)` — true when the pattern IS a bare literal with no metacharacters. Used by the grep pipeline to route pure-literals to the whole-buffer path (V8's regex engine is hyper-optimized for pure-literal patterns; the prefilter adds overhead without benefit there). ### Modified: `src/lib/scan/grep.ts` (~240 LOC changes) Three-way dispatch in `readAndGrep` based on the extracted literal: 1. **`grepByLiteralPrefilter`** (new) — regex with extractable literal + `multiline: true`. Uses `indexOf(literal)` to find candidate lines, runs the regex engine only on those lines. This is the main perf win. 2. **`grepByWholeBuffer`** — existing path, used for: - Pure-literal patterns (V8 handles them optimally) - Patterns with no extractable literal (complex regex, top-level alternation) - `multiline: false` mode (fast path requires per-line semantics) Also: lazy line counting. Replaced the `charCodeAt`-walk that counted newlines char-by-char from cursor to match index with an `indexOf("\n", cursor)` hop loop. V8 implements `indexOf` in optimized C++ without per-iteration JS interop, making this 2-5× faster on the line-counting sub-loop. Extracted `buildMatch(ctx, bounds)` as a shared helper to bundle the match-construction arguments into a single struct — keeps Biome's `useMaxParams` rule satisfied across three call sites with a clean API. ### Tests added - `test/lib/scan/literal-extract.test.ts` — 39 tests covering the extractor's rules (escape handling, quantifier drop, alternation bail, case-insensitive, minimum length). - `test/lib/scan/grep.test.ts` — 7 new tests for the prefilter fast path: correctness vs whole-buffer, escaped-literal extraction, case-insensitive flag, zero-literal-hit short-circuit, routing of pure literals to whole-buffer, and alternation routing. ## Perf impact (synthetic/large, 10k files, Bun 1.3.11, 4-core) | Op | Before | After | Δ | |---|---:|---:|---:| | `scan.grepFiles` (DSN pattern) | 370 ms | **318 ms** | −14% | | `scanCodeForDsns` | 342 ms | **333 ms** | −3% (noise-equivalent) | | `detectAllDsns.cold` | 363 ms | **313 ms** | −14% | | `detectDsn.cold` | 7.73 ms | **5.61 ms** | −27% | | `scanCodeForFirstDsn` | 2.91 ms | **2.13 ms** | −27% | | `import.*from` uncapped (bench) | 1489 ms | **1178 ms** | −21% | The DSN workloads improve because `DSN_PATTERN` extracts `http` as its literal — most source files don't contain `http` at all, so the prefilter short-circuits before the regex runs. No regressions on any benchmark. Pure-literal patterns (e.g., `SENTRY_DSN`, `NONEXISTENT_TOKEN_XYZ`) continue through the whole- buffer path unchanged. ## What this PR is NOT - Not a structural parallelization change (workers, byte-level scanning, chunked reads). Those may follow separately. - Not a regex-engine replacement. V8's Irregexp is what it is. - Not feature-level changes to `GrepOptions`/`GrepResult`. Public API preserved. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning unrelated to this PR in `src/lib/formatters/markdown.ts`) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5610 pass, 0 fail** (+58 new: 39 literal-extract + 7 grep prefilter + 7 existing grep tests re-verified) - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large --runs 5` — all scan ops at or below previous baseline; no regressions - [x] Manually verified semantic parity: `collectGrep` returns identical `GrepMatch[]` on prefilter vs whole-buffer paths for patterns where the prefilter fires ## Background This PR follows the ripgrep research from PR #791: rg's central perf trick is extracting a literal from each regex and prefiltering with SIMD memchr. V8 doesn't expose SIMD directly but its `String.prototype.indexOf` is compiled to a tight byte-level loop with internal SIMD on x64 — functionally equivalent for our use case. The remaining gap with rg is now primarily orchestration overhead (async/await, `mapFilesConcurrent`, walker correctness features) rather than regex speed. A worker-pool exploration may follow.

github-actions · 2026-04-21T15:37:07Z

PR Preview Action v1.8.1
🚀 View preview at https://cli.sentry.dev/_preview/pr-804/
Built to branch `gh-pages` at 2026-04-21 18:25 UTC. Preview will be ready when the GitHub Pages deployment is complete.

github-actions · 2026-04-21T15:39:00Z

Codecov Results 📊

✅ 138 passed | Total: 138 | Pass Rate: 100% | Execution Time: 0ms

📊 Comparison with Base Branch

Metric	Change
Total Tests	—
Passed Tests	—
Failed Tests	—
Skipped Tests	—

✨ No test changes detected

All tests are passing successfully.

✅ Patch coverage is 95.85%. Project has 1768 uncovered lines.
❌ Project coverage is 95.63%. Comparing base (base) to head (head).

Files with missing lines (1)

File	Patch %	Lines
src/lib/scan/literal-extract.ts	94.95%	⚠️ 10 Missing

Coverage diff

@@            Coverage Diff             @@
##          main       #PR       +/-##
==========================================
- Coverage    95.65%    95.63%    -0.02%
==========================================
  Files          280       281        +1
  Lines        40228     40444      +216
  Branches         0         0         —
==========================================
+ Hits         38477     38676      +199
- Misses        1751      1768       +17
- Partials         0         0         —

Generated by Codecov Action

Cursor Bugbot found a silent correctness bug in the literal-prefix extractor: `[(]foo|bar` was extracting `"foo"` as a required literal, causing `collectGrep`'s literal prefilter to silently miss lines matching only the `bar` branch. ## Root cause `hasTopLevelAlternation` tracked paren/bracket depth for `(`, `)`, `[`, `]` in a single pass but didn't skip over the interior of a character class. Inside `[...]`, the characters `(` and `)` are LITERAL bytes, not grouping metacharacters — they shouldn't change `depthParen`. But the old code incremented `depthParen` on `(` even when it was inside `[...]`. Once the class closed, `depthParen` was stuck at 1, so the subsequent top-level `|` was ignored, the detector returned `false`, and the extractor proceeded to extract `"foo"` from after the class. Result: the prefilter scanned for `foo` and silently dropped any line that matched only via the `bar` branch of the alternation. ## Fix Character classes are opaque to the alternation detector. When `[` is seen, skip past the first unescaped `]` in one step. Also recognize that `[...]` is NOT nestable — `[[]` is a class containing a literal `[`, not a nested class (regex spec). So `skipToMatching` (which is nestable) is wrong for classes. New helper `skipCharacterClass` handles the opaque-skip correctly. Both `extractInnerLiteral` (line 174) and `hasTopLevelAlternation` (line 273) now use it for `[...]`. `skipToMatching` is preserved for `(...)` groups (which ARE nestable). ## Verification Before: extractInnerLiteral("[(]foo|bar", "") → "foo" (WRONG) collectGrep({pattern: "[(]foo|bar"}) → misses "bar" lines After: extractInnerLiteral("[(]foo|bar", "") → null (correct — alternation detected) collectGrep → falls through to whole-buffer path, finds all matches ## Tests - `literal-extract.test.ts`: 10 new tests under "character-class opaqueness (Cursor bug #1)" — 5 patterns that previously silently extracted a wrong literal, 5 control patterns that should still extract correctly (char classes without alternation). - `grep.test.ts`: 1 new end-to-end regression test verifying that `collectGrep` on `[(]foo|bar` across three files (only-bar, only-paren, both) returns all expected matches. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning unrelated) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5621 pass, 0 fail** (+11 new regression tests) - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large --op scan.grepFiles` — 329ms p50 (noise-equivalent to previous 318ms; no perf regression)

Cursor Bugbot found the sibling of yesterday's char-class bug (#612b8b0). The fix for `hasTopLevelAlternation` used the new `skipCharacterClass` helper, but the group-skipper (formerly `skipToMatching`, now renamed `skipGroup`) was left untouched. Same underlying issue: a `)` inside `[...]` was closing the enclosing group early. ## Root cause `extractInnerLiteral` encounters `(` and calls the group-skipper to advance past the matching `)`. The skipper tracked paren depth and stopped at the first `)` where depth hit zero. But `[...]` char classes contain literal bytes — a `)` inside `[)]` is NOT a grouping metacharacter. The skipper didn't treat classes as opaque, so it exited the group at the wrong paren. Concrete: `(ABC[)]DEF)?GHI` - skipper enters at `(`, tracks depth=1 - sees `[`, advances past `]` — but `)` inside `[)]` was reached first, decremented depth to 0, skipper thinks group ended - extractor now considers `DEF` to be at the top level, returns it as the required literal - prefilter searches for `DEF` and silently misses lines matching only the `GHI` branch (the group is optional `?`) ## Fix `skipGroup` now calls `skipCharacterClass` when it sees `[`, making classes opaque to group-depth tracking — same mechanism already used in `hasTopLevelAlternation`. Also renamed `skipToMatching` → `skipGroup` since it was always called with `(`/`)` anyway. Clearer intent + no functional change. ## Verification Before: extractInnerLiteral("(ABC[)]DEF)?GHI", "") → "DEF" (WRONG) collectGrep silently drops lines matching only "GHI" After: extractInnerLiteral("(ABC[)]DEF)?GHI", "") → "GHI" (correct) collectGrep finds all matches ## Tests - `literal-extract.test.ts`: 9 new tests under "class-in-group opaqueness (Cursor bug #1 followup)" — 5 bug patterns + 4 control patterns verifying regular group-skipping still works (including nested groups like `((foo))?bar`). - `grep.test.ts`: 1 new end-to-end regression test for `collectGrep` with `(ABC[)]DEF)?GHI` across two files — verifies both branches match after the fix. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning unrelated) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5631 pass, 0 fail** (+10 from this fix) - [x] `bun test test/isolated` — 138 pass - [x] Manual E2E repro from the bug report now returns all matches

…ange (Cursor #2) Cursor Bugbot found another silent correctness bug in `grepByLiteralPrefilter`. When the caller sets `caseSensitive: false`, the fast path lowercases the content and searches for the pre-lowercased literal via `indexOf`. The implementation used `hit` positions from the lowercased `haystack` as indices into the original `content` for line-boundary math, assuming the two strings have the same length. ## Root cause JavaScript's `String.prototype.toLowerCase()` is specified against Unicode case-folding rules that can produce a LONGER string. The classic example is Turkish dotted `İ` (U+0130), which lowercases to `i` + U+0307 (combining dot above) — net +1 code unit per `İ`. When content contains such characters: - `haystack.length !== content.length` - `hit` positions from `haystack` are offset past the matching position in `content` - `content.lastIndexOf("\n", hit)` / `content.indexOf("\n", hit)` return newlines that happen to bracket SOME line (often the wrong one) - `lineEnd + 1` (computed on `content`) fed back to `haystack.indexOf(literal, lineEnd + 1)` compounds the misalignment and can SKIP PAST real match positions Concrete repro (would fail under the buggy code, passes now): file: `İİİİİİİ line 1 with Sentry.init here` `İİİİİİİ line 2 plain` `İİİİİİİ line 3 with Sentry.init again` `İİİİİİİ line 4 plain` pattern: /Sentry\.init/i before: only line 1 found — line 3 silently dropped after: both lines 1 and 3 found ## Fix Detect the length divergence up front and fall back to the whole- buffer regex path when it happens: ```ts if (isCaseInsensitive) { const lowered = content.toLowerCase(); if (lowered.length !== content.length) { return grepByWholeBuffer(content, entry, opts); } haystack = lowered; } ``` The check is one `.length` comparison — nearly free. Real source code almost never contains the problematic characters, so the fast path still fires on the overwhelming common case. When it doesn't, we degrade to the correct-but-slower regex path instead of silently missing matches. ## Tests - `grep.test.ts`: new end-to-end regression test with Turkish `İ` in the fixture content, verifying both `Sentry.init` matches on lines 1 and 3 are found. Before the fix this test would return only 1 match. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning unrelated; added `biome-ignore noExcessiveCognitiveComplexity` to `grepByLiteralPrefilter` — complexity grew from 15 to 17 with the new length-guard branch) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5632 pass, 0 fail** (+1 new regression test) - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large --op scan.grepFiles` — 332ms p50 (noise-equivalent to 329ms before; no perf regression on the common ASCII-only path)

…t (Cursor #3) Cursor Bugbot found another silent correctness bug. The literal extractor was running against the raw user-supplied `patternSource` instead of the compiled `regex.source`. When `compilePattern` rewrites scoped inline flags like `(?i:foo|bar)baz`, it strips the group and widens the flag, yielding `foo|barbaz` + the `i` flag at the top level. The rewritten source has top-level alternation the original lacked. ## Concrete bug - User input: `(?i:foo|bar)baz` - Compiled source (what the regex engine runs): `foo|barbaz` - Compiled flags: `i` The old code called `extractInnerLiteral(userInput, flags)`. The extractor's `hasTopLevelAlternation` didn't see the `|` (it's inside the `(?i:...)` group from its perspective), so it didn't bail, and `extractInnerLiteral` returned `"baz"` as a "required" literal. But the actual compiled regex `/foo|barbaz/i` matches `foo` alone. Lines with only `foo` got silently dropped by the prefilter because they don't contain `"baz"`. Repro confirmed: ``` pattern: (?i:foo|bar)baz files: foo-only.txt: "foo line without that other word" ← should match via `foo` branch barbaz.txt: "contains barbaz here" before: only barbaz.txt matched — foo-only.txt silently dropped after: both files match ``` ## Fix Always extract from `regex.source` (post-compile). That way the extractor sees what the regex engine will actually run — including any structural changes from `compilePattern`'s inline-flag handling. `regex.flags` is authoritative for case-sensitivity either way. One-line change plus updated comment explaining the constraint. ## Tests - `grep.test.ts`: new end-to-end regression test with the `(?i:foo|bar)baz` pattern against two files — verifies both branches match after the fix. Before the fix this test would return only the barbaz.txt match. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning unrelated) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5633 pass, 0 fail** (+1 new regression test) - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large --op scan.grepFiles` — 325ms p50 (noise-equivalent; no perf regression) - [x] All 13 literal-prefilter fast path tests still pass unchanged (extraction from compiled source gives equivalent results for patterns without scoped inline flags, which is the overwhelming common case)

…char escapes (review #4) A subagent self-review found two more correctness bugs in the literal prefilter: (A) multi-char escape sequences like \x41, \u0041, \cA had their tails mistaken for literals, and (B) patterns that can match across newlines (e.g., foo\sbar) silently missed valid matches via the per-line verify path. Rather than patch each failure mode, this commit redesigns the prefilter as a FILE-LEVEL gate only and fixes escape-sequence advancement in the extractor. Net result is simpler, correct, and ~5% faster on bench. ## Bug A: multi-char escapes `extractInnerLiteral` advanced all escape sequences by 2 chars, leaving the tail chars of longer escapes as "literal" content. - `\x41foo` → returned "41foo", but compiled regex is /Afoo/. Files with "Afoo" got skipped by the gate (no "41foo" present). - `\u0041foo` → "0041foo" returned, same silent miss. - `\cAfoo` → "Afoo" returned (wrong — ctrl-A, not "A"). - `\k<name>foo` → "<name>foo" returned (backref syntax leaked as literal). - `\p{L}foo` → "{L}foo" returned (Unicode property syntax leaked). Fix: new `escapeSequenceLength` helper computes the full source-char length of all JS regex escape forms: - `\xHH` → 4 chars - `\uHHHH` → 6 chars - `\u{H+}` → variable (until `}`) - `\cX` → 3 chars - `\k<name>` → variable (until `>`) - `\p{...}`, `\P{...}` → variable (until `}`) - default (single-char like `\d`, `\b`, `\t`, `\n`, numeric backrefs like `\1`, escaped metachars `\.`, etc.) → 2 chars Applied to: the main extractor, `skipCharacterClass`, `skipGroup`, `hasTopLevelAlternation`. All four call sites now correctly advance past the whole escape sequence. ## Bug B: cross-newline patterns The old prefilter did per-line verify: after finding a candidate via `indexOf(literal)`, extracted the enclosing line and ran `verifyRegex.test(lineText)`. This broke silently for patterns whose match spans newlines: - `foo\sbar` — `\s` includes `\n`, so `foo\nbar` is a match. But the per-line verify tests "foo" alone, which fails. - `function\s+\w+` — multi-line function declarations missed. This was a REGRESSION vs the pre-PR whole-buffer path, which correctly found these matches via `regex.exec(content)`. ## Fix: file-level gate only Redesigned the prefilter: Before: literal prefilter = find candidate lines + per-line verify After: literal prefilter = file-level gate (skip entire files without the literal); fall through to whole-buffer for everything else The gate still captures most of the perf win — files without the literal are cheap to reject via indexOf, and in large trees those dominate. The per-line optimization was the part with subtle correctness failures, and it's gone. Also simplified: removed `literalIsPattern` field, `grepByLiteralPrefilter` function, `stripGlobalFlag` helper, and the Unicode length guard (no longer needed since we only use indexOf as a boolean). The Unicode concern from Cursor #2 reduces to: "does `content.toLowerCase()` still contain `literal` at any position?" — which works correctly regardless of whether `toLowerCase` changes the string's length. ## Performance Synthetic/large fixture (10k files): | Op | Before review #4 | After (this commit) | |---|---:|---:| | scan.grepFiles | 332 ms | **325 ms** | | scanCodeForDsns | 333 ms | **313 ms** | | detectDsn.cold | 5.61 ms | **4.55 ms** | | detectAllDsns.cold | 313 ms | 316 ms | File-level gate is cheaper than per-line verify (one indexOf vs many plus regex.test per candidate). Net improvement. ## Tests - `literal-extract.test.ts`: +8 tests for multi-char escape sequences (\x41, \u0041, \u{1F600}, \cA, \k<name>, \p{L}, \P{L}, plus controls verifying pre-escape literal extraction still wins). - `grep.test.ts`: +2 end-to-end tests — one for cross-newline patterns (`foo\sbar`) and one for multi-char escapes (`\x41foo` must find files with `Afoo`, not `41foo`). Both tests would fail under the old code. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning unrelated) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5644 pass, 0 fail** (+11 new regression tests) - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large --runs 5` — no regressions; scan ops slightly faster due to simpler gate

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 2277d0d. Configure here.}

Cursor Bugbot caught dead code: `isPureLiteral` was exported and tested but no longer called anywhere in the pipeline after the previous commit simplified the prefilter to a file-level gate. The function correctly identified pure-literal patterns but the new design uses the literal the same way whether the whole pattern is literal or not (file-level indexOf gate, fall through to whole- buffer regex). ## Changes - Deleted `isPureLiteral` export and its 4 unit tests. - Updated the module-level docstring to reflect that pure-literal patterns flow through the same gate-then-regex path as everything else. - Fixed a stale comment in `grep.test.ts` that claimed pure literals "skip the prefilter" via `isPureLiteral` — they go through the gate like all other patterns, and V8's regex engine handles pure-literal patterns efficiently in the whole-buffer path. Test pass count drops by 4 (from 5644 to 5640) — the removed tests were all coverage for dead code. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning unrelated) - [x] `bun test --timeout 15000 test/lib/scan/` — **228 pass, 0 fail** - [x] All existing behavior tests still pass (the `pure literal` grep test now correctly describes what the code does)

…esign Pre-merge review found two stale docstrings describing the old per-line prefilter design that no longer exists after the simplification to a file-level gate. ## Changes - `src/lib/scan/literal-extract.ts`: top-of-file docstring now correctly describes the extractor as feeding a FILE-LEVEL gate, not a per-line prefilter. Updated the "gate effectiveness" section to talk about files rejected vs. the old (stale) "lines before touching the regex engine" framing. Removed the reference to `isPureLiteral` in the pure-literal paragraph (function was deleted in the previous commit). - `src/lib/scan/grep.ts`: top-of-file docstring rewritten. The old version described `content.split("\n") + regex.test(line)` but `grepByWholeBuffer` actually uses `regex.exec(content)` on the full buffer with per-match line-boundary extraction. New docstring describes the two-stage gate-then-whole-buffer design and points at `setupGrepPipeline` for literal extraction. No behavioral changes. Lint clean, all 228 scan tests pass.

…#805) ## Summary Walker was 411ms p50 for 10k files on the synthetic/large fixture — roughly 10× slower than a naive sync walk (50ms). Follow-up to PR #791 and #804. Five targeted optimizations in the hot loop, all micro-benchmarked before landing: | Change | Measured win | End-to-end win | |---|---|---| | `readdirSync` instead of `readdir` | 105→45ms (−60ms) | dominant | | `statSync` instead of `Bun.file().size` | −10ms per 10k files | | | String concat for paths | path.join 7ms → concat 0.7ms (10×) | | | `abs.slice(cwdPrefixLen)` for relPath | path.relative 9ms → slice 0.8ms (11×) | | | Manual `lastIndexOf`-based extname | 9ms → 7ms (25%) | | ### Perf results (synthetic/large, 10k files, p50) | Op | Before | After | Δ | |---|---:|---:|---:| | `scan.walk` | 411ms | **231ms** | **−44%** | | `scan.walk.noExt` | 572ms | 448ms | **−22%** | | `scan.walk.dsnParity` | 228ms | **138ms** | **−39%** | | `scanCodeForDsns` | 323ms | 304ms | −6% | | `detectAllDsns.cold` | 327ms | 308ms | −6% | | `detectAllDsns.warm` | 27.9ms | 27.0ms | — | | `scan.grepFiles` | 322ms | 316ms | noise | | `scanCodeForFirstDsn` | 2.23ms | 2.05ms | — | Walker ops are 22-44% faster. Downstream ops (grep, DSN scanner) benefit less because their time is dominated by content scanning, not walking — but still show consistent ~6% improvements. ## Why `readdirSync` is safe here The sync vs async tradeoff usually favors async because blocking the event loop is bad in general. But measured per-call cost matters: | Metric | `readdirSync` on 3635 dirs | |---|---| | p50 | 11µs | | p95 | 24µs | | p99 | 35µs | | max | 65µs | 65µs max block is trivial — `setTimeout(0)` latency in Node is ~4ms. Blocking for 65µs never causes noticeable event-loop pauses. And we pay ~60µs of microtask overhead for each async readdir, which wipes out any theoretical fairness benefit. Net: 2-3× faster per-dir on walks with many small directories, which is every realistic CLI workload. If this ever matters for a weird embedded use case, the optimization is trivially reversible — the walker's public API is unchanged. ## mtime parity fix Initial implementation regressed `detectAllDsns.warm` from 28ms → 304ms because `statSync().mtimeMs` is a float (e.g. `1776790602458.1033`) while `Bun.file().lastModified` is already an integer. The DSN cache validator compares floored `sourceMtimes`, so un-floored floats caused cache misses on every warm call. Fixed by flooring explicitly in `tryYieldFile` — matches the same treatment already applied to `onDirectoryVisit`'s dirMtimes. ## Walker v2 design notes - **Sync readdir + async yield.** Readdir blocks briefly (~65µs max), but the surrounding async-generator interface is preserved. Library consumers that insert async work between iterations still get cooperative scheduling. - **Path ops inlined.** `path.join` / `path.relative` / `path.extname` + `toLowerCase` are all replaced with manual string ops in the hot loop. On Windows, `normalizePath` is still applied via `replaceAll(NATIVE_SEP, "/")` — the POSIX fast path uses a cached `POSIX_NATIVE` module constant. - **`WalkContext.cwdPrefixLen`** precomputed once per walk, used to slice relative paths from absolute paths. - **Cached module constants** `NATIVE_SEP` and `POSIX_NATIVE` avoid per-call `path.sep` property lookups in the hot loop. ## What this PR does NOT change - `WalkEntry` shape: preserved identically (absolutePath, relativePath, size, mtime, isBinary, depth). - `WalkOptions` contract: no new options, no semantics changes. - Symlink handling, `followSymlinks`, cycle detection via `visitedInodes`: unchanged. - IgnoreStack + gitignore semantics: unchanged. - `onDirectoryVisit`, `recordMtimes`, `abortSignal`: unchanged. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning in `src/lib/formatters/markdown.ts` unrelated) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5640 pass, 0 fail** - [x] `bun test test/isolated` — 138 pass - [x] `bun test test/lib/scan/walker.test.ts` — 34 pass (incl. property tests covering hidden files, symlinks, maxDepth, gitignore interaction) - [x] `bun test test/lib/dsn/code-scanner.test.ts` — 52 pass (incl. dirMtimes / sourceMtimes cache validation) - [x] DSN count correctness verified end-to-end: 4 DSNs found on fixture (matches pre-change count) - [x] `bun run bench --size large --runs 5 --warmup 2` — results in table above 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>