perf(scan): literal prefilter + lazy line counting in readAndGrep#804
Merged
perf(scan): literal prefilter + lazy line counting in readAndGrep#804
Conversation
Two regex-level optimizations to narrow the perf gap with ripgrep on
`collectGrep`/`grepFiles`. Biggest win is a ripgrep-style literal
prefilter: extract a literal substring from the regex source (e.g.,
`import` from `import.*from`), scan the buffer with `indexOf` to
locate candidate lines, and only invoke the regex engine on lines
that contain the literal. On lines without the literal, `indexOf`
is ~50× faster than V8's regex engine for the same rejection work.
## What changed
### New file: `src/lib/scan/literal-extract.ts` (~300 LOC)
Conservative literal extractor. Walks a regex source looking for
the longest contiguous run of literal bytes that every match must
contain. Bails out safely on top-level alternation, character
classes, groups, lookarounds, quantifiers, and escape classes.
Handles escaped metacharacters intelligently: `Sentry\.init` yields
`Sentry.init` (extracted via literal `\.` → `.`), while `\bfoo\b`
yields `foo` (escape `\b` is an anchor, not a literal `b`).
Exports:
- `extractInnerLiteral(source, flags)` — returns the literal, or
null if no safe extraction possible. Honors `/i` by lowercasing.
- `isPureLiteral(source, flags)` — true when the pattern IS a bare
literal with no metacharacters. Used by the grep pipeline to
route pure-literals to the whole-buffer path (V8's regex engine
is hyper-optimized for pure-literal patterns; the prefilter adds
overhead without benefit there).
### Modified: `src/lib/scan/grep.ts` (~240 LOC changes)
Three-way dispatch in `readAndGrep` based on the extracted literal:
1. **`grepByLiteralPrefilter`** (new) — regex with extractable
literal + `multiline: true`. Uses `indexOf(literal)` to find
candidate lines, runs the regex engine only on those lines.
This is the main perf win.
2. **`grepByWholeBuffer`** — existing path, used for:
- Pure-literal patterns (V8 handles them optimally)
- Patterns with no extractable literal (complex regex, top-level
alternation)
- `multiline: false` mode (fast path requires per-line semantics)
Also: lazy line counting. Replaced the `charCodeAt`-walk that
counted newlines char-by-char from cursor to match index with an
`indexOf("\n", cursor)` hop loop. V8 implements `indexOf` in
optimized C++ without per-iteration JS interop, making this 2-5×
faster on the line-counting sub-loop.
Extracted `buildMatch(ctx, bounds)` as a shared helper to bundle
the match-construction arguments into a single struct — keeps
Biome's `useMaxParams` rule satisfied across three call sites
with a clean API.
### Tests added
- `test/lib/scan/literal-extract.test.ts` — 39 tests covering the
extractor's rules (escape handling, quantifier drop, alternation
bail, case-insensitive, minimum length).
- `test/lib/scan/grep.test.ts` — 7 new tests for the prefilter fast
path: correctness vs whole-buffer, escaped-literal extraction,
case-insensitive flag, zero-literal-hit short-circuit, routing
of pure literals to whole-buffer, and alternation routing.
## Perf impact (synthetic/large, 10k files, Bun 1.3.11, 4-core)
| Op | Before | After | Δ |
|---|---:|---:|---:|
| `scan.grepFiles` (DSN pattern) | 370 ms | **318 ms** | −14% |
| `scanCodeForDsns` | 342 ms | **333 ms** | −3% (noise-equivalent) |
| `detectAllDsns.cold` | 363 ms | **313 ms** | −14% |
| `detectDsn.cold` | 7.73 ms | **5.61 ms** | −27% |
| `scanCodeForFirstDsn` | 2.91 ms | **2.13 ms** | −27% |
| `import.*from` uncapped (bench) | 1489 ms | **1178 ms** | −21% |
The DSN workloads improve because `DSN_PATTERN` extracts `http` as
its literal — most source files don't contain `http` at all, so
the prefilter short-circuits before the regex runs.
No regressions on any benchmark. Pure-literal patterns (e.g.,
`SENTRY_DSN`, `NONEXISTENT_TOKEN_XYZ`) continue through the whole-
buffer path unchanged.
## What this PR is NOT
- Not a structural parallelization change (workers, byte-level
scanning, chunked reads). Those may follow separately.
- Not a regex-engine replacement. V8's Irregexp is what it is.
- Not feature-level changes to `GrepOptions`/`GrepResult`. Public
API preserved.
## Test plan
- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing warning unrelated to
this PR in `src/lib/formatters/markdown.ts`)
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
**5610 pass, 0 fail** (+58 new: 39 literal-extract + 7 grep
prefilter + 7 existing grep tests re-verified)
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large --runs 5` — all scan ops at or
below previous baseline; no regressions
- [x] Manually verified semantic parity: `collectGrep` returns
identical `GrepMatch[]` on prefilter vs whole-buffer paths
for patterns where the prefilter fires
## Background
This PR follows the ripgrep research from PR #791: rg's central
perf trick is extracting a literal from each regex and prefiltering
with SIMD memchr. V8 doesn't expose SIMD directly but its
`String.prototype.indexOf` is compiled to a tight byte-level loop
with internal SIMD on x64 — functionally equivalent for our use
case. The remaining gap with rg is now primarily orchestration
overhead (async/await, `mapFilesConcurrent`, walker correctness
features) rather than regex speed. A worker-pool exploration may
follow.
Contributor
|
Contributor
Codecov Results 📊✅ 138 passed | Total: 138 | Pass Rate: 100% | Execution Time: 0ms 📊 Comparison with Base Branch
✨ No test changes detected All tests are passing successfully. ✅ Patch coverage is 95.85%. Project has 1768 uncovered lines. Files with missing lines (1)
Coverage diff@@ Coverage Diff @@
## main #PR +/-##
==========================================
- Coverage 95.65% 95.63% -0.02%
==========================================
Files 280 281 +1
Lines 40228 40444 +216
Branches 0 0 —
==========================================
+ Hits 38477 38676 +199
- Misses 1751 1768 +17
- Partials 0 0 —Generated by Codecov Action |
Cursor Bugbot found a silent correctness bug in the literal-prefix
extractor: `[(]foo|bar` was extracting `"foo"` as a required literal,
causing `collectGrep`'s literal prefilter to silently miss lines
matching only the `bar` branch.
## Root cause
`hasTopLevelAlternation` tracked paren/bracket depth for `(`, `)`,
`[`, `]` in a single pass but didn't skip over the interior of a
character class. Inside `[...]`, the characters `(` and `)` are
LITERAL bytes, not grouping metacharacters — they shouldn't change
`depthParen`. But the old code incremented `depthParen` on `(` even
when it was inside `[...]`. Once the class closed, `depthParen` was
stuck at 1, so the subsequent top-level `|` was ignored, the
detector returned `false`, and the extractor proceeded to extract
`"foo"` from after the class.
Result: the prefilter scanned for `foo` and silently dropped any
line that matched only via the `bar` branch of the alternation.
## Fix
Character classes are opaque to the alternation detector. When `[`
is seen, skip past the first unescaped `]` in one step. Also
recognize that `[...]` is NOT nestable — `[[]` is a class containing
a literal `[`, not a nested class (regex spec). So `skipToMatching`
(which is nestable) is wrong for classes.
New helper `skipCharacterClass` handles the opaque-skip correctly.
Both `extractInnerLiteral` (line 174) and `hasTopLevelAlternation`
(line 273) now use it for `[...]`. `skipToMatching` is preserved for
`(...)` groups (which ARE nestable).
## Verification
Before:
extractInnerLiteral("[(]foo|bar", "") → "foo" (WRONG)
collectGrep({pattern: "[(]foo|bar"}) → misses "bar" lines
After:
extractInnerLiteral("[(]foo|bar", "") → null (correct — alternation detected)
collectGrep → falls through to whole-buffer path, finds all matches
## Tests
- `literal-extract.test.ts`: 10 new tests under "character-class
opaqueness (Cursor bug #1)" — 5 patterns that previously silently
extracted a wrong literal, 5 control patterns that should still
extract correctly (char classes without alternation).
- `grep.test.ts`: 1 new end-to-end regression test verifying that
`collectGrep` on `[(]foo|bar` across three files (only-bar,
only-paren, both) returns all expected matches.
## Test plan
- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing warning unrelated)
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
**5621 pass, 0 fail** (+11 new regression tests)
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large --op scan.grepFiles` — 329ms p50
(noise-equivalent to previous 318ms; no perf regression)
Cursor Bugbot found the sibling of yesterday's char-class bug
(#612b8b0). The fix for `hasTopLevelAlternation` used the new
`skipCharacterClass` helper, but the group-skipper (formerly
`skipToMatching`, now renamed `skipGroup`) was left untouched. Same
underlying issue: a `)` inside `[...]` was closing the enclosing
group early.
## Root cause
`extractInnerLiteral` encounters `(` and calls the group-skipper to
advance past the matching `)`. The skipper tracked paren depth and
stopped at the first `)` where depth hit zero. But `[...]` char
classes contain literal bytes — a `)` inside `[)]` is NOT a grouping
metacharacter. The skipper didn't treat classes as opaque, so it
exited the group at the wrong paren.
Concrete: `(ABC[)]DEF)?GHI`
- skipper enters at `(`, tracks depth=1
- sees `[`, advances past `]` — but `)` inside `[)]` was reached
first, decremented depth to 0, skipper thinks group ended
- extractor now considers `DEF` to be at the top level, returns
it as the required literal
- prefilter searches for `DEF` and silently misses lines matching
only the `GHI` branch (the group is optional `?`)
## Fix
`skipGroup` now calls `skipCharacterClass` when it sees `[`, making
classes opaque to group-depth tracking — same mechanism already
used in `hasTopLevelAlternation`.
Also renamed `skipToMatching` → `skipGroup` since it was always
called with `(`/`)` anyway. Clearer intent + no functional change.
## Verification
Before:
extractInnerLiteral("(ABC[)]DEF)?GHI", "") → "DEF" (WRONG)
collectGrep silently drops lines matching only "GHI"
After:
extractInnerLiteral("(ABC[)]DEF)?GHI", "") → "GHI" (correct)
collectGrep finds all matches
## Tests
- `literal-extract.test.ts`: 9 new tests under "class-in-group
opaqueness (Cursor bug #1 followup)" — 5 bug patterns + 4
control patterns verifying regular group-skipping still works
(including nested groups like `((foo))?bar`).
- `grep.test.ts`: 1 new end-to-end regression test for
`collectGrep` with `(ABC[)]DEF)?GHI` across two files — verifies
both branches match after the fix.
## Test plan
- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing warning unrelated)
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
**5631 pass, 0 fail** (+10 from this fix)
- [x] `bun test test/isolated` — 138 pass
- [x] Manual E2E repro from the bug report now returns all matches
…ange (Cursor #2) Cursor Bugbot found another silent correctness bug in `grepByLiteralPrefilter`. When the caller sets `caseSensitive: false`, the fast path lowercases the content and searches for the pre-lowercased literal via `indexOf`. The implementation used `hit` positions from the lowercased `haystack` as indices into the original `content` for line-boundary math, assuming the two strings have the same length. ## Root cause JavaScript's `String.prototype.toLowerCase()` is specified against Unicode case-folding rules that can produce a LONGER string. The classic example is Turkish dotted `İ` (U+0130), which lowercases to `i` + U+0307 (combining dot above) — net +1 code unit per `İ`. When content contains such characters: - `haystack.length !== content.length` - `hit` positions from `haystack` are offset past the matching position in `content` - `content.lastIndexOf("\n", hit)` / `content.indexOf("\n", hit)` return newlines that happen to bracket SOME line (often the wrong one) - `lineEnd + 1` (computed on `content`) fed back to `haystack.indexOf(literal, lineEnd + 1)` compounds the misalignment and can SKIP PAST real match positions Concrete repro (would fail under the buggy code, passes now): file: `İİİİİİİ line 1 with Sentry.init here` `İİİİİİİ line 2 plain` `İİİİİİİ line 3 with Sentry.init again` `İİİİİİİ line 4 plain` pattern: /Sentry\.init/i before: only line 1 found — line 3 silently dropped after: both lines 1 and 3 found ## Fix Detect the length divergence up front and fall back to the whole- buffer regex path when it happens: ```ts if (isCaseInsensitive) { const lowered = content.toLowerCase(); if (lowered.length !== content.length) { return grepByWholeBuffer(content, entry, opts); } haystack = lowered; } ``` The check is one `.length` comparison — nearly free. Real source code almost never contains the problematic characters, so the fast path still fires on the overwhelming common case. When it doesn't, we degrade to the correct-but-slower regex path instead of silently missing matches. ## Tests - `grep.test.ts`: new end-to-end regression test with Turkish `İ` in the fixture content, verifying both `Sentry.init` matches on lines 1 and 3 are found. Before the fix this test would return only 1 match. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning unrelated; added `biome-ignore noExcessiveCognitiveComplexity` to `grepByLiteralPrefilter` — complexity grew from 15 to 17 with the new length-guard branch) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5632 pass, 0 fail** (+1 new regression test) - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large --op scan.grepFiles` — 332ms p50 (noise-equivalent to 329ms before; no perf regression on the common ASCII-only path)
…t (Cursor #3) Cursor Bugbot found another silent correctness bug. The literal extractor was running against the raw user-supplied `patternSource` instead of the compiled `regex.source`. When `compilePattern` rewrites scoped inline flags like `(?i:foo|bar)baz`, it strips the group and widens the flag, yielding `foo|barbaz` + the `i` flag at the top level. The rewritten source has top-level alternation the original lacked. ## Concrete bug - User input: `(?i:foo|bar)baz` - Compiled source (what the regex engine runs): `foo|barbaz` - Compiled flags: `i` The old code called `extractInnerLiteral(userInput, flags)`. The extractor's `hasTopLevelAlternation` didn't see the `|` (it's inside the `(?i:...)` group from its perspective), so it didn't bail, and `extractInnerLiteral` returned `"baz"` as a "required" literal. But the actual compiled regex `/foo|barbaz/i` matches `foo` alone. Lines with only `foo` got silently dropped by the prefilter because they don't contain `"baz"`. Repro confirmed: ``` pattern: (?i:foo|bar)baz files: foo-only.txt: "foo line without that other word" ← should match via `foo` branch barbaz.txt: "contains barbaz here" before: only barbaz.txt matched — foo-only.txt silently dropped after: both files match ``` ## Fix Always extract from `regex.source` (post-compile). That way the extractor sees what the regex engine will actually run — including any structural changes from `compilePattern`'s inline-flag handling. `regex.flags` is authoritative for case-sensitivity either way. One-line change plus updated comment explaining the constraint. ## Tests - `grep.test.ts`: new end-to-end regression test with the `(?i:foo|bar)baz` pattern against two files — verifies both branches match after the fix. Before the fix this test would return only the barbaz.txt match. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning unrelated) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5633 pass, 0 fail** (+1 new regression test) - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large --op scan.grepFiles` — 325ms p50 (noise-equivalent; no perf regression) - [x] All 13 literal-prefilter fast path tests still pass unchanged (extraction from compiled source gives equivalent results for patterns without scoped inline flags, which is the overwhelming common case)
…char escapes (review #4) A subagent self-review found two more correctness bugs in the literal prefilter: (A) multi-char escape sequences like \x41, \u0041, \cA had their tails mistaken for literals, and (B) patterns that can match across newlines (e.g., foo\sbar) silently missed valid matches via the per-line verify path. Rather than patch each failure mode, this commit redesigns the prefilter as a FILE-LEVEL gate only and fixes escape-sequence advancement in the extractor. Net result is simpler, correct, and ~5% faster on bench. ## Bug A: multi-char escapes `extractInnerLiteral` advanced all escape sequences by 2 chars, leaving the tail chars of longer escapes as "literal" content. - `\x41foo` → returned "41foo", but compiled regex is /Afoo/. Files with "Afoo" got skipped by the gate (no "41foo" present). - `\u0041foo` → "0041foo" returned, same silent miss. - `\cAfoo` → "Afoo" returned (wrong — ctrl-A, not "A"). - `\k<name>foo` → "<name>foo" returned (backref syntax leaked as literal). - `\p{L}foo` → "{L}foo" returned (Unicode property syntax leaked). Fix: new `escapeSequenceLength` helper computes the full source-char length of all JS regex escape forms: - `\xHH` → 4 chars - `\uHHHH` → 6 chars - `\u{H+}` → variable (until `}`) - `\cX` → 3 chars - `\k<name>` → variable (until `>`) - `\p{...}`, `\P{...}` → variable (until `}`) - default (single-char like `\d`, `\b`, `\t`, `\n`, numeric backrefs like `\1`, escaped metachars `\.`, etc.) → 2 chars Applied to: the main extractor, `skipCharacterClass`, `skipGroup`, `hasTopLevelAlternation`. All four call sites now correctly advance past the whole escape sequence. ## Bug B: cross-newline patterns The old prefilter did per-line verify: after finding a candidate via `indexOf(literal)`, extracted the enclosing line and ran `verifyRegex.test(lineText)`. This broke silently for patterns whose match spans newlines: - `foo\sbar` — `\s` includes `\n`, so `foo\nbar` is a match. But the per-line verify tests "foo" alone, which fails. - `function\s+\w+` — multi-line function declarations missed. This was a REGRESSION vs the pre-PR whole-buffer path, which correctly found these matches via `regex.exec(content)`. ## Fix: file-level gate only Redesigned the prefilter: Before: literal prefilter = find candidate lines + per-line verify After: literal prefilter = file-level gate (skip entire files without the literal); fall through to whole-buffer for everything else The gate still captures most of the perf win — files without the literal are cheap to reject via indexOf, and in large trees those dominate. The per-line optimization was the part with subtle correctness failures, and it's gone. Also simplified: removed `literalIsPattern` field, `grepByLiteralPrefilter` function, `stripGlobalFlag` helper, and the Unicode length guard (no longer needed since we only use indexOf as a boolean). The Unicode concern from Cursor #2 reduces to: "does `content.toLowerCase()` still contain `literal` at any position?" — which works correctly regardless of whether `toLowerCase` changes the string's length. ## Performance Synthetic/large fixture (10k files): | Op | Before review #4 | After (this commit) | |---|---:|---:| | scan.grepFiles | 332 ms | **325 ms** | | scanCodeForDsns | 333 ms | **313 ms** | | detectDsn.cold | 5.61 ms | **4.55 ms** | | detectAllDsns.cold | 313 ms | 316 ms | File-level gate is cheaper than per-line verify (one indexOf vs many plus regex.test per candidate). Net improvement. ## Tests - `literal-extract.test.ts`: +8 tests for multi-char escape sequences (\x41, \u0041, \u{1F600}, \cA, \k<name>, \p{L}, \P{L}, plus controls verifying pre-escape literal extraction still wins). - `grep.test.ts`: +2 end-to-end tests — one for cross-newline patterns (`foo\sbar`) and one for multi-char escapes (`\x41foo` must find files with `Afoo`, not `41foo`). Both tests would fail under the old code. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning unrelated) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5644 pass, 0 fail** (+11 new regression tests) - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large --runs 5` — no regressions; scan ops slightly faster due to simpler gate
Contributor
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 2277d0d. Configure here.
Cursor Bugbot caught dead code: `isPureLiteral` was exported and
tested but no longer called anywhere in the pipeline after the
previous commit simplified the prefilter to a file-level gate. The
function correctly identified pure-literal patterns but the new
design uses the literal the same way whether the whole pattern is
literal or not (file-level indexOf gate, fall through to whole-
buffer regex).
## Changes
- Deleted `isPureLiteral` export and its 4 unit tests.
- Updated the module-level docstring to reflect that pure-literal
patterns flow through the same gate-then-regex path as everything
else.
- Fixed a stale comment in `grep.test.ts` that claimed pure literals
"skip the prefilter" via `isPureLiteral` — they go through the
gate like all other patterns, and V8's regex engine handles
pure-literal patterns efficiently in the whole-buffer path.
Test pass count drops by 4 (from 5644 to 5640) — the removed tests
were all coverage for dead code.
## Test plan
- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing warning unrelated)
- [x] `bun test --timeout 15000 test/lib/scan/` — **228 pass, 0 fail**
- [x] All existing behavior tests still pass (the `pure literal`
grep test now correctly describes what the code does)
…esign
Pre-merge review found two stale docstrings describing the old
per-line prefilter design that no longer exists after the
simplification to a file-level gate.
## Changes
- `src/lib/scan/literal-extract.ts`: top-of-file docstring now
correctly describes the extractor as feeding a FILE-LEVEL gate,
not a per-line prefilter. Updated the "gate effectiveness"
section to talk about files rejected vs. the old (stale)
"lines before touching the regex engine" framing. Removed the
reference to `isPureLiteral` in the pure-literal paragraph
(function was deleted in the previous commit).
- `src/lib/scan/grep.ts`: top-of-file docstring rewritten. The old
version described `content.split("\n") + regex.test(line)` but
`grepByWholeBuffer` actually uses `regex.exec(content)` on the
full buffer with per-match line-boundary extraction. New
docstring describes the two-stage gate-then-whole-buffer design
and points at `setupGrepPipeline` for literal extraction.
No behavioral changes. Lint clean, all 228 scan tests pass.
8 tasks
BYK
added a commit
that referenced
this pull request
Apr 21, 2026
…#805) ## Summary Walker was 411ms p50 for 10k files on the synthetic/large fixture — roughly 10× slower than a naive sync walk (50ms). Follow-up to PR #791 and #804. Five targeted optimizations in the hot loop, all micro-benchmarked before landing: | Change | Measured win | End-to-end win | |---|---|---| | `readdirSync` instead of `readdir` | 105→45ms (−60ms) | dominant | | `statSync` instead of `Bun.file().size` | −10ms per 10k files | | | String concat for paths | path.join 7ms → concat 0.7ms (10×) | | | `abs.slice(cwdPrefixLen)` for relPath | path.relative 9ms → slice 0.8ms (11×) | | | Manual `lastIndexOf`-based extname | 9ms → 7ms (25%) | | ### Perf results (synthetic/large, 10k files, p50) | Op | Before | After | Δ | |---|---:|---:|---:| | `scan.walk` | 411ms | **231ms** | **−44%** | | `scan.walk.noExt` | 572ms | 448ms | **−22%** | | `scan.walk.dsnParity` | 228ms | **138ms** | **−39%** | | `scanCodeForDsns` | 323ms | 304ms | −6% | | `detectAllDsns.cold` | 327ms | 308ms | −6% | | `detectAllDsns.warm` | 27.9ms | 27.0ms | — | | `scan.grepFiles` | 322ms | 316ms | noise | | `scanCodeForFirstDsn` | 2.23ms | 2.05ms | — | Walker ops are 22-44% faster. Downstream ops (grep, DSN scanner) benefit less because their time is dominated by content scanning, not walking — but still show consistent ~6% improvements. ## Why `readdirSync` is safe here The sync vs async tradeoff usually favors async because blocking the event loop is bad in general. But measured per-call cost matters: | Metric | `readdirSync` on 3635 dirs | |---|---| | p50 | 11µs | | p95 | 24µs | | p99 | 35µs | | max | 65µs | 65µs max block is trivial — `setTimeout(0)` latency in Node is ~4ms. Blocking for 65µs never causes noticeable event-loop pauses. And we pay ~60µs of microtask overhead for each async readdir, which wipes out any theoretical fairness benefit. Net: 2-3× faster per-dir on walks with many small directories, which is every realistic CLI workload. If this ever matters for a weird embedded use case, the optimization is trivially reversible — the walker's public API is unchanged. ## mtime parity fix Initial implementation regressed `detectAllDsns.warm` from 28ms → 304ms because `statSync().mtimeMs` is a float (e.g. `1776790602458.1033`) while `Bun.file().lastModified` is already an integer. The DSN cache validator compares floored `sourceMtimes`, so un-floored floats caused cache misses on every warm call. Fixed by flooring explicitly in `tryYieldFile` — matches the same treatment already applied to `onDirectoryVisit`'s dirMtimes. ## Walker v2 design notes - **Sync readdir + async yield.** Readdir blocks briefly (~65µs max), but the surrounding async-generator interface is preserved. Library consumers that insert async work between iterations still get cooperative scheduling. - **Path ops inlined.** `path.join` / `path.relative` / `path.extname` + `toLowerCase` are all replaced with manual string ops in the hot loop. On Windows, `normalizePath` is still applied via `replaceAll(NATIVE_SEP, "/")` — the POSIX fast path uses a cached `POSIX_NATIVE` module constant. - **`WalkContext.cwdPrefixLen`** precomputed once per walk, used to slice relative paths from absolute paths. - **Cached module constants** `NATIVE_SEP` and `POSIX_NATIVE` avoid per-call `path.sep` property lookups in the hot loop. ## What this PR does NOT change - `WalkEntry` shape: preserved identically (absolutePath, relativePath, size, mtime, isBinary, depth). - `WalkOptions` contract: no new options, no semantics changes. - Symlink handling, `followSymlinks`, cycle detection via `visitedInodes`: unchanged. - IgnoreStack + gitignore semantics: unchanged. - `onDirectoryVisit`, `recordMtimes`, `abortSignal`: unchanged. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning in `src/lib/formatters/markdown.ts` unrelated) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5640 pass, 0 fail** - [x] `bun test test/isolated` — 138 pass - [x] `bun test test/lib/scan/walker.test.ts` — 34 pass (incl. property tests covering hidden files, symlinks, maxDepth, gitignore interaction) - [x] `bun test test/lib/dsn/code-scanner.test.ts` — 52 pass (incl. dirMtimes / sourceMtimes cache validation) - [x] DSN count correctness verified end-to-end: 4 DSNs found on fixture (matches pre-change count) - [x] `bun run bench --size large --runs 5 --warmup 2` — results in table above 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
10 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Two regex-level optimizations to narrow the perf gap with ripgrep on our pure-TS
collectGrep/grepFiles. Follow-up to PR #791 and #797.importfromimport.*from), scan the buffer withindexOfto locate candidate lines, only invoke the regex engine on lines that contain the literal. V8'sindexOfis roughly SIMD-speed; skipping the regex engine on non-candidate lines is where most of the win comes from.charCodeAt-walk forindexOf("\n", cursor)hops. 2-5× faster on the line-counting sub-loop because V8 implementsindexOfin C++ without per-iteration JS interop.Perf impact (synthetic/large, 10k files, Bun 1.3.11, 4-core)
scan.grepFiles(DSN pattern)detectAllDsns.colddetectDsn.coldscanCodeForFirstDsnscanCodeForDsnsimport.*fromuncapped (bench)The DSN workloads improve because
DSN_PATTERNextractshttpas its literal — most source files don't containhttpat all, so the prefilter short-circuits before the regex runs.No regressions on any benchmark. Pure-literal patterns (e.g.,
SENTRY_DSN,NONEXISTENT_TOKEN_XYZ) continue through the whole-buffer path unchanged.What changed
New file:
src/lib/scan/literal-extract.ts(~300 LOC)Conservative literal extractor. Walks a regex source looking for the longest contiguous run of literal bytes that every match must contain. Bails out safely on top-level alternation, character classes, groups, lookarounds, quantifiers, and escape classes.
Handles escaped metacharacters intelligently:
Sentry\.inityieldsSentry.init(extracted via literal\.→.), while\bfoo\byieldsfoo(escape\bis an anchor, not a literalb).Exports:
extractInnerLiteral(source, flags)— returns the literal, or null if no safe extraction possible. Honors/iby lowercasing.isPureLiteral(source, flags)— true when the pattern IS a bare literal with no metacharacters. Used by the grep pipeline to route pure-literals to the whole-buffer path (V8's regex engine is hyper-optimized for pure-literal patterns; the prefilter adds overhead without benefit there).Modified:
src/lib/scan/grep.ts(~240 LOC changes)Three-way dispatch in
readAndGrepbased on the extracted literal:grepByLiteralPrefilter(new) — regex with extractable literal +multiline: true. UsesindexOf(literal)to find candidate lines, runs the regex engine only on those. This is the main perf win.grepByWholeBuffer— existing path, used for:multiline: falsemode (the fast path requires per-line semantics)Also: replaced the
charCodeAt-walk that counted newlines char-by-char with anindexOf("\n", cursor)hop loop. ExtractedbuildMatch(ctx, bounds)as a shared helper to bundle the match-construction arguments.Tests added
test/lib/scan/literal-extract.test.ts— 39 tests covering the extractor's rules (escape handling, quantifier drop, alternation bail, case-insensitive, minimum length).test/lib/scan/grep.test.ts— 7 new tests for the prefilter fast path: correctness vs whole-buffer, escaped-literal extraction, case-insensitive flag, zero-literal-hit short-circuit, routing of pure literals to whole-buffer, and alternation routing.Why this approach
From the ripgrep research (attached to PR #791): rg's central perf trick is extracting a literal from each regex and prefiltering with SIMD memchr. V8 doesn't expose SIMD directly but its
String.prototype.indexOfis compiled to a tight byte-level loop with internal SIMD on x64 — functionally equivalent for our use case.Three of the five techniques in the Loggly regex-perf guide were evaluated:
.*—DSN_PATTERNalready uses[a-z0-9]+, no change needed.DSN_PATTERN's(?:\.[a-z]+|:[0-9]+)is already correctly ordered (.more common than:in DSN hosts); swapping regressed perf by noise.\btoDSN_PATTERNregressed perf 2.8× on our workload. V8's existing fast character-mismatch rejection on the first byte outperforms the boundary check overhead.The remaining gap with rg is now primarily orchestration overhead (async/await,
mapFilesConcurrent, walker correctness features) rather than regex speed. A worker-pool exploration may follow.Test plan
bunx tsc --noEmit— cleanbun run lint— clean (1 pre-existing warning insrc/lib/formatters/markdown.tsunrelated to this PR)bun test --timeout 15000 test/lib test/commands test/types— 5610 pass, 0 fail (+58 new)bun test test/isolated— 138 pass, 0 failbun run bench --size large --runs 5— all scan ops at or below previous baselinecollectGrepreturns identicalGrepMatch[]on prefilter vs whole-buffer paths for patterns where the prefilter fires🤖 Generated with Claude Code
Co-authored-by: Claude Opus 4.7 (1M context) noreply@anthropic.com