feat(scan): pure-TS ripgrep-compatible scanner + DSN migration + perf overhaul#791
feat(scan): pure-TS ripgrep-compatible scanner + DSN migration + perf overhaul#791
Conversation
Semver Impact of This PR🟡 Minor (new features) 📋 Changelog PreviewThis is how your changes will appear in the changelog. New Features ✨
Bug Fixes 🐛
Internal Changes 🔧
🤖 This preview updates automatically when you update the PR. |
|
Codecov Results 📊✅ 138 passed | Total: 138 | Pass Rate: 100% | Execution Time: 0ms 📊 Comparison with Base Branch
✨ No test changes detected All tests are passing successfully. ✅ Patch coverage is 98.05%. Project has 1730 uncovered lines. Files with missing lines (6)
Coverage diff@@ Coverage Diff @@
## main #PR +/-##
==========================================
+ Coverage 95.58% 95.65% +0.07%
==========================================
Files 264 278 +14
Lines 38454 39794 +1340
Branches 0 0 —
==========================================
+ Hits 36755 38064 +1309
- Misses 1699 1730 +31
- Partials 0 0 —Generated by Codecov Action |
…dedup Three findings from Cursor Bugbot + Seer on PR #791: **Seer (MEDIUM): followSymlinks: true silently drops symlinks** `readdir({withFileTypes: true})` uses lstat semantics — on a symlink, `isSymbolicLink()` returns true while `isFile()` and `isDirectory()` both return false. The existing `processEntry` only routed entries via `isDirectory()` / `isFile()`, so a symlink with `followSymlinks: true` passed the skip check but then fell through and returned null. The `inodeKey`/`visitedInodes` cycle detection in `maybeDescend` was unreachable dead code. Fix: when a symlink is encountered with `followSymlinks: true`, `stat()` (not `lstat`) the target to learn its real type, route to `maybeDescend` or `tryYieldFile` accordingly. Broken symlinks and stat errors are silently skipped (matches ripgrep's behavior). Circular symlinks are broken by the existing inode-based cycle detection. Tests: 4 new symlink cases in `walker.test.ts` covering default-skip, follow-through, circular cycle detection, and broken symlinks. **Cursor #2 (MEDIUM): concurrent grep workers share stateful RegExp** When a pre-compiled `/gm` RegExp was passed to `grepFiles`, `ensureGlobalMultilineFlags` returned it unchanged. Concurrent `readAndGrep` workers then shared the regex's `lastIndex` property. Strictly speaking this was safe today — JS is single-threaded and the match loop has no `await`, so workers can't actually interleave mid-iteration — but any future `await` inside the loop would turn it into a silent match-dropping bug. Fix: always clone the regex at the start of `readAndGrep` via `new RegExp(src, flags)`. ~1µs per file, eliminates the foot-gun. Added a regression test that exercises the shared-regex path with multiple concurrent workers. **Cursor #1 (LOW): shared utilities duplicated between grep and glob** `compileMatchers`, `matchesAny`, `basenameOf`, `walkerRoot`, and `joinPosix` were duplicated between `src/lib/scan/grep.ts` and `glob.ts`. Extracted to new internal module `src/lib/scan/path-utils.ts` (not part of the public barrel — implementation detail). Net LOC reduction: ~80 lines across the two files. ## Test plan - [x] `bun run typecheck` — clean - [x] `bun run lint` — clean (1 pre-existing warning) - [x] `bun test test/lib test/commands test/types` — **5443 pass, 0 fail** (+5 from this commit) - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large` — no regression on `scanCodeForDsns` (312ms) or `scan.grepFiles` (278ms)
… overhaul
Introduces `src/lib/scan/` — a pure-TypeScript, dependency-free-at-runtime
ripgrep-compatible file scanner (walker + ignore + binary detection + grep +
glob) — and migrates the DSN scanner at `src/lib/dsn/code-scanner.ts` to
consume it. Also ships a local benchmark harness, a data-driven concurrency
tuning pass, and two rounds of `IgnoreStack` optimization.
- **`src/lib/scan/`** (~2.7k LOC across 10 source files):
walker (DFS with time budget + monorepo descent hook + gitignore), grep
(`AsyncIterable<GrepMatch>` + `collectGrep`), glob (picomatch-backed),
shared `mapFilesConcurrent` helper, regex helpers
(`compilePattern`, `ensureGlobalFlag`, `ensureGlobalMultilineFlags`).
Policy-free core — the DSN scanner composes its own preset via
`src/lib/dsn/scan-options.ts`.
- **DSN scanner migration**: `src/lib/dsn/code-scanner.ts` shrinks from
**715 → 435 LOC** (−280). All 5 public exports + wire shapes preserved
byte-identical. Telemetry span/attributes unchanged. New behavior: respects
nested `.gitignore` files (matches git's cumulative semantics — DSNs in
files covered by a subdirectory `.gitignore` are no longer falsely
detected).
- **Benchmark harness**: `script/bench.ts` + `script/bench-sweep.ts`,
deterministic xorshift32-seeded fixture generator at
`test/fixtures/bench/`. Three presets (small/medium/large ≈ 100/1k/10k
files). `bun run bench`, `bench:save`, `bench:compare`, `bench:sweep`.
Baselines gitignored (machine-specific).
- **Concurrency tuning**: data-driven sweep found the knee at
`availableParallelism()` on a 4-core box (old default `50` was 8% slower
at the fixture size). `CONCURRENCY_LIMIT` now `Math.max(2, cores)`.
- **IgnoreStack rewrite**: split into `#rootIg` + `#nestedByRelDir` with a
root-only fast path. Per-query cost dropped from 3.76µs → 0.27µs (14×)
on the common case, 5.76µs → 1.93µs (3×) on nested.
- **Grep whole-buffer matchAll**: replaced `content.split("\n")` + per-line
`regex.test` with whole-buffer iteration via `regex.exec` + lazy line-number
counting. 12× faster on profiled grep CPU phase.
- **Literal-prefix fast path** in DSN scanner: short-circuits `matchAll` when
content has neither `"http"` nor `"HTTP"`. ~3% improvement on large fixture.
| Op | Baseline | Now | Δ |
|-----------------------|---------:|-------:|------------------------|
| `detectDsn.cold` | 179ms | 3.9ms | **46× faster** |
| `scanCodeForFirstDsn` | 183ms | 2.1ms | **88× faster** |
| `scanCodeForDsns` | 298ms | 300ms | parity |
| `detectAllDsns.cold` | 304ms | 310ms | parity |
| `scan.ignore` (µbench)| (new op) | 2.0ms | 81% faster than pre-opt|
The common-case hot paths (`detectDsn.cold`, `scanCodeForFirstDsn`) are
dramatically faster thanks to `mapFilesConcurrent`'s cooperative early-exit
flag. Find-all paths hold at parity while gaining nested-gitignore correctness,
richer cache coverage (`dirMtimes`), and a 40% LOC reduction.
- **Worker threads**: profiling showed CPU work (45ms DSN regex) is smaller
than worker bootstrap + postMessage serialization. Net negative.
- **Replace `ignore` package with picomatch**: gitignore syntax isn't
compatible with glob syntax. Would need a 200-400 LOC translator with
subtle correctness edge cases.
- [x] `bun run typecheck` — clean
- [x] `bun run lint` — clean (1 pre-existing warning in markdown.ts)
- [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5433 pass, 0 fail** (+152 new scan + bench + dsn tests)
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large` — all success criteria met, benchmarks stable
Addresses two issues raised in PR 791 code review: **C1: Mixed-case DSN schemes silently dropped** (regression in PR 791) The literal-prefix fast path in `extractDsnsFromContent` used two case-sensitive `indexOf` calls (`"http"` and `"HTTP"`), which covered only all-lowercase and all-uppercase scheme prefixes. Mixed-case URLs like `Https://...` or `hTtP://...` returned `[]` even though the `DSN_PATTERN` regex (with `/i` flag) would have matched. Impact: any source file with non-standard scheme casing (rare but possible in generated code or test fixtures) would silently fail DSN auto-detection — `sentry issue list` would behave as if no DSN existed. Also hits the cache-verifier path via `extractFirstDsnFromContent`. Fix: replaced the two `indexOf` calls with `/http/i.test(content)`. Adds ~16ms on a 10k-file scan (V8 regex vs raw `indexOf`), a trade we accept for correctness. **C2: Property test couldn't catch the mixed-case regression** The property test for the fast path used a lowercase-only arbitrary charset, so it could never generate the input class where the old fast path diverged from the slow path. Widened `BENIGN_CHARS` to include uppercase letters; tightened the JSDoc explaining the invariant. Also added an explicit `test.each` regression suite in `code-scanner.test.ts` pinning detection across five scheme casings (all-lower, all-upper, title, mixed, http no-s). **Concurrency retune: back to `availableParallelism`** Previously set to `Math.max(2, availableParallelism())` in PR 3.5. First attempted to raise to `cores × 4` (up to 32) based on a microbenchmark showing that pure-I/O per-file work has a knee at ~16-32 in-flight tasks. Measurement was correct but didn't reflect the real workload: every current scan-module caller is walker-fed, and the walker's serial `readdir` descent is the bottleneck. Raising concurrency regressed `scanCodeForDsns` ~15% because more in-flight tasks created scheduling pressure without additional useful work. Final choice: `Math.min(16, Math.max(2, availableParallelism()))` — same as before this commit (4 on 4-core, 2 floor for tiny CI, 16 cap for large workstations). The JSDoc now honestly documents the tension between walker-fed and pure-I/O workloads and tells callers they can override per-call via `opts.concurrency` when they know the shape of their work. - [x] `bun run typecheck` — clean - [x] `bun run lint` — clean (1 pre-existing warning) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5438 pass, 0 fail** (+5 new parametrized regression tests) - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large` — `scanCodeForDsns` 312ms (PR 0 baseline 298ms, +4.7%, under 1.2× bar); hot-path ops `detectDsn.cold` 3.76ms / `scanCodeForFirstDsn` 2.10ms unchanged (46-87× faster than baseline) - [x] Manual: mixed-case DSN casings (`Https://`, `hTtP://`, `HttpS://`) all detected post-fix
…dedup Three findings from Cursor Bugbot + Seer on PR #791: **Seer (MEDIUM): followSymlinks: true silently drops symlinks** `readdir({withFileTypes: true})` uses lstat semantics — on a symlink, `isSymbolicLink()` returns true while `isFile()` and `isDirectory()` both return false. The existing `processEntry` only routed entries via `isDirectory()` / `isFile()`, so a symlink with `followSymlinks: true` passed the skip check but then fell through and returned null. The `inodeKey`/`visitedInodes` cycle detection in `maybeDescend` was unreachable dead code. Fix: when a symlink is encountered with `followSymlinks: true`, `stat()` (not `lstat`) the target to learn its real type, route to `maybeDescend` or `tryYieldFile` accordingly. Broken symlinks and stat errors are silently skipped (matches ripgrep's behavior). Circular symlinks are broken by the existing inode-based cycle detection. Tests: 4 new symlink cases in `walker.test.ts` covering default-skip, follow-through, circular cycle detection, and broken symlinks. **Cursor #2 (MEDIUM): concurrent grep workers share stateful RegExp** When a pre-compiled `/gm` RegExp was passed to `grepFiles`, `ensureGlobalMultilineFlags` returned it unchanged. Concurrent `readAndGrep` workers then shared the regex's `lastIndex` property. Strictly speaking this was safe today — JS is single-threaded and the match loop has no `await`, so workers can't actually interleave mid-iteration — but any future `await` inside the loop would turn it into a silent match-dropping bug. Fix: always clone the regex at the start of `readAndGrep` via `new RegExp(src, flags)`. ~1µs per file, eliminates the foot-gun. Added a regression test that exercises the shared-regex path with multiple concurrent workers. **Cursor #1 (LOW): shared utilities duplicated between grep and glob** `compileMatchers`, `matchesAny`, `basenameOf`, `walkerRoot`, and `joinPosix` were duplicated between `src/lib/scan/grep.ts` and `glob.ts`. Extracted to new internal module `src/lib/scan/path-utils.ts` (not part of the public barrel — implementation detail). Net LOC reduction: ~80 lines across the two files. - [x] `bun run typecheck` — clean - [x] `bun run lint` — clean (1 pre-existing warning) - [x] `bun test test/lib test/commands test/types` — **5443 pass, 0 fail** (+5 from this commit) - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large` — no regression on `scanCodeForDsns` (312ms) or `scan.grepFiles` (278ms)
… truncation overshoot Two follow-up findings from Cursor Bugbot on commit dd4fde0: **Cursor (MEDIUM): error path leaks partial dirMtimes** The `scanCodeForDsns` catch block was returning `{ dsns: [], sourceMtimes: {}, dirMtimes }` where the last field carried partial state populated by `onDirectoryVisit` callbacks that fired before the walker threw. If a downstream caller cached this partial result, the cache verifier would later only check the dirs we happened to reach pre-error and silently bless the cache for dirs the walker never visited — missing DSNs in those unvisited dirs. Fix: return all three maps empty on error. Matches the pre-PR-3 scanner's behavior and forces a full rescan on the next attempt. **Cursor (LOW): collectGrep truncation off-by-one** `collectGrep` forwarded `maxResults` directly to `grepFilesInternal`, which flipped `stats.truncated = true` the moment `matchesEmitted >= maxResults`. So asking for `maxResults: 3` against a corpus with exactly 3 matches (no more) falsely reported truncation. Fix: mirror `collectGlob`'s `+1` overshoot pattern. Ask the iterator for `maxResults + 1` internally; stop draining at `matches.length >= maxResults`; only set `truncated = true` if we actually saw the overshoot match. Also matches our own documented lore about this pattern (which I violated — the reviewer's catch). ## Test plan - [x] `bun run typecheck` — clean - [x] `bun run lint` — clean (1 pre-existing warning) - [x] `bun test test/lib test/commands test/types` — **5445 pass, 0 fail** (+2 regression tests) - [x] `bun test test/isolated` — 138 pass - [x] Neither fix affects the happy-path runtime (error branch never runs; truncation probe adds one iteration only when the corpus has strictly more than `maxResults` matches)
…onors opt-out Two more findings from Cursor Bugbot on commit 5347702: **Cursor (MEDIUM): producer errors swallowed on consumer break** `mapFilesConcurrentStream` placed the `throw producerError` after its `try/finally` block. When a consumer `break`s early, the runtime calls the generator's `return()`, which runs `finally` but does NOT execute any statement after the try block. Producer errors were therefore silently lost on the break path (though visible on drain-to-completion). Verified with a standalone repro: ```js async function* gen() { try { yield 1; yield 2; } finally { console.log("finally ran"); } throw new Error("dead code"); } for await (const x of gen()) { if (x === 1) break; } // → "finally ran" prints; error does NOT surface. ``` Fix: move the rethrow inside the `finally` block. Added a regression test that reverts cleanly — when the fix is removed, the test fails with `caught === null`. **Cursor (LOW): `multiline: false` silently overridden** `readAndGrep` unconditionally applied `/m` via `ensureGlobalMultilineFlags`, making the `GrepOptions.multiline` option a no-op. Caller passing `multiline: false` expecting buffer-boundary anchoring got line-boundary anchoring instead. Root cause: the whole-buffer `matchAll` rewrite in PR 3.5 needed `/m` to preserve grep's line-boundary semantics (`^foo` matches any line starting with `foo`), but I hardcoded it rather than making the default explicit. Fix: the `multiline` option now defaults to `true` (grep-like) and honors `multiline: false` when explicitly set (buffer-boundary JS semantics). Updated JSDoc in `types.ts` to document the new contract. Added two regression tests: - default: `^foo` matches mid-file lines (grep-like) - `multiline: false`: `^foo` only matches at buffer start - [x] `bun run typecheck` — clean - [x] `bun run lint` — clean (1 pre-existing warning) - [x] `bun test test/lib test/commands test/types` — **5448 pass, 0 fail** (+3 from this commit) - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large` — no perf regression (per-op: scan.grepFiles 313ms, scan.walk.dsnParity 212ms, scan.ignore 1.74ms, detectDsn.cold 5.85ms, scanCodeForFirstDsn 4.24ms, scanCodeForDsns 365ms)
…sult Follow-up to Cursor Bugbot finding on the rebased commit: `processEntry` returned `[]` for files with no DSNs, which `mapFilesConcurrent` pushes into its results array and fires `onResult` for (as a no-op over an empty array). On a 10k-file walk where ~99% of files contain no DSN, this accumulates ~9900 wasted `onResult` invocations and pushes ~9900 empty arrays into the otherwise-unused results buffer. Fix: return `null` from `processEntry` when the file contributes nothing (no raw matches, all matches rejected by host validation, or the file was unreadable). `mapFilesConcurrent` already filters `null` results before invoking `onResult` and before pushing to results, so the empty-work path is fully skipped. This also tidies a parallel subtlety: when `detected.length === 0` (raw matches existed but host validation rejected all of them), we now skip sourceMtimes recording AND the onResult call, consistent with "no DSNs here". Previously we'd push an empty array and record no mtime but still fire onResult — inconsistent signal. Type change: `processEntry(...)` now returns `Promise<DetectedDsn[] | null>`. Private function, no external API impact. ## Test plan - [x] `bun run typecheck` — clean - [x] `bun run lint` — clean (1 pre-existing warning) - [x] `bun test test/lib/dsn/` — **150 pass, 0 fail** (DSN semantics unchanged: 4 DSNs, 76 sourceMtimes, 461 dirMtimes on medium fixture) - [x] `bun test test/lib/scan/` — 153 pass (scan module untouched)
Two low-severity findings from Cursor Bugbot on commit 5bb480c: **`filesRead` counter incremented before read succeeded** `grepFilesInternal`'s per-file callback bumped `opts.stats.filesRead` *before* `readAndGrep` executed. `GrepStats` documents the field as "files whose content was read and tested against the pattern" but the actual count included failed opens (`readAndGrep` returns `null` on read error). Fixed: increment only after a non-null return. **`grepFiles` and `collectGrep` duplicated pipeline setup** Both entry points had ~15 lines of identical code for pattern compilation, matcher compilation, default resolution, walker construction, and filter wiring. Any future change to defaults needed to be applied to both; a divergence between them would be a silent correctness bug. Extracted to a shared `setupGrepPipeline()` helper that returns a `GrepPipelineSetup` bundle consumed by both entry points. `collectGrep` still manages its own `+1 probeLimit` override and the collector-side truncation flag after the shared setup runs. ## Test plan - [x] `bun run typecheck` — clean - [x] `bun run lint` — clean (1 pre-existing warning) - [x] `bun test test/lib test/commands test/types` — **5464 pass, 0 fail** - [x] `bun test test/lib/scan/` — 153 pass (same)
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 7bddf97. Configure here.
| let ignored = rootResult.ignored && !rootResult.unignored; | ||
| if (rootResult.unignored) { | ||
| ignored = false; | ||
| } |
There was a problem hiding this comment.
Redundant conditional in isIgnored Tier 2 path
Low Severity
In the Tier 2 path of isIgnored, lines 216-218 (if (rootResult.unignored) { ignored = false; }) are dead code. Line 215 already computes ignored = rootResult.ignored && !rootResult.unignored, which correctly handles every combination: when unignored is true, the && with !unignored already yields false. The subsequent if block can never change the value of ignored.
Reviewed by Cursor Bugbot for commit 7bddf97. Configure here.
## Summary Two regex-level optimizations to narrow the perf gap with ripgrep on our pure-TS `collectGrep`/`grepFiles`. Follow-up to PR #791 and #797. - **Literal prefilter** — ripgrep-style: extract a literal substring from the regex source (e.g., `import` from `import.*from`), scan the buffer with `indexOf` to locate candidate lines, only invoke the regex engine on lines that contain the literal. V8's `indexOf` is roughly SIMD-speed; skipping the regex engine on non-candidate lines is where most of the win comes from. - **Lazy line counting** — swapped `charCodeAt`-walk for `indexOf("\n", cursor)` hops. 2-5× faster on the line-counting sub-loop because V8 implements `indexOf` in C++ without per-iteration JS interop. ## Perf impact (synthetic/large, 10k files, Bun 1.3.11, 4-core) | Op | Before | After | Δ | |---|---:|---:|---:| | `scan.grepFiles` (DSN pattern) | 370 ms | **318 ms** | **−14%** | | `detectAllDsns.cold` | 363 ms | **313 ms** | **−14%** | | `detectDsn.cold` | 7.73 ms | **5.61 ms** | **−27%** | | `scanCodeForFirstDsn` | 2.91 ms | **2.13 ms** | **−27%** | | `scanCodeForDsns` | 342 ms | 333 ms | −3% (noise-equivalent) | | `import.*from` uncapped (bench) | 1489 ms | **1178 ms** | **−21%** | The DSN workloads improve because `DSN_PATTERN` extracts `http` as its literal — most source files don't contain `http` at all, so the prefilter short-circuits before the regex runs. No regressions on any benchmark. Pure-literal patterns (e.g., `SENTRY_DSN`, `NONEXISTENT_TOKEN_XYZ`) continue through the whole-buffer path unchanged. ## What changed ### New file: `src/lib/scan/literal-extract.ts` (~300 LOC) Conservative literal extractor. Walks a regex source looking for the longest contiguous run of literal bytes that every match must contain. Bails out safely on top-level alternation, character classes, groups, lookarounds, quantifiers, and escape classes. Handles escaped metacharacters intelligently: `Sentry\.init` yields `Sentry.init` (extracted via literal `\.` → `.`), while `\bfoo\b` yields `foo` (escape `\b` is an anchor, not a literal `b`). Exports: - `extractInnerLiteral(source, flags)` — returns the literal, or null if no safe extraction possible. Honors `/i` by lowercasing. - `isPureLiteral(source, flags)` — true when the pattern IS a bare literal with no metacharacters. Used by the grep pipeline to route pure-literals to the whole-buffer path (V8's regex engine is hyper-optimized for pure-literal patterns; the prefilter adds overhead without benefit there). ### Modified: `src/lib/scan/grep.ts` (~240 LOC changes) Three-way dispatch in `readAndGrep` based on the extracted literal: 1. **`grepByLiteralPrefilter`** (new) — regex with extractable literal + `multiline: true`. Uses `indexOf(literal)` to find candidate lines, runs the regex engine only on those. This is the main perf win. 2. **`grepByWholeBuffer`** — existing path, used for: - Pure-literal patterns (V8 handles them optimally) - Patterns with no extractable literal (complex regex, top-level alternation) - `multiline: false` mode (the fast path requires per-line semantics) Also: replaced the `charCodeAt`-walk that counted newlines char-by-char with an `indexOf("\n", cursor)` hop loop. Extracted `buildMatch(ctx, bounds)` as a shared helper to bundle the match-construction arguments. ### Tests added - `test/lib/scan/literal-extract.test.ts` — **39 tests** covering the extractor's rules (escape handling, quantifier drop, alternation bail, case-insensitive, minimum length). - `test/lib/scan/grep.test.ts` — **7 new tests** for the prefilter fast path: correctness vs whole-buffer, escaped-literal extraction, case-insensitive flag, zero-literal-hit short-circuit, routing of pure literals to whole-buffer, and alternation routing. ## Why this approach From the ripgrep research (attached to PR #791): rg's central perf trick is extracting a literal from each regex and prefiltering with SIMD memchr. V8 doesn't expose SIMD directly but its `String.prototype.indexOf` is compiled to a tight byte-level loop with internal SIMD on x64 — functionally equivalent for our use case. Three of the five techniques in the Loggly regex-perf guide were evaluated: - **Character classes over `.*`** — `DSN_PATTERN` already uses `[a-z0-9]+`, no change needed. - **Alternation order** — `DSN_PATTERN`'s `(?:\.[a-z]+|:[0-9]+)` is already correctly ordered (`.` more common than `:` in DSN hosts); swapping regressed perf by noise. - **Anchors/word boundaries** — adding `\b` to `DSN_PATTERN` *regressed* perf 2.8× on our workload. V8's existing fast character-mismatch rejection on the first byte outperforms the boundary check overhead. The remaining gap with rg is now primarily orchestration overhead (async/await, `mapFilesConcurrent`, walker correctness features) rather than regex speed. A worker-pool exploration may follow. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning in `src/lib/formatters/markdown.ts` unrelated to this PR) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5610 pass, 0 fail** (+58 new) - [x] `bun test test/isolated` — 138 pass, 0 fail - [x] `bun run bench --size large --runs 5` — all scan ops at or below previous baseline - [x] Manually verified semantic parity: `collectGrep` returns identical `GrepMatch[]` on prefilter vs whole-buffer paths for patterns where the prefilter fires 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#805) ## Summary Walker was 411ms p50 for 10k files on the synthetic/large fixture — roughly 10× slower than a naive sync walk (50ms). Follow-up to PR #791 and #804. Five targeted optimizations in the hot loop, all micro-benchmarked before landing: | Change | Measured win | End-to-end win | |---|---|---| | `readdirSync` instead of `readdir` | 105→45ms (−60ms) | dominant | | `statSync` instead of `Bun.file().size` | −10ms per 10k files | | | String concat for paths | path.join 7ms → concat 0.7ms (10×) | | | `abs.slice(cwdPrefixLen)` for relPath | path.relative 9ms → slice 0.8ms (11×) | | | Manual `lastIndexOf`-based extname | 9ms → 7ms (25%) | | ### Perf results (synthetic/large, 10k files, p50) | Op | Before | After | Δ | |---|---:|---:|---:| | `scan.walk` | 411ms | **231ms** | **−44%** | | `scan.walk.noExt` | 572ms | 448ms | **−22%** | | `scan.walk.dsnParity` | 228ms | **138ms** | **−39%** | | `scanCodeForDsns` | 323ms | 304ms | −6% | | `detectAllDsns.cold` | 327ms | 308ms | −6% | | `detectAllDsns.warm` | 27.9ms | 27.0ms | — | | `scan.grepFiles` | 322ms | 316ms | noise | | `scanCodeForFirstDsn` | 2.23ms | 2.05ms | — | Walker ops are 22-44% faster. Downstream ops (grep, DSN scanner) benefit less because their time is dominated by content scanning, not walking — but still show consistent ~6% improvements. ## Why `readdirSync` is safe here The sync vs async tradeoff usually favors async because blocking the event loop is bad in general. But measured per-call cost matters: | Metric | `readdirSync` on 3635 dirs | |---|---| | p50 | 11µs | | p95 | 24µs | | p99 | 35µs | | max | 65µs | 65µs max block is trivial — `setTimeout(0)` latency in Node is ~4ms. Blocking for 65µs never causes noticeable event-loop pauses. And we pay ~60µs of microtask overhead for each async readdir, which wipes out any theoretical fairness benefit. Net: 2-3× faster per-dir on walks with many small directories, which is every realistic CLI workload. If this ever matters for a weird embedded use case, the optimization is trivially reversible — the walker's public API is unchanged. ## mtime parity fix Initial implementation regressed `detectAllDsns.warm` from 28ms → 304ms because `statSync().mtimeMs` is a float (e.g. `1776790602458.1033`) while `Bun.file().lastModified` is already an integer. The DSN cache validator compares floored `sourceMtimes`, so un-floored floats caused cache misses on every warm call. Fixed by flooring explicitly in `tryYieldFile` — matches the same treatment already applied to `onDirectoryVisit`'s dirMtimes. ## Walker v2 design notes - **Sync readdir + async yield.** Readdir blocks briefly (~65µs max), but the surrounding async-generator interface is preserved. Library consumers that insert async work between iterations still get cooperative scheduling. - **Path ops inlined.** `path.join` / `path.relative` / `path.extname` + `toLowerCase` are all replaced with manual string ops in the hot loop. On Windows, `normalizePath` is still applied via `replaceAll(NATIVE_SEP, "/")` — the POSIX fast path uses a cached `POSIX_NATIVE` module constant. - **`WalkContext.cwdPrefixLen`** precomputed once per walk, used to slice relative paths from absolute paths. - **Cached module constants** `NATIVE_SEP` and `POSIX_NATIVE` avoid per-call `path.sep` property lookups in the hot loop. ## What this PR does NOT change - `WalkEntry` shape: preserved identically (absolutePath, relativePath, size, mtime, isBinary, depth). - `WalkOptions` contract: no new options, no semantics changes. - Symlink handling, `followSymlinks`, cycle detection via `visitedInodes`: unchanged. - IgnoreStack + gitignore semantics: unchanged. - `onDirectoryVisit`, `recordMtimes`, `abortSignal`: unchanged. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning in `src/lib/formatters/markdown.ts` unrelated) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5640 pass, 0 fail** - [x] `bun test test/isolated` — 138 pass - [x] `bun test test/lib/scan/walker.test.ts` — 34 pass (incl. property tests covering hidden files, symlinks, maxDepth, gitignore interaction) - [x] `bun test test/lib/dsn/code-scanner.test.ts` — 52 pass (incl. dirMtimes / sourceMtimes cache validation) - [x] DSN count correctness verified end-to-end: 4 DSNs found on fixture (matches pre-change count) - [x] `bun run bench --size large --runs 5 --warmup 2` — results in table above 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>


Summary
Introduces
src/lib/scan/— a pure-TypeScript, dependency-free-at-runtimeripgrep-compatible file scanner — and migrates the DSN scanner at
src/lib/dsn/code-scanner.tsto consume it. Also ships a local benchmarkharness with a data-driven concurrency tuning pass, a whole-buffer
matchAllrewrite of the grep engine, and a two-tierIgnoreStackoptimization.
The DSN scanner shrinks from 715 → 435 LOC (−280) while keeping all
public export signatures unchanged. DSN detection's common-case hot paths
(
detectDsn.cold,scanCodeForFirstDsn) become 46-87× faster. Theexhaustive find-all path holds near parity with pre-change numbers
(+4.7%, well under the 20% budget) while gaining nested-
.gitignorecorrectness and
dirMtimescache coverage.What's in the module
src/lib/scan/(10 files, ~2.7k LOC, exported via a barrel):readdir({withFileTypes}), time-budgeted exploration, monorepo descent hook,onDirectoryVisitobserver for cache-invalidation callersIgnoreStackwrapping theignorenpm package with a root-only fast path + per-dir nested.gitignoresupport (matches git's cumulative semantics)mapFilesConcurrent/mapFilesConcurrentStreamwith cooperative early-exit flag (never usespLimit.clearQueue()— that hangsPromise.all)compilePattern(rg-style leading inline flags → JS RegExp flags) +ensureGlobalFlag/ensureGlobalMultilineFlagshelpersgrepFiles(AsyncIterable<GrepMatch>) +collectGrep(sortedPromise<GrepResult>). Whole-buffermatchAlliteration.globFiles+collectGlob, picomatch-backedTEXT_EXTENSIONS,DEFAULT_SKIP_DIRS,DSN_ADDITIONAL_SKIP_DIRS,MAX_FILE_SIZE,CONCURRENCY_LIMIT, helpers (normalizePath,isMonorepoPackageDir)WalkEntry,WalkOptions,GrepMatch,GrepOptions,GlobOptions,IgnoreMatcherDSN-specific policy lives in
src/lib/dsn/scan-options.ts.scan/stays policy-free so the init wizard (follow-up PR) can bring its own preset.DSN scanner migration (
src/lib/dsn/code-scanner.ts)scanCodeForDsns/dsn.detect.code.gitignorestate.earlyExitmapFilesConcurrent.onResultDSN-specific content extraction (
extractDsnsFromContent,isCommentedLine,isValidDsnHost,getExpectedHost) stays incode-scanner.tsunchanged. The walker + concurrency + gitignore handling moved toscan/.Added a case-insensitive literal-prefix fast path to
extractDsnsFromContent(/http/i.test(content)) that short-circuitsmatchAllwhen the file contains no scheme prefix in any casing.Performance (synthetic/large, 10k files, 4-core)
detectDsn.coldscanCodeForFirstDsnscanCodeForDsnsdetectAllDsns.coldscan.grepFilesscanCodeForDsnsscan.ignore(µbench)The 46-87× wins on the stop-on-first hot paths come from
mapFilesConcurrent's cooperative early-exit flag — the old scanner's stop-on-first was less aggressive. That's the path everysentry issue listinvocation hits.The small regression on
scanCodeForDsnsis the cost of nested-gitignore correctness (~41ms single-threaded) plus the case-insensitive scheme probe (~15ms vs rawindexOf). Both are correctness improvements; the regression is within the planning budget (20%).Benchmark harness
script/bench.ts+script/bench-sweep.ts, deterministic xorshift32-seeded fixture generator attest/fixtures/bench/. Three presets (small~100 files,medium~1k,large~10k). Baselines gitignored (machine-specific).Op labels intentionally match Sentry span names (
scanCodeForDsns,findProjectRoot, …) so local measurements correlate with production telemetry.Optimizations landed
mapFilesConcurrent+ per-fileextractDsnsFromContent(content, limit: 1)— the early-exit wins (46-87× on stop-on-first).CONCURRENCY_LIMITtied toavailableParallelism()with floor 2 and cap 16. The JSDoc documents the tradeoff between walker-fed (our real workload) and pure-I/O workloads, and notes that callers can override per-call.grep.ts::readAndGrepwhole-buffermatchAll— replacedcontent.split("\n")+ per-lineregex.testwithregex.exec+ lazy line-num counting. 12× faster on the profiled grep CPU phase./http/i.test(content)short-circuitsmatchAllon the ~92% of files that can't possibly contain a DSN. Must be case-insensitive for correctness.IgnoreStacktwo-tier query — root-only fast path when no nested gitignores loaded (14× per-query speedup); prefix-walk slow path for the nested case (3× speedup). Thescan.ignoremicrobench dropped from 10.9 ms → 2.1 ms.Non-goals, explicitly rejected with evidence
grep.tscovered the gap instead.ignorewithpicomatch: gitignore syntax (bare basename match, trailing/= subtree, leading/anchored) isn't compatible with picomatch glob syntax. Would require a 200-400 LOC translator with subtle correctness edge cases.ignorepackage is ~800 LOC specifically because this is non-trivial.Behavior change (intentional)
Nested
.gitignorenow honored in DSN detection. Pre-change: only the project-root.gitignorewas read. Post-change: every directory's.gitignoreapplies to its subtree with cumulative semantics (parent patterns apply, child negations override).User-visible impact: a DSN in a file covered by a subdirectory
.gitignoreis no longer detected. This matches git's owncheck-ignorebehavior and is almost always what users want (a gitignored file should not leak DSNs to the CLI). If anyone depends on the old behavior, they can remove the relevant.gitignorerule.Review round fixes
Https://,hTtP://, etc. were silently dropped by the initialindexOf-based probe). The probe is now/http/i.test(content). Widened the property test arbitrary to include uppercase so this class is actually exercised. Added 5 parametrized regression tests incode-scanner.test.ts.CONCURRENCY_LIMITbump (cores × 4, cap 32) that microbenchmarked well but regressed the real walker-fed workload ~15%. The JSDoc now documents why the "I/O-bound therefore scale with something larger than cores" reasoning is misleading for this specific pipeline.What's NOT in this PR
src/lib/init/tools/grep.ts+glob.tssubprocess shims withcollectGrep/collectGlob). Separate follow-up PR with its own plan.Test plan
bun run typecheck— cleanbun run lint— clean (1 pre-existing warning inmarkdown.ts, not from this PR)bun test --timeout 15000 test/lib test/commands test/types— 5464 pass, 0 fail (+183 new scan + bench + DSN tests including property tests via fast-check + 5 mixed-case regression tests)bun test test/isolated— 138 pass, 0 failbun run bench --size large—scanCodeForDsnswithin 1.2× of pre-PR baseline (312 ms vs 298 ms = 1.05×)Https://,hTtP://,HttpS://— all pass🤖 Generated with Claude Code
Co-authored-by: Claude Opus 4.7 (1M context) noreply@anthropic.com