Skip to content

refactor(scan): transport grep worker line pool as bytes#826

Merged
BYK merged 1 commit intomainfrom
byk/perf/worker-postmessage-bytes
Apr 23, 2026
Merged

refactor(scan): transport grep worker line pool as bytes#826
BYK merged 1 commit intomainfrom
byk/perf/worker-postmessage-bytes

Conversation

@BYK
Copy link
Copy Markdown
Member

@BYK BYK commented Apr 23, 2026

Summary

Changes the grep worker's result protocol from {type, ints: Uint32Array, linePool: string} to {type, ints, linePoolBytes: Uint8Array}, encoding the pool to UTF-8 on the worker side and decoding it once on the main side. Both buffers now ride postMessage's zero-copy transfer path.

Why

Bun's fast-path postMessage for strings fires for bare strings and string-only plain objects, but not when a string is sent alongside a transferable. Our mixed message {ints: transferable, linePool: string} was on the slow structured-clone path for the string portion on every batch.

Packing the pool as a transferable Uint8Array lets both pieces go zero-copy. In a worker-messaging microbench this is ~8× faster per round-trip at realistic match densities.

Not a perf PR

End-to-end wall-clock on typical grep workloads is within ±2% of the previous protocol because messaging happens concurrently with walker I/O and worker regex work — it was never on the critical path. Measured multiple rounds of 50-run benches across zero-match, rare, common, and very-common grep patterns; the microbench-visible savings don't surface as user-visible speed.

Landing anyway for:

  • Memory pressure: the pool is no longer duplicated across the thread boundary on every batch.
  • Protocol correctness: the old string path silently papered over two protocol-level hazards that the bytes path surfaces and fixes (see below).
  • Future-proofing: if Bun later extends the fast path to mixed messages, or the pipeline becomes less overlapped, the zero-copy path delivers automatically.

Correctness fixes surfaced by the new transport

Both bugs were latent on the string protocol — structured-clone silently preserved surrogates and BOMs, so the bugs never manifested. Forcing TextEncoder.encode / TextDecoder.decode for the round-trip made them visible, and both are now tested:

1. maxLineLength truncation splitting a surrogate pair

grep-worker.js truncates long lines via line.slice(0, maxLineLength - 1) + "…". slice() is code-unit-based, so it can split a UTF-16 surrogate pair and leave a lone high surrogate at the boundary. TextEncoder.encode then replaces the lone half with U+FFFD — silent data loss.

Fix: before cutting, if the character at cut - 1 is a high surrogate (its low-surrogate partner is at cut, which we're about to exclude), back off one code unit. Drops both halves of the orphaned pair. Safe under all observable inputs — readFileSync("utf-8") can't produce lone surrogates on its own, so truncation is the only path that could create one.

2. TextDecoder BOM stripping (caught by Cursor Bugbot in review)

new TextDecoder("utf-8") defaults to ignoreBOM: false, which silently drops a leading U+FEFF. For a BOM-prefixed source file, the worker puts U+FEFF at pool index 0 and stores offsets against the pre-encode pool length; without ignoreBOM: true on the main-side decoder, the decoded pool is one code unit shorter than the worker expected and every lineOffset/lineLength in the batch shifts left by one — lines bleed into each other.

Fix: pass { ignoreBOM: true } to the LINE_POOL_DECODER constructor. Verified end-to-end with collectGrep on a real BOM-prefixed file: without the fix, matched lines came back as "TARGET firstT", "ARGET secondT", "ARGET third"; with the fix, they're intact.

Bench

Measured on the fx-large synthetic fixture (10k files), 50 runs × 5 warmup × 3 repeated rounds each side. p50:

op main PR Δ
collectGrep zero-match uncapped 302ms 298ms ~
collectGrep rare uncapped (SENTRY_DSN) 316ms 303ms −4%
collectGrep common uncapped (import.*from) 631ms 636ms +1%
collectGrep very-common uncapped (function\s+\w+) 636ms 629ms −1%

All deltas within ±4%, consistent with noise floor.

Tests

Three new tests in test/lib/scan/grep.test.ts:

All 41 grep tests pass; 5729 full-suite + 138 isolated pass. Typecheck + lint clean.

Review

  • Round 1 (subagent): caught the lone-surrogate correctness issue in the pre-existing truncation code.
  • Round 2 (subagent): verified the surrogate fix and blessed for ship.
  • Cursor Bugbot on force-push: caught the BOM-stripping bug on LINE_POOL_DECODER. Fixed with { ignoreBOM: true } + regression test.
  • Round 3 (subagent, pre-merge): verified both fixes hold, no other latent issues, ready to merge.

Test plan

  • bunx tsc --noEmit — clean
  • bun run lint — clean
  • bun test test/lib/scan/grep.test.ts — 41 pass
  • bun test test/lib test/commands test/types — 5729 pass
  • bun test test/isolated — 138 pass
  • Negative-verified both correctness regression tests catch the bugs they guard against

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 23, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://cli.sentry.dev/_preview/pr-826/

Built to branch gh-pages at 2026-04-23 10:06 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 23, 2026

Codecov Results 📊

138 passed | Total: 138 | Pass Rate: 100% | Execution Time: 0ms

📊 Comparison with Base Branch

Metric Change
Total Tests
Passed Tests
Failed Tests
Skipped Tests

✨ No test changes detected

All tests are passing successfully.

✅ Patch coverage is 100.00%. Project has 1948 uncovered lines.
❌ Project coverage is 95.27%. Comparing base (base) to head (head).

Coverage diff
@@            Coverage Diff             @@
##          main       #PR       +/-##
==========================================
- Coverage    95.29%    95.27%    -0.02%
==========================================
  Files          284       284         —
  Lines        41151     41151         —
  Branches         0         0         —
==========================================
+ Hits         39211     39203        -8
- Misses        1940      1948        +8
- Partials         0         0         —

Generated by Codecov Action

@BYK BYK force-pushed the byk/perf/worker-postmessage-bytes branch from eb5f71f to 9d1fc63 Compare April 23, 2026 09:51
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 9d1fc63. Configure here.

Comment thread src/lib/scan/worker-pool.ts Outdated
Changes the worker → main result protocol from `{type, ints:
Uint32Array, linePool: string}` to `{type, ints, linePoolBytes:
Uint8Array}`, encoding the pool to UTF-8 on the worker side and
decoding it once on the main side. Both buffers now ride
`postMessage`'s zero-copy transfer path.

## Why

Bun's [fast-path postMessage](https://bun.com/blog/how-we-made-postMessage-string-500x-faster)
for strings fires for bare strings and string-only plain objects,
but not when a string is sent alongside a transferable — our
mixed-message `{ints: transferable, linePool: string}` was on the
slow structured-clone path for the string portion on every batch.

Packing the pool as a transferable `Uint8Array` lets both pieces
go zero-copy. In a worker-messaging microbench this is ~8× faster
per round-trip at realistic match densities.

## This is a refactor, not a perf PR

**End-to-end wall-clock on typical grep workloads is within ±2% of
the previous protocol** because messaging happens concurrently
with walker I/O and worker regex work — it was never on the
critical path. I measured multiple rounds of 50-run benches
across zero-match, rare, common, and very-common grep patterns;
the microbench-visible savings don't surface as user-visible speed.

Landing anyway for:

- **Memory pressure**: the pool is no longer duplicated across the
  thread boundary on every batch.
- **Protocol correctness**: the old string path worked only
  because structured-clone silently preserved lone UTF-16
  surrogates. The new bytes path made that accidental guarantee
  visible by forcing it through `TextEncoder`, which exposed the
  truncation bug fixed in this same commit (see below).
- **Future-proofing**: if Bun later extends the fast path to mixed
  messages, or the pipeline becomes less overlapped, the zero-copy
  path delivers automatically.

## Truncation / surrogate-pair fix (also in this commit)

`grep-worker.js` truncates lines longer than `maxLineLength` via
`line.slice(0, maxLineLength - 1) + "…"`. `slice()` is code-unit-
based, so it can split a UTF-16 surrogate pair and leave a lone
high surrogate at the boundary. The old string protocol preserved
that (structured-clone is lone-surrogate-safe); the new bytes
protocol cannot (`TextEncoder.encode` replaces lone halves with
U+FFFD).

Fix: before cutting, if the character at `cut - 1` is a high
surrogate (its low-surrogate partner is at `cut`, which we're
about to exclude), back off one code unit. This drops both halves
of the orphaned pair and keeps `.length` / offsets correct in both
the worker and the decoded main-side string. Safe under all
observable inputs — `readFileSync("utf-8")` can't produce lone
surrogates on its own (invalid UTF-8 bytes are pre-replaced with
U+FFFD), so truncation is the only path that could create one.

## Tests

- New `preserves non-ASCII / multi-byte UTF-8 in matched lines` —
  covers emoji, CJK, accented chars, astral-plane codepoints
  through the full encode/decode round-trip.
- New `truncation at surrogate-pair boundary stays intact` —
  direct regression test for the truncation fix; verified FAILS
  when the backoff is reverted (U+FFFD leaks in) and PASSES with
  it.
- All 40 tests in `test/lib/scan/grep.test.ts` pass.
- All 5728 tests + 138 isolated pass.
- Typecheck + lint clean (1 pre-existing markdown.ts warning).

## Review

Two rounds of subagent review before commit:
1. Round 1 (`ses_2467b0231ffe3j49P7D1lN5kCU`): caught the lone-
   surrogate correctness issue at the truncation boundary.
2. Round 2 (`ses_2465f6151ffe9JyNQEpDoYsNYH`): verified the fix
   covers all reachable edge cases (`maxLineLength=1`/`2`,
   low-surrogate at boundary, non-surrogate chars, adjacent lone
   highs are structurally impossible given the `readFileSync`
   source guarantee), the regression test is targeted correctly,
   and no new issues were introduced.
@BYK BYK force-pushed the byk/perf/worker-postmessage-bytes branch from 9d1fc63 to 57c2fdc Compare April 23, 2026 10:05
@BYK BYK merged commit 555a774 into main Apr 23, 2026
26 checks passed
@BYK BYK deleted the byk/perf/worker-postmessage-bytes branch April 23, 2026 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant