refactor(scan): transport grep worker line pool as bytes by BYK · Pull Request #826 · getsentry/cli

BYK · 2026-04-23T09:24:38Z

Summary

Changes the grep worker's result protocol from {type, ints: Uint32Array, linePool: string} to {type, ints, linePoolBytes: Uint8Array}, encoding the pool to UTF-8 on the worker side and decoding it once on the main side. Both buffers now ride postMessage's zero-copy transfer path.

Why

Bun's fast-path postMessage for strings fires for bare strings and string-only plain objects, but not when a string is sent alongside a transferable. Our mixed message {ints: transferable, linePool: string} was on the slow structured-clone path for the string portion on every batch.

Packing the pool as a transferable Uint8Array lets both pieces go zero-copy. In a worker-messaging microbench this is ~8× faster per round-trip at realistic match densities.

Not a perf PR

End-to-end wall-clock on typical grep workloads is within ±2% of the previous protocol because messaging happens concurrently with walker I/O and worker regex work — it was never on the critical path. Measured multiple rounds of 50-run benches across zero-match, rare, common, and very-common grep patterns; the microbench-visible savings don't surface as user-visible speed.

Landing anyway for:

Memory pressure: the pool is no longer duplicated across the thread boundary on every batch.
Protocol correctness: the old string path silently papered over two protocol-level hazards that the bytes path surfaces and fixes (see below).
Future-proofing: if Bun later extends the fast path to mixed messages, or the pipeline becomes less overlapped, the zero-copy path delivers automatically.

Correctness fixes surfaced by the new transport

Both bugs were latent on the string protocol — structured-clone silently preserved surrogates and BOMs, so the bugs never manifested. Forcing TextEncoder.encode / TextDecoder.decode for the round-trip made them visible, and both are now tested:

1. `maxLineLength` truncation splitting a surrogate pair

grep-worker.js truncates long lines via line.slice(0, maxLineLength - 1) + "…". slice() is code-unit-based, so it can split a UTF-16 surrogate pair and leave a lone high surrogate at the boundary. TextEncoder.encode then replaces the lone half with U+FFFD — silent data loss.

Fix: before cutting, if the character at cut - 1 is a high surrogate (its low-surrogate partner is at cut, which we're about to exclude), back off one code unit. Drops both halves of the orphaned pair. Safe under all observable inputs — readFileSync("utf-8") can't produce lone surrogates on its own, so truncation is the only path that could create one.

2. `TextDecoder` BOM stripping (caught by Cursor Bugbot in review)

new TextDecoder("utf-8") defaults to ignoreBOM: false, which silently drops a leading U+FEFF. For a BOM-prefixed source file, the worker puts U+FEFF at pool index 0 and stores offsets against the pre-encode pool length; without ignoreBOM: true on the main-side decoder, the decoded pool is one code unit shorter than the worker expected and every lineOffset/lineLength in the batch shifts left by one — lines bleed into each other.

Fix: pass { ignoreBOM: true } to the LINE_POOL_DECODER constructor. Verified end-to-end with collectGrep on a real BOM-prefixed file: without the fix, matched lines came back as "TARGET firstT", "ARGET secondT", "ARGET third"; with the fix, they're intact.

Bench

Measured on the fx-large synthetic fixture (10k files), 50 runs × 5 warmup × 3 repeated rounds each side. p50:

op	main	PR	Δ
`collectGrep` zero-match uncapped	302ms	298ms	~
`collectGrep` rare uncapped (SENTRY_DSN)	316ms	303ms	−4%
`collectGrep` common uncapped (`import.*from`)	631ms	636ms	+1%
`collectGrep` very-common uncapped (`function\s+\w+`)	636ms	629ms	−1%

All deltas within ±4%, consistent with noise floor.

Tests

Three new tests in test/lib/scan/grep.test.ts:

preserves non-ASCII / multi-byte UTF-8 in matched lines — emoji, CJK, accented chars, astral-plane codepoints through the encode/decode round-trip.
truncation at a surrogate-pair boundary doesn't leak U+FFFD — regression for fix refactor: rename CLI from sry to sentry-cli-next #1; verified to fail when the backoff is reverted.
UTF-8 BOM at the start of a file preserves line offsets — regression for fix refactor: use native Sentry device flow, remove oauth-proxy #2; verified to fail when ignoreBOM: true is removed.

All 41 grep tests pass; 5729 full-suite + 138 isolated pass. Typecheck + lint clean.

Review

Round 1 (subagent): caught the lone-surrogate correctness issue in the pre-existing truncation code.
Round 2 (subagent): verified the surrogate fix and blessed for ship.
Cursor Bugbot on force-push: caught the BOM-stripping bug on LINE_POOL_DECODER. Fixed with { ignoreBOM: true } + regression test.
Round 3 (subagent, pre-merge): verified both fixes hold, no other latent issues, ready to merge.

Test plan

bunx tsc --noEmit — clean
bun run lint — clean
bun test test/lib/scan/grep.test.ts — 41 pass
bun test test/lib test/commands test/types — 5729 pass
bun test test/isolated — 138 pass
Negative-verified both correctness regression tests catch the bugs they guard against

github-actions · 2026-04-23T09:25:14Z

PR Preview Action v1.8.1
🚀 View preview at https://cli.sentry.dev/_preview/pr-826/
Built to branch `gh-pages` at 2026-04-23 10:06 UTC. Preview will be ready when the GitHub Pages deployment is complete.

github-actions · 2026-04-23T09:26:48Z

Codecov Results 📊

✅ 138 passed | Total: 138 | Pass Rate: 100% | Execution Time: 0ms

📊 Comparison with Base Branch

Metric	Change
Total Tests	—
Passed Tests	—
Failed Tests	—
Skipped Tests	—

✨ No test changes detected

All tests are passing successfully.

✅ Patch coverage is 100.00%. Project has 1948 uncovered lines.
❌ Project coverage is 95.27%. Comparing base (base) to head (head).

Coverage diff

@@            Coverage Diff             @@
##          main       #PR       +/-##
==========================================
- Coverage    95.29%    95.27%    -0.02%
==========================================
  Files          284       284         —
  Lines        41151     41151         —
  Branches         0         0         —
==========================================
+ Hits         39211     39203        -8
- Misses        1940      1948        +8
- Partials         0         0         —

Generated by Codecov Action

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 9d1fc63. Configure here.}

Changes the worker → main result protocol from `{type, ints: Uint32Array, linePool: string}` to `{type, ints, linePoolBytes: Uint8Array}`, encoding the pool to UTF-8 on the worker side and decoding it once on the main side. Both buffers now ride `postMessage`'s zero-copy transfer path. ## Why Bun's [fast-path postMessage](https://bun.com/blog/how-we-made-postMessage-string-500x-faster) for strings fires for bare strings and string-only plain objects, but not when a string is sent alongside a transferable — our mixed-message `{ints: transferable, linePool: string}` was on the slow structured-clone path for the string portion on every batch. Packing the pool as a transferable `Uint8Array` lets both pieces go zero-copy. In a worker-messaging microbench this is ~8× faster per round-trip at realistic match densities. ## This is a refactor, not a perf PR **End-to-end wall-clock on typical grep workloads is within ±2% of the previous protocol** because messaging happens concurrently with walker I/O and worker regex work — it was never on the critical path. I measured multiple rounds of 50-run benches across zero-match, rare, common, and very-common grep patterns; the microbench-visible savings don't surface as user-visible speed. Landing anyway for: - **Memory pressure**: the pool is no longer duplicated across the thread boundary on every batch. - **Protocol correctness**: the old string path worked only because structured-clone silently preserved lone UTF-16 surrogates. The new bytes path made that accidental guarantee visible by forcing it through `TextEncoder`, which exposed the truncation bug fixed in this same commit (see below). - **Future-proofing**: if Bun later extends the fast path to mixed messages, or the pipeline becomes less overlapped, the zero-copy path delivers automatically. ## Truncation / surrogate-pair fix (also in this commit) `grep-worker.js` truncates lines longer than `maxLineLength` via `line.slice(0, maxLineLength - 1) + "…"`. `slice()` is code-unit- based, so it can split a UTF-16 surrogate pair and leave a lone high surrogate at the boundary. The old string protocol preserved that (structured-clone is lone-surrogate-safe); the new bytes protocol cannot (`TextEncoder.encode` replaces lone halves with U+FFFD). Fix: before cutting, if the character at `cut - 1` is a high surrogate (its low-surrogate partner is at `cut`, which we're about to exclude), back off one code unit. This drops both halves of the orphaned pair and keeps `.length` / offsets correct in both the worker and the decoded main-side string. Safe under all observable inputs — `readFileSync("utf-8")` can't produce lone surrogates on its own (invalid UTF-8 bytes are pre-replaced with U+FFFD), so truncation is the only path that could create one. ## Tests - New `preserves non-ASCII / multi-byte UTF-8 in matched lines` — covers emoji, CJK, accented chars, astral-plane codepoints through the full encode/decode round-trip. - New `truncation at surrogate-pair boundary stays intact` — direct regression test for the truncation fix; verified FAILS when the backoff is reverted (U+FFFD leaks in) and PASSES with it. - All 40 tests in `test/lib/scan/grep.test.ts` pass. - All 5728 tests + 138 isolated pass. - Typecheck + lint clean (1 pre-existing markdown.ts warning). ## Review Two rounds of subagent review before commit: 1. Round 1 (`ses_2467b0231ffe3j49P7D1lN5kCU`): caught the lone- surrogate correctness issue at the truncation boundary. 2. Round 2 (`ses_2465f6151ffe9JyNQEpDoYsNYH`): verified the fix covers all reachable edge cases (`maxLineLength=1`/`2`, low-surrogate at boundary, non-surrogate chars, adjacent lone highs are structurally impossible given the `readFileSync` source guarantee), the regression test is targeted correctly, and no new issues were introduced.

BYK force-pushed the byk/perf/worker-postmessage-bytes branch from eb5f71f to 9d1fc63 Compare April 23, 2026 09:51

cursor Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread src/lib/scan/worker-pool.ts Outdated

BYK force-pushed the byk/perf/worker-postmessage-bytes branch from 9d1fc63 to 57c2fdc Compare April 23, 2026 10:05

BYK merged commit 555a774 into main Apr 23, 2026
26 checks passed

BYK deleted the byk/perf/worker-postmessage-bytes branch April 23, 2026 10:14

sentry-release-bot Bot mentioned this pull request Apr 23, 2026

publish: getsentry/cli@0.29.0 getsentry/publish#7927

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(scan): transport grep worker line pool as bytes#826

refactor(scan): transport grep worker line pool as bytes#826
BYK merged 1 commit intomainfrom
byk/perf/worker-postmessage-bytes

BYK commented Apr 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 23, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-04-23 10:06 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

github-actions Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

BYK commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Not a perf PR

Correctness fixes surfaced by the new transport

1. maxLineLength truncation splitting a surrogate pair

2. TextDecoder BOM stripping (caught by Cursor Bugbot in review)

Bench

Tests

Review

Test plan

Uh oh!

github-actions Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-04-23 10:06 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

github-actions Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Results 📊

📊 Comparison with Base Branch

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

BYK commented Apr 23, 2026 •

edited

Loading

1. `maxLineLength` truncation splitting a surrogate pair

2. `TextDecoder` BOM stripping (caught by Cursor Bugbot in review)

github-actions Bot commented Apr 23, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-04-23 10:06 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

github-actions Bot commented Apr 23, 2026 •

edited

Loading