refactor(scan): transport grep worker line pool as bytes#826
Merged
Conversation
Contributor
|
Contributor
Codecov Results 📊✅ 138 passed | Total: 138 | Pass Rate: 100% | Execution Time: 0ms 📊 Comparison with Base Branch
✨ No test changes detected All tests are passing successfully. ✅ Patch coverage is 100.00%. Project has 1948 uncovered lines. Coverage diff@@ Coverage Diff @@
## main #PR +/-##
==========================================
- Coverage 95.29% 95.27% -0.02%
==========================================
Files 284 284 —
Lines 41151 41151 —
Branches 0 0 —
==========================================
+ Hits 39211 39203 -8
- Misses 1940 1948 +8
- Partials 0 0 —Generated by Codecov Action |
eb5f71f to
9d1fc63
Compare
Contributor
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 9d1fc63. Configure here.
Changes the worker → main result protocol from `{type, ints:
Uint32Array, linePool: string}` to `{type, ints, linePoolBytes:
Uint8Array}`, encoding the pool to UTF-8 on the worker side and
decoding it once on the main side. Both buffers now ride
`postMessage`'s zero-copy transfer path.
## Why
Bun's [fast-path postMessage](https://bun.com/blog/how-we-made-postMessage-string-500x-faster)
for strings fires for bare strings and string-only plain objects,
but not when a string is sent alongside a transferable — our
mixed-message `{ints: transferable, linePool: string}` was on the
slow structured-clone path for the string portion on every batch.
Packing the pool as a transferable `Uint8Array` lets both pieces
go zero-copy. In a worker-messaging microbench this is ~8× faster
per round-trip at realistic match densities.
## This is a refactor, not a perf PR
**End-to-end wall-clock on typical grep workloads is within ±2% of
the previous protocol** because messaging happens concurrently
with walker I/O and worker regex work — it was never on the
critical path. I measured multiple rounds of 50-run benches
across zero-match, rare, common, and very-common grep patterns;
the microbench-visible savings don't surface as user-visible speed.
Landing anyway for:
- **Memory pressure**: the pool is no longer duplicated across the
thread boundary on every batch.
- **Protocol correctness**: the old string path worked only
because structured-clone silently preserved lone UTF-16
surrogates. The new bytes path made that accidental guarantee
visible by forcing it through `TextEncoder`, which exposed the
truncation bug fixed in this same commit (see below).
- **Future-proofing**: if Bun later extends the fast path to mixed
messages, or the pipeline becomes less overlapped, the zero-copy
path delivers automatically.
## Truncation / surrogate-pair fix (also in this commit)
`grep-worker.js` truncates lines longer than `maxLineLength` via
`line.slice(0, maxLineLength - 1) + "…"`. `slice()` is code-unit-
based, so it can split a UTF-16 surrogate pair and leave a lone
high surrogate at the boundary. The old string protocol preserved
that (structured-clone is lone-surrogate-safe); the new bytes
protocol cannot (`TextEncoder.encode` replaces lone halves with
U+FFFD).
Fix: before cutting, if the character at `cut - 1` is a high
surrogate (its low-surrogate partner is at `cut`, which we're
about to exclude), back off one code unit. This drops both halves
of the orphaned pair and keeps `.length` / offsets correct in both
the worker and the decoded main-side string. Safe under all
observable inputs — `readFileSync("utf-8")` can't produce lone
surrogates on its own (invalid UTF-8 bytes are pre-replaced with
U+FFFD), so truncation is the only path that could create one.
## Tests
- New `preserves non-ASCII / multi-byte UTF-8 in matched lines` —
covers emoji, CJK, accented chars, astral-plane codepoints
through the full encode/decode round-trip.
- New `truncation at surrogate-pair boundary stays intact` —
direct regression test for the truncation fix; verified FAILS
when the backoff is reverted (U+FFFD leaks in) and PASSES with
it.
- All 40 tests in `test/lib/scan/grep.test.ts` pass.
- All 5728 tests + 138 isolated pass.
- Typecheck + lint clean (1 pre-existing markdown.ts warning).
## Review
Two rounds of subagent review before commit:
1. Round 1 (`ses_2467b0231ffe3j49P7D1lN5kCU`): caught the lone-
surrogate correctness issue at the truncation boundary.
2. Round 2 (`ses_2465f6151ffe9JyNQEpDoYsNYH`): verified the fix
covers all reachable edge cases (`maxLineLength=1`/`2`,
low-surrogate at boundary, non-surrogate chars, adjacent lone
highs are structurally impossible given the `readFileSync`
source guarantee), the regression test is targeted correctly,
and no new issues were introduced.
9d1fc63 to
57c2fdc
Compare
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Changes the grep worker's result protocol from
{type, ints: Uint32Array, linePool: string}to{type, ints, linePoolBytes: Uint8Array}, encoding the pool to UTF-8 on the worker side and decoding it once on the main side. Both buffers now ridepostMessage's zero-copy transfer path.Why
Bun's fast-path
postMessagefor strings fires for bare strings and string-only plain objects, but not when a string is sent alongside a transferable. Our mixed message{ints: transferable, linePool: string}was on the slow structured-clone path for the string portion on every batch.Packing the pool as a transferable
Uint8Arraylets both pieces go zero-copy. In a worker-messaging microbench this is ~8× faster per round-trip at realistic match densities.Not a perf PR
End-to-end wall-clock on typical grep workloads is within ±2% of the previous protocol because messaging happens concurrently with walker I/O and worker regex work — it was never on the critical path. Measured multiple rounds of 50-run benches across zero-match, rare, common, and very-common grep patterns; the microbench-visible savings don't surface as user-visible speed.
Landing anyway for:
Correctness fixes surfaced by the new transport
Both bugs were latent on the string protocol — structured-clone silently preserved surrogates and BOMs, so the bugs never manifested. Forcing
TextEncoder.encode/TextDecoder.decodefor the round-trip made them visible, and both are now tested:1.
maxLineLengthtruncation splitting a surrogate pairgrep-worker.jstruncates long lines vialine.slice(0, maxLineLength - 1) + "…".slice()is code-unit-based, so it can split a UTF-16 surrogate pair and leave a lone high surrogate at the boundary.TextEncoder.encodethen replaces the lone half with U+FFFD — silent data loss.Fix: before cutting, if the character at
cut - 1is a high surrogate (its low-surrogate partner is atcut, which we're about to exclude), back off one code unit. Drops both halves of the orphaned pair. Safe under all observable inputs —readFileSync("utf-8")can't produce lone surrogates on its own, so truncation is the only path that could create one.2.
TextDecoderBOM stripping (caught by Cursor Bugbot in review)new TextDecoder("utf-8")defaults toignoreBOM: false, which silently drops a leading U+FEFF. For a BOM-prefixed source file, the worker puts U+FEFF at pool index 0 and stores offsets against the pre-encode pool length; withoutignoreBOM: trueon the main-side decoder, the decoded pool is one code unit shorter than the worker expected and everylineOffset/lineLengthin the batch shifts left by one — lines bleed into each other.Fix: pass
{ ignoreBOM: true }to theLINE_POOL_DECODERconstructor. Verified end-to-end withcollectGrepon a real BOM-prefixed file: without the fix, matched lines came back as"TARGET firstT","ARGET secondT","ARGET third"; with the fix, they're intact.Bench
Measured on the
fx-largesynthetic fixture (10k files), 50 runs × 5 warmup × 3 repeated rounds each side. p50:collectGrepzero-match uncappedcollectGreprare uncapped (SENTRY_DSN)collectGrepcommon uncapped (import.*from)collectGrepvery-common uncapped (function\s+\w+)All deltas within ±4%, consistent with noise floor.
Tests
Three new tests in
test/lib/scan/grep.test.ts:preserves non-ASCII / multi-byte UTF-8 in matched lines— emoji, CJK, accented chars, astral-plane codepoints through the encode/decode round-trip.truncation at a surrogate-pair boundary doesn't leak U+FFFD— regression for fix refactor: rename CLI from sry to sentry-cli-next #1; verified to fail when the backoff is reverted.UTF-8 BOM at the start of a file preserves line offsets— regression for fix refactor: use native Sentry device flow, remove oauth-proxy #2; verified to fail whenignoreBOM: trueis removed.All 41 grep tests pass; 5729 full-suite + 138 isolated pass. Typecheck + lint clean.
Review
LINE_POOL_DECODER. Fixed with{ ignoreBOM: true }+ regression test.Test plan
bunx tsc --noEmit— cleanbun run lint— cleanbun test test/lib/scan/grep.test.ts— 41 passbun test test/lib test/commands test/types— 5729 passbun test test/isolated— 138 pass