[wrangler] Fix start-worker-node (Windows fixtures) and tail (Linux packages) CI flakes by petebacondarwin · Pull Request #13662 · cloudflare/workers-sdk

petebacondarwin · 2026-04-24T11:26:39Z

Fixes two independently reproducible CI flakes that surface whenever the wrangler turbo caches get invalidated.

1. `start-worker-node-test` fixture — Windows fixtures CI

Example failing run: https://github.com/cloudflare/workers-sdk/actions/runs/24879778860/job/72844998017

Every test inside fixtures/start-worker-node-test/src/config-errors.test.js passes in ~1.2 s, then the subprocess hangs until node --test's 50 s file-level timeout fires and cancels the file. Two prior PRs (#13515, #13604) attacked individual symptoms — #13604 moved the fixture onto its own CI step with a 50 s timeout — but the flake kept recurring.

Root cause

Three compounding bugs in the DevEnv.teardown() path leaked the esbuild child process so the Node event loop could not drain after worker.dispose() returned:

bundleWorker leaks esbuild.BuildContext when the initial build throws. packages/wrangler/src/deployment-bundle/bundle.ts scoped ctx inside if (watch) { }. An initial-build failure (e.g. unresolvable entrypoint from the setConfig tests) threw out of that block, leaving ctx unreachable and its underlying esbuild child process alive for the lifetime of the parent Node process.
runBuild's cleanup closure is fire-and-forget when the build is still in flight. packages/wrangler/src/dev/use-esbuild.ts returned void buildPromise.then(() => stopWatching?.()), so teardown could return before ctx.dispose() had run.
BundlerController.teardown() removes the tmp dir before awaiting esbuild cleanup. packages/wrangler/src/api/startDevWorker/BundlerController.ts deleted .wrangler/tmp/bundle-XXXX first, making an in-flight rebuild fail with Could not resolve .wrangler/tmp/bundle-XXXX/middleware-loader.entry.ts — the exact noise visible in the failing CI log right after the timeout.

Fixes (commit 1)

Hoist ctx to outer scope in bundleWorker and dispose it in the catch.
Make runBuild's cleanup closure await the build promise before calling stopWatching.
Reorder BundlerController.teardown() to dispose esbuild before removing the tmp dir, and abort #bundleBuildAborter so a finishing build cannot emit stale bundleStart / bundleComplete events into a torn-down bus.

Revert the #13604 workaround (commit 2)

Re-include ./fixtures/start-worker-node-test in the main parallel fixtures test run.
Delete the separate Run tests (start-worker-node) step.
Drop the node --test --test-timeout back from 50 s to 15 s.

2. `tail.test.ts` — Linux `packages-and-tools` CI

Example failing run: https://github.com/cloudflare/workers-sdk/actions/runs/24888217640/job/72873435065. Also visible intermittently on earlier PRs (e.g. #13622 on 2026-04-22). All 42 tests in tail.test.ts pass, then vitest reports 30× Unhandled Rejection: Error: Not logged in. from src/tail/index.ts:205.

Root cause

wrangler tail registered a process-level onExit(exit) handler via signal-exit but never removed it. In long-lived CLI runs this is harmless — the handler eventually runs once on shutdown. In unit tests that invoke wrangler tail dozens of times in the same vitest worker, every invocation accumulates a handler. When the worker later terminates, all of them fire simultaneously — each calling deleteTail() → requireLoggedIn() after the test's auth mocks have been torn down, producing the spurious rejections.

Fix (commit 3)

Capture the remove-function returned by onExit and call it the first time exit() runs.
Guard exit() against re-entry so it's idempotent if both the WebSocket close event and a real signal fire in the same session (the same pattern packages/wrangler/src/pages/deployment-tails.ts already uses).

Commit layout

Commit	Scope	Changeset
[wrangler] Fix esbuild resource leaks during `unstable_startWorker` teardown	§1 root cause	`fix-esbuild-teardown-leaks.md` (wrangler patch)
[ci] Revert start-worker-node flake workaround now that root cause is fixed	revert #13604	none
[wrangler] Fix signal-exit listener leak in `wrangler tail`	§2 root cause	`fix-wrangler-tail-exit-listener-leak.md` (wrangler patch)

Verification

Stress-tested on CI by repeatedly running the fixture suite with --force --only to bypass turbo caching. Across one run that exercised the fix 15 times on three platforms we saw zero hangs, zero timeouts, zero cancellations on start-worker-node-test:

Platform	Clean cache-miss `start-worker-node` runs	Duration range
Linux	7	1.58–1.98 s
Windows	2	3.06–3.29 s
macOS	6	1.34–3.02 s

(Linux and macOS would have run more but hit GitHub's 30 min job cap; Windows ran 2 before an unrelated @fixture/worker-ratelimit ECONNRESET Windows networking flake aborted its step.) Every execution completed well under the restored 15 s node --test --test-timeout, compared to the pre-fix behaviour where the subprocess would hit the artificially raised 50 s timeout. The stress-test workflow scaffolding has been removed from the branch — only the three fix commits remain.

Also verified:

BundleController × 8 and full startDevWorker suite × 47 (8 todo) pass.
start-worker-node-test runs cleanly on macOS locally in ~1.6 s.

Tests
- Tests included/updated — the start-worker-node-test fixture is what's being stabilised; the tail fix is exercised by the existing tail.test.ts that was surfacing the leak.
- Automated tests not possible - manual testing has been completed as follows:
- Additional testing not necessary because:
Public documentation
- Cloudflare docs PR(s):
- Documentation not necessary because: bug fixes, no user-facing API change.

changeset-bot · 2026-04-24T11:26:45Z

🦋 Changeset detected

Latest commit: eee802a

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

workers-devprod

Codeowners reviews satisfied

ask-bonk · 2026-04-24T11:27:33Z

I've reviewed the PR thoroughly. This is a temporary diagnostic-only change to instrument test files and capture event-loop handle information on Windows CI to diagnose a known flake. Let me verify there are no concerns:

No product code changes -- only test fixture files are modified.
The diagnostic file is clearly marked as temporary with a comment at the top.
The after hook ordering in index.test.js -- registerLeakDiagnostic is called after the existing after that runs worker.dispose(). In node:test, after hooks within a describe run in registration order (FIFO), so the diagnostic will correctly run after dispose(). Good.
The after hook ordering in config-errors.test.js -- registerLeakDiagnostic is registered at the top of the describe block, before the tests. Since config-errors.test.js calls worker.dispose() inline within each test (not in an after hook), registration order doesn't matter here -- the diagnostic after will run after all tests complete, which is correct.
The 500ms delay is intentionally ref'd (not unref'd) as documented in the comment -- reasonable for diagnostic purposes.
Use of process._getActiveHandles() / process._getActiveRequests() -- undocumented but standard diagnostic APIs, fine for temporary instrumentation.
No changeset needed -- this is test-only instrumentation with no user-facing impact, and the PR is marked draft.

Everything looks clean and fit for purpose.

LGTM

github run

pkg-pr-new · 2026-04-24T11:31:33Z

create-cloudflare

npm i https://pkg.pr.new/create-cloudflare@13662

@cloudflare/kv-asset-handler

npm i https://pkg.pr.new/@cloudflare/kv-asset-handler@13662

miniflare

npm i https://pkg.pr.new/miniflare@13662

@cloudflare/pages-shared

npm i https://pkg.pr.new/@cloudflare/pages-shared@13662

@cloudflare/unenv-preset

npm i https://pkg.pr.new/@cloudflare/unenv-preset@13662

@cloudflare/vite-plugin

npm i https://pkg.pr.new/@cloudflare/vite-plugin@13662

@cloudflare/vitest-pool-workers

npm i https://pkg.pr.new/@cloudflare/vitest-pool-workers@13662

@cloudflare/workers-editor-shared

npm i https://pkg.pr.new/@cloudflare/workers-editor-shared@13662

@cloudflare/workers-utils

npm i https://pkg.pr.new/@cloudflare/workers-utils@13662

wrangler

npm i https://pkg.pr.new/wrangler@13662

commit: eee802a

github-actions · 2026-04-24T11:40:14Z

✅ All changesets look good

workers-devprod · 2026-04-24T11:40:29Z

Codeowners approval required for this PR:

@cloudflare/wrangler

Show detailed file reviewers

.github/workflows/test-and-check.yml: [@cloudflare/wrangler]
packages/wrangler/src/api/startDevWorker/BundlerController.ts: [@cloudflare/wrangler]
packages/wrangler/src/deployment-bundle/bundle.ts: [@cloudflare/wrangler]
packages/wrangler/src/dev/use-esbuild.ts: [@cloudflare/wrangler]
packages/wrangler/src/tail/index.ts: [@cloudflare/wrangler]

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

…rdown Three resource leaks in the `DevEnv` teardown path could prevent the Node process from exiting cleanly after `worker.dispose()` returns: 1. `bundleWorker` in `bundle.ts` scoped `ctx` (the esbuild build context) inside the `if (watch)` block. A failing initial build threw out of that block, leaving `ctx` unreachable, so the esbuild child process never got disposed. Hoist `ctx` to outer scope and dispose it in the catch block. 2. `runBuild` in `use-esbuild.ts` returned a fire-and-forget cleanup function when the initial build was still in flight. Teardown could return before `ctx.dispose()` had been called, so the esbuild watcher outlived the parent `worker.dispose()`. Await the build promise before calling the bundler's stop handler. 3. `BundlerController.teardown()` removed the bundler's tmp directory before awaiting the esbuild cleanup, so an in-flight rebuild would error with "Could not resolve .wrangler/tmp/bundle-XXXX/middleware- loader.entry.ts" during dispose. Run the esbuild cleanup first, then remove the dir. Also abort the bundleBuildAborter so a finishing build cannot emit stale bundleStart/bundleComplete events. These were identified while investigating why the `start-worker-node` fixture's `config-errors.test.js` times out at node:test's 50s file- level limit on Windows CI despite every individual test passing in ~1.2s. The describe block completed cleanly but the subprocess could not exit because the esbuild child process kept the event loop alive.

… fixed The previous PR #13604 worked around an intermittent flake in the `start-worker-node-test` fixture by: 1. Running the fixture as its own CI step, serialised from all other fixtures (because parallel load made the cleanup hang more likely). 2. Bumping the `node --test` file-level timeout from 15s to 50s so that node:test would sometimes let the subprocess finish before cancelling it. Neither addressed the underlying cause. The preceding commit fixes the actual esbuild resource leak in `unstable_startWorker` teardown, so the workaround can go: - Re-include `./fixtures/start-worker-node-test` in the main fixtures test run (restores parallelism). - Delete the separate `Run tests (start-worker-node)` step. - Drop the node:test timeout back to 15s.

The `wrangler tail` command registered both a `tail.on("close", exit)` listener and a process-level `onExit(exit)` handler via `signal-exit`, but never removed the latter after `exit()` had run. In long-lived CLI processes this is harmless — the handler eventually runs once on shutdown. But in unit tests that repeatedly invoke `wrangler tail` in the same process, every invocation accumulates a handler that fires during test-runner shutdown. Those late invocations call `deleteTail()` after the test's auth mocks have been torn down, producing spurious "Not logged in" unhandled rejections which fail Linux CI (see e.g. PR #13622's failing Tests (Linux, packages-and-tools)). - Capture the remove function returned by `onExit` and call it as soon as `exit()` runs, so shutdown never re-fires the handler. - Guard `exit()` against re-entry so it's idempotent if both the WebSocket `close` event and a real signal fire in the same session (the existing `pages tails` path already uses the same pattern).

…ackages) CI flakes (cloudflare#13662)

petebacondarwin added the no-changeset-required label Apr 24, 2026

workers-devprod added this to workers-sdk Apr 24, 2026

github-project-automation Bot moved this to Untriaged in workers-sdk Apr 24, 2026

workers-devprod approved these changes Apr 24, 2026

View reviewed changes

github-project-automation Bot moved this from Untriaged to Approved in workers-sdk Apr 24, 2026

petebacondarwin changed the title ~~[ci] TEMP: diagnose start-worker-node Windows CI flake~~ [wrangler] Fix esbuild resource leaks during startDevWorker teardown (+ diagnostic) Apr 24, 2026

petebacondarwin removed the no-changeset-required label Apr 24, 2026

petebacondarwin marked this pull request as ready for review April 24, 2026 11:40

workers-devprod requested review from a team and penalosa and removed request for a team April 24, 2026 11:40

devin-ai-integration Bot reviewed Apr 24, 2026

View reviewed changes

petebacondarwin force-pushed the pbd/fix-start-worker-node-flake-diagnose branch from c3814fa to 0d4560f Compare April 24, 2026 11:56

petebacondarwin changed the title ~~[wrangler] Fix esbuild resource leaks during startDevWorker teardown (+ diagnostic)~~ [wrangler] Fix start-worker-node Windows CI flake by fixing esbuild teardown leaks Apr 24, 2026

petebacondarwin mentioned this pull request Apr 24, 2026

DIAGNOSTIC: verify tail flake on main #13664

Closed

petebacondarwin changed the title ~~[wrangler] Fix start-worker-node Windows CI flake by fixing esbuild teardown leaks~~ [wrangler] Fix start-worker-node and tail Windows/Linux CI flakes Apr 24, 2026

petebacondarwin changed the title ~~[wrangler] Fix start-worker-node and tail Windows/Linux CI flakes~~ [wrangler] Fix start-worker-node (Windows fixtures) and tail (Linux packages) CI flakes Apr 24, 2026

This comment was marked as resolved.

Sign in to view

petebacondarwin force-pushed the pbd/fix-start-worker-node-flake-diagnose branch from 662186f to 1b87b4a Compare April 24, 2026 13:55

petebacondarwin marked this pull request as draft April 24, 2026 14:29

petebacondarwin removed the request for review from penalosa April 24, 2026 14:29

petebacondarwin force-pushed the pbd/fix-start-worker-node-flake-diagnose branch from 628ee25 to bb638cc Compare April 24, 2026 16:08

petebacondarwin marked this pull request as ready for review April 24, 2026 17:07

workers-devprod requested review from a team and removed request for a team April 24, 2026 17:07

workers-devprod requested a review from ascorbic April 24, 2026 17:07

petebacondarwin added 3 commits April 24, 2026 21:42

petebacondarwin force-pushed the pbd/fix-start-worker-node-flake-diagnose branch from bb638cc to eee802a Compare April 24, 2026 20:42

petebacondarwin merged commit f2e2241 into main Apr 25, 2026
52 of 53 checks passed

petebacondarwin deleted the pbd/fix-start-worker-node-flake-diagnose branch April 25, 2026 06:20

github-project-automation Bot moved this from Approved to Done in workers-sdk Apr 25, 2026

workers-devprod mentioned this pull request Apr 24, 2026

Version Packages #13665

Merged

vaishnav-mk pushed a commit to vaishnav-mk/workers-sdk that referenced this pull request Apr 27, 2026

[wrangler] Fix start-worker-node (Windows fixtures) and tail (Linux p…

954f087

…ackages) CI flakes (cloudflare#13662)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wrangler] Fix start-worker-node (Windows fixtures) and tail (Linux packages) CI flakes#13662

[wrangler] Fix start-worker-node (Windows fixtures) and tail (Linux packages) CI flakes#13662
petebacondarwin merged 3 commits intomainfrom
pbd/fix-start-worker-node-flake-diagnose

petebacondarwin commented Apr 24, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

workers-devprod left a comment

Uh oh!

ask-bonk Bot commented Apr 24, 2026

Uh oh!

pkg-pr-new Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

workers-devprod commented Apr 24, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

petebacondarwin commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. start-worker-node-test fixture — Windows fixtures CI

Root cause

Fixes (commit 1)

Revert the #13604 workaround (commit 2)

2. tail.test.ts — Linux packages-and-tools CI

Root cause

Fix (commit 3)

Commit layout

Verification

Uh oh!

changeset-bot Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

workers-devprod left a comment

Choose a reason for hiding this comment

Uh oh!

ask-bonk Bot commented Apr 24, 2026

Uh oh!

pkg-pr-new Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

workers-devprod commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

petebacondarwin commented Apr 24, 2026 •

edited

Loading

1. `start-worker-node-test` fixture — Windows fixtures CI

2. `tail.test.ts` — Linux `packages-and-tools` CI

changeset-bot Bot commented Apr 24, 2026 •

edited

Loading

pkg-pr-new Bot commented Apr 24, 2026 •

edited

Loading

github-actions Bot commented Apr 24, 2026 •

edited

Loading

workers-devprod commented Apr 24, 2026 •

edited

Loading