[wrangler] Fix start-worker-node (Windows fixtures) and tail (Linux packages) CI flakes#13662
Merged
petebacondarwin merged 3 commits intomainfrom Apr 25, 2026
Merged
Conversation
🦋 Changeset detectedLatest commit: eee802a The changes in this PR will be included in the next version bump. Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
workers-devprod
approved these changes
Apr 24, 2026
Contributor
workers-devprod
left a comment
There was a problem hiding this comment.
Codeowners reviews satisfied
Contributor
|
I've reviewed the PR thoroughly. This is a temporary diagnostic-only change to instrument test files and capture event-loop handle information on Windows CI to diagnose a known flake. Let me verify there are no concerns:
Everything looks clean and fit for purpose. LGTM |
create-cloudflare
@cloudflare/kv-asset-handler
miniflare
@cloudflare/pages-shared
@cloudflare/unenv-preset
@cloudflare/vite-plugin
@cloudflare/vitest-pool-workers
@cloudflare/workers-editor-shared
@cloudflare/workers-utils
wrangler
commit: |
Contributor
|
✅ All changesets look good |
Contributor
|
Codeowners approval required for this PR:
Show detailed file reviewers
|
c3814fa to
0d4560f
Compare
662186f to
1b87b4a
Compare
628ee25 to
bb638cc
Compare
…rdown Three resource leaks in the `DevEnv` teardown path could prevent the Node process from exiting cleanly after `worker.dispose()` returns: 1. `bundleWorker` in `bundle.ts` scoped `ctx` (the esbuild build context) inside the `if (watch)` block. A failing initial build threw out of that block, leaving `ctx` unreachable, so the esbuild child process never got disposed. Hoist `ctx` to outer scope and dispose it in the catch block. 2. `runBuild` in `use-esbuild.ts` returned a fire-and-forget cleanup function when the initial build was still in flight. Teardown could return before `ctx.dispose()` had been called, so the esbuild watcher outlived the parent `worker.dispose()`. Await the build promise before calling the bundler's stop handler. 3. `BundlerController.teardown()` removed the bundler's tmp directory before awaiting the esbuild cleanup, so an in-flight rebuild would error with "Could not resolve .wrangler/tmp/bundle-XXXX/middleware- loader.entry.ts" during dispose. Run the esbuild cleanup first, then remove the dir. Also abort the bundleBuildAborter so a finishing build cannot emit stale bundleStart/bundleComplete events. These were identified while investigating why the `start-worker-node` fixture's `config-errors.test.js` times out at node:test's 50s file- level limit on Windows CI despite every individual test passing in ~1.2s. The describe block completed cleanly but the subprocess could not exit because the esbuild child process kept the event loop alive.
… fixed The previous PR #13604 worked around an intermittent flake in the `start-worker-node-test` fixture by: 1. Running the fixture as its own CI step, serialised from all other fixtures (because parallel load made the cleanup hang more likely). 2. Bumping the `node --test` file-level timeout from 15s to 50s so that node:test would sometimes let the subprocess finish before cancelling it. Neither addressed the underlying cause. The preceding commit fixes the actual esbuild resource leak in `unstable_startWorker` teardown, so the workaround can go: - Re-include `./fixtures/start-worker-node-test` in the main fixtures test run (restores parallelism). - Delete the separate `Run tests (start-worker-node)` step. - Drop the node:test timeout back to 15s.
The `wrangler tail` command registered both a `tail.on("close", exit)`
listener and a process-level `onExit(exit)` handler via `signal-exit`,
but never removed the latter after `exit()` had run. In long-lived CLI
processes this is harmless — the handler eventually runs once on
shutdown. But in unit tests that repeatedly invoke `wrangler tail` in
the same process, every invocation accumulates a handler that fires
during test-runner shutdown. Those late invocations call `deleteTail()`
after the test's auth mocks have been torn down, producing spurious
"Not logged in" unhandled rejections which fail Linux CI (see e.g.
PR #13622's failing Tests (Linux, packages-and-tools)).
- Capture the remove function returned by `onExit` and call it as soon
as `exit()` runs, so shutdown never re-fires the handler.
- Guard `exit()` against re-entry so it's idempotent if both the
WebSocket `close` event and a real signal fire in the same session
(the existing `pages tails` path already uses the same pattern).
bb638cc to
eee802a
Compare
Merged
vaishnav-mk
pushed a commit
to vaishnav-mk/workers-sdk
that referenced
this pull request
Apr 27, 2026
…ackages) CI flakes (cloudflare#13662)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes two independently reproducible CI flakes that surface whenever the wrangler turbo caches get invalidated.
1.
start-worker-node-testfixture — Windows fixtures CIExample failing run: https://github.com/cloudflare/workers-sdk/actions/runs/24879778860/job/72844998017
Every test inside
fixtures/start-worker-node-test/src/config-errors.test.jspasses in ~1.2 s, then the subprocess hangs untilnode --test's 50 s file-level timeout fires and cancels the file. Two prior PRs (#13515, #13604) attacked individual symptoms —#13604moved the fixture onto its own CI step with a 50 s timeout — but the flake kept recurring.Root cause
Three compounding bugs in the
DevEnv.teardown()path leaked the esbuild child process so the Node event loop could not drain afterworker.dispose()returned:bundleWorkerleaksesbuild.BuildContextwhen the initial build throws.packages/wrangler/src/deployment-bundle/bundle.tsscopedctxinsideif (watch) { }. An initial-build failure (e.g. unresolvable entrypoint from thesetConfigtests) threw out of that block, leavingctxunreachable and its underlying esbuild child process alive for the lifetime of the parent Node process.runBuild's cleanup closure is fire-and-forget when the build is still in flight.packages/wrangler/src/dev/use-esbuild.tsreturnedvoid buildPromise.then(() => stopWatching?.()), so teardown could return beforectx.dispose()had run.BundlerController.teardown()removes the tmp dir before awaiting esbuild cleanup.packages/wrangler/src/api/startDevWorker/BundlerController.tsdeleted.wrangler/tmp/bundle-XXXXfirst, making an in-flight rebuild fail withCould not resolve .wrangler/tmp/bundle-XXXX/middleware-loader.entry.ts— the exact noise visible in the failing CI log right after the timeout.Fixes (commit 1)
ctxto outer scope inbundleWorkerand dispose it in thecatch.runBuild's cleanup closureawaitthe build promise before callingstopWatching.BundlerController.teardown()to dispose esbuild before removing the tmp dir, and abort#bundleBuildAborterso a finishing build cannot emit stalebundleStart/bundleCompleteevents into a torn-down bus.Revert the #13604 workaround (commit 2)
./fixtures/start-worker-node-testin the main parallel fixtures test run.Run tests (start-worker-node)step.node --test --test-timeoutback from 50 s to 15 s.2.
tail.test.ts— Linuxpackages-and-toolsCIExample failing run: https://github.com/cloudflare/workers-sdk/actions/runs/24888217640/job/72873435065. Also visible intermittently on earlier PRs (e.g. #13622 on 2026-04-22). All 42 tests in
tail.test.tspass, then vitest reports 30×Unhandled Rejection: Error: Not logged in.fromsrc/tail/index.ts:205.Root cause
wrangler tailregistered a process-levelonExit(exit)handler viasignal-exitbut never removed it. In long-lived CLI runs this is harmless — the handler eventually runs once on shutdown. In unit tests that invokewrangler taildozens of times in the same vitest worker, every invocation accumulates a handler. When the worker later terminates, all of them fire simultaneously — each callingdeleteTail()→requireLoggedIn()after the test's auth mocks have been torn down, producing the spurious rejections.Fix (commit 3)
onExitand call it the first timeexit()runs.exit()against re-entry so it's idempotent if both the WebSocketcloseevent and a real signal fire in the same session (the same patternpackages/wrangler/src/pages/deployment-tails.tsalready uses).Commit layout
unstable_startWorkerteardownfix-esbuild-teardown-leaks.md(wrangler patch)wrangler tailfix-wrangler-tail-exit-listener-leak.md(wrangler patch)Verification
Stress-tested on CI by repeatedly running the fixture suite with
--force --onlyto bypass turbo caching. Across one run that exercised the fix 15 times on three platforms we saw zero hangs, zero timeouts, zero cancellations onstart-worker-node-test:start-worker-noderuns(Linux and macOS would have run more but hit GitHub's 30 min job cap; Windows ran 2 before an unrelated
@fixture/worker-ratelimitECONNRESETWindows networking flake aborted its step.) Every execution completed well under the restored 15 snode --test --test-timeout, compared to the pre-fix behaviour where the subprocess would hit the artificially raised 50 s timeout. The stress-test workflow scaffolding has been removed from the branch — only the three fix commits remain.Also verified:
BundleController× 8 and fullstartDevWorkersuite × 47 (8 todo) pass.start-worker-node-testruns cleanly on macOS locally in ~1.6 s.start-worker-node-testfixture is what's being stabilised; the tail fix is exercised by the existingtail.test.tsthat was surfacing the leak.