Skip to content

[wrangler] Fix start-worker-node (Windows fixtures) and tail (Linux packages) CI flakes#13662

Merged
petebacondarwin merged 3 commits intomainfrom
pbd/fix-start-worker-node-flake-diagnose
Apr 25, 2026
Merged

[wrangler] Fix start-worker-node (Windows fixtures) and tail (Linux packages) CI flakes#13662
petebacondarwin merged 3 commits intomainfrom
pbd/fix-start-worker-node-flake-diagnose

Conversation

@petebacondarwin
Copy link
Copy Markdown
Contributor

@petebacondarwin petebacondarwin commented Apr 24, 2026

Fixes two independently reproducible CI flakes that surface whenever the wrangler turbo caches get invalidated.

1. start-worker-node-test fixture — Windows fixtures CI

Example failing run: https://github.com/cloudflare/workers-sdk/actions/runs/24879778860/job/72844998017

Every test inside fixtures/start-worker-node-test/src/config-errors.test.js passes in ~1.2 s, then the subprocess hangs until node --test's 50 s file-level timeout fires and cancels the file. Two prior PRs (#13515, #13604) attacked individual symptoms — #13604 moved the fixture onto its own CI step with a 50 s timeout — but the flake kept recurring.

Root cause

Three compounding bugs in the DevEnv.teardown() path leaked the esbuild child process so the Node event loop could not drain after worker.dispose() returned:

  1. bundleWorker leaks esbuild.BuildContext when the initial build throws. packages/wrangler/src/deployment-bundle/bundle.ts scoped ctx inside if (watch) { }. An initial-build failure (e.g. unresolvable entrypoint from the setConfig tests) threw out of that block, leaving ctx unreachable and its underlying esbuild child process alive for the lifetime of the parent Node process.
  2. runBuild's cleanup closure is fire-and-forget when the build is still in flight. packages/wrangler/src/dev/use-esbuild.ts returned void buildPromise.then(() => stopWatching?.()), so teardown could return before ctx.dispose() had run.
  3. BundlerController.teardown() removes the tmp dir before awaiting esbuild cleanup. packages/wrangler/src/api/startDevWorker/BundlerController.ts deleted .wrangler/tmp/bundle-XXXX first, making an in-flight rebuild fail with Could not resolve .wrangler/tmp/bundle-XXXX/middleware-loader.entry.ts — the exact noise visible in the failing CI log right after the timeout.

Fixes (commit 1)

  • Hoist ctx to outer scope in bundleWorker and dispose it in the catch.
  • Make runBuild's cleanup closure await the build promise before calling stopWatching.
  • Reorder BundlerController.teardown() to dispose esbuild before removing the tmp dir, and abort #bundleBuildAborter so a finishing build cannot emit stale bundleStart / bundleComplete events into a torn-down bus.

Revert the #13604 workaround (commit 2)

  • Re-include ./fixtures/start-worker-node-test in the main parallel fixtures test run.
  • Delete the separate Run tests (start-worker-node) step.
  • Drop the node --test --test-timeout back from 50 s to 15 s.

2. tail.test.ts — Linux packages-and-tools CI

Example failing run: https://github.com/cloudflare/workers-sdk/actions/runs/24888217640/job/72873435065. Also visible intermittently on earlier PRs (e.g. #13622 on 2026-04-22). All 42 tests in tail.test.ts pass, then vitest reports 30× Unhandled Rejection: Error: Not logged in. from src/tail/index.ts:205.

Root cause

wrangler tail registered a process-level onExit(exit) handler via signal-exit but never removed it. In long-lived CLI runs this is harmless — the handler eventually runs once on shutdown. In unit tests that invoke wrangler tail dozens of times in the same vitest worker, every invocation accumulates a handler. When the worker later terminates, all of them fire simultaneously — each calling deleteTail()requireLoggedIn() after the test's auth mocks have been torn down, producing the spurious rejections.

Fix (commit 3)

  • Capture the remove-function returned by onExit and call it the first time exit() runs.
  • Guard exit() against re-entry so it's idempotent if both the WebSocket close event and a real signal fire in the same session (the same pattern packages/wrangler/src/pages/deployment-tails.ts already uses).

Commit layout

Commit Scope Changeset
[wrangler] Fix esbuild resource leaks during unstable_startWorker teardown §1 root cause fix-esbuild-teardown-leaks.md (wrangler patch)
[ci] Revert start-worker-node flake workaround now that root cause is fixed revert #13604 none
[wrangler] Fix signal-exit listener leak in wrangler tail §2 root cause fix-wrangler-tail-exit-listener-leak.md (wrangler patch)

Verification

Stress-tested on CI by repeatedly running the fixture suite with --force --only to bypass turbo caching. Across one run that exercised the fix 15 times on three platforms we saw zero hangs, zero timeouts, zero cancellations on start-worker-node-test:

Platform Clean cache-miss start-worker-node runs Duration range
Linux 7 1.58–1.98 s
Windows 2 3.06–3.29 s
macOS 6 1.34–3.02 s

(Linux and macOS would have run more but hit GitHub's 30 min job cap; Windows ran 2 before an unrelated @fixture/worker-ratelimit ECONNRESET Windows networking flake aborted its step.) Every execution completed well under the restored 15 s node --test --test-timeout, compared to the pre-fix behaviour where the subprocess would hit the artificially raised 50 s timeout. The stress-test workflow scaffolding has been removed from the branch — only the three fix commits remain.

Also verified:

  • BundleController × 8 and full startDevWorker suite × 47 (8 todo) pass.
  • start-worker-node-test runs cleanly on macOS locally in ~1.6 s.

  • Tests
    • Tests included/updated — the start-worker-node-test fixture is what's being stabilised; the tail fix is exercised by the existing tail.test.ts that was surfacing the leak.
    • Automated tests not possible - manual testing has been completed as follows:
    • Additional testing not necessary because:
  • Public documentation
    • Cloudflare docs PR(s):
    • Documentation not necessary because: bug fixes, no user-facing API change.

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Apr 24, 2026

🦋 Changeset detected

Latest commit: eee802a

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Copy Markdown
Contributor

@workers-devprod workers-devprod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codeowners reviews satisfied

@github-project-automation github-project-automation Bot moved this from Untriaged to Approved in workers-sdk Apr 24, 2026
@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented Apr 24, 2026

I've reviewed the PR thoroughly. This is a temporary diagnostic-only change to instrument test files and capture event-loop handle information on Windows CI to diagnose a known flake. Let me verify there are no concerns:

  1. No product code changes -- only test fixture files are modified.
  2. The diagnostic file is clearly marked as temporary with a comment at the top.
  3. The after hook ordering in index.test.js -- registerLeakDiagnostic is called after the existing after that runs worker.dispose(). In node:test, after hooks within a describe run in registration order (FIFO), so the diagnostic will correctly run after dispose(). Good.
  4. The after hook ordering in config-errors.test.js -- registerLeakDiagnostic is registered at the top of the describe block, before the tests. Since config-errors.test.js calls worker.dispose() inline within each test (not in an after hook), registration order doesn't matter here -- the diagnostic after will run after all tests complete, which is correct.
  5. The 500ms delay is intentionally ref'd (not unref'd) as documented in the comment -- reasonable for diagnostic purposes.
  6. Use of process._getActiveHandles() / process._getActiveRequests() -- undocumented but standard diagnostic APIs, fine for temporary instrumentation.
  7. No changeset needed -- this is test-only instrumentation with no user-facing impact, and the PR is marked draft.

Everything looks clean and fit for purpose.

LGTM

github run

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented Apr 24, 2026

create-cloudflare

npm i https://pkg.pr.new/create-cloudflare@13662

@cloudflare/kv-asset-handler

npm i https://pkg.pr.new/@cloudflare/kv-asset-handler@13662

miniflare

npm i https://pkg.pr.new/miniflare@13662

@cloudflare/pages-shared

npm i https://pkg.pr.new/@cloudflare/pages-shared@13662

@cloudflare/unenv-preset

npm i https://pkg.pr.new/@cloudflare/unenv-preset@13662

@cloudflare/vite-plugin

npm i https://pkg.pr.new/@cloudflare/vite-plugin@13662

@cloudflare/vitest-pool-workers

npm i https://pkg.pr.new/@cloudflare/vitest-pool-workers@13662

@cloudflare/workers-editor-shared

npm i https://pkg.pr.new/@cloudflare/workers-editor-shared@13662

@cloudflare/workers-utils

npm i https://pkg.pr.new/@cloudflare/workers-utils@13662

wrangler

npm i https://pkg.pr.new/wrangler@13662

commit: eee802a

@petebacondarwin petebacondarwin changed the title [ci] TEMP: diagnose start-worker-node Windows CI flake [wrangler] Fix esbuild resource leaks during startDevWorker teardown (+ diagnostic) Apr 24, 2026
@petebacondarwin petebacondarwin marked this pull request as ready for review April 24, 2026 11:40
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 24, 2026

✅ All changesets look good

@workers-devprod workers-devprod requested review from a team and penalosa and removed request for a team April 24, 2026 11:40
@workers-devprod
Copy link
Copy Markdown
Contributor

workers-devprod commented Apr 24, 2026

Codeowners approval required for this PR:

  • @cloudflare/wrangler
Show detailed file reviewers
  • .github/workflows/test-and-check.yml: [@cloudflare/wrangler]
  • packages/wrangler/src/api/startDevWorker/BundlerController.ts: [@cloudflare/wrangler]
  • packages/wrangler/src/deployment-bundle/bundle.ts: [@cloudflare/wrangler]
  • packages/wrangler/src/dev/use-esbuild.ts: [@cloudflare/wrangler]
  • packages/wrangler/src/tail/index.ts: [@cloudflare/wrangler]

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

Open in Devin Review

@petebacondarwin petebacondarwin force-pushed the pbd/fix-start-worker-node-flake-diagnose branch from c3814fa to 0d4560f Compare April 24, 2026 11:56
@petebacondarwin petebacondarwin changed the title [wrangler] Fix esbuild resource leaks during startDevWorker teardown (+ diagnostic) [wrangler] Fix start-worker-node Windows CI flake by fixing esbuild teardown leaks Apr 24, 2026
@petebacondarwin petebacondarwin changed the title [wrangler] Fix start-worker-node Windows CI flake by fixing esbuild teardown leaks [wrangler] Fix start-worker-node and tail Windows/Linux CI flakes Apr 24, 2026
@petebacondarwin petebacondarwin changed the title [wrangler] Fix start-worker-node and tail Windows/Linux CI flakes [wrangler] Fix start-worker-node (Windows fixtures) and tail (Linux packages) CI flakes Apr 24, 2026
devin-ai-integration[bot]

This comment was marked as resolved.

@petebacondarwin petebacondarwin force-pushed the pbd/fix-start-worker-node-flake-diagnose branch from 662186f to 1b87b4a Compare April 24, 2026 13:55
@petebacondarwin petebacondarwin marked this pull request as draft April 24, 2026 14:29
@petebacondarwin petebacondarwin removed the request for review from penalosa April 24, 2026 14:29
@petebacondarwin petebacondarwin force-pushed the pbd/fix-start-worker-node-flake-diagnose branch from 628ee25 to bb638cc Compare April 24, 2026 16:08
@petebacondarwin petebacondarwin marked this pull request as ready for review April 24, 2026 17:07
@workers-devprod workers-devprod requested review from a team and removed request for a team April 24, 2026 17:07
…rdown

Three resource leaks in the `DevEnv` teardown path could prevent the
Node process from exiting cleanly after `worker.dispose()` returns:

1. `bundleWorker` in `bundle.ts` scoped `ctx` (the esbuild build
   context) inside the `if (watch)` block. A failing initial build
   threw out of that block, leaving `ctx` unreachable, so the esbuild
   child process never got disposed. Hoist `ctx` to outer scope and
   dispose it in the catch block.

2. `runBuild` in `use-esbuild.ts` returned a fire-and-forget cleanup
   function when the initial build was still in flight. Teardown could
   return before `ctx.dispose()` had been called, so the esbuild
   watcher outlived the parent `worker.dispose()`. Await the build
   promise before calling the bundler's stop handler.

3. `BundlerController.teardown()` removed the bundler's tmp directory
   before awaiting the esbuild cleanup, so an in-flight rebuild would
   error with "Could not resolve .wrangler/tmp/bundle-XXXX/middleware-
   loader.entry.ts" during dispose. Run the esbuild cleanup first,
   then remove the dir. Also abort the bundleBuildAborter so a finishing
   build cannot emit stale bundleStart/bundleComplete events.

These were identified while investigating why the `start-worker-node`
fixture's `config-errors.test.js` times out at node:test's 50s file-
level limit on Windows CI despite every individual test passing in
~1.2s. The describe block completed cleanly but the subprocess could
not exit because the esbuild child process kept the event loop alive.
… fixed

The previous PR #13604 worked around an intermittent flake in the
`start-worker-node-test` fixture by:

1. Running the fixture as its own CI step, serialised from all other
   fixtures (because parallel load made the cleanup hang more likely).
2. Bumping the `node --test` file-level timeout from 15s to 50s so
   that node:test would sometimes let the subprocess finish before
   cancelling it.

Neither addressed the underlying cause. The preceding commit fixes the
actual esbuild resource leak in `unstable_startWorker` teardown, so
the workaround can go:

- Re-include `./fixtures/start-worker-node-test` in the main
  fixtures test run (restores parallelism).
- Delete the separate `Run tests (start-worker-node)` step.
- Drop the node:test timeout back to 15s.
The `wrangler tail` command registered both a `tail.on("close", exit)`
listener and a process-level `onExit(exit)` handler via `signal-exit`,
but never removed the latter after `exit()` had run. In long-lived CLI
processes this is harmless — the handler eventually runs once on
shutdown. But in unit tests that repeatedly invoke `wrangler tail` in
the same process, every invocation accumulates a handler that fires
during test-runner shutdown. Those late invocations call `deleteTail()`
after the test's auth mocks have been torn down, producing spurious
"Not logged in" unhandled rejections which fail Linux CI (see e.g.
PR #13622's failing Tests (Linux, packages-and-tools)).

- Capture the remove function returned by `onExit` and call it as soon
  as `exit()` runs, so shutdown never re-fires the handler.
- Guard `exit()` against re-entry so it's idempotent if both the
  WebSocket `close` event and a real signal fire in the same session
  (the existing `pages tails` path already uses the same pattern).
@petebacondarwin petebacondarwin force-pushed the pbd/fix-start-worker-node-flake-diagnose branch from bb638cc to eee802a Compare April 24, 2026 20:42
@petebacondarwin petebacondarwin merged commit f2e2241 into main Apr 25, 2026
52 of 53 checks passed
@petebacondarwin petebacondarwin deleted the pbd/fix-start-worker-node-flake-diagnose branch April 25, 2026 06:20
@github-project-automation github-project-automation Bot moved this from Approved to Done in workers-sdk Apr 25, 2026
vaishnav-mk pushed a commit to vaishnav-mk/workers-sdk that referenced this pull request Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants