sf playwright: fix worker-manager EADDRINUSE race + port-conflict diagnostics#4827
Merged
Conversation
…flicts The findAndHoldAvailablePort holder bound `127.0.0.1:0` while the worker-manager child (and other harness children) calls `listen(port)` without a host, binding the dual-stack wildcard `::port`. The kernel treated those as distinct scopes for port selection, so the OS port-0 allocator could legitimately hand the holder a port that still had a lingering bind on `::` from a previous run — the next child then crashed with `EADDRINUSE :::port` even though the holder had been correctly released. - findAndHoldAvailablePort and fixtures.ts:waitForPortFree now bind wildcard, so allocation/teardown both see the same scope the child will use. - New diagnosePortConflict probes the contested port across 127.0.0.1 / 0.0.0.0 / ::1 / :: and shells out to `ss -tlnp` so any recurrence leaves a trail describing which interface holds it. - startIsolatedRealmStack retries the worker-manager spawn once on diagnosable EADDRINUSE with a fresh port (mirrors the existing prerender-server retry pattern). Retry is skipped when the caller pinned an explicit port. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR addresses Software Factory Playwright realm startup flakes caused by port scope mismatches during worker-manager startup and improves diagnostics when EADDRINUSE occurs.
Changes:
- Switches port availability checks/reservations to wildcard binds to match child process bind behavior.
- Adds
diagnosePortConflict(port)for bind-scope andssdiagnostics. - Adds a one-time worker-manager retry on detected dynamic-port
EADDRINUSE.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
packages/software-factory/tests/fixtures.ts |
Updates teardown port-free probing to use wildcard binding. |
packages/realm-test-harness/src/shared.ts |
Adds port conflict diagnostics and changes held-port allocation to wildcard binding. |
packages/realm-test-harness/src/isolated-realm-stack.ts |
Adds worker-manager bind failure detection, diagnostics, and retry logic. |
packages/realm-test-harness/src/index.ts |
Re-exports the new diagnostic helper. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
probeBind resolved its promise immediately after calling `server.close()`, which is asynchronous. In diagnosePortConflict the probes run sequentially, so a leftover-closing socket from a `FREE` scope could surface as a false EADDRINUSE on the next scope, masking which interface actually owns the contested port. Wait for the close callback before resolving so each probe leaves no state behind. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jurgenwerk
approved these changes
May 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Recent flake on shard 2/3 of the SF Playwright suite (e.g. run 25836067730) had the worker-manager subprocess die immediately with
EADDRINUSE :::34301, and the realm fixture polled forruntime.jsonfor the full 240s timeout before declaring failure.Root cause:
findAndHoldAvailablePort(andfixtures.ts:waitForPortFree) bound127.0.0.1while the worker-manager child callslisten(port)without a host and therefore binds the dual-stack wildcard::port. The kernel treats those as distinct scopes for port selection, so the OS port-0 allocator could legitimately hand the holder a port that still had a lingering bind on::from a previous worker-manager process. The holder's release-before-spawn handoff was correct in isolation; the kernel just gave it a port the next child couldn't use.Changes:
findAndHoldAvailablePortandwaitForPortFreenow bind wildcard, so allocation and teardown both observe the same scope the child will use.diagnosePortConflict(port)probes the contested port across127.0.0.1,0.0.0.0,::1, and::, plus a best-effortss -tlnplookup, so any recurrence leaves a trail describing which interface holds the port and which pid.startIsolatedRealmStackretries the worker-manager spawn once on diagnosable EADDRINUSE with a freshly-allocated port (mirrors the existing prerender-server retry pattern). Retry is skipped when the caller pinned an explicit port — there's nowhere else to put it.The diagnostic logs on every EADDRINUSE detection, even when the retry succeeds, so we still get telemetry on whether the wildcard-bind fix is fully effective in the wild.
Test plan
worker manager EADDRINUSE on port … port-conflict probe …line is present in the failed-job log