Skip to content

sf playwright: fix worker-manager EADDRINUSE race + port-conflict diagnostics#4827

Merged
habdelra merged 2 commits into
mainfrom
worktree-cs-flaky-sf-port-collision
May 14, 2026
Merged

sf playwright: fix worker-manager EADDRINUSE race + port-conflict diagnostics#4827
habdelra merged 2 commits into
mainfrom
worktree-cs-flaky-sf-port-collision

Conversation

@habdelra
Copy link
Copy Markdown
Contributor

Summary

Recent flake on shard 2/3 of the SF Playwright suite (e.g. run 25836067730) had the worker-manager subprocess die immediately with EADDRINUSE :::34301, and the realm fixture polled for runtime.json for the full 240s timeout before declaring failure.

Root cause: findAndHoldAvailablePort (and fixtures.ts:waitForPortFree) bound 127.0.0.1 while the worker-manager child calls listen(port) without a host and therefore binds the dual-stack wildcard ::port. The kernel treats those as distinct scopes for port selection, so the OS port-0 allocator could legitimately hand the holder a port that still had a lingering bind on :: from a previous worker-manager process. The holder's release-before-spawn handoff was correct in isolation; the kernel just gave it a port the next child couldn't use.

Changes:

  • findAndHoldAvailablePort and waitForPortFree now bind wildcard, so allocation and teardown both observe the same scope the child will use.
  • New diagnosePortConflict(port) probes the contested port across 127.0.0.1, 0.0.0.0, ::1, and ::, plus a best-effort ss -tlnp lookup, so any recurrence leaves a trail describing which interface holds the port and which pid.
  • startIsolatedRealmStack retries the worker-manager spawn once on diagnosable EADDRINUSE with a freshly-allocated port (mirrors the existing prerender-server retry pattern). Retry is skipped when the caller pinned an explicit port — there's nowhere else to put it.

The diagnostic logs on every EADDRINUSE detection, even when the retry succeeds, so we still get telemetry on whether the wildcard-bind fix is fully effective in the wild.

Test plan

  • CI Software Factory Tests pass on this PR
  • No regression on shards 1/3 and 3/3 over a couple of runs
  • (If a flake recurs) the new worker manager EADDRINUSE on port … port-conflict probe … line is present in the failed-job log

…flicts

The findAndHoldAvailablePort holder bound `127.0.0.1:0` while the
worker-manager child (and other harness children) calls `listen(port)`
without a host, binding the dual-stack wildcard `::port`. The kernel
treated those as distinct scopes for port selection, so the OS port-0
allocator could legitimately hand the holder a port that still had a
lingering bind on `::` from a previous run — the next child then
crashed with `EADDRINUSE :::port` even though the holder had been
correctly released.

- findAndHoldAvailablePort and fixtures.ts:waitForPortFree now bind
  wildcard, so allocation/teardown both see the same scope the child
  will use.
- New diagnosePortConflict probes the contested port across
  127.0.0.1 / 0.0.0.0 / ::1 / :: and shells out to `ss -tlnp` so any
  recurrence leaves a trail describing which interface holds it.
- startIsolatedRealmStack retries the worker-manager spawn once on
  diagnosable EADDRINUSE with a fresh port (mirrors the existing
  prerender-server retry pattern). Retry is skipped when the caller
  pinned an explicit port.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses Software Factory Playwright realm startup flakes caused by port scope mismatches during worker-manager startup and improves diagnostics when EADDRINUSE occurs.

Changes:

  • Switches port availability checks/reservations to wildcard binds to match child process bind behavior.
  • Adds diagnosePortConflict(port) for bind-scope and ss diagnostics.
  • Adds a one-time worker-manager retry on detected dynamic-port EADDRINUSE.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
packages/software-factory/tests/fixtures.ts Updates teardown port-free probing to use wildcard binding.
packages/realm-test-harness/src/shared.ts Adds port conflict diagnostics and changes held-port allocation to wildcard binding.
packages/realm-test-harness/src/isolated-realm-stack.ts Adds worker-manager bind failure detection, diagnostics, and retry logic.
packages/realm-test-harness/src/index.ts Re-exports the new diagnostic helper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/realm-test-harness/src/shared.ts
probeBind resolved its promise immediately after calling
`server.close()`, which is asynchronous. In diagnosePortConflict the
probes run sequentially, so a leftover-closing socket from a `FREE`
scope could surface as a false EADDRINUSE on the next scope, masking
which interface actually owns the contested port.

Wait for the close callback before resolving so each probe leaves no
state behind.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra habdelra requested a review from a team May 14, 2026 02:14
@habdelra habdelra merged commit 4e0f0df into main May 14, 2026
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants