Skip to content

flaky test - wait-for-host-standby: retry puppeteer.launch with diagnostics#4872

Open
habdelra wants to merge 1 commit into
mainfrom
worktree-flaky-wait-for-host-standby
Open

flaky test - wait-for-host-standby: retry puppeteer.launch with diagnostics#4872
habdelra wants to merge 1 commit into
mainfrom
worktree-flaky-wait-for-host-standby

Conversation

@habdelra
Copy link
Copy Markdown
Contributor

Why

Boxel CLI Tests on PR #4863 failed with this in the dev-stack startup:

[start:prerender-dev] [wait-for-host-standby] probing https://localhost:4200/_standby (max 600s)...
[start:prerender-dev] [wait-for-host-standby] unexpected failure: TimeoutError: Timed out after 30000 ms while waiting for the WS endpoint URL to appear in stdout!
    at ChromeLauncher.launch (puppeteer-core/src/node/BrowserLauncher.ts:260:15)
    at async main (packages/realm-server/scripts/wait-for-host-standby.ts:84:17)

puppeteer.launch timed out before the script's existing retry loop could even start. The retry loop only covered the post-launch page.goto + waitForFunction phases. A slow Chrome cold start on a loaded CI runner aborts the whole script, which fails the Start dev stack step, which fails the job — and Chrome cold-start time is a well-known flakiness vector on CI.

What this PR changes

packages/realm-server/scripts/wait-for-host-standby.ts: wrap puppeteer.launch in its own retry helper.

  • explicit 90s launch timeout per attempt (default was 30s)
  • up to 3 attempts with a 2s backoff between them, capped by the same 10-minute total deadline as the navigation loop
  • structured log lines for each attempt (executable path, args, dumpio, timeout, success/failure timing) so a future flake is debuggable from the CI log alone
  • on the final attempt, enable puppeteer's dumpio so Chrome's own stdout/stderr is piped through node — if launch still fails after the retries, we capture why (sandbox denial, missing shared library, GPU init crash) instead of a bare "Timed out … waiting for the WS endpoint URL"

The WAIT_FOR_HOST_STANDBY_VERBOSE=1 env var also now forces dumpio: true on every attempt for local repro.

What this does NOT do

  • Doesn't touch BrowserManager in the prerender server itself. That path also uses the puppeteer default 30s launch timeout, but it runs under a long-lived service with its own restart/recovery story — out of scope for this PR.
  • Doesn't change the post-launch retry budget. The 30s-per-phase + 10-minute total ceiling for page.goto/waitForFunction is unchanged.

Test plan

Boxel CLI Tests on CI failed because puppeteer.launch timed out after
the default 30s waiting for Chrome's DevTools WS endpoint URL to appear
in stdout. The existing retry loop only covered page.goto and
waitForFunction — a slow Chrome cold start aborted the script before
the loop could even run.

Wrap the launch in its own retry helper:

- explicit 90s launch timeout per attempt (default was 30s)
- up to 3 attempts with a 2s backoff between them, capped by the same
  10-minute total deadline as the navigation loop
- structured log lines for each attempt (executable path, args, dumpio,
  timeout, success/failure timing) so a future flake is debuggable from
  the CI log alone
- on the final attempt, enable puppeteer's dumpio so Chrome's own
  stdout/stderr is piped through node — if launch still fails after
  the retries, we capture *why* (sandbox denial, missing shared lib,
  GPU init crash) instead of a bare "Timed out … waiting for the WS
  endpoint URL"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra habdelra requested a review from Copilot May 18, 2026 22:23
@habdelra habdelra changed the title wait-for-host-standby: retry puppeteer.launch with diagnostics flaky test - wait-for-host-standby: retry puppeteer.launch with diagnostics May 18, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Wraps puppeteer.launch in wait-for-host-standby.ts with its own retry loop and a 90s per-attempt timeout to mitigate CI flakiness where Chrome cold-starts exceed Puppeteer's default 30s launch timeout before the existing post-launch retry loop ever runs.

Changes:

  • Add launchBrowserWithRetry helper with up to 3 attempts, 2s backoff, and a 90s timeout per attempt, bounded by the existing 10-minute total deadline.
  • Enable Puppeteer dumpio on the final attempt (and whenever WAIT_FOR_HOST_STANDBY_VERBOSE=1) to surface Chrome's own stderr when launch ultimately fails.
  • Replace the single inline puppeteer.launch call in main() with the new retry helper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 18, 2026

Host Test Results

    1 files      1 suites   1h 29m 31s ⏱️
2 661 tests 2 646 ✅ 15 💤 0 ❌
2 680 runs  2 665 ✅ 15 💤 0 ❌

Results for commit 4af133e.

Realm Server Test Results

    1 files      1 suites   8m 26s ⏱️
1 408 tests 1 407 ✅ 0 💤 1 ❌
2 990 runs  2 989 ✅ 0 💤 1 ❌

Results for commit 4af133e.

For more details on these errors, see this check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants