fix(browse): recover a late-healthy detached daemon instead of a false "Server failed to start" (#1846)#1847
Conversation
…e "failed to start" (garrytan#1846) startServer polls /health for MAX_START_WAIT, then throws "Server failed to start within Ns" on timeout. But the server is spawned detached + unref'd, so it keeps booting independently of the CLI's poll loop. On a loaded machine the daemon can become healthy in the gap between the loop's final tick and the throw — the very next `browse status` shows it listening and healthy. The probe timed out; the launch did not. - Add a final readState() + isServerHealthy() re-check before the timeout throw, mirroring the post-loop recovery already in ensureServer(). A genuinely failed server stays unhealthy and falls through to the existing error report. - Make the startup budget overridable via BROWSE_START_TIMEOUT (ms) through a new pure, exported resolveStartTimeout() — matching the BROWSE_* tunable convention in server.ts — so hosts where even the platform ceiling is too low have an escape hatch. Complements garrytan#1732 (which only raised the constant): this removes the false-negative at the throw site itself, on any platform/load. Test: browse/test/cli-start-final-healthcheck.test.ts — behavioral coverage of resolveStartTimeout overrides + a static invariant that the final re-check sits between the poll loop and the throw. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
❌ This pull request has been removed from the merge queue because required statuses are not defined in
After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here |
|
Independently reviewed against current The asymmetry the PR describes is real on The Collision check: no competing open PR for #1846 / |
|
/trunk merge |
|
These look ready from my side — the branch is clean/mergeable and an independent review confirmed the fix is correct and complete. The only red check is the Trunk merge-queue gate, which from what I can tell just needs a maintainer to enqueue it (no code/test failure). @garrytan whenever you have a moment, could these be added to the merge queue? Happy to rebase if anything is needed. Thanks! |
Summary
browse'sstartServer()polls/healthforMAX_START_WAIT, then throwsServer failed to start within Nson timeout. But the server is spawned detached +.unref()'d, so it keeps booting independently of the CLI's poll loop. On a loaded machine the daemon can become healthy in the gap between the loop's final tick and the throw — the very nextbrowse statusshows it listening and healthy. The probe timed out; the launch did not.Closes #1846.
What this changes
Final health re-check before the throw (the actual fix). Before the timeout
throwinstartServer, do one lastreadState()+isServerHealthy(); if healthy, return that state. This mirrors the post-loop recovery already present inensureServer()(cli.ts:482-483) —startServerwas simply missing the same guard. A genuinely failed server stays unhealthy and falls straight through to the existing startup-error-log report, so real failures are unaffected.BROWSE_START_TIMEOUTenv override.MAX_START_WAITwas a hardcoded magic number — the onlybrowsestartup tunable without aBROWSE_*escape hatch (unlikeBROWSE_PORT,BROWSE_IDLE_TIMEOUT, …). It's now produced by a small pure, exportedresolveStartTimeout(env)that honorsBROWSE_START_TIMEOUT(ms) and falls back to the platform default for absent/non-positive/unparseable values.Relationship to #1732
Complementary, not overlapping. #1732 only widens the constant (8s → 15s for non-Windows). This PR removes the false-negative at the throw site itself, so it holds at any budget/platform/load, and adds the env knob #1732 left out. The two changes touch different lines and compose cleanly.
Tests
browse/test/cli-start-final-healthcheck.test.ts:resolveStartTimeout— positive override honored;0/ negative / garbage / empty fall back to the platform default.readState()+isServerHealthy()re-check sits between the poll loop and the throw, returns the recovered state, and thatMAX_START_WAITis wired throughresolveStartTimeout().Existing
cli-*/restart-envtests stay green. (The unrelatedcommands.test.tsfailure on a fresh checkout is a missing-npx playwright installbrowser binary — reproduces identically onmain.)🤖 Generated with Claude Code