fix(training-agent): block app.listen() until tenant registry is ready#4062
Merged
Conversation
d75a5f5 to
703381b
Compare
The previous fix (#4060) triggered eager init at module load but didn't gate the HTTP listener on init completing. The 6-tenant registration burst takes 30-60s on a fresh Fly machine; the post-deploy smoke runs at ~T+10s and probes tenant routes during the warmup window, getting 500s every time. Five consecutive deploys failed the smoke before this PR; production was healthy minutes later in every case. Fix: createTrainingAgentRouter() now returns { router, ready }. HTTPServer.start() awaits ready before app.listen(), so the listener doesn't bind until the registry is actually ready to serve. Real init bugs (#3854, #3869 class) now surface as a boot crash and roll the deploy back, instead of dribbling 500s at users until restart. API change touches the 7 callsites that boot the router (5 integration tests, 1 manual test, 1 e2e script). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
703381b to
403ed0e
Compare
4 tasks
This was referenced May 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #4060. That PR triggered eager init at module load but didn't gate the HTTP listener on init completing — the 6-tenant registration burst takes 30–60s on a fresh Fly machine, the post-deploy smoke runs at ~T+10s during the warmup window, and tenant routes return 500. Five consecutive deploys failed the smoke before this PR; production was healthy minutes later in every case.
Why eager init alone wasn't enough
HTTPServer.start()was structured as: build app → mount routes (eager init kicks off in background) →app.listen(). The eager init promise was fire-and-forget, so the listener bound immediately while registration was still in flight. Smoke probes during that window get 500 becauseresolveByRequestreturns null for tenants inpendinghealth, and downstream the route handler ends up at Express's default error handler.What changed
createTrainingAgentRouter()now returns{ router, ready }instead of a bareRouter. ThereadyPromise resolves when the registry is fully registered + validated.HTTPServer.start()awaitsreadybeforeapp.listen(). The HTTP server simply doesn't bind to the port until tenants are healthy.Test plan
🤖 Generated with Claude Code