fix(training-agent): eager tenant registry init at server boot#4060
Merged
Conversation
Every recent deploy failed the post-deploy smoke on /sales/mcp, /signals/mcp, /governance/mcp, /creative/mcp, /creative-builder/mcp, and /brand/mcp returning HTTP 500 during a ~16s window after rolling deploy completes, then healing on their own minutes later. Root cause: RegistryHolder was lazy-initialized on first request. On a fresh Fly machine the 6-tenant registration burst takes 30-60s — longer than the smoke's 16s retry budget. Initial probe + 8s retry both land while register() calls are still in flight, return 500. Pre-warm the registry inside mountTenantRoutes so init starts at server boot, not first request. Per-request handlers continue to await holder.get() which reuses the in-flight or completed promise. Plus two safety nets: - Reset pendingInit on rejection so a transient init failure doesn't poison every subsequent request until machine restart. - Eager-init errors are logged, not thrown — doesn't crash the server. Drops the unused req param from RegistryHolder.get(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
bokelley
added a commit
that referenced
this pull request
May 4, 2026
The previous fix (#4060) triggered eager init at module load but didn't gate the HTTP listener on init completing. The 6-tenant registration burst takes 30-60s on a fresh Fly machine; the post-deploy smoke runs at ~T+10s and probes tenant routes during the warmup window, getting 500s every time. Five consecutive deploys failed the smoke before this PR; production was healthy minutes later in every case. Fix: createTrainingAgentRouter() now returns { router, ready }. HTTPServer.start() awaits ready before app.listen(), so the listener doesn't bind until the registry is actually ready to serve. Real init bugs (#3854, #3869 class) now surface as a boot crash and roll the deploy back, instead of dribbling 500s at users until restart. API change touches the 7 callsites that boot the router (5 integration tests, 1 manual test, 1 e2e script). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bokelley
added a commit
that referenced
this pull request
May 4, 2026
The previous fix (#4060) triggered eager init at module load but didn't gate the HTTP listener on init completing. The 6-tenant registration burst takes 30-60s on a fresh Fly machine; the post-deploy smoke runs at ~T+10s and probes tenant routes during the warmup window, getting 500s every time. Five consecutive deploys failed the smoke before this PR; production was healthy minutes later in every case. Fix: createTrainingAgentRouter() now returns { router, ready }. HTTPServer.start() awaits ready before app.listen(), so the listener doesn't bind until the registry is actually ready to serve. Real init bugs (#3854, #3869 class) now surface as a boot crash and roll the deploy back, instead of dribbling 500s at users until restart. API change touches the 7 callsites that boot the router (5 integration tests, 1 manual test, 1 e2e script). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bokelley
added a commit
that referenced
this pull request
May 4, 2026
#4062) The previous fix (#4060) triggered eager init at module load but didn't gate the HTTP listener on init completing. The 6-tenant registration burst takes 30-60s on a fresh Fly machine; the post-deploy smoke runs at ~T+10s and probes tenant routes during the warmup window, getting 500s every time. Five consecutive deploys failed the smoke before this PR; production was healthy minutes later in every case. Fix: createTrainingAgentRouter() now returns { router, ready }. HTTPServer.start() awaits ready before app.listen(), so the listener doesn't bind until the registry is actually ready to serve. Real init bugs (#3854, #3869 class) now surface as a boot crash and roll the deploy back, instead of dribbling 500s at users until restart. API change touches the 7 callsites that boot the router (5 integration tests, 1 manual test, 1 e2e script). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 4, 2026
8 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Every recent deploy has been failing the post-deploy smoke check on the training-agent tenant routes (
/sales/mcp,/signals/mcp,/governance/mcp,/creative/mcp,/creative-builder/mcp,/brand/mcp) — all returning HTTP 500 during the ~16s window after rolling deploy completes, then healing on their own minutes later.5 of the last 5 deploys have shown this pattern. Production is healthy throughout (warm machines serving traffic), but the failure noise hides real regressions and forces operators to manually verify each deploy.
Root cause
RegistryHolderlazy-initialized on first request (registry.ts:193). On a fresh Fly machine the 6-tenant registration burst takes 30–60s — longer than the smoke's 16s retry budget. The smoke's initial probe + 8s retry both land whileregister()calls are still in flight, return 500 (the init promise hasn't resolved yet, the route handler can't get a registry to dispatch against), and the smoke gives up.What changed
mountTenantRoutes: triggersholder.get()once at mount time so the 6-tenant registration starts at server boot, not first request. Per-request handlers continue to awaitholder.get(), which now reuses the in-flight or completed promise from the eager call.pendingInitis reset on rejection so a transient init failure doesn't poison every subsequent request with the same rejected promise until machine restart. Defense in depth — the existing code never reset, so a bad init at boot would have been permanently fatal until restart.reqparam fromRegistryHolder.get()— the comment said it was vestigial; confirmed by greppingall callers.What this catches that the smoke was meant to catch
The smoke step's comment names two prior incidents (#3854 in-memory task registry refused under NODE_ENV=production, #3869 noopJwksValidator threw under NODE_ENV=production). Both were deterministic init failures. With eager init, those would surface at server boot — visible in the Fly logs immediately, instead of dressed up as a smoke flake.
Test plan
tests/unit/training-agent.test.ts+src/training-agent/)src/training-agent/tenants/tenant-smoke.test.ts)Tenant smoke failed: /<tenant>/mcp returned HTTP 500)🤖 Generated with Claude Code