Skip to content

fix(training-agent): eager tenant registry init at server boot#4060

Merged
bokelley merged 1 commit into
mainfrom
bokelley/eager-tenant-registry-init
May 4, 2026
Merged

fix(training-agent): eager tenant registry init at server boot#4060
bokelley merged 1 commit into
mainfrom
bokelley/eager-tenant-registry-init

Conversation

@bokelley
Copy link
Copy Markdown
Contributor

@bokelley bokelley commented May 4, 2026

Summary

Every recent deploy has been failing the post-deploy smoke check on the training-agent tenant routes (/sales/mcp, /signals/mcp, /governance/mcp, /creative/mcp, /creative-builder/mcp, /brand/mcp) — all returning HTTP 500 during the ~16s window after rolling deploy completes, then healing on their own minutes later.

5 of the last 5 deploys have shown this pattern. Production is healthy throughout (warm machines serving traffic), but the failure noise hides real regressions and forces operators to manually verify each deploy.

Root cause

RegistryHolder lazy-initialized on first request (registry.ts:193). On a fresh Fly machine the 6-tenant registration burst takes 30–60s — longer than the smoke's 16s retry budget. The smoke's initial probe + 8s retry both land while register() calls are still in flight, return 500 (the init promise hasn't resolved yet, the route handler can't get a registry to dispatch against), and the smoke gives up.

What changed

  • Eager init in mountTenantRoutes: triggers holder.get() once at mount time so the 6-tenant registration starts at server boot, not first request. Per-request handlers continue to await holder.get(), which now reuses the in-flight or completed promise from the eager call.
  • Reject-clear: pendingInit is reset on rejection so a transient init failure doesn't poison every subsequent request with the same rejected promise until machine restart. Defense in depth — the existing code never reset, so a bad init at boot would have been permanently fatal until restart.
  • Eager-init failure logged, not crashed: if the boot-time init throws, the error is logged and per-request init retries on the next call. Doesn't take down the server.
  • Drops unused req param from RegistryHolder.get() — the comment said it was vestigial; confirmed by greppingall callers.

What this catches that the smoke was meant to catch

The smoke step's comment names two prior incidents (#3854 in-memory task registry refused under NODE_ENV=production, #3869 noopJwksValidator threw under NODE_ENV=production). Both were deterministic init failures. With eager init, those would surface at server boot — visible in the Fly logs immediately, instead of dressed up as a smoke flake.

Test plan

  • 377 training-agent unit/integration tests pass (full suite under tests/unit/training-agent.test.ts + src/training-agent/)
  • Tenant smoke tests pass (src/training-agent/tenants/tenant-smoke.test.ts)
  • Typecheck clean
  • Precommit hook passed
  • After merge: deploy this PR and verify the post-deploy smoke step passes (no more Tenant smoke failed: /<tenant>/mcp returned HTTP 500)
  • After merge: subsequent unrelated deploys also pass smoke (confirms the fix is durable, not specific to this PR's image)

🤖 Generated with Claude Code

Every recent deploy failed the post-deploy smoke on /sales/mcp,
/signals/mcp, /governance/mcp, /creative/mcp, /creative-builder/mcp,
and /brand/mcp returning HTTP 500 during a ~16s window after rolling
deploy completes, then healing on their own minutes later.

Root cause: RegistryHolder was lazy-initialized on first request. On a
fresh Fly machine the 6-tenant registration burst takes 30-60s —
longer than the smoke's 16s retry budget. Initial probe + 8s retry
both land while register() calls are still in flight, return 500.

Pre-warm the registry inside mountTenantRoutes so init starts at
server boot, not first request. Per-request handlers continue to
await holder.get() which reuses the in-flight or completed promise.

Plus two safety nets:
- Reset pendingInit on rejection so a transient init failure doesn't
  poison every subsequent request until machine restart.
- Eager-init errors are logged, not thrown — doesn't crash the server.

Drops the unused req param from RegistryHolder.get().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bokelley bokelley merged commit 805067b into main May 4, 2026
19 checks passed
@bokelley bokelley deleted the bokelley/eager-tenant-registry-init branch May 4, 2026 09:28
bokelley added a commit that referenced this pull request May 4, 2026
The previous fix (#4060) triggered eager init at module load but didn't
gate the HTTP listener on init completing. The 6-tenant registration
burst takes 30-60s on a fresh Fly machine; the post-deploy smoke runs
at ~T+10s and probes tenant routes during the warmup window, getting
500s every time. Five consecutive deploys failed the smoke before this
PR; production was healthy minutes later in every case.

Fix: createTrainingAgentRouter() now returns { router, ready }.
HTTPServer.start() awaits ready before app.listen(), so the listener
doesn't bind until the registry is actually ready to serve. Real init
bugs (#3854, #3869 class) now surface as a boot crash and roll the
deploy back, instead of dribbling 500s at users until restart.

API change touches the 7 callsites that boot the router (5 integration
tests, 1 manual test, 1 e2e script).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bokelley added a commit that referenced this pull request May 4, 2026
The previous fix (#4060) triggered eager init at module load but didn't
gate the HTTP listener on init completing. The 6-tenant registration
burst takes 30-60s on a fresh Fly machine; the post-deploy smoke runs
at ~T+10s and probes tenant routes during the warmup window, getting
500s every time. Five consecutive deploys failed the smoke before this
PR; production was healthy minutes later in every case.

Fix: createTrainingAgentRouter() now returns { router, ready }.
HTTPServer.start() awaits ready before app.listen(), so the listener
doesn't bind until the registry is actually ready to serve. Real init
bugs (#3854, #3869 class) now surface as a boot crash and roll the
deploy back, instead of dribbling 500s at users until restart.

API change touches the 7 callsites that boot the router (5 integration
tests, 1 manual test, 1 e2e script).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bokelley added a commit that referenced this pull request May 4, 2026
#4062)

The previous fix (#4060) triggered eager init at module load but didn't
gate the HTTP listener on init completing. The 6-tenant registration
burst takes 30-60s on a fresh Fly machine; the post-deploy smoke runs
at ~T+10s and probes tenant routes during the warmup window, getting
500s every time. Five consecutive deploys failed the smoke before this
PR; production was healthy minutes later in every case.

Fix: createTrainingAgentRouter() now returns { router, ready }.
HTTPServer.start() awaits ready before app.listen(), so the listener
doesn't bind until the registry is actually ready to serve. Real init
bugs (#3854, #3869 class) now surface as a boot crash and roll the
deploy back, instead of dribbling 500s at users until restart.

API change touches the 7 callsites that boot the router (5 integration
tests, 1 manual test, 1 e2e script).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant