Skip to content

fix(training-agent): block app.listen() until tenant registry is ready#4062

Merged
bokelley merged 1 commit into
mainfrom
bokelley/block-listen-on-tenant-init
May 4, 2026
Merged

fix(training-agent): block app.listen() until tenant registry is ready#4062
bokelley merged 1 commit into
mainfrom
bokelley/block-listen-on-tenant-init

Conversation

@bokelley
Copy link
Copy Markdown
Contributor

@bokelley bokelley commented May 4, 2026

Summary

Follow-up to #4060. That PR triggered eager init at module load but didn't gate the HTTP listener on init completing — the 6-tenant registration burst takes 30–60s on a fresh Fly machine, the post-deploy smoke runs at ~T+10s during the warmup window, and tenant routes return 500. Five consecutive deploys failed the smoke before this PR; production was healthy minutes later in every case.

Why eager init alone wasn't enough

HTTPServer.start() was structured as: build app → mount routes (eager init kicks off in background) → app.listen(). The eager init promise was fire-and-forget, so the listener bound immediately while registration was still in flight. Smoke probes during that window get 500 because resolveByRequest returns null for tenants in pending health, and downstream the route handler ends up at Express's default error handler.

What changed

  • createTrainingAgentRouter() now returns { router, ready } instead of a bare Router. The ready Promise resolves when the registry is fully registered + validated.
  • HTTPServer.start() awaits ready before app.listen(). The HTTP server simply doesn't bind to the port until tenants are healthy.
  • Real init bugs (fix(training-agent): wire Postgres task registry — production hotfix #3854 in-memory task registry, fix(training-agent): drop noopJwksValidator production guard #3869 noopJwksValidator under NODE_ENV=production) now surface as a boot crash. Deploy rolls back. Better than the silent 500s-until-restart state.
  • API change touches every test/script that boots the router: 5 integration tests, 1 manual test, 1 e2e script.

Test plan

  • 371 training-agent unit/integration tests pass
  • Typecheck clean
  • Precommit hook passed
  • After merge: deploy this PR, watch the post-deploy smoke step in the workflow run — should pass cleanly with all 6 tenant routes returning 401 (auth-rejected, not 500)
  • After merge: subsequent unrelated deploys also pass smoke (confirms the fix is durable, not specific to this image)

🤖 Generated with Claude Code

@bokelley bokelley force-pushed the bokelley/block-listen-on-tenant-init branch from d75a5f5 to 703381b Compare May 4, 2026 09:56
The previous fix (#4060) triggered eager init at module load but didn't
gate the HTTP listener on init completing. The 6-tenant registration
burst takes 30-60s on a fresh Fly machine; the post-deploy smoke runs
at ~T+10s and probes tenant routes during the warmup window, getting
500s every time. Five consecutive deploys failed the smoke before this
PR; production was healthy minutes later in every case.

Fix: createTrainingAgentRouter() now returns { router, ready }.
HTTPServer.start() awaits ready before app.listen(), so the listener
doesn't bind until the registry is actually ready to serve. Real init
bugs (#3854, #3869 class) now surface as a boot crash and roll the
deploy back, instead of dribbling 500s at users until restart.

API change touches the 7 callsites that boot the router (5 integration
tests, 1 manual test, 1 e2e script).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bokelley bokelley force-pushed the bokelley/block-listen-on-tenant-init branch from 703381b to 403ed0e Compare May 4, 2026 10:06
@bokelley bokelley merged commit 29ca944 into main May 4, 2026
19 checks passed
@bokelley bokelley deleted the bokelley/block-listen-on-tenant-init branch May 4, 2026 10:09
bokelley added a commit that referenced this pull request May 4, 2026
bokelley added a commit that referenced this pull request May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant