fix(training-agent): wire PostgresStateStore (root cause of tenant init fast-reject)#4071
Merged
Merged
Conversation
…se timing Wraps `await holder.get()` in tenantMcpHandler with try/catch so init rejections are logged with full context and the route returns JSON-RPC 503 instead of HTML 500. The previous unhandled-rejection path produced an HTML body that the post-deploy smoke flagged but with no log entry tying the error to the rejected promise. Adds boot-phase timing in createRegistryHolder (create, config-build, per-tenant register, totals) so the next deploy reveals which phase is blowing the budget on a fresh Fly machine. Init runs ~14ms locally; slowness is environment-specific and was previously invisible. Per- tenant register failures log a stack before re-throwing so a single bad tenant doesn't hide behind Promise.all. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refactor the rejection path to `.catch()` returning null so `registry` keeps its inferred `TenantRegistry` type instead of falling to `any` under strict mode. Add `Retry-After: 5` header to match the SDK's documented contract for pending tenants (`@adcp/sdk` tenant-registry.d.ts says: host transport should respond 503 + Retry-After). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t-rejecting The diagnostic logging from #4067 confirmed every fresh Fly machine fails all six tenant `register()` calls within ~13ms with: createAdcpServer: in-memory state store refused outside {NODE_ENV=test, NODE_ENV=development} SDK 6.0.1 hard-refuses the module-singleton `InMemoryStateStore` for multi-tenant deployments. We never wired a non-default `stateStore`, so every `createAdcpServer` call inside `register()` threw, `Promise.all` rejected, holder.get() rejected, and tenant routes returned 503 indefinitely (smoke saw HTML 500 before #4067; 503 with Retry-After after). Adds `pickStateStore()` mirroring `pickTaskRegistry()`: PostgresStateStore in production, InMemoryStateStore in dev/test. Re-throws on prod-pool init failure rather than falling back to in-memory (would just re-trip the same SDK guard with extra confusion). Migration 466_adcp_state.sql is the SDK's `ADCP_STATE_MIGRATION` verbatim — idempotent, runs once via `release_command`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
bokelley
added a commit
that referenced
this pull request
May 4, 2026
…empt boot succeeds (#4072) The fix in #4071 unblocked deploys but exposed a boot-ordering race: `mountTenantRoutes()` runs before `initializeDatabase()`, so the eager `holder.get()` calls `getPool()` and throws "Database not initialized." The eager-mount catch resets `pendingInit`, the next request retries, and by then the pool is up — production heals within ~5s. Visible in the post-#4071 deploy logs: T+0 init starting → "Database not initialized" (eager fail) T+5 init starting → "Tenant registry initialized" totalMs=27 Wrap the pool in a lazy `PgQueryable` so `getPool()` runs at first query, not at construction. Construction always succeeds; by the time a tool touches `ctx.store`, the pool is initialized regardless of mount order. Eliminates one round-trip of boot retry and the noisy first error. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The diagnostic logging from #4067 caught it on first deploy:
All six tenant
register()calls fail within ~13ms of boot. SDK 6.0.1 hard-refuses the module-singletonInMemoryStateStorefor multi-tenant deployments — and we never wired a non-defaultstateStoreindefaultServerOptions. EverycreateAdcpServercall insideregister()threw,Promise.allrejected,holder.get()rejected, and tenant routes returned 503 indefinitely (HTML 500 before #4067).What changed
pickStateStore()inserver/src/training-agent/tenants/registry.tsmirrorspickTaskRegistry():PostgresStateStorein production,InMemoryStateStorein dev/test. Re-throws on prod-pool failure rather than falling back (in-memory fallback would just re-trip the SDK guard).stateStore: pickStateStore()intobuildDefaultServerOptions().466_adcp_state.sql— verbatim from the SDK'sADCP_STATE_MIGRATIONconstant. Idempotent (CREATE TABLE IF NOT EXISTS+ADD COLUMN IF NOT EXISTS); runs once via Fly'srelease_commandbefore machines roll.Test plan
npx tsc -p server/tsconfig.json --noEmitcleantenant-smoke.test.ts+sync-accounts-gates.test.ts14/14 passcode-reviewerandpython-expert— no blockersTenant registry initializedwithtotalMs< 1sRollback
If the migration succeeds but the wiring trips: revert this PR (the registry change), then
DROP TABLE adcp_statevia psql. Reverting the registry without dropping the table is also safe — the table just sits empty.🤖 Generated with Claude Code