fix(training-agent): surface tenant registry init failures + boot-phase timing#4067
Merged
Conversation
…se timing Wraps `await holder.get()` in tenantMcpHandler with try/catch so init rejections are logged with full context and the route returns JSON-RPC 503 instead of HTML 500. The previous unhandled-rejection path produced an HTML body that the post-deploy smoke flagged but with no log entry tying the error to the rejected promise. Adds boot-phase timing in createRegistryHolder (create, config-build, per-tenant register, totals) so the next deploy reveals which phase is blowing the budget on a fresh Fly machine. Init runs ~14ms locally; slowness is environment-specific and was previously invisible. Per- tenant register failures log a stack before re-throwing so a single bad tenant doesn't hide behind Promise.all. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refactor the rejection path to `.catch()` returning null so `registry` keeps its inferred `TenantRegistry` type instead of falling to `any` under strict mode. Add `Retry-After: 5` header to match the SDK's documented contract for pending tenants (`@adcp/sdk` tenant-registry.d.ts says: host transport should respond 503 + Retry-After). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
bokelley
added a commit
that referenced
this pull request
May 4, 2026
…it fast-reject) (#4071) * fix(training-agent): surface tenant registry init failures + boot-phase timing Wraps `await holder.get()` in tenantMcpHandler with try/catch so init rejections are logged with full context and the route returns JSON-RPC 503 instead of HTML 500. The previous unhandled-rejection path produced an HTML body that the post-deploy smoke flagged but with no log entry tying the error to the rejected promise. Adds boot-phase timing in createRegistryHolder (create, config-build, per-tenant register, totals) so the next deploy reveals which phase is blowing the budget on a fresh Fly machine. Init runs ~14ms locally; slowness is environment-specific and was previously invisible. Per- tenant register failures log a stack before re-throwing so a single bad tenant doesn't hide behind Promise.all. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * review: keep registry typed + Retry-After: 5 on warmup 503 Refactor the rejection path to `.catch()` returning null so `registry` keeps its inferred `TenantRegistry` type instead of falling to `any` under strict mode. Add `Retry-After: 5` header to match the SDK's documented contract for pending tenants (`@adcp/sdk` tenant-registry.d.ts says: host transport should respond 503 + Retry-After). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(training-agent): wire PostgresStateStore so tenant init stops fast-rejecting The diagnostic logging from #4067 confirmed every fresh Fly machine fails all six tenant `register()` calls within ~13ms with: createAdcpServer: in-memory state store refused outside {NODE_ENV=test, NODE_ENV=development} SDK 6.0.1 hard-refuses the module-singleton `InMemoryStateStore` for multi-tenant deployments. We never wired a non-default `stateStore`, so every `createAdcpServer` call inside `register()` threw, `Promise.all` rejected, holder.get() rejected, and tenant routes returned 503 indefinitely (smoke saw HTML 500 before #4067; 503 with Retry-After after). Adds `pickStateStore()` mirroring `pickTaskRegistry()`: PostgresStateStore in production, InMemoryStateStore in dev/test. Re-throws on prod-pool init failure rather than falling back to in-memory (would just re-trip the same SDK guard with extra confusion). Migration 466_adcp_state.sql is the SDK's `ADCP_STATE_MIGRATION` verbatim — idempotent, runs once via `release_command`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 4, 2026
bokelley
added a commit
that referenced
this pull request
May 4, 2026
…tually uses Postgres on cold boot (#4073) Same boot-ordering issue the state-store fix in #4072 closed: `mountTenantRoutes()` runs before `initializeDatabase()`, so `getPool()` at construction threw "Database not initialized." `pickTaskRegistry`'s try/catch silently downgraded to the in-memory registry — every cold-booted production machine has been running with `InMemoryTaskRegistry` since #463 shipped, defeating the whole point of `adcp_decisioning_tasks`. Buyer creates a media buy on machine A, polls on machine B, sees "task not found" with ~50% probability. Surfaced by #4067's diagnostic logging — the post-#4072 deploy log showed "Database not initialized" at pickTaskRegistry (registry.js:124) immediately followed by the success path (state-store now lazy, so init completed, but task registry was already in-memory). Wrap the pool in the same lazy `PgQueryable` adapter `pickStateStore` uses. `getPool()` runs at first query; the Postgres backend now actually gets used in production. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The post-deploy smoke has been failing on
/sales/mcp…/brand/mcpfor several deploys with HTTP 500 + HTML body. Two prior attempts (#4060 eager init, #4062 block listen) did not move the needle — #4062 was reverted because the listener didn't bind within Fly's--wait-timeout 300s.The smoke's HTML body is the giveaway: that's Express 5's default error handler, which fires only when an unhandled rejection escapes the route handler. Looking at
tenantMcpHandler,await holder.get()sits outside the try/catch — so any tenant-registry init rejection becomes an HTML 500 with no JSON body, no log entry, and no clue why.This PR adds the diagnostic surface we should have had from the start:
await holder.get()intry/catch. Log the rejection witherrMessage,errName,errStack,errCause,tenantId,host. Return JSON-RPC 503 instead of HTML 500..catch()so the boot-time rejection is also logged with full error metadata, not just{ err }(which pino-serializes to a few fields).createRegistryHolder— registry construction, config build, per-tenantregister()elapsed, and aggregate totals. Per-tenant register failures log a stack before re-throwing so a single bad tenant doesn't hide behindPromise.all.Init runs ~14ms locally (visible in the new log). The next failed deploy will tell us which phase is blowing the budget on a fresh Fly machine — DNS, JWKS,
buildServerschema work, or something else.Test plan
npx tsc -p server/tsconfig.json --noEmitpassesnpx vitest run server/src/training-agent/tenants/tenant-smoke.test.ts— 3/3 passflyctl logs --app adcp-docs --no-tailfor eitherTenant registry initialized(with phase timings) orTenant register failed/tenant registry init rejected — request failed before dispatch(with stack)🤖 Generated with Claude Code