Skip to content

fix(training-agent): surface tenant registry init failures + boot-phase timing#4067

Merged
bokelley merged 2 commits into
mainfrom
bokelley/deploy-smoke-investigation
May 4, 2026
Merged

fix(training-agent): surface tenant registry init failures + boot-phase timing#4067
bokelley merged 2 commits into
mainfrom
bokelley/deploy-smoke-investigation

Conversation

@bokelley
Copy link
Copy Markdown
Contributor

@bokelley bokelley commented May 4, 2026

Summary

The post-deploy smoke has been failing on /sales/mcp/brand/mcp for several deploys with HTTP 500 + HTML body. Two prior attempts (#4060 eager init, #4062 block listen) did not move the needle — #4062 was reverted because the listener didn't bind within Fly's --wait-timeout 300s.

The smoke's HTML body is the giveaway: that's Express 5's default error handler, which fires only when an unhandled rejection escapes the route handler. Looking at tenantMcpHandler, await holder.get() sits outside the try/catch — so any tenant-registry init rejection becomes an HTML 500 with no JSON body, no log entry, and no clue why.

This PR adds the diagnostic surface we should have had from the start:

  • Wrap await holder.get() in try/catch. Log the rejection with errMessage, errName, errStack, errCause, tenantId, host. Return JSON-RPC 503 instead of HTML 500.
  • Widen the eager-mount .catch() so the boot-time rejection is also logged with full error metadata, not just { err } (which pino-serializes to a few fields).
  • Add boot-phase timing in createRegistryHolder — registry construction, config build, per-tenant register() elapsed, and aggregate totals. Per-tenant register failures log a stack before re-throwing so a single bad tenant doesn't hide behind Promise.all.

Init runs ~14ms locally (visible in the new log). The next failed deploy will tell us which phase is blowing the budget on a fresh Fly machine — DNS, JWKS, buildServer schema work, or something else.

Test plan

  • npx tsc -p server/tsconfig.json --noEmit passes
  • npx vitest run server/src/training-agent/tenants/tenant-smoke.test.ts — 3/3 pass
  • After merge: watch the next deploy's flyctl logs --app adcp-docs --no-tail for either Tenant registry initialized (with phase timings) or Tenant register failed / tenant registry init rejected — request failed before dispatch (with stack)

🤖 Generated with Claude Code

bokelley and others added 2 commits May 4, 2026 07:01
…se timing

Wraps `await holder.get()` in tenantMcpHandler with try/catch so init
rejections are logged with full context and the route returns JSON-RPC
503 instead of HTML 500. The previous unhandled-rejection path produced
an HTML body that the post-deploy smoke flagged but with no log entry
tying the error to the rejected promise.

Adds boot-phase timing in createRegistryHolder (create, config-build,
per-tenant register, totals) so the next deploy reveals which phase is
blowing the budget on a fresh Fly machine. Init runs ~14ms locally;
slowness is environment-specific and was previously invisible. Per-
tenant register failures log a stack before re-throwing so a single
bad tenant doesn't hide behind Promise.all.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refactor the rejection path to `.catch()` returning null so `registry`
keeps its inferred `TenantRegistry` type instead of falling to `any`
under strict mode. Add `Retry-After: 5` header to match the SDK's
documented contract for pending tenants
(`@adcp/sdk` tenant-registry.d.ts says: host transport should respond
503 + Retry-After).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bokelley bokelley merged commit fa69e06 into main May 4, 2026
19 checks passed
@bokelley bokelley deleted the bokelley/deploy-smoke-investigation branch May 4, 2026 11:30
bokelley added a commit that referenced this pull request May 4, 2026
…it fast-reject) (#4071)

* fix(training-agent): surface tenant registry init failures + boot-phase timing

Wraps `await holder.get()` in tenantMcpHandler with try/catch so init
rejections are logged with full context and the route returns JSON-RPC
503 instead of HTML 500. The previous unhandled-rejection path produced
an HTML body that the post-deploy smoke flagged but with no log entry
tying the error to the rejected promise.

Adds boot-phase timing in createRegistryHolder (create, config-build,
per-tenant register, totals) so the next deploy reveals which phase is
blowing the budget on a fresh Fly machine. Init runs ~14ms locally;
slowness is environment-specific and was previously invisible. Per-
tenant register failures log a stack before re-throwing so a single
bad tenant doesn't hide behind Promise.all.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: keep registry typed + Retry-After: 5 on warmup 503

Refactor the rejection path to `.catch()` returning null so `registry`
keeps its inferred `TenantRegistry` type instead of falling to `any`
under strict mode. Add `Retry-After: 5` header to match the SDK's
documented contract for pending tenants
(`@adcp/sdk` tenant-registry.d.ts says: host transport should respond
503 + Retry-After).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(training-agent): wire PostgresStateStore so tenant init stops fast-rejecting

The diagnostic logging from #4067 confirmed every fresh Fly machine
fails all six tenant `register()` calls within ~13ms with:

  createAdcpServer: in-memory state store refused outside
  {NODE_ENV=test, NODE_ENV=development}

SDK 6.0.1 hard-refuses the module-singleton `InMemoryStateStore` for
multi-tenant deployments. We never wired a non-default `stateStore`,
so every `createAdcpServer` call inside `register()` threw,
`Promise.all` rejected, holder.get() rejected, and tenant routes
returned 503 indefinitely (smoke saw HTML 500 before #4067; 503 with
Retry-After after).

Adds `pickStateStore()` mirroring `pickTaskRegistry()`: PostgresStateStore
in production, InMemoryStateStore in dev/test. Re-throws on prod-pool
init failure rather than falling back to in-memory (would just re-trip
the same SDK guard with extra confusion).

Migration 466_adcp_state.sql is the SDK's `ADCP_STATE_MIGRATION`
verbatim — idempotent, runs once via `release_command`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bokelley added a commit that referenced this pull request May 4, 2026
…tually uses Postgres on cold boot (#4073)

Same boot-ordering issue the state-store fix in #4072 closed:
`mountTenantRoutes()` runs before `initializeDatabase()`, so
`getPool()` at construction threw "Database not initialized."
`pickTaskRegistry`'s try/catch silently downgraded to the in-memory
registry — every cold-booted production machine has been running with
`InMemoryTaskRegistry` since #463 shipped, defeating the whole point
of `adcp_decisioning_tasks`. Buyer creates a media buy on machine A,
polls on machine B, sees "task not found" with ~50% probability.

Surfaced by #4067's diagnostic logging — the post-#4072 deploy log
showed "Database not initialized" at pickTaskRegistry (registry.js:124)
immediately followed by the success path (state-store now lazy, so
init completed, but task registry was already in-memory).

Wrap the pool in the same lazy `PgQueryable` adapter `pickStateStore`
uses. `getPool()` runs at first query; the Postgres backend now
actually gets used in production.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant