Skip to content

fix(training-agent): wire PostgresStateStore (root cause of tenant init fast-reject)#4071

Merged
bokelley merged 3 commits into
mainfrom
bokelley/wire-postgres-state-store
May 4, 2026
Merged

fix(training-agent): wire PostgresStateStore (root cause of tenant init fast-reject)#4071
bokelley merged 3 commits into
mainfrom
bokelley/wire-postgres-state-store

Conversation

@bokelley
Copy link
Copy Markdown
Contributor

@bokelley bokelley commented May 4, 2026

Summary

The diagnostic logging from #4067 caught it on first deploy:

createAdcpServer: in-memory state store refused outside
{NODE_ENV=test, NODE_ENV=development}

All six tenant register() calls fail within ~13ms of boot. SDK 6.0.1 hard-refuses the module-singleton InMemoryStateStore for multi-tenant deployments — and we never wired a non-default stateStore in defaultServerOptions. Every createAdcpServer call inside register() threw, Promise.all rejected, holder.get() rejected, and tenant routes returned 503 indefinitely (HTML 500 before #4067).

What changed

  1. pickStateStore() in server/src/training-agent/tenants/registry.ts mirrors pickTaskRegistry(): PostgresStateStore in production, InMemoryStateStore in dev/test. Re-throws on prod-pool failure rather than falling back (in-memory fallback would just re-trip the SDK guard).
  2. Wires stateStore: pickStateStore() into buildDefaultServerOptions().
  3. Migration 466_adcp_state.sql — verbatim from the SDK's ADCP_STATE_MIGRATION constant. Idempotent (CREATE TABLE IF NOT EXISTS + ADD COLUMN IF NOT EXISTS); runs once via Fly's release_command before machines roll.

Test plan

  • npx tsc -p server/tsconfig.json --noEmit clean
  • tenant-smoke.test.ts + sync-accounts-gates.test.ts 14/14 pass
  • Training-agent integration suite 25/25 pass
  • Reviewed by code-reviewer and python-expert — no blockers
  • After merge: deploy completes; smoke returns 200/401/403 (not 503); fly logs show Tenant registry initialized with totalMs < 1s

Rollback

If the migration succeeds but the wiring trips: revert this PR (the registry change), then DROP TABLE adcp_state via psql. Reverting the registry without dropping the table is also safe — the table just sits empty.

🤖 Generated with Claude Code

bokelley and others added 3 commits May 4, 2026 07:01
…se timing

Wraps `await holder.get()` in tenantMcpHandler with try/catch so init
rejections are logged with full context and the route returns JSON-RPC
503 instead of HTML 500. The previous unhandled-rejection path produced
an HTML body that the post-deploy smoke flagged but with no log entry
tying the error to the rejected promise.

Adds boot-phase timing in createRegistryHolder (create, config-build,
per-tenant register, totals) so the next deploy reveals which phase is
blowing the budget on a fresh Fly machine. Init runs ~14ms locally;
slowness is environment-specific and was previously invisible. Per-
tenant register failures log a stack before re-throwing so a single
bad tenant doesn't hide behind Promise.all.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refactor the rejection path to `.catch()` returning null so `registry`
keeps its inferred `TenantRegistry` type instead of falling to `any`
under strict mode. Add `Retry-After: 5` header to match the SDK's
documented contract for pending tenants
(`@adcp/sdk` tenant-registry.d.ts says: host transport should respond
503 + Retry-After).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t-rejecting

The diagnostic logging from #4067 confirmed every fresh Fly machine
fails all six tenant `register()` calls within ~13ms with:

  createAdcpServer: in-memory state store refused outside
  {NODE_ENV=test, NODE_ENV=development}

SDK 6.0.1 hard-refuses the module-singleton `InMemoryStateStore` for
multi-tenant deployments. We never wired a non-default `stateStore`,
so every `createAdcpServer` call inside `register()` threw,
`Promise.all` rejected, holder.get() rejected, and tenant routes
returned 503 indefinitely (smoke saw HTML 500 before #4067; 503 with
Retry-After after).

Adds `pickStateStore()` mirroring `pickTaskRegistry()`: PostgresStateStore
in production, InMemoryStateStore in dev/test. Re-throws on prod-pool
init failure rather than falling back to in-memory (would just re-trip
the same SDK guard with extra confusion).

Migration 466_adcp_state.sql is the SDK's `ADCP_STATE_MIGRATION`
verbatim — idempotent, runs once via `release_command`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bokelley bokelley merged commit a43f78f into main May 4, 2026
20 checks passed
@bokelley bokelley deleted the bokelley/wire-postgres-state-store branch May 4, 2026 11:46
bokelley added a commit that referenced this pull request May 4, 2026
…empt boot succeeds (#4072)

The fix in #4071 unblocked deploys but exposed a boot-ordering race:
`mountTenantRoutes()` runs before `initializeDatabase()`, so the eager
`holder.get()` calls `getPool()` and throws "Database not initialized."
The eager-mount catch resets `pendingInit`, the next request retries,
and by then the pool is up — production heals within ~5s. Visible in
the post-#4071 deploy logs:

  T+0  init starting → "Database not initialized" (eager fail)
  T+5  init starting → "Tenant registry initialized" totalMs=27

Wrap the pool in a lazy `PgQueryable` so `getPool()` runs at first
query, not at construction. Construction always succeeds; by the time
a tool touches `ctx.store`, the pool is initialized regardless of
mount order. Eliminates one round-trip of boot retry and the noisy
first error.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant