Skip to content

fix(training-agent): wire Postgres task registry — production hotfix#3854

Merged
bokelley merged 1 commit intomainfrom
bokelley/training-agent-postgres-task-registry
May 2, 2026
Merged

fix(training-agent): wire Postgres task registry — production hotfix#3854
bokelley merged 1 commit intomainfrom
bokelley/training-agent-postgres-task-registry

Conversation

@bokelley
Copy link
Copy Markdown
Contributor

@bokelley bokelley commented May 2, 2026

Summary

Production hotfix for the multi-tenant migration (#3713). Per-tenant POSTs returned HTTP 500 — SDK 6.0 refuses the default in-memory task registry under NODE_ENV=production, and we never passed an explicit taskRegistry.

Symptom

After #3713 + #3714 deployed:

  • GET /sales/mcp → 405 (handler registered, doesn't hit registry)
  • GET /.well-known/adagents.json → 200 with all 6 tenants in _training_agent_tenants
  • POST /mcp (legacy alias) → 200, full tools list (uses v5 path, no SDK 6.0 task registry)
  • POST /signals/mcp500 Internal Server Error
  • POST /<any-tenant>/mcp500 Internal Server Error

Fix

createPostgresTaskRegistry({ pool: getPool() }) for production; createInMemoryTaskRegistry() for test/dev.

This is the right fix independent of the SDK guard. The AAO app runs multiple Fly machines — an in-memory task registry is process-local. A buyer who creates a media buy on machine A and polls on machine B would otherwise see task-not-found on ~50% of async-flow polls.

Production fallback is fail-loud-with-warning: if the pool is missing or migration 463 hasn't run, log error and fall back to in-memory. Single-instance deployments keep working; multi-instance gets a flag to fix the deploy.

Migration

server/src/db/migrations/463_adcp_decisioning_tasks.sql — the SDK-shipped DDL from getDecisioningTaskRegistryMigration() verbatim. Idempotent (CREATE TABLE IF NOT EXISTS).

Test plan

  • TypeScript clean
  • 382/382 tests passing (5 test files: demo-key, webhooks, legacy-mcp, tool-catalog-drift, training-agent unit)
  • Storyboard sanity: signals 59/61 clean, sales 57/61 clean — equal to or above the post-migration baseline
  • After deploy: curl -s -X POST https://test-agent.adcontextprotocol.org/signals/mcp -H "Authorization: Bearer 1v8tAhASaUYYp4odoQ1PnMpdqNaMiTrCRqYo9OJp6IQ" -H "Content-Type: application/json" -H "Accept: application/json, text/event-stream" -d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' returns 200 with a tools array including get_signals and activate_signal

🤖 Generated with Claude Code

Per-tenant POSTs returned HTTP 500 in production after the multi-tenant
migration (#3713) deployed. SDK 6.0 refuses its default in-memory task
registry under NODE_ENV=production, and we never passed an explicit
`taskRegistry` — `createAdcpServerFromPlatform` threw at first request,
every per-tenant POST bubbled up as Internal Server Error. Legacy /mcp
was unaffected (uses the v5 `createTrainingAgentServer` path which
doesn't go through the SDK 6.0 task registry).

Wire `createPostgresTaskRegistry({ pool: getPool() })` into the tenant
registry's default server options. Test/dev fall back to
`createInMemoryTaskRegistry()` because the test harness doesn't
initialize the postgres pool — `getPool()` throws before
`initializeDatabase()`. Production fallback is fail-loud-with-warning:
if the pool is missing or migration 463 hasn't run, log error and fall
back to in-memory rather than booting broken; a single-instance
deployment then keeps working with degraded multi-machine semantics.

Postgres-backed registry is also independently the correct choice —
the AAO app runs multiple Fly machines, and the in-memory registry is
process-local. A buyer creating a media buy on machine A and polling
on machine B would otherwise see task-not-found on ~50% of polls.

Migration 463 is the SDK-shipped DDL from
`getDecisioningTaskRegistryMigration()` verbatim:
- `adcp_decisioning_tasks` table with `task_id` PK,
  `account_id`/`status`/`created_at`/`updated_at` columns
- `adcp_decisioning_tasks_valid_status` CHECK enumerating the four
  framework-written states (submitted/working/completed/failed)
- Indexes on `account_id` and `(status, created_at)`

382/382 tests passing. Storyboard sanity: signals 59/61 clean, sales
57/61 clean — both equal to or above the post-migration baseline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bokelley bokelley merged commit 9c8cba9 into main May 2, 2026
20 checks passed
@bokelley bokelley deleted the bokelley/training-agent-postgres-task-registry branch May 2, 2026 17:44
bokelley added a commit that referenced this pull request May 2, 2026
Hotfix #2 after the multi-tenant migration. PR #3854 fixed the registry
init crash (postgres task registry), but production still returned 404
"Tenant 'signals' is not registered" on per-tenant POSTs. Cause: a
`NODE_ENV=production`-gated guard in `noopJwksValidator` threw at
validation time, the SDK marked every tenant `disabled`, and
`resolveByRequest` returned null for every lookup.

The guard was added in the round-1 review fixes (#3713) on the theory
that an adopter might accidentally import the no-op into a production
tenant registry that should be enforcing JWKS validation. But the only
consumer of this file is the training agent's production deployment,
which uses the no-op by design — brand.json is mounted at
/api/training-agent/.well-known/brand.json, not host root, so the SDK's
default validator can't reach it. The guard fired in production and
broke the deployment it was meant to protect.

Removing the guard. Doc-comment updated to note why the no-op is
intentional and what the failure mode was.

382/382 tests passing.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bokelley added a commit that referenced this pull request May 2, 2026
Code review feedback on #3879:
- Strict-bearer-format servers could 400 on `Bearer unset`, false-failing
  the deploy. Omit the Authorization header entirely when the secret
  isn't configured — we accept 401 as healthy anyway.
- Single curl through Cloudflare during rolling-restart settle can blip.
  Retry once with 8s backoff. Both real bug modes (#3854, #3869) are
  deterministic, so a healthy tenant won't pass-on-retry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bokelley added a commit that referenced this pull request May 2, 2026
* ci(deploy): smoke training-agent tenants after rolling restart

Closes #3878. Two recent hotfixes (#3854, #3869) shipped clean through CI
but caused production outages because the failure modes were
NODE_ENV=production-gated: createInMemoryTaskRegistry throwing on init,
and noopJwksValidator throwing under a now-removed prod guard that left
every tenant disabled. Both surfaced as 404 or 5xx on per-tenant POSTs —
failure modes that happen before auth, so an unauthenticated tools/list
is sufficient to detect them. 401 is treated as healthy; only 404, 5xx,
and timeouts are deploy-fatal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(deploy): retry once + drop bearer when secret unset

Code review feedback on #3879:
- Strict-bearer-format servers could 400 on `Bearer unset`, false-failing
  the deploy. Omit the Authorization header entirely when the secret
  isn't configured — we accept 401 as healthy anyway.
- Single curl through Cloudflare during rolling-restart settle can blip.
  Retry once with 8s backoff. Both real bug modes (#3854, #3869) are
  deterministic, so a healthy tenant won't pass-on-retry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant