fix(training-agent): wire Postgres task registry — production hotfix by bokelley · Pull Request #3854 · adcontextprotocol/adcp

bokelley · 2026-05-02T17:37:38Z

Summary

Production hotfix for the multi-tenant migration (#3713). Per-tenant POSTs returned HTTP 500 — SDK 6.0 refuses the default in-memory task registry under NODE_ENV=production, and we never passed an explicit taskRegistry.

Symptom

After #3713 + #3714 deployed:

✓ GET /sales/mcp → 405 (handler registered, doesn't hit registry)
✓ GET /.well-known/adagents.json → 200 with all 6 tenants in _training_agent_tenants
✓ POST /mcp (legacy alias) → 200, full tools list (uses v5 path, no SDK 6.0 task registry)
✗ POST /signals/mcp → 500 Internal Server Error
✗ POST /<any-tenant>/mcp → 500 Internal Server Error

Fix

createPostgresTaskRegistry({ pool: getPool() }) for production; createInMemoryTaskRegistry() for test/dev.

This is the right fix independent of the SDK guard. The AAO app runs multiple Fly machines — an in-memory task registry is process-local. A buyer who creates a media buy on machine A and polls on machine B would otherwise see task-not-found on ~50% of async-flow polls.

Production fallback is fail-loud-with-warning: if the pool is missing or migration 463 hasn't run, log error and fall back to in-memory. Single-instance deployments keep working; multi-instance gets a flag to fix the deploy.

Migration

server/src/db/migrations/463_adcp_decisioning_tasks.sql — the SDK-shipped DDL from getDecisioningTaskRegistryMigration() verbatim. Idempotent (CREATE TABLE IF NOT EXISTS).

Test plan

TypeScript clean
382/382 tests passing (5 test files: demo-key, webhooks, legacy-mcp, tool-catalog-drift, training-agent unit)
Storyboard sanity: signals 59/61 clean, sales 57/61 clean — equal to or above the post-migration baseline
After deploy: curl -s -X POST https://test-agent.adcontextprotocol.org/signals/mcp -H "Authorization: Bearer 1v8tAhASaUYYp4odoQ1PnMpdqNaMiTrCRqYo9OJp6IQ" -H "Content-Type: application/json" -H "Accept: application/json, text/event-stream" -d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' returns 200 with a tools array including get_signals and activate_signal

🤖 Generated with Claude Code

Per-tenant POSTs returned HTTP 500 in production after the multi-tenant migration (#3713) deployed. SDK 6.0 refuses its default in-memory task registry under NODE_ENV=production, and we never passed an explicit `taskRegistry` — `createAdcpServerFromPlatform` threw at first request, every per-tenant POST bubbled up as Internal Server Error. Legacy /mcp was unaffected (uses the v5 `createTrainingAgentServer` path which doesn't go through the SDK 6.0 task registry). Wire `createPostgresTaskRegistry({ pool: getPool() })` into the tenant registry's default server options. Test/dev fall back to `createInMemoryTaskRegistry()` because the test harness doesn't initialize the postgres pool — `getPool()` throws before `initializeDatabase()`. Production fallback is fail-loud-with-warning: if the pool is missing or migration 463 hasn't run, log error and fall back to in-memory rather than booting broken; a single-instance deployment then keeps working with degraded multi-machine semantics. Postgres-backed registry is also independently the correct choice — the AAO app runs multiple Fly machines, and the in-memory registry is process-local. A buyer creating a media buy on machine A and polling on machine B would otherwise see task-not-found on ~50% of polls. Migration 463 is the SDK-shipped DDL from `getDecisioningTaskRegistryMigration()` verbatim: - `adcp_decisioning_tasks` table with `task_id` PK, `account_id`/`status`/`created_at`/`updated_at` columns - `adcp_decisioning_tasks_valid_status` CHECK enumerating the four framework-written states (submitted/working/completed/failed) - Indexes on `account_id` and `(status, created_at)` 382/382 tests passing. Storyboard sanity: signals 59/61 clean, sales 57/61 clean — both equal to or above the post-migration baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Hotfix #2 after the multi-tenant migration. PR #3854 fixed the registry init crash (postgres task registry), but production still returned 404 "Tenant 'signals' is not registered" on per-tenant POSTs. Cause: a `NODE_ENV=production`-gated guard in `noopJwksValidator` threw at validation time, the SDK marked every tenant `disabled`, and `resolveByRequest` returned null for every lookup. The guard was added in the round-1 review fixes (#3713) on the theory that an adopter might accidentally import the no-op into a production tenant registry that should be enforcing JWKS validation. But the only consumer of this file is the training agent's production deployment, which uses the no-op by design — brand.json is mounted at /api/training-agent/.well-known/brand.json, not host root, so the SDK's default validator can't reach it. The guard fired in production and broke the deployment it was meant to protect. Removing the guard. Doc-comment updated to note why the no-op is intentional and what the failure mode was. 382/382 tests passing. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Code review feedback on #3879: - Strict-bearer-format servers could 400 on `Bearer unset`, false-failing the deploy. Omit the Authorization header entirely when the secret isn't configured — we accept 401 as healthy anyway. - Single curl through Cloudflare during rolling-restart settle can blip. Retry once with 8s backoff. Both real bug modes (#3854, #3869) are deterministic, so a healthy tenant won't pass-on-retry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(deploy): smoke training-agent tenants after rolling restart Closes #3878. Two recent hotfixes (#3854, #3869) shipped clean through CI but caused production outages because the failure modes were NODE_ENV=production-gated: createInMemoryTaskRegistry throwing on init, and noopJwksValidator throwing under a now-removed prod guard that left every tenant disabled. Both surfaced as 404 or 5xx on per-tenant POSTs — failure modes that happen before auth, so an unauthenticated tools/list is sufficient to detect them. 401 is treated as healthy; only 404, 5xx, and timeouts are deploy-fatal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(deploy): retry once + drop bearer when secret unset Code review feedback on #3879: - Strict-bearer-format servers could 400 on `Bearer unset`, false-failing the deploy. Omit the Authorization header entirely when the secret isn't configured — we accept 401 as healthy anyway. - Single curl through Cloudflare during rolling-restart settle can blip. Retry once with 8s backoff. Both real bug modes (#3854, #3869) are deterministic, so a healthy tenant won't pass-on-retry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bokelley merged commit 9c8cba9 into main May 2, 2026
20 checks passed

bokelley deleted the bokelley/training-agent-postgres-task-registry branch May 2, 2026 17:44

bokelley mentioned this pull request May 2, 2026

fix(training-agent): drop noopJwksValidator production guard #3869

Merged

3 tasks

This was referenced May 2, 2026

Catch SDK production-only failures in training agent CI before they reach prod #3878

Closed

ci(deploy): smoke training-agent tenants after rolling restart #3879

Merged

bokelley mentioned this pull request May 2, 2026

ci(deploy): smoke accepts 403 (SDK v6 invalid-bearer status) #3882

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(training-agent): wire Postgres task registry — production hotfix#3854

fix(training-agent): wire Postgres task registry — production hotfix#3854
bokelley merged 1 commit intomainfrom
bokelley/training-agent-postgres-task-registry

bokelley commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bokelley commented May 2, 2026

Summary

Symptom

Fix

Migration

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant