Conversation
Per-tenant POSTs returned HTTP 500 in production after the multi-tenant migration (#3713) deployed. SDK 6.0 refuses its default in-memory task registry under NODE_ENV=production, and we never passed an explicit `taskRegistry` — `createAdcpServerFromPlatform` threw at first request, every per-tenant POST bubbled up as Internal Server Error. Legacy /mcp was unaffected (uses the v5 `createTrainingAgentServer` path which doesn't go through the SDK 6.0 task registry). Wire `createPostgresTaskRegistry({ pool: getPool() })` into the tenant registry's default server options. Test/dev fall back to `createInMemoryTaskRegistry()` because the test harness doesn't initialize the postgres pool — `getPool()` throws before `initializeDatabase()`. Production fallback is fail-loud-with-warning: if the pool is missing or migration 463 hasn't run, log error and fall back to in-memory rather than booting broken; a single-instance deployment then keeps working with degraded multi-machine semantics. Postgres-backed registry is also independently the correct choice — the AAO app runs multiple Fly machines, and the in-memory registry is process-local. A buyer creating a media buy on machine A and polling on machine B would otherwise see task-not-found on ~50% of polls. Migration 463 is the SDK-shipped DDL from `getDecisioningTaskRegistryMigration()` verbatim: - `adcp_decisioning_tasks` table with `task_id` PK, `account_id`/`status`/`created_at`/`updated_at` columns - `adcp_decisioning_tasks_valid_status` CHECK enumerating the four framework-written states (submitted/working/completed/failed) - Indexes on `account_id` and `(status, created_at)` 382/382 tests passing. Storyboard sanity: signals 59/61 clean, sales 57/61 clean — both equal to or above the post-migration baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
bokelley
added a commit
that referenced
this pull request
May 2, 2026
Hotfix #2 after the multi-tenant migration. PR #3854 fixed the registry init crash (postgres task registry), but production still returned 404 "Tenant 'signals' is not registered" on per-tenant POSTs. Cause: a `NODE_ENV=production`-gated guard in `noopJwksValidator` threw at validation time, the SDK marked every tenant `disabled`, and `resolveByRequest` returned null for every lookup. The guard was added in the round-1 review fixes (#3713) on the theory that an adopter might accidentally import the no-op into a production tenant registry that should be enforcing JWKS validation. But the only consumer of this file is the training agent's production deployment, which uses the no-op by design — brand.json is mounted at /api/training-agent/.well-known/brand.json, not host root, so the SDK's default validator can't reach it. The guard fired in production and broke the deployment it was meant to protect. Removing the guard. Doc-comment updated to note why the no-op is intentional and what the failure mode was. 382/382 tests passing. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 2, 2026
bokelley
added a commit
that referenced
this pull request
May 2, 2026
Code review feedback on #3879: - Strict-bearer-format servers could 400 on `Bearer unset`, false-failing the deploy. Omit the Authorization header entirely when the secret isn't configured — we accept 401 as healthy anyway. - Single curl through Cloudflare during rolling-restart settle can blip. Retry once with 8s backoff. Both real bug modes (#3854, #3869) are deterministic, so a healthy tenant won't pass-on-retry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bokelley
added a commit
that referenced
this pull request
May 2, 2026
* ci(deploy): smoke training-agent tenants after rolling restart Closes #3878. Two recent hotfixes (#3854, #3869) shipped clean through CI but caused production outages because the failure modes were NODE_ENV=production-gated: createInMemoryTaskRegistry throwing on init, and noopJwksValidator throwing under a now-removed prod guard that left every tenant disabled. Both surfaced as 404 or 5xx on per-tenant POSTs — failure modes that happen before auth, so an unauthenticated tools/list is sufficient to detect them. 401 is treated as healthy; only 404, 5xx, and timeouts are deploy-fatal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(deploy): retry once + drop bearer when secret unset Code review feedback on #3879: - Strict-bearer-format servers could 400 on `Bearer unset`, false-failing the deploy. Omit the Authorization header entirely when the secret isn't configured — we accept 401 as healthy anyway. - Single curl through Cloudflare during rolling-restart settle can blip. Retry once with 8s backoff. Both real bug modes (#3854, #3869) are deterministic, so a healthy tenant won't pass-on-retry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Production hotfix for the multi-tenant migration (#3713). Per-tenant POSTs returned HTTP 500 — SDK 6.0 refuses the default in-memory task registry under
NODE_ENV=production, and we never passed an explicittaskRegistry.Symptom
After #3713 + #3714 deployed:
GET /sales/mcp→ 405 (handler registered, doesn't hit registry)GET /.well-known/adagents.json→ 200 with all 6 tenants in_training_agent_tenantsPOST /mcp(legacy alias) → 200, full tools list (uses v5 path, no SDK 6.0 task registry)POST /signals/mcp→ 500 Internal Server ErrorPOST /<any-tenant>/mcp→ 500 Internal Server ErrorFix
createPostgresTaskRegistry({ pool: getPool() })for production;createInMemoryTaskRegistry()for test/dev.This is the right fix independent of the SDK guard. The AAO app runs multiple Fly machines — an in-memory task registry is process-local. A buyer who creates a media buy on machine A and polls on machine B would otherwise see task-not-found on ~50% of async-flow polls.
Production fallback is fail-loud-with-warning: if the pool is missing or migration 463 hasn't run, log error and fall back to in-memory. Single-instance deployments keep working; multi-instance gets a flag to fix the deploy.
Migration
server/src/db/migrations/463_adcp_decisioning_tasks.sql— the SDK-shipped DDL fromgetDecisioningTaskRegistryMigration()verbatim. Idempotent (CREATE TABLE IF NOT EXISTS).Test plan
curl -s -X POST https://test-agent.adcontextprotocol.org/signals/mcp -H "Authorization: Bearer 1v8tAhASaUYYp4odoQ1PnMpdqNaMiTrCRqYo9OJp6IQ" -H "Content-Type: application/json" -H "Accept: application/json, text/event-stream" -d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}'returns 200 with a tools array includingget_signalsandactivate_signal🤖 Generated with Claude Code