Skip to content

fix(training-agent): drop noopJwksValidator production guard#3869

Merged
bokelley merged 1 commit intomainfrom
bokelley/training-agent-drop-noop-jwks-prod-guard
May 2, 2026
Merged

fix(training-agent): drop noopJwksValidator production guard#3869
bokelley merged 1 commit intomainfrom
bokelley/training-agent-drop-noop-jwks-prod-guard

Conversation

@bokelley
Copy link
Copy Markdown
Contributor

@bokelley bokelley commented May 2, 2026

Summary

Hotfix #2 after the multi-tenant migration. After #3854 fixed the registry-init crash (postgres task registry), production smoke still failed:

POST /signals/mcp → HTTP 404
{"jsonrpc":"2.0","id":null,"error":{"code":-32000,"message":"Tenant 'signals' is not registered"}}

Cause

A NODE_ENV=production-gated guard in noopJwksValidator (added in the round-1 review fixes for #3713) threw at validation time. The SDK caught the throw, marked every tenant disabled, and resolveByRequest returned null for every lookup.

Fix

Remove the guard. It was overprotective: reviewers added it on the theory that an adopter might accidentally import the no-op into a production registry that should be enforcing JWKS validation — but the only consumer of this file is the training agent's production deployment, which uses the no-op by design (brand.json is mounted at /api/training-agent/.well-known/brand.json, not host root, so the SDK's default validator can't reach it).

The guard fired in production and broke the deployment it was meant to protect.

After deploy

Smoke that should now return 200:

curl -s -X POST https://test-agent.adcontextprotocol.org/signals/mcp \
  -H "Authorization: Bearer 1v8tAhASaUYYp4odoQ1PnMpdqNaMiTrCRqYo9OJp6IQ" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}'

Test plan

  • TypeScript clean
  • 382/382 tests passing
  • After deploy: per-tenant POST returns 200 with the right tools

🤖 Generated with Claude Code

Hotfix #2 after the multi-tenant migration. PR #3854 fixed the registry
init crash (postgres task registry), but production still returned 404
"Tenant 'signals' is not registered" on per-tenant POSTs. Cause: a
`NODE_ENV=production`-gated guard in `noopJwksValidator` threw at
validation time, the SDK marked every tenant `disabled`, and
`resolveByRequest` returned null for every lookup.

The guard was added in the round-1 review fixes (#3713) on the theory
that an adopter might accidentally import the no-op into a production
tenant registry that should be enforcing JWKS validation. But the only
consumer of this file is the training agent's production deployment,
which uses the no-op by design — brand.json is mounted at
/api/training-agent/.well-known/brand.json, not host root, so the SDK's
default validator can't reach it. The guard fired in production and
broke the deployment it was meant to protect.

Removing the guard. Doc-comment updated to note why the no-op is
intentional and what the failure mode was.

382/382 tests passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bokelley bokelley merged commit 2268acf into main May 2, 2026
19 checks passed
@bokelley bokelley deleted the bokelley/training-agent-drop-noop-jwks-prod-guard branch May 2, 2026 18:39
bokelley added a commit that referenced this pull request May 2, 2026
Code review feedback on #3879:
- Strict-bearer-format servers could 400 on `Bearer unset`, false-failing
  the deploy. Omit the Authorization header entirely when the secret
  isn't configured — we accept 401 as healthy anyway.
- Single curl through Cloudflare during rolling-restart settle can blip.
  Retry once with 8s backoff. Both real bug modes (#3854, #3869) are
  deterministic, so a healthy tenant won't pass-on-retry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bokelley added a commit that referenced this pull request May 2, 2026
* ci(deploy): smoke training-agent tenants after rolling restart

Closes #3878. Two recent hotfixes (#3854, #3869) shipped clean through CI
but caused production outages because the failure modes were
NODE_ENV=production-gated: createInMemoryTaskRegistry throwing on init,
and noopJwksValidator throwing under a now-removed prod guard that left
every tenant disabled. Both surfaced as 404 or 5xx on per-tenant POSTs —
failure modes that happen before auth, so an unauthenticated tools/list
is sufficient to detect them. 401 is treated as healthy; only 404, 5xx,
and timeouts are deploy-fatal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(deploy): retry once + drop bearer when secret unset

Code review feedback on #3879:
- Strict-bearer-format servers could 400 on `Bearer unset`, false-failing
  the deploy. Omit the Authorization header entirely when the secret
  isn't configured — we accept 401 as healthy anyway.
- Single curl through Cloudflare during rolling-restart settle can blip.
  Retry once with 8s backoff. Both real bug modes (#3854, #3869) are
  deterministic, so a healthy tenant won't pass-on-retry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant