fix(training-agent): drop noopJwksValidator production guard#3869
Merged
fix(training-agent): drop noopJwksValidator production guard#3869
Conversation
Hotfix #2 after the multi-tenant migration. PR #3854 fixed the registry init crash (postgres task registry), but production still returned 404 "Tenant 'signals' is not registered" on per-tenant POSTs. Cause: a `NODE_ENV=production`-gated guard in `noopJwksValidator` threw at validation time, the SDK marked every tenant `disabled`, and `resolveByRequest` returned null for every lookup. The guard was added in the round-1 review fixes (#3713) on the theory that an adopter might accidentally import the no-op into a production tenant registry that should be enforcing JWKS validation. But the only consumer of this file is the training agent's production deployment, which uses the no-op by design — brand.json is mounted at /api/training-agent/.well-known/brand.json, not host root, so the SDK's default validator can't reach it. The guard fired in production and broke the deployment it was meant to protect. Removing the guard. Doc-comment updated to note why the no-op is intentional and what the failure mode was. 382/382 tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 2, 2026
bokelley
added a commit
that referenced
this pull request
May 2, 2026
Code review feedback on #3879: - Strict-bearer-format servers could 400 on `Bearer unset`, false-failing the deploy. Omit the Authorization header entirely when the secret isn't configured — we accept 401 as healthy anyway. - Single curl through Cloudflare during rolling-restart settle can blip. Retry once with 8s backoff. Both real bug modes (#3854, #3869) are deterministic, so a healthy tenant won't pass-on-retry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bokelley
added a commit
that referenced
this pull request
May 2, 2026
* ci(deploy): smoke training-agent tenants after rolling restart Closes #3878. Two recent hotfixes (#3854, #3869) shipped clean through CI but caused production outages because the failure modes were NODE_ENV=production-gated: createInMemoryTaskRegistry throwing on init, and noopJwksValidator throwing under a now-removed prod guard that left every tenant disabled. Both surfaced as 404 or 5xx on per-tenant POSTs — failure modes that happen before auth, so an unauthenticated tools/list is sufficient to detect them. 401 is treated as healthy; only 404, 5xx, and timeouts are deploy-fatal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(deploy): retry once + drop bearer when secret unset Code review feedback on #3879: - Strict-bearer-format servers could 400 on `Bearer unset`, false-failing the deploy. Omit the Authorization header entirely when the secret isn't configured — we accept 401 as healthy anyway. - Single curl through Cloudflare during rolling-restart settle can blip. Retry once with 8s backoff. Both real bug modes (#3854, #3869) are deterministic, so a healthy tenant won't pass-on-retry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Hotfix #2 after the multi-tenant migration. After #3854 fixed the registry-init crash (postgres task registry), production smoke still failed:
Cause
A
NODE_ENV=production-gated guard innoopJwksValidator(added in the round-1 review fixes for #3713) threw at validation time. The SDK caught the throw, marked every tenantdisabled, andresolveByRequestreturned null for every lookup.Fix
Remove the guard. It was overprotective: reviewers added it on the theory that an adopter might accidentally import the no-op into a production registry that should be enforcing JWKS validation — but the only consumer of this file is the training agent's production deployment, which uses the no-op by design (brand.json is mounted at
/api/training-agent/.well-known/brand.json, not host root, so the SDK's default validator can't reach it).The guard fired in production and broke the deployment it was meant to protect.
After deploy
Smoke that should now return 200:
Test plan
🤖 Generated with Claude Code