Skip to content

Refactor: Improve MCPServer retry logic and transient error detection#329

Merged
teemow merged 5 commits intomainfrom
fix/upgrade-mcp-oauth-cross-client-scope-fix
Jan 31, 2026
Merged

Refactor: Improve MCPServer retry logic and transient error detection#329
teemow merged 5 commits intomainfrom
fix/upgrade-mcp-oauth-cross-client-scope-fix

Conversation

@teemow
Copy link
Member

@teemow teemow commented Jan 31, 2026

Summary

  • Retry Logic: Added background retry loop for failed MCPServers with exponential backoff
  • Transient Error Detection: Extended to include HTTP 5xx errors (500-511)
  • Max Concurrent Retries: Added limit of 5 concurrent retries to prevent thundering herd
  • Comprehensive Tests: Added unit tests for orchestrator retry methods

Changes

Orchestrator Retry Logic

  • New retryFailedMCPServers() background task that periodically checks for failed servers
  • shouldAttemptRetry() checks service state and backoff expiration
  • MaxConcurrentRetries = 5 prevents overwhelming upstream services
  • Proper shutdown handling with WaitGroup for in-flight retries

Transient Error Detection

  • HTTP 5xx status codes (500-511) detected via loop for DRY code
  • Added descriptive patterns for common 5xx error messages
  • Updated comment documentation to reflect 500-511 range

Code Quality Improvements (Code Review)

  • Replaced time.Sleep in test with proper channel synchronization (no race conditions)
  • Added compile-time interface verification for mock types (var _ Interface = (*Mock)(nil))
  • Simplified HTTP 5xx pattern matching using loop (DRY principle)
  • Added test for restart error handling path
  • Extracted RestartGracePeriod as named constant with documentation

Documentation

  • Documented the 200ms grace period in Restart() method explaining why it's needed
  • Added RestartGracePeriod constant with detailed rationale

Test plan

  • All existing tests pass (make test)
  • All 168 BDD scenarios pass (muster test --parallel 50)
  • New retry tests cover:
    • shouldAttemptRetry with various service states and backoff conditions
    • attemptReconnectFailedServers with eligible/ineligible services
    • MaxConcurrentRetries enforcement
    • Graceful shutdown of retry loop
    • Restart error handling path
  • HTTP 5xx tests updated for complete 500-511 coverage
  • No linting errors

…ging

This upgrade fixes an issue where cross-client audience scopes configured
in the Dex provider were being ignored when clients requested specific
OAuth scopes.

The fix in mcp-oauth v0.2.56 ensures that mandatory scopes like
`audience:server:client_id:dex-k8s-authenticator` are always merged
into client-requested scopes, enabling proper SSO token forwarding
for Kubernetes OIDC authentication.

Related: giantswarm/mcp-oauth#203
- Add sync.WaitGroup to track in-flight restart goroutines for clean shutdown
- Extract shouldAttemptRetry helper method for better readability
- Move RetryInterval constant to file-level with other constants
- Consolidate HTTP 5xx patterns and add coverage for 501, 507-509 status codes
- Add context cancellation check before restart attempts
- Expand test coverage for HTTP error detection including mixed case and wrapped errors
…-cross-client-scope-fix

* origin/main:
  Fix: Upgrade mcp-oauth to v0.2.56 for cross-client audience scope merging (#328)
- Add unit tests for orchestrator retry methods (shouldAttemptRetry,
  attemptReconnectFailedServers, retryFailedMCPServers shutdown)
- Add MaxConcurrentRetries (5) to prevent thundering herd on retry
- Add HTTP 505/506 status codes to transient error detection
- Document the grace period sleep in Restart method
@teemow teemow requested a review from a team as a code owner January 31, 2026 08:07
…r handling

- Replace time.Sleep in test with proper channel synchronization
- Add compile-time interface verification for mock types
- Simplify HTTP 5xx pattern matching using loop (DRY)
- Add test for restart error handling gracefully
- Extract RestartGracePeriod as named constant with documentation
- Update comment to reflect HTTP 5xx range 500-511
@teemow teemow merged commit 2fda4b5 into main Jan 31, 2026
6 of 9 checks passed
@teemow teemow deleted the fix/upgrade-mcp-oauth-cross-client-scope-fix branch January 31, 2026 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant