-
Notifications
You must be signed in to change notification settings - Fork 1k
Resilience Guide
OmniRoute has three distinct but related resilience mechanisms. Each has a different scope and purpose. Keep them separate when debugging routing behavior.
Source: diagrams/resilience-3layers.mmd
Scope: entire provider (e.g., glm, openai, anthropic).
Purpose: stop sending traffic to a provider that is repeatedly failing at the upstream/service level.
Implementation:
- Core class:
src/shared/utils/circuitBreaker.ts - Wiring:
src/sse/handlers/chatHelpers.ts,src/sse/handlers/chat.ts - Status API:
GET /api/monitoring/health - Reset API:
POST /api/resilience/reset - Wrappers:
open-sse/services/accountFallback.ts - DB table:
domain_circuit_breakers
States:
-
CLOSED— normal traffic allowed -
OPEN— provider temporarily blocked; combo routing skips it -
HALF_OPEN— reset timeout elapsed; probe request allowed
Defaults (open-sse/config/constants.ts):
| Class | Threshold | Reset timeout |
|---|---|---|
| OAuth | 3 failures | 60s |
| API-key | 5 failures | 30s |
| Local | 2 failures | 15s |
Trip codes: only provider-level statuses [408, 500, 502, 503, 504]. Do NOT trip for account-level errors (most 401/403/429 — those belong to cooldown or lockout).
Lazy recovery: when OPEN expires, getStatus(), canExecute(), getRetryAfterMs() refresh state to HALF_OPEN. No background timer needed.
Scope: single provider connection/account/key.
Purpose: skip one bad key while other connections for the same provider keep serving.
Implementation:
- Mark unavailable:
src/sse/services/auth.ts::markAccountUnavailable() - Selection:
getProviderCredentials*in same file - Cooldown calc:
open-sse/services/accountFallback.ts::checkFallbackError() - Settings:
src/lib/resilience/settings.ts
Fields per connection:
-
rateLimitedUntil— timestamp until cooldown expires testStatus: "unavailable"-
lastError,lastErrorType,errorCode -
backoffLevel— exponential backoff counter
Default cooldowns:
- OAuth base: 5s
- API-key base: 3s
- API-key 429: prefers upstream
Retry-After/reset headers/parseable reset text - Backoff:
baseCooldownMs * 2 ** failureIndex
Anti-thundering-herd guard: prevents concurrent failures from over-extending cooldown or double-incrementing backoffLevel.
Terminal states (NOT cooldowns):
bannedexpiredcredits_exhausted
These persist until credentials change or an operator resets them. Do not overwrite terminal states with transient cooldown state.
Lazy recovery: when rateLimitedUntil is past, connection becomes eligible again. On successful use, clearAccountError() clears all error fields.
Scope: provider + connection + model triple.
Purpose: avoid disabling a whole connection when only one model is unavailable or quota-limited.
Examples:
- Per-model quota providers returning 429
- Local providers returning 404 for one missing model
- Provider-specific mode/model permission failures (e.g., Grok modes)
Implementation: open-sse/services/accountFallback.ts — lockModel(), clearModelLock(), getAllModelLockouts().
UI: Settings → Model Cooldowns (src/app/(dashboard)/dashboard/settings/components/ModelCooldownsCard.tsx)
Lists active lockouts with: provider, connection, model, reason, expiresAt. Operators can manually re-enable a model from the card.
REST API:
-
GET /api/resilience/model-cooldowns— list active lockouts -
DELETE /api/resilience/model-cooldowns— manual re-enable. Body:{provider, connection, model}. Auth: management.
- 14 routing strategies (priority, weighted, round-robin, context-relay, fill-first, p2c, random, least-used, cost-optimized, reset-aware, strict-random, auto, lkgp, context-optimized) — see AUTO-COMBO.md.
- Reset-aware routing (v3.8.0) — prioritizes connections by quota reset time.
-
Background mode degradation — Responses API
background: truedegraded to sync with warning. - Dynamic tool limit detection — backs off providers when tool count limits hit.
- All keys for a provider skipped → check both circuit breaker state AND each connection's
rateLimitedUntil/testStatus. - Provider permanently excluded after reset window → code reading raw
stateinstead ofgetStatus()/canExecute(). - One key fails, others should work → prefer connection cooldown over circuit breaker.
- Only one model fails → prefer model lockout over connection cooldown.
- State should self-recover but doesn't → check for future timestamp + read path that refreshes expired state. Permanent statuses require manual changes.
Provider-specific stealth (JA3/JA4, CCH, obfuscation) is separately documented — see STEALTH_GUIDE.md.
- Architecture Guide — System architecture and internals
- User Guide — Providers, combos, CLI integration
- Auto-Combo Engine — 6-factor scoring, mode packs
OmniRoute · Website · npm · Docker Hub
- Setup Guide
- User Guide
- Features
- Quick Start (Docker)
- Electron Desktop App
- Termux (Android)
- PWA Guide
- MCP Server
- A2A Server
- Agent Protocols
- OpenCode Plugin
- Webhooks
- Cloud Agents
- Skills
- Memory
- Evals
- Gamification
- Guardrails
- Compliance
- Error Sanitization
- Public Credentials
- Route Guard Tiers
- Stealth Guide
- CLI Token Auth