-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Monitoring Guide
TL;DR: OmniRoute ships with built-in health monitoring, provider autopilot, quota tracking, and observability hooks. This guide covers the dashboard, alerts, and troubleshooting.
Sources:
-
src/lib/monitoring/observability.ts— observability snapshot -
src/lib/monitoring/comboHealthAutopilot.ts— combo health autopilot -
src/lib/monitoring/providerHealthAutopilot.ts— provider autopilot -
src/lib/monitoring/providerHealthMatrix.ts— provider health matrix -
src/lib/localHealthCheck.ts— local health check -
src/lib/tokenHealthCheck.ts— token refresh health -
src/lib/proxyHealth.ts— proxy health cache (covered in PROXY_GUIDE.md)
OmniRoute has 3 layers of monitoring:
┌──────────────────────────────────────────────────────────────┐
│ Layer 1: System Health (server-level) │
│ ├─ localHealthCheck.ts — DB, ports, native deps │
│ ├─ db/healthCheck.ts — integrity, FK, orphaned artifacts │
│ └─ Dashboard: /dashboard/health │
├──────────────────────────────────────────────────────────────┤
│ Layer 2: Provider Health (per-provider resilience) │
│ ├─ providerHealthAutopilot.ts — circuit breaker, cooldowns │
│ ├─ providerHealthMatrix.ts — health scores by provider/model │
│ └─ Dashboard: /dashboard/providers │
├──────────────────────────────────────────────────────────────┤
│ Layer 3: Live Observability (runtime snapshots) │
│ ├─ observability.ts — circuit breakers, sessions, quota │
│ ├─ tokenHealthCheck.ts — OAuth token refresh health │
│ └─ MCP tools: omniroute_get_health, omniroute_get_session_snapshot │
└──────────────────────────────────────────────────────────────┘
The top-level health dashboard shows:
| Section | What it shows |
|---|---|
| Server status | Uptime, version, port, active connections |
| Database | Connection, integrity, WAL size, recent migrations |
| Provider summary | Active count, healthy count, breaker open count |
| Quota monitors | Active sessions, alerting, exhausted |
| Recent errors | Last 10 errors with stack traces |
| Resource usage | Memory, CPU, heap pressure indicator |
Per-provider dashboard:
| Column | Description |
|---|---|
| Provider | Provider ID + display name |
| Health | Green/yellow/red status |
| Circuit | Open/closed/half-open state |
| Connections | Count of connections, last refresh |
| Models | Available models, health per model |
| Cost | Today's cost, 7-day trend |
| Errors | Last 24h error count, top error class |
Click a provider to see:
- Recent requests with latency breakdown
- Per-connection health scores
- Per-model lockouts
- Autopilot recommendations
For each API key:
- Current usage vs limit (progress bar)
- Quota trend (30-day chart)
- Next reset time
- Alert history
Per-combo:
- Strategy + targets
- Health per target
- Recent fallback events
- Success rate (24h, 7d, 30d)
Note: Only
GET /api/monitoring/healthis exposed as a REST endpoint. All other monitoring data (provider health, autopilot issues, quota monitors, token health, latency) is accessed via the MCP toolobservability_snapshotor the dashboard pages — there are no dedicated REST routes for these.
GET /api/monitoring/healthResponse:
{
"status": "healthy",
"version": "3.8.16",
"uptime": 123456,
"checks": {
"database": { "status": "pass", "latency_ms": 2 },
"writeable": { "status": "pass" },
"integrity": { "status": "pass", "result": "ok" },
"foreign_keys": { "status": "pass", "violations": 0 },
"heap_pressure": { "status": "pass", "usage_mb": 142, "threshold_mb": 512 },
"active_sessions": 12,
"providers": {
"total": 7,
"healthy": 6,
"degraded": 1,
"down": 0
}
}
}No REST endpoint. Provider health data is available via the MCP tool
observability_snapshotor the dashboard/dashboard/providerspage.
No REST endpoint. Per-provider detail is available via the dashboard
/dashboard/providerspage.
The providerHealthAutopilot.ts module is a self-healing system that:
- Detects provider issues (circuit open, cooldowns, lockouts, quota warnings)
- Generates recommended actions to resolve them
- Optionally auto-executes low-risk actions
| Issue kind | Severity | Example condition |
|---|---|---|
provider_circuit_open |
critical | Circuit breaker open after 5 failures |
provider_circuit_half_open |
warning | Circuit testing recovery |
connection_cooldown |
warning | Connection in cooldown after 429 |
stale_connection_error |
warning | Last refresh failed 30+ minutes ago |
terminal_connection_error |
critical | OAuth revoked, key invalid |
inactive_connection |
info | Connection disabled in settings |
model_lockout |
warning | Specific model in quarantine |
quota_monitor_warning |
warning | Quota at 80%+ usage |
| Action | Risk | Description |
|---|---|---|
clear_provider_breaker |
medium | Reset the circuit breaker to closed |
clear_connection_cooldown |
low | Remove cooldown from a connection |
clear_stale_connection_error |
low | Clear stale error flag |
clear_model_lockout |
low | Re-enable a quarantined model |
reactivate_connection |
medium | Re-enable a deactivated connection |
deactivate_connection |
high | Disable a problematic connection |
No REST endpoint. Autopilot issues are available via the MCP tool
observability_snapshotor the dashboard. The autopilot runs internally; its behavior is configured via the settings DB (per-connectionautopilotModefield), not environment variables —grep -rnfor an autopilot-mode env var returns zero hits.
The autopilot operates in manual mode by default — it detects issues and generates recommended actions, but does not auto-apply them. Actions can be applied via the dashboard.
comboHealthAutopilot.ts is the combo-specific equivalent of the provider autopilot. It:
- Detects unhealthy combos
- Recommends target reordering
- Suggests disabling broken targets
- Auto-removes dead targets after N failures
Combo "always-on" (priority strategy)
├─ Target 1: openai/gpt-5 (healthy)
├─ Target 2: anthropic/claude-opus-4-6 (⚠️ model lockout until 14:00)
└─ Target 3: kiro/claude-sonnet-4-5 (healthy)
Recommended action: Reorder — move kiro above anthropic until lockout expires
observability.ts exposes per-session quota monitors for subscription providers (Claude Code, Codex, GitHub Copilot):
interface QuotaMonitorSnapshot {
sessionId: string;
provider: string;
accountId: string;
status: "starting" | "idle" | "healthy" | "warning" | "exhausted" | "error";
lastQuotaPercent: number | null; // 0-100
lastQuotaUsed: number | null;
lastQuotaTotal: number | null;
lastResetAt: string | null;
nextPollAt: string | null;
totalPolls: number;
totalAlerts: number;
consecutiveFailures: number;
}| Status | When | UI action |
|---|---|---|
starting |
Initial poll in progress | Spinner |
idle |
No recent activity | Hidden from dashboard |
healthy |
Quota > 50% remaining | Green dot |
warning |
Quota < 50% remaining | Yellow alert |
exhausted |
Quota = 0% | Red block, route to next provider |
error |
Polling failed | Red dot, retry soon |
No REST endpoint. Quota monitor data is available via the MCP tool
observability_snapshotor the dashboard.
The MCP tool observability_snapshot returns a complete system snapshot for AI agents:
{
"circuitBreakers": [
{
"name": "openai",
"state": "closed",
"failureCount": 0,
"lastFailureTime": null,
"retryAfterMs": null
}
],
"sessions": [
{
"sessionId": "sess-123",
"createdAt": 1234567890,
"lastActive": 1234567999,
"requestCount": 42,
"connectionId": "conn-456",
"ageMs": 109
}
],
"quotaMonitors": { /* see above */ },
"uptime": 12345,
"version": "3.8.16"
}Agents use this to make routing decisions — for example, "if openai's circuit is open, route to anthropic first".
OAuth providers (Claude Code, GitHub Copilot, Cursor) need periodic token refresh. src/lib/tokenHealthCheck.ts runs a background scheduler:
-
Sweep tick: every 60 seconds (sweep in
TICK_MS = 60 * 1000atsrc/lib/tokenHealthCheck.ts:30) -
Per-connection health check interval: default 60 minutes (
DEFAULT_HEALTH_CHECK_INTERVAL_MIN = 60); configurable via the settings DB - Pre-emptive refresh on 401: handled by the per-connection interceptor
interface TokenHealth {
connectionId: string;
provider: string;
status: "valid" | "expiring_soon" | "expired" | "refresh_failed";
expiresAt: string;
lastRefresh: string;
nextRefresh: string;
consecutiveFailures: number;
}Token health check configuration is handled internally by tokenHealthCheck.ts.
No REST endpoint. Token health data is available via the dashboard or the MCP tool
observability_snapshot.
OmniRoute supports 3 alert channels:
| Channel | Setup | Use case |
|---|---|---|
| Dashboard banner | Always on | In-app notifications |
| Webhook | Configure URL | Slack, Discord, PagerDuty |
| Log | Default | For external log aggregation |
Note: Webhook alerting configuration is handled via the dashboard Settings page. See the Settings UI for webhook URL, event filtering, and payload customization.
| Alert | When | Default severity |
|---|---|---|
provider_circuit_open |
Circuit opens | critical |
provider_circuit_half_open |
Circuit testing recovery | info |
quota_warning |
Quota at 80%+ | warning |
quota_exhausted |
Quota at 100% | critical |
token_refresh_failed |
3+ consecutive refresh failures | warning |
token_expired |
Token past expiry | critical |
combo_target_unhealthy |
Combo target in cooldown for 1h+ | warning |
db_integrity_warning |
FK violations > 0 | warning |
heap_pressure |
Heap usage > 80% of threshold | warning |
| Metric | Type | Source |
|---|---|---|
request_count |
counter | services/usage.ts |
request_latency_ms |
histogram | services/usage.ts |
tokens_consumed |
counter | services/usage.ts |
cost_usd |
counter | services/usage.ts |
provider_errors |
counter | services/errorClassifier.ts |
circuit_state_changes |
counter | services/resilience.ts |
cache_hits |
counter | services/signatureCache.ts |
compression_savings |
histogram | services/compression/stats.ts |
quota_used |
gauge | services/quotaMonitor.ts |
memory_used_mb |
gauge | observability.ts |
No REST endpoint. Latency percentile data is available via the dashboard
/dashboard/healthpage. Prometheus/OpenTelemetry export is planned for v3.9.
Planned for v3.9: native export to Prometheus, OpenTelemetry, Datadog.
For now, scrape /api/monitoring/health with any HTTP-based monitoring system (Prometheus blackbox exporter, Datadog HTTP check, etc.).
Note: Webhook alerting is configured through the dashboard Settings page — there are no dedicated webhook env vars (
grep -rnreturns zero hits). See the Settings UI for webhook URL, event filtering, and payload customization.
Webhook alerting uses the same Settings UI flow as Slack. Discord accepts the same JSON payload shape.
Webhook alerting uses the same Settings UI flow. PagerDuty Events API v2 routing keys are configured in the Settings UI.
Any HTTP endpoint that accepts POST with JSON body will work. Configure the URL in the Settings UI.
Create a ~/.omniroute/dashboard.json:
{
"health": {
"sections": [
"server_status",
"database",
"providers",
"quota_monitors",
"recent_errors"
],
"refresh_interval_ms": 5000
}
}{
"health": {
"pinned_providers": ["openai", "anthropic"]
}
}- Check the autopilot issues — maybe a model is locked out
- Look at recent errors for the specific error class
- Try the connection test in the provider card
- Check if the provider is rate-limited at upstream (not visible locally)
- 429 means the provider says you've used your quota
- OmniRoute's quota tracking may be stale — the provider's truth is upstream
- Quota data refreshes automatically via the internal quota monitor
- Check combo health dashboard for target ordering issues
- Look at fallback events — maybe the combo is exhausting too quickly
- Verify the strategy matches your use case (priority vs round-robin vs auto)
- Run
sqlite3 ~/.omniroute/storage.sqlite "PRAGMA integrity_check;" - If "ok" — false alarm, the health check is being too strict
- If anything else — stop OmniRoute and follow the disaster recovery guide
# Check current heap
node -e "console.log(process.memoryUsage())"
# Trigger manual GC (if --expose-gc)
node --expose-gc -e "global.gc(); console.log(process.memoryUsage())"
# Reduce concurrent requests (set via the dashboard Settings page, not an env var)
# There is no `MAX_CONCURRENT_REQUESTS` env var — configure it in Settings → Concurrency.- USAGE_QUOTA_GUIDE.md — usage & cost tracking
- DATABASE_GUIDE.md — DB schema + health
- PROXY_GUIDE.md — proxy health (separate cache)
- ARCHITECTURE.md — system architecture
- RESILIENCE_GUIDE.md — circuit breaker details
- Source:
src/lib/monitoring/(4 files, 2121 LOC)
OmniRoute · Website · npm · Docker Hub
- Setup Guide
- User Guide
- Features
- Quick Start (Docker)
- Electron Desktop App
- Termux (Android)
- PWA Guide
- MCP Server
- A2A Server
- Agent Protocols
- OpenCode Plugin
- Webhooks
- Cloud Agents
- Skills
- Memory
- Evals
- Gamification
- Guardrails
- Compliance
- Error Sanitization
- Public Credentials
- Route Guard Tiers
- Stealth Guide
- CLI Token Auth