Monitoring Guide

🌍 View in other languages

Monitoring & Observability Guide

TL;DR: OmniRoute ships with built-in health monitoring, provider autopilot, quota tracking, and observability hooks. This guide covers the dashboard, alerts, and troubleshooting.

Sources:

src/lib/monitoring/observability.ts — observability snapshot
src/lib/monitoring/comboHealthAutopilot.ts — combo health autopilot
src/lib/monitoring/providerHealthAutopilot.ts — provider autopilot
src/lib/monitoring/providerHealthMatrix.ts — provider health matrix
src/lib/localHealthCheck.ts — local health check
src/lib/tokenHealthCheck.ts — token refresh health
src/lib/proxyHealth.ts — proxy health cache (covered in PROXY_GUIDE.md)

Overview

OmniRoute has 3 layers of monitoring:

┌──────────────────────────────────────────────────────────────┐
│  Layer 1: System Health (server-level)                        │
│  ├─ localHealthCheck.ts — DB, ports, native deps              │
│  ├─ db/healthCheck.ts — integrity, FK, orphaned artifacts     │
│  └─ Dashboard: /dashboard/health                              │
├──────────────────────────────────────────────────────────────┤
│  Layer 2: Provider Health (per-provider resilience)            │
│  ├─ providerHealthAutopilot.ts — circuit breaker, cooldowns   │
│  ├─ providerHealthMatrix.ts — health scores by provider/model │
│  └─ Dashboard: /dashboard/providers                           │
├──────────────────────────────────────────────────────────────┤
│  Layer 3: Live Observability (runtime snapshots)               │
│  ├─ observability.ts — circuit breakers, sessions, quota       │
│  ├─ tokenHealthCheck.ts — OAuth token refresh health          │
│  └─ MCP tools: omniroute_get_health, omniroute_get_session_snapshot │
└──────────────────────────────────────────────────────────────┘

Dashboard Pages

`/dashboard/health` (System Health)

The top-level health dashboard shows:

Section	What it shows
Server status	Uptime, version, port, active connections
Database	Connection, integrity, WAL size, recent migrations
Provider summary	Active count, healthy count, breaker open count
Quota monitors	Active sessions, alerting, exhausted
Recent errors	Last 10 errors with stack traces
Resource usage	Memory, CPU, heap pressure indicator

`/dashboard/providers` (Provider Health)

Per-provider dashboard:

Column	Description
Provider	Provider ID + display name
Health	Green/yellow/red status
Circuit	Open/closed/half-open state
Connections	Count of connections, last refresh
Models	Available models, health per model
Cost	Today's cost, 7-day trend
Errors	Last 24h error count, top error class

Click a provider to see:

Recent requests with latency breakdown
Per-connection health scores
Per-model lockouts
Autopilot recommendations

`/dashboard/quota` (Quota Tracking)

For each API key:

Current usage vs limit (progress bar)
Quota trend (30-day chart)
Next reset time
Alert history

`/dashboard/combos` (Combo Health)

Per-combo:

Strategy + targets
Health per target
Recent fallback events
Success rate (24h, 7d, 30d)

Health Check API

Note: Only GET /api/monitoring/health is exposed as a REST endpoint. All other monitoring data (provider health, autopilot issues, quota monitors, token health, latency) is accessed via the MCP tool observability_snapshot or the dashboard pages — there are no dedicated REST routes for these.

System Health

GET /api/monitoring/health

Response:

{
  "status": "healthy",
  "version": "3.8.16",
  "uptime": 123456,
  "checks": {
    "database": { "status": "pass", "latency_ms": 2 },
    "writeable": { "status": "pass" },
    "integrity": { "status": "pass", "result": "ok" },
    "foreign_keys": { "status": "pass", "violations": 0 },
    "heap_pressure": { "status": "pass", "usage_mb": 142, "threshold_mb": 512 },
    "active_sessions": 12,
    "providers": {
      "total": 7,
      "healthy": 6,
      "degraded": 1,
      "down": 0
    }
  }
}

Provider Health

No REST endpoint. Provider health data is available via the MCP tool observability_snapshot or the dashboard /dashboard/providers page.

Provider Detail

No REST endpoint. Per-provider detail is available via the dashboard /dashboard/providers page.

Provider Health Autopilot

The providerHealthAutopilot.ts module is a self-healing system that:

Detects provider issues (circuit open, cooldowns, lockouts, quota warnings)
Generates recommended actions to resolve them
Optionally auto-executes low-risk actions

Issue Types Detected

Issue kind	Severity	Example condition
`provider_circuit_open`	critical	Circuit breaker open after 5 failures
`provider_circuit_half_open`	warning	Circuit testing recovery
`connection_cooldown`	warning	Connection in cooldown after 429
`stale_connection_error`	warning	Last refresh failed 30+ minutes ago
`terminal_connection_error`	critical	OAuth revoked, key invalid
`inactive_connection`	info	Connection disabled in settings
`model_lockout`	warning	Specific model in quarantine
`quota_monitor_warning`	warning	Quota at 80%+ usage

Action Types Generated

Action	Risk	Description
`clear_provider_breaker`	medium	Reset the circuit breaker to closed
`clear_connection_cooldown`	low	Remove cooldown from a connection
`clear_stale_connection_error`	low	Clear stale error flag
`clear_model_lockout`	low	Re-enable a quarantined model
`reactivate_connection`	medium	Re-enable a deactivated connection
`deactivate_connection`	high	Disable a problematic connection

API

No REST endpoint. Autopilot issues are available via the MCP tool observability_snapshot or the dashboard. The autopilot runs internally; its behavior is configured via the settings DB (per-connection autopilotMode field), not environment variables — grep -rn for an autopilot-mode env var returns zero hits.

Autopilot Mode

The autopilot operates in manual mode by default — it detects issues and generates recommended actions, but does not auto-apply them. Actions can be applied via the dashboard.

Combo Health Autopilot

comboHealthAutopilot.ts is the combo-specific equivalent of the provider autopilot. It:

Detects unhealthy combos
Recommends target reordering
Suggests disabling broken targets
Auto-removes dead targets after N failures

Combo Issue Examples

Combo "always-on" (priority strategy)
├─ Target 1: openai/gpt-5 (healthy)
├─ Target 2: anthropic/claude-opus-4-6 (⚠️ model lockout until 14:00)
└─ Target 3: kiro/claude-sonnet-4-5 (healthy)

Recommended action: Reorder — move kiro above anthropic until lockout expires

Quota Monitors

observability.ts exposes per-session quota monitors for subscription providers (Claude Code, Codex, GitHub Copilot):

interface QuotaMonitorSnapshot {
  sessionId: string;
  provider: string;
  accountId: string;
  status: "starting" | "idle" | "healthy" | "warning" | "exhausted" | "error";
  lastQuotaPercent: number | null;  // 0-100
  lastQuotaUsed: number | null;
  lastQuotaTotal: number | null;
  lastResetAt: string | null;
  nextPollAt: string | null;
  totalPolls: number;
  totalAlerts: number;
  consecutiveFailures: number;
}

Status Meanings

Status	When	UI action
`starting`	Initial poll in progress	Spinner
`idle`	No recent activity	Hidden from dashboard
`healthy`	Quota > 50% remaining	Green dot
`warning`	Quota < 50% remaining	Yellow alert
`exhausted`	Quota = 0%	Red block, route to next provider
`error`	Polling failed	Red dot, retry soon

API

No REST endpoint. Quota monitor data is available via the MCP tool observability_snapshot or the dashboard.

Observability Snapshot

The MCP tool observability_snapshot returns a complete system snapshot for AI agents:

{
  "circuitBreakers": [
    {
      "name": "openai",
      "state": "closed",
      "failureCount": 0,
      "lastFailureTime": null,
      "retryAfterMs": null
    }
  ],
  "sessions": [
    {
      "sessionId": "sess-123",
      "createdAt": 1234567890,
      "lastActive": 1234567999,
      "requestCount": 42,
      "connectionId": "conn-456",
      "ageMs": 109
    }
  ],
  "quotaMonitors": { /* see above */ },
  "uptime": 12345,
  "version": "3.8.16"
}

Agents use this to make routing decisions — for example, "if openai's circuit is open, route to anthropic first".

Token Health Check

OAuth providers (Claude Code, GitHub Copilot, Cursor) need periodic token refresh. src/lib/tokenHealthCheck.ts runs a background scheduler:

Sweep tick: every 60 seconds (sweep in TICK_MS = 60 * 1000 at src/lib/tokenHealthCheck.ts:30)
Per-connection health check interval: default 60 minutes (DEFAULT_HEALTH_CHECK_INTERVAL_MIN = 60); configurable via the settings DB
Pre-emptive refresh on 401: handled by the per-connection interceptor

Token Health Status

interface TokenHealth {
  connectionId: string;
  provider: string;
  status: "valid" | "expiring_soon" | "expired" | "refresh_failed";
  expiresAt: string;
  lastRefresh: string;
  nextRefresh: string;
  consecutiveFailures: number;
}

Configuration

Token health check configuration is handled internally by tokenHealthCheck.ts.

Token Health

No REST endpoint. Token health data is available via the dashboard or the MCP tool observability_snapshot.

Alerting

Built-in Channels

OmniRoute supports 3 alert channels:

Channel	Setup	Use case
Dashboard banner	Always on	In-app notifications
Webhook	Configure URL	Slack, Discord, PagerDuty
Log	Default	For external log aggregation

Webhook Configuration

Note: Webhook alerting configuration is handled via the dashboard Settings page. See the Settings UI for webhook URL, event filtering, and payload customization.

Alert Types

Alert	When	Default severity
`provider_circuit_open`	Circuit opens	critical
`provider_circuit_half_open`	Circuit testing recovery	info
`quota_warning`	Quota at 80%+	warning
`quota_exhausted`	Quota at 100%	critical
`token_refresh_failed`	3+ consecutive refresh failures	warning
`token_expired`	Token past expiry	critical
`combo_target_unhealthy`	Combo target in cooldown for 1h+	warning
`db_integrity_warning`	FK violations > 0	warning
`heap_pressure`	Heap usage > 80% of threshold	warning

Performance Metrics

Tracked Metrics

Metric	Type	Source
`request_count`	counter	`services/usage.ts`
`request_latency_ms`	histogram	`services/usage.ts`
`tokens_consumed`	counter	`services/usage.ts`
`cost_usd`	counter	`services/usage.ts`
`provider_errors`	counter	`services/errorClassifier.ts`
`circuit_state_changes`	counter	`services/resilience.ts`
`cache_hits`	counter	`services/signatureCache.ts`
`compression_savings`	histogram	`services/compression/stats.ts`
`quota_used`	gauge	`services/quotaMonitor.ts`
`memory_used_mb`	gauge	`observability.ts`

Latency Percentiles (p50/p95/p99)

No REST endpoint. Latency percentile data is available via the dashboard /dashboard/health page. Prometheus/OpenTelemetry export is planned for v3.9.

Prometheus / OpenTelemetry Export (Phase 2)

Planned for v3.9: native export to Prometheus, OpenTelemetry, Datadog.

For now, scrape /api/monitoring/health with any HTTP-based monitoring system (Prometheus blackbox exporter, Datadog HTTP check, etc.).

Alerting Recipes

Slack

Note: Webhook alerting is configured through the dashboard Settings page — there are no dedicated webhook env vars (grep -rn returns zero hits). See the Settings UI for webhook URL, event filtering, and payload customization.

Discord

Webhook alerting uses the same Settings UI flow as Slack. Discord accepts the same JSON payload shape.

PagerDuty

Webhook alerting uses the same Settings UI flow. PagerDuty Events API v2 routing keys are configured in the Settings UI.

Custom Webhook (JSON)

Any HTTP endpoint that accepts POST with JSON body will work. Configure the URL in the Settings UI.

Dashboard Configuration

Customize the Health Dashboard

Create a ~/.omniroute/dashboard.json:

{
  "health": {
    "sections": [
      "server_status",
      "database",
      "providers",
      "quota_monitors",
      "recent_errors"
    ],
    "refresh_interval_ms": 5000
  }
}

Pin a Provider to the Top

{
  "health": {
    "pinned_providers": ["openai", "anthropic"]
  }
}

Troubleshooting

"Provider says healthy but requests fail"

Check the autopilot issues — maybe a model is locked out
Look at recent errors for the specific error class
Try the connection test in the provider card
Check if the provider is rate-limited at upstream (not visible locally)

"Quota says healthy but I see 429s"

429 means the provider says you've used your quota
OmniRoute's quota tracking may be stale — the provider's truth is upstream
Quota data refreshes automatically via the internal quota monitor

"Combo is failing but all targets look healthy"

Check combo health dashboard for target ordering issues
Look at fallback events — maybe the combo is exhausting too quickly
Verify the strategy matches your use case (priority vs round-robin vs auto)

"Database health check is failing"

Run sqlite3 ~/.omniroute/storage.sqlite "PRAGMA integrity_check;"
If "ok" — false alarm, the health check is being too strict
If anything else — stop OmniRoute and follow the disaster recovery guide

"Memory heap pressure is critical"

# Check current heap
node -e "console.log(process.memoryUsage())"

# Trigger manual GC (if --expose-gc)
node --expose-gc -e "global.gc(); console.log(process.memoryUsage())"

# Reduce concurrent requests (set via the dashboard Settings page, not an env var)
# There is no `MAX_CONCURRENT_REQUESTS` env var — configure it in Settings → Concurrency.

Uh oh!

Monitoring Guide

Monitoring & Observability Guide

Overview

Dashboard Pages

/dashboard/health (System Health)

/dashboard/providers (Provider Health)

/dashboard/quota (Quota Tracking)

/dashboard/combos (Combo Health)

Health Check API

System Health

Provider Health

Provider Detail

Provider Health Autopilot

Issue Types Detected

Action Types Generated

API

Autopilot Mode

Combo Health Autopilot

Combo Issue Examples

Quota Monitors

Status Meanings

API

Observability Snapshot

Token Health Check

Token Health Status

Configuration

Token Health

Alerting

Built-in Channels

Webhook Configuration

Alert Types

Performance Metrics

Tracked Metrics

Latency Percentiles (p50/p95/p99)

Prometheus / OpenTelemetry Export (Phase 2)

Alerting Recipes

Slack

Discord

PagerDuty

Custom Webhook (JSON)

Dashboard Configuration

Customize the Health Dashboard

Pin a Provider to the Top

Troubleshooting

"Provider says healthy but requests fail"

"Quota says healthy but I see 429s"

"Combo is failing but all targets look healthy"

"Database health check is failing"

"Memory heap pressure is critical"

See Also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

`/dashboard/health` (System Health)

`/dashboard/providers` (Provider Health)

`/dashboard/quota` (Quota Tracking)

`/dashboard/combos` (Combo Health)