-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Incident Response
Status: Authoritative. The 71-pillar audit (L61) references this doc
for the Obs > 2.00 gate.
Owner: observability-circle (lead: security-circle lead).
SLOs: see docs/PERF_BUDGETS.md § 1 (top-level SLOs) and
ops/slos.yaml (machine-readable form, generated by the Bifrost team).
Disclosure policy: see SECURITY.md (vulnerability disclosure only,
separate flow).
This runbook is the operational playbook for non-security incidents:
outages, latency regressions, error-budget burn, and provider-side
failures. Vulnerability disclosure stays on SECURITY.md; do not route
those through this runbook.
| Sev | Definition | Examples | Page on | Resolve by |
|---|---|---|---|---|
| SEV-1 | User-visible outage; > 50 % of requests failing or > 2x SLO breach for 5 min. | Cluster down; auth layer broken; 5xx flood. | On-call P0 (immediate) | 4 h |
| SEV-2 | Significant degradation; 1.5–2x SLO breach for 15 min, or single-tenant impact. | Single provider down; p95 > 1.5x budget; rate-limit runaway. | On-call P1 (15 min) | 24 h |
| SEV-3 | Latent bug or near-miss; no current user impact but error budget at risk. | Memory leak trending up; circuit breaker tripping on one provider. | Slack #omniroute-ops (next standup) |
7 d |
| SEV-4 | Cosmetic / informational. | Log line noise; non-binding UI glitch. | Next weekly review | Next refactor cycle |
Burn-rate escalation (per docs/PERF_BUDGETS.md § 1): 6x for 5 min
is SEV-1; 2x for 1 h is SEV-2; sustained < 1x for 7 d demotes to SEV-3.
| Source | Signal | Routing |
|---|---|---|
Prometheus (/metrics) |
Counter deltas (5xx, latency) | Alertmanager → PagerDuty |
| Grafana SLO dashboards | SLO burn-rate panels | Slack #omniroute-ops
|
Uptime probe (/api/health/ping) |
3 consecutive failures from 3 regions | Alertmanager → PagerDuty |
| Dependabot | New CVE in dependency | GitHub issue + Slack #security
|
| User report (support@) | Manual triage | Slack #omniroute-triage
|
| Error budget burn alert | slo_burn_rate > threshold |
Alertmanager |
Prometheus and Alertmanager are configured in the deploy repo (see
docs/operations/DEPLOY.md once published; currently inline in
docker-compose.prod.yml).
When paged, the on-call engineer runs this checklist verbatim. Do not skip steps; each is timed.
- 0:00 — Acknowledge the page in PagerDuty. Stops the escalation timer and notifies the secondary.
-
0:02 — Open the SLO dashboard and the incident
channel (
#inc-YYYY-MM-DD-slug). Post a single-line ack with the alert name and the time. -
0:05 — Classify severity per § 1. If SEV-1 or SEV-2, declare
the incident in the channel and tag
@incident-commander. - 0:08 — Capture the alert payload, the most recent deploy SHA, and the top 5 slow / erroring endpoints. Post to the channel.
-
0:12 — Decide: mitigate first, root-cause later. Choose
one of:
-
Roll back to the last green deploy (
bin/rollback.sh vX.Y.Z). -
Failover to the healthy replicas (Caddy LB removes the bad
replica automatically; verify with
curl /api/health/ping). -
Disable a broken provider connection via
PUT /api/providers/{id}with{ "isActive": false }(one-line toggle; safe by default).
-
Roll back to the last green deploy (
- 0:15 — Post the chosen mitigation in the channel. If the page is still firing after 5 more minutes, escalate to the secondary.
-
PUT /api/providers/{id}with{ "isActive": false }— toggles the connection off in the registry; all routes re-resolve on next request. - Verify p95 returns to budget within 5 min.
- If all providers for a model are down, disable the model (see
src/lib/a2a/skills/providerDiscovery.tsfor the disable path). - Update the status page with a banner if the outage exceeds 15 min.
- Check the most recent deploy (
/api/system/versionreturns the running version). - If p95 doubled vs the 7-day baseline, roll back to the prior
SHA via
bin/rollback.sh. - If the regression is provider-side, see § 4.1.
- Check the authz-inventory endpoint:
curl https://api.omniroute.dev/api/settings/authz-inventory | jq. - If
policies_activeis empty, restore from the last good backup (bin/restore-policies.sh <sha>). - Roll back if the cause is unclear.
-
Stop the cluster (
docker compose -f docker-compose.prod.yml stop) — preventing further writes is more important than uptime. - Snapshot the data volume (
bin/snapshot-data.sh). - Open a SEV-1; this is data-loss territory. Page the data-team.
- Restore from the last verified backup (see
docs/BACKUP.mdonce published; currently the runbook isbin/restore-data.sh <sha>).
Stop. This is the SECURITY.md path, not this runbook. Page the
security on-call (@security-team); do not post details to
#omniroute-ops.
| Audience | Channel | Cadence | Owner |
|---|---|---|---|
| Engineering | #inc-YYYY-MM-DD-slug |
Real-time | Incident commander |
| Status page | status.phenotype.dev |
Every 30 min during SEV-1/2 | On-call |
| Customers (email) | announce@phenotype.dev |
At SEV-1 start + resolution | Comms lead |
| Upstream providers | Direct contact | At SEV-1 start | Vendor mgmt |
| Postmortem | docs/postmortem/YYYY-MM-DD-slug.md |
Within 5 business days | Incident commander |
Postmortem template is at docs/postmortem/TEMPLATE.md (forthcoming;
see ADR-024 for the cadence and ADR-029 for the postmortem convention).
| Role | Primary | Secondary | Rotation |
|---|---|---|---|
| Engineering on-call | security-circle lead | @open-sse | Weekly, Mon 09:00 PDT |
| Security on-call | @security-team | — | Weekly |
| Data on-call | @db-team | — | Weekly |
| Comms lead | @comms | — | As needed |
Handoff: every Monday 09:00 PDT, the outgoing on-call posts a
written handoff to the incoming in #omnirouse-ops-handoff covering:
open SEV-3/4 items, scheduled maintenance windows, and any
in-flight mitigations.
- Blameless. People did the best they could with the information they had. Focus on systems, signals, and decision points.
-
Within 5 business days of resolution. File via
gh issue create --label postmortem --label SEV-1(or--label SEV-2). -
Action items must be assigned, dated, and tracked in
docs/TECH_DEBT.md(P0 < 30 d, P1 < 90 d per that doc's SLA). - Mandatory attendees: incident commander, on-call, any engineer who touched the mitigation, and one person who was not involved (fresh-eyes review).
| Date | Reviewer | Change |
|---|---|---|
| 2026-06-18 | security-circle lead | Initial runbook; severity ladder + 15-min checklist + 4.1–4.5 mitigation runbooks. Closes 71-pillar audit L61 (1/3 → 2/3). |
| 2026-07-18 (planned) | observability-circle | Wire on-call rotation into PagerDuty schedule; add the postmortem template. |
OmniRoute · Website · npm · Docker Hub
- Setup Guide
- User Guide
- Features
- Quick Start (Docker)
- Electron Desktop App
- Termux (Android)
- PWA Guide
- MCP Server
- A2A Server
- Agent Protocols
- OpenCode Plugin
- Webhooks
- Cloud Agents
- Skills
- Memory
- Evals
- Gamification
- Guardrails
- Compliance
- Error Sanitization
- Public Credentials
- Route Guard Tiers
- Stealth Guide
- CLI Token Auth