Skip to content

Incident Response

github-actions[bot] edited this page Jun 25, 2026 · 1 revision

🌍 View in other languages

Incident Response Runbook — OmniRoute (2026-06-18)

Status: Authoritative. The 71-pillar audit (L61) references this doc for the Obs > 2.00 gate. Owner: observability-circle (lead: security-circle lead). SLOs: see docs/PERF_BUDGETS.md § 1 (top-level SLOs) and ops/slos.yaml (machine-readable form, generated by the Bifrost team). Disclosure policy: see SECURITY.md (vulnerability disclosure only, separate flow).

This runbook is the operational playbook for non-security incidents: outages, latency regressions, error-budget burn, and provider-side failures. Vulnerability disclosure stays on SECURITY.md; do not route those through this runbook.


1. Severity ladder

Sev Definition Examples Page on Resolve by
SEV-1 User-visible outage; > 50 % of requests failing or > 2x SLO breach for 5 min. Cluster down; auth layer broken; 5xx flood. On-call P0 (immediate) 4 h
SEV-2 Significant degradation; 1.5–2x SLO breach for 15 min, or single-tenant impact. Single provider down; p95 > 1.5x budget; rate-limit runaway. On-call P1 (15 min) 24 h
SEV-3 Latent bug or near-miss; no current user impact but error budget at risk. Memory leak trending up; circuit breaker tripping on one provider. Slack #omniroute-ops (next standup) 7 d
SEV-4 Cosmetic / informational. Log line noise; non-binding UI glitch. Next weekly review Next refactor cycle

Burn-rate escalation (per docs/PERF_BUDGETS.md § 1): 6x for 5 min is SEV-1; 2x for 1 h is SEV-2; sustained < 1x for 7 d demotes to SEV-3.


2. Detection sources

Source Signal Routing
Prometheus (/metrics) Counter deltas (5xx, latency) Alertmanager → PagerDuty
Grafana SLO dashboards SLO burn-rate panels Slack #omniroute-ops
Uptime probe (/api/health/ping) 3 consecutive failures from 3 regions Alertmanager → PagerDuty
Dependabot New CVE in dependency GitHub issue + Slack #security
User report (support@) Manual triage Slack #omniroute-triage
Error budget burn alert slo_burn_rate > threshold Alertmanager

Prometheus and Alertmanager are configured in the deploy repo (see docs/operations/DEPLOY.md once published; currently inline in docker-compose.prod.yml).


3. First-15-minutes checklist

When paged, the on-call engineer runs this checklist verbatim. Do not skip steps; each is timed.

  1. 0:00 — Acknowledge the page in PagerDuty. Stops the escalation timer and notifies the secondary.
  2. 0:02 — Open the SLO dashboard and the incident channel (#inc-YYYY-MM-DD-slug). Post a single-line ack with the alert name and the time.
  3. 0:05 — Classify severity per § 1. If SEV-1 or SEV-2, declare the incident in the channel and tag @incident-commander.
  4. 0:08 — Capture the alert payload, the most recent deploy SHA, and the top 5 slow / erroring endpoints. Post to the channel.
  5. 0:12 — Decide: mitigate first, root-cause later. Choose one of:
    • Roll back to the last green deploy (bin/rollback.sh vX.Y.Z).
    • Failover to the healthy replicas (Caddy LB removes the bad replica automatically; verify with curl /api/health/ping).
    • Disable a broken provider connection via PUT /api/providers/{id} with { "isActive": false } (one-line toggle; safe by default).
  6. 0:15 — Post the chosen mitigation in the channel. If the page is still firing after 5 more minutes, escalate to the secondary.

4. Mitigation runbooks (per failure mode)

4.1 Provider outage (single provider down)

  1. PUT /api/providers/{id} with { "isActive": false } — toggles the connection off in the registry; all routes re-resolve on next request.
  2. Verify p95 returns to budget within 5 min.
  3. If all providers for a model are down, disable the model (see src/lib/a2a/skills/providerDiscovery.ts for the disable path).
  4. Update the status page with a banner if the outage exceeds 15 min.

4.2 Cluster-wide latency regression

  1. Check the most recent deploy (/api/system/version returns the running version).
  2. If p95 doubled vs the 7-day baseline, roll back to the prior SHA via bin/rollback.sh.
  3. If the regression is provider-side, see § 4.1.

4.3 Auth layer broken (5xx on /v1/responses for all keys)

  1. Check the authz-inventory endpoint: curl https://api.omniroute.dev/api/settings/authz-inventory | jq.
  2. If policies_active is empty, restore from the last good backup (bin/restore-policies.sh <sha>).
  3. Roll back if the cause is unclear.

4.4 Data-layer incident (sqlite corruption, audit log gap)

  1. Stop the cluster (docker compose -f docker-compose.prod.yml stop) — preventing further writes is more important than uptime.
  2. Snapshot the data volume (bin/snapshot-data.sh).
  3. Open a SEV-1; this is data-loss territory. Page the data-team.
  4. Restore from the last verified backup (see docs/BACKUP.md once published; currently the runbook is bin/restore-data.sh <sha>).

4.5 Security incident (vulnerability disclosure)

Stop. This is the SECURITY.md path, not this runbook. Page the security on-call (@security-team); do not post details to #omniroute-ops.


5. Communication

Audience Channel Cadence Owner
Engineering #inc-YYYY-MM-DD-slug Real-time Incident commander
Status page status.phenotype.dev Every 30 min during SEV-1/2 On-call
Customers (email) announce@phenotype.dev At SEV-1 start + resolution Comms lead
Upstream providers Direct contact At SEV-1 start Vendor mgmt
Postmortem docs/postmortem/YYYY-MM-DD-slug.md Within 5 business days Incident commander

Postmortem template is at docs/postmortem/TEMPLATE.md (forthcoming; see ADR-024 for the cadence and ADR-029 for the postmortem convention).


6. On-call rotation

Role Primary Secondary Rotation
Engineering on-call security-circle lead @open-sse Weekly, Mon 09:00 PDT
Security on-call @security-team Weekly
Data on-call @db-team Weekly
Comms lead @comms As needed

Handoff: every Monday 09:00 PDT, the outgoing on-call posts a written handoff to the incoming in #omnirouse-ops-handoff covering: open SEV-3/4 items, scheduled maintenance windows, and any in-flight mitigations.


7. Postmortem expectations

  • Blameless. People did the best they could with the information they had. Focus on systems, signals, and decision points.
  • Within 5 business days of resolution. File via gh issue create --label postmortem --label SEV-1 (or --label SEV-2).
  • Action items must be assigned, dated, and tracked in docs/TECH_DEBT.md (P0 < 30 d, P1 < 90 d per that doc's SLA).
  • Mandatory attendees: incident commander, on-call, any engineer who touched the mitigation, and one person who was not involved (fresh-eyes review).

8. Review log

Date Reviewer Change
2026-06-18 security-circle lead Initial runbook; severity ladder + 15-min checklist + 4.1–4.5 mitigation runbooks. Closes 71-pillar audit L61 (1/3 → 2/3).
2026-07-18 (planned) observability-circle Wire on-call rotation into PagerDuty schedule; add the postmortem template.

Clone this wiki locally