Incident Response

🌍 View in other languages

Incident Response Runbook — OmniRoute (2026-06-18)

Status: Authoritative. The 71-pillar audit (L61) references this doc for the Obs > 2.00 gate. Owner: observability-circle (lead: security-circle lead). SLOs: see docs/PERF_BUDGETS.md § 1 (top-level SLOs) and ops/slos.yaml (machine-readable form, generated by the Bifrost team). Disclosure policy: see SECURITY.md (vulnerability disclosure only, separate flow).

This runbook is the operational playbook for non-security incidents: outages, latency regressions, error-budget burn, and provider-side failures. Vulnerability disclosure stays on SECURITY.md; do not route those through this runbook.

1. Severity ladder

Sev	Definition	Examples	Page on	Resolve by
SEV-1	User-visible outage; > 50 % of requests failing or > 2x SLO breach for 5 min.	Cluster down; auth layer broken; 5xx flood.	On-call P0 (immediate)	4 h
SEV-2	Significant degradation; 1.5–2x SLO breach for 15 min, or single-tenant impact.	Single provider down; p95 > 1.5x budget; rate-limit runaway.	On-call P1 (15 min)	24 h
SEV-3	Latent bug or near-miss; no current user impact but error budget at risk.	Memory leak trending up; circuit breaker tripping on one provider.	Slack `#omniroute-ops` (next standup)	7 d
SEV-4	Cosmetic / informational.	Log line noise; non-binding UI glitch.	Next weekly review	Next refactor cycle

Burn-rate escalation (per docs/PERF_BUDGETS.md § 1): 6x for 5 min is SEV-1; 2x for 1 h is SEV-2; sustained < 1x for 7 d demotes to SEV-3.

2. Detection sources

Source	Signal	Routing
Prometheus (`/metrics`)	Counter deltas (5xx, latency)	Alertmanager → PagerDuty
Grafana SLO dashboards	SLO burn-rate panels	Slack `#omniroute-ops`
Uptime probe (`/api/health/ping`)	3 consecutive failures from 3 regions	Alertmanager → PagerDuty
Dependabot	New CVE in dependency	GitHub issue + Slack `#security`
User report (support@)	Manual triage	Slack `#omniroute-triage`
Error budget burn alert	`slo_burn_rate > threshold`	Alertmanager

Prometheus and Alertmanager are configured in the deploy repo (see docs/operations/DEPLOY.md once published; currently inline in docker-compose.prod.yml).

3. First-15-minutes checklist

When paged, the on-call engineer runs this checklist verbatim. Do not skip steps; each is timed.

0:00 — Acknowledge the page in PagerDuty. Stops the escalation timer and notifies the secondary.
0:02 — Open the SLO dashboard and the incident channel (#inc-YYYY-MM-DD-slug). Post a single-line ack with the alert name and the time.
0:05 — Classify severity per § 1. If SEV-1 or SEV-2, declare the incident in the channel and tag @incident-commander.
0:08 — Capture the alert payload, the most recent deploy SHA, and the top 5 slow / erroring endpoints. Post to the channel.
0:12 — Decide: mitigate first, root-cause later. Choose one of:
- Roll back to the last green deploy (bin/rollback.sh vX.Y.Z).
- Failover to the healthy replicas (Caddy LB removes the bad replica automatically; verify with curl /api/health/ping).
- Disable a broken provider connection via PUT /api/providers/{id} with { "isActive": false } (one-line toggle; safe by default).
0:15 — Post the chosen mitigation in the channel. If the page is still firing after 5 more minutes, escalate to the secondary.

4. Mitigation runbooks (per failure mode)

4.1 Provider outage (single provider down)

PUT /api/providers/{id} with { "isActive": false } — toggles the connection off in the registry; all routes re-resolve on next request.
Verify p95 returns to budget within 5 min.
If all providers for a model are down, disable the model (see src/lib/a2a/skills/providerDiscovery.ts for the disable path).
Update the status page with a banner if the outage exceeds 15 min.

4.2 Cluster-wide latency regression

Check the most recent deploy (/api/system/version returns the running version).
If p95 doubled vs the 7-day baseline, roll back to the prior SHA via bin/rollback.sh.
If the regression is provider-side, see § 4.1.

4.3 Auth layer broken (5xx on /v1/responses for all keys)

Check the authz-inventory endpoint: curl https://api.omniroute.dev/api/settings/authz-inventory | jq.
If policies_active is empty, restore from the last good backup (bin/restore-policies.sh <sha>).
Roll back if the cause is unclear.

4.4 Data-layer incident (sqlite corruption, audit log gap)

Stop the cluster (docker compose -f docker-compose.prod.yml stop) — preventing further writes is more important than uptime.
Snapshot the data volume (bin/snapshot-data.sh).
Open a SEV-1; this is data-loss territory. Page the data-team.
Restore from the last verified backup (see docs/BACKUP.md once published; currently the runbook is bin/restore-data.sh <sha>).

4.5 Security incident (vulnerability disclosure)

Stop. This is the SECURITY.md path, not this runbook. Page the security on-call (@security-team); do not post details to #omniroute-ops.

5. Communication

Audience	Channel	Cadence	Owner
Engineering	`#inc-YYYY-MM-DD-slug`	Real-time	Incident commander
Status page	`status.phenotype.dev`	Every 30 min during SEV-1/2	On-call
Customers (email)	`announce@phenotype.dev`	At SEV-1 start + resolution	Comms lead
Upstream providers	Direct contact	At SEV-1 start	Vendor mgmt
Postmortem	`docs/postmortem/YYYY-MM-DD-slug.md`	Within 5 business days	Incident commander

Postmortem template is at docs/postmortem/TEMPLATE.md (forthcoming; see ADR-024 for the cadence and ADR-029 for the postmortem convention).

6. On-call rotation

Role	Primary	Secondary	Rotation
Engineering on-call	security-circle lead	@open-sse	Weekly, Mon 09:00 PDT
Security on-call	@security-team	—	Weekly
Data on-call	@db-team	—	Weekly
Comms lead	@comms	—	As needed

Handoff: every Monday 09:00 PDT, the outgoing on-call posts a written handoff to the incoming in #omnirouse-ops-handoff covering: open SEV-3/4 items, scheduled maintenance windows, and any in-flight mitigations.

7. Postmortem expectations

Blameless. People did the best they could with the information they had. Focus on systems, signals, and decision points.
Within 5 business days of resolution. File via gh issue create --label postmortem --label SEV-1 (or --label SEV-2).
Action items must be assigned, dated, and tracked in docs/TECH_DEBT.md (P0 < 30 d, P1 < 90 d per that doc's SLA).
Mandatory attendees: incident commander, on-call, any engineer who touched the mitigation, and one person who was not involved (fresh-eyes review).

8. Review log

Date	Reviewer	Change
2026-06-18	security-circle lead	Initial runbook; severity ladder + 15-min checklist + 4.1–4.5 mitigation runbooks. Closes 71-pillar audit L61 (1/3 → 2/3).
2026-07-18 (planned)	observability-circle	Wire on-call rotation into PagerDuty schedule; add the postmortem template.

Uh oh!

Incident Response

Incident Response Runbook — OmniRoute (2026-06-18)

1. Severity ladder

2. Detection sources

3. First-15-minutes checklist

4. Mitigation runbooks (per failure mode)

4.1 Provider outage (single provider down)

4.2 Cluster-wide latency regression

4.3 Auth layer broken (5xx on /v1/responses for all keys)

4.4 Data-layer incident (sqlite corruption, audit log gap)

4.5 Security incident (vulnerability disclosure)

5. Communication

6. On-call rotation

7. Postmortem expectations

8. Review log

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!