Define SLOs and add burn-rate alerting for activity-apiserver

## Summary

Define latency and availability SLOs for the activity-apiserver and implement burn-rate alerts that catch real consumer-facing issues without noise.

## SLOs

**Alerting target: 99%** (7.2 hours/month error budget)
**North-star target: 99.9%** (tracked on dashboard, no alerting until system stabilizes)

| Category | Resources | SLI | Alert target | North-star |
|----------|-----------|-----|-------------|------------|
| Metadata operations | `activitypolicies` GET/LIST/APPLY | p99 < 1s | 99% | 99.9% |
| Audit log queries | `auditlogqueries` POST | p99 < 3s | 99% | 99.9% |
| Activity queries | `activityqueries`, `activityfacetqueries` POST | p99 < 3s | 99% | 99.9% |
| Event queries | `eventqueries`, `eventfacetqueries` POST | p99 < 3s | 99% | 99.9% |
| Availability | All resources | Non-5xx responses | 99% | 99.9% |

## Burn-rate alerts

| Urgency | Burn rate | Long window | Short window | Action |
|---------|-----------|-------------|--------------|--------|
| Page | 14.4x | 1 hour | 5 min | Immediate response |
| Ticket | 6x | 6 hours | 30 min | Investigate today |
| Low | 3x | 3 days | 6 hours | Investigate this week |

## What needs to happen

1. Add recording rules for per-category SLI metrics (good/total request ratios)
2. Add burn-rate alert rules for each SLO
3. Add Grafana dashboard panels showing error budget remaining and burn rate
4. Track 99.9% on dashboard as a north-star goal

## Why 99% to start

Recent incidents consumed far more than a 99.9% budget would allow (the March 24 latency spike alone used 558% of a 99.9% monthly budget). Starting at 99% gives a realistic error budget while the system is being hardened with rate limiting (#136), query timeouts (#137), and pipeline reliability improvements (#144). Once those are in place, tighten to 99.5% and eventually 99.9%.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define SLOs and add burn-rate alerting for activity-apiserver #145

Summary

SLOs

Burn-rate alerts

What needs to happen

Why 99% to start

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Category	Resources	SLI	Alert target	North-star
Metadata operations	`activitypolicies` GET/LIST/APPLY	p99 < 1s	99%	99.9%
Audit log queries	`auditlogqueries` POST	p99 < 3s	99%	99.9%
Activity queries	`activityqueries`, `activityfacetqueries` POST	p99 < 3s	99%	99.9%
Event queries	`eventqueries`, `eventfacetqueries` POST	p99 < 3s	99%	99.9%
Availability	All resources	Non-5xx responses	99%	99.9%

Urgency	Burn rate	Long window	Short window	Action
Page	14.4x	1 hour	5 min	Immediate response
Ticket	6x	6 hours	30 min	Investigate today
Low	3x	3 days	6 hours	Investigate this week

Define SLOs and add burn-rate alerting for activity-apiserver #145

Description

Summary

SLOs

Burn-rate alerts

What needs to happen

Why 99% to start

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions