Summary
Define latency and availability SLOs for the activity-apiserver and implement burn-rate alerts that catch real consumer-facing issues without noise.
SLOs
Alerting target: 99% (7.2 hours/month error budget)
North-star target: 99.9% (tracked on dashboard, no alerting until system stabilizes)
| Category |
Resources |
SLI |
Alert target |
North-star |
| Metadata operations |
activitypolicies GET/LIST/APPLY |
p99 < 1s |
99% |
99.9% |
| Audit log queries |
auditlogqueries POST |
p99 < 3s |
99% |
99.9% |
| Activity queries |
activityqueries, activityfacetqueries POST |
p99 < 3s |
99% |
99.9% |
| Event queries |
eventqueries, eventfacetqueries POST |
p99 < 3s |
99% |
99.9% |
| Availability |
All resources |
Non-5xx responses |
99% |
99.9% |
Burn-rate alerts
| Urgency |
Burn rate |
Long window |
Short window |
Action |
| Page |
14.4x |
1 hour |
5 min |
Immediate response |
| Ticket |
6x |
6 hours |
30 min |
Investigate today |
| Low |
3x |
3 days |
6 hours |
Investigate this week |
What needs to happen
- Add recording rules for per-category SLI metrics (good/total request ratios)
- Add burn-rate alert rules for each SLO
- Add Grafana dashboard panels showing error budget remaining and burn rate
- Track 99.9% on dashboard as a north-star goal
Why 99% to start
Recent incidents consumed far more than a 99.9% budget would allow (the March 24 latency spike alone used 558% of a 99.9% monthly budget). Starting at 99% gives a realistic error budget while the system is being hardened with rate limiting (#136), query timeouts (#137), and pipeline reliability improvements (#144). Once those are in place, tighten to 99.5% and eventually 99.9%.
Summary
Define latency and availability SLOs for the activity-apiserver and implement burn-rate alerts that catch real consumer-facing issues without noise.
SLOs
Alerting target: 99% (7.2 hours/month error budget)
North-star target: 99.9% (tracked on dashboard, no alerting until system stabilizes)
activitypoliciesGET/LIST/APPLYauditlogqueriesPOSTactivityqueries,activityfacetqueriesPOSTeventqueries,eventfacetqueriesPOSTBurn-rate alerts
What needs to happen
Why 99% to start
Recent incidents consumed far more than a 99.9% budget would allow (the March 24 latency spike alone used 558% of a 99.9% monthly budget). Starting at 99% gives a realistic error budget while the system is being hardened with rate limiting (#136), query timeouts (#137), and pipeline reliability improvements (#144). Once those are in place, tighten to 99.5% and eventually 99.9%.