Skip to content

Define SLOs and add burn-rate alerting for activity-apiserver #145

@scotwells

Description

@scotwells

Summary

Define latency and availability SLOs for the activity-apiserver and implement burn-rate alerts that catch real consumer-facing issues without noise.

SLOs

Alerting target: 99% (7.2 hours/month error budget)
North-star target: 99.9% (tracked on dashboard, no alerting until system stabilizes)

Category Resources SLI Alert target North-star
Metadata operations activitypolicies GET/LIST/APPLY p99 < 1s 99% 99.9%
Audit log queries auditlogqueries POST p99 < 3s 99% 99.9%
Activity queries activityqueries, activityfacetqueries POST p99 < 3s 99% 99.9%
Event queries eventqueries, eventfacetqueries POST p99 < 3s 99% 99.9%
Availability All resources Non-5xx responses 99% 99.9%

Burn-rate alerts

Urgency Burn rate Long window Short window Action
Page 14.4x 1 hour 5 min Immediate response
Ticket 6x 6 hours 30 min Investigate today
Low 3x 3 days 6 hours Investigate this week

What needs to happen

  1. Add recording rules for per-category SLI metrics (good/total request ratios)
  2. Add burn-rate alert rules for each SLO
  3. Add Grafana dashboard panels showing error budget remaining and burn rate
  4. Track 99.9% on dashboard as a north-star goal

Why 99% to start

Recent incidents consumed far more than a 99.9% budget would allow (the March 24 latency spike alone used 558% of a 99.9% monthly budget). Starting at 99% gives a realistic error budget while the system is being hardened with rate limiting (#136), query timeouts (#137), and pipeline reliability improvements (#144). Once those are in place, tighten to 99.5% and eventually 99.9%.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions