Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: alert based on SLIs #39

Closed
2 tasks done
cirocosta opened this issue Apr 26, 2019 · 3 comments
Closed
2 tasks done

metrics: alert based on SLIs #39

cirocosta opened this issue Apr 26, 2019 · 3 comments

Comments

@cirocosta
Copy link
Member

cirocosta commented Apr 26, 2019

Hey,

At the moment, we have our SLIs (see Datadog SLI Dashboard) being displayed in a monitor in the office, but this does not necessarily mean that the operators (Test Pilot Pair) end up acting on problems that happen there, as the pair would need to keep looking at it.

sli-hurt

By tying those SLIs with PagerDuty, we're then able to treat degradations in those numbers as triggers for action.

To better reflect the fact that an alert is generated, we should also update the way that the colors are set there (such that it's immediate for someone to see that there's an active incident).

In terms of "which thresholds to use", let's start with an SLO of daily 95 (which means 1h12m of downtime), and adjust that accordingly.

Acceptance criteria

  • With a daily SLI being hurt, a page is generated to PagerDuty
  • Have Datadog colors representing our objectives.

Thanks!

@scottietremendous
Copy link

We should have follow up issues to continually raise the SLOs.

@cirocosta
Copy link
Member Author

@pivotal-bin-ju 🙌

Screen Shot 2019-04-29 at 4 50 53 PM
Screen Shot 2019-04-29 at 4 51 07 PM

@cirocosta
Copy link
Member Author

cirocosta commented May 1, 2019

Whoops, it turns out that we were performing the evaluations in a 1m interval (which is pretty much a 99.93% daily).

Now we're also including error budgets that tell us how much time we have left to explore aiming at that 99% monthly:

Screen Shot 2019-05-01 at 10 00 52 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants