Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] For a service, one check goes into error, then another a little later #281

Open
blysik opened this issue Oct 23, 2015 · 6 comments

Comments

@blysik
Copy link

blysik commented Oct 23, 2015

Hi,

Just a question on what the behavior is supposed to be.

  1. create a service, and assign it two checks: graphite, and http.
  2. http check goes into error, and then an alert is triggered. (Importance of 'Error'.)
  3. A few moments later, the graphite check failed (importance of critical), however no alert appears to be triggered.

Shouldn't another alert be triggered for 3?

@xinity
Copy link
Contributor

xinity commented Oct 23, 2015

Nop,

Alerts are triggered service based, not check based. So if an alert has
already been triggered, then Cabot will not trigger another one even if
another check fails until alert_notification time is reached.

@dbuxton please fix me if I'm wrong ;-)

Le ven. 23 oct. 2015 18:53, blysik notifications@github.com a écrit :

Hi,

Just a question on what the behavior is supposed to be.

  1. create a service, and assign it two checks: graphite, and http.
  2. http check goes into error, and then an alert is triggered.
    (Importance of 'Error'.)
  3. A few moments later, the graphite check failed (importance of
    critical), however no alert appears to be triggered.

Shouldn't another alert be triggered for 3?


Reply to this email directly or view it on GitHub
#281.

@blysik
Copy link
Author

blysik commented Oct 23, 2015

Wouldn't that be a problem if the first check was just a warning, and the next check was a critical?

@xinity
Copy link
Contributor

xinity commented Oct 23, 2015

It depends IMHO I maybe be interesting to have an elevation of failure state of a server, like defcon in war movies ;-)

but I don't think this is easy to implement, nor I think it will be widely used.

@blysik
Copy link
Author

blysik commented Oct 23, 2015

I think, as currently designed, critical errors might go unnoticed.

  1. Service has a low priority check fail, which generates a warning.
  2. Alert goes out to Ops team.
  3. Ops team sees it's a low priority check, ignores it until morning.
  4. More severe check fails, no alert gets sent.
  5. Ops team doesn't know about it.

@dbuxton
Copy link
Contributor

dbuxton commented Oct 23, 2015

At the moment we just track as a timestamp the last alert sent (Service.last_alert_sent - see

elif self.overall_status in (self.CRITICAL_STATUS, self.ERROR_STATUS):
) - we don't track what kind of alert that was.

It would be easy to also track Service.last_alert_sent_overall_status and compare that to the current to ensure that this issue doesn't occur.

Happy to merge anything that does this, I too think this is a big potential problem. However it won't silence alerts until morning @blysik, just for ALERT_INTERVAL

@blysik
Copy link
Author

blysik commented Oct 23, 2015

Aha! ALERT_INTERVAL. Okay, so not as bad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants