[question] For a service, one check goes into error, then another a little later #281

blysik · 2015-10-23T16:53:01Z

Hi,

Just a question on what the behavior is supposed to be.

create a service, and assign it two checks: graphite, and http.
http check goes into error, and then an alert is triggered. (Importance of 'Error'.)
A few moments later, the graphite check failed (importance of critical), however no alert appears to be triggered.

Shouldn't another alert be triggered for 3?

xinity · 2015-10-23T16:59:21Z

Nop,

Alerts are triggered service based, not check based. So if an alert has
already been triggered, then Cabot will not trigger another one even if
another check fails until alert_notification time is reached.

@dbuxton please fix me if I'm wrong ;-)

Le ven. 23 oct. 2015 18:53, blysik notifications@github.com a écrit :

Hi,

Just a question on what the behavior is supposed to be.

create a service, and assign it two checks: graphite, and http.

http check goes into error, and then an alert is triggered.
(Importance of 'Error'.)

A few moments later, the graphite check failed (importance of
critical), however no alert appears to be triggered.

Shouldn't another alert be triggered for 3?

—
Reply to this email directly or view it on GitHub
#281.

blysik · 2015-10-23T17:32:21Z

Wouldn't that be a problem if the first check was just a warning, and the next check was a critical?

xinity · 2015-10-23T17:39:25Z

It depends IMHO I maybe be interesting to have an elevation of failure state of a server, like defcon in war movies ;-)

but I don't think this is easy to implement, nor I think it will be widely used.

blysik · 2015-10-23T19:02:51Z

I think, as currently designed, critical errors might go unnoticed.

Service has a low priority check fail, which generates a warning.
Alert goes out to Ops team.
Ops team sees it's a low priority check, ignores it until morning.
More severe check fails, no alert gets sent.
Ops team doesn't know about it.

dbuxton · 2015-10-23T20:04:25Z

At the moment we just track as a timestamp the last alert sent (Service.last_alert_sent - see

cabot/cabot/cabotapp/models.py

Line 180 in fc33c98

elif self.overall_status in (self.CRITICAL_STATUS, self.ERROR_STATUS):

) - we don't track what kind of alert that was.

It would be easy to also track Service.last_alert_sent_overall_status and compare that to the current to ensure that this issue doesn't occur.

Happy to merge anything that does this, I too think this is a big potential problem. However it won't silence alerts until morning @blysik, just for ALERT_INTERVAL

blysik · 2015-10-23T20:46:25Z

Aha! ALERT_INTERVAL. Okay, so not as bad.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[question] For a service, one check goes into error, then another a little later #281

[question] For a service, one check goes into error, then another a little later #281

blysik commented Oct 23, 2015

xinity commented Oct 23, 2015

blysik commented Oct 23, 2015

xinity commented Oct 23, 2015

blysik commented Oct 23, 2015

dbuxton commented Oct 23, 2015

blysik commented Oct 23, 2015

[question] For a service, one check goes into error, then another a little later #281

[question] For a service, one check goes into error, then another a little later #281

Comments

blysik commented Oct 23, 2015

xinity commented Oct 23, 2015

blysik commented Oct 23, 2015

xinity commented Oct 23, 2015

blysik commented Oct 23, 2015

dbuxton commented Oct 23, 2015

blysik commented Oct 23, 2015