Alert muting #46034

mikecote · 2019-09-18T15:59:38Z

Muting is designed to opt-out of executing the actions for a given alert or alert instance. There's been discussion about how exactly this should work so I have created this discuss issue for tracking purposes.

Questions

When a user mutes an alert or alert instance, does the muting ever expire?
Does it make sense to add alert level and alert instance level muting? Alert level would behave as a "mute all" including new instances that would exist in the future.
When a muted alert instance stops firing, should the instance be unmuted?
What happens to muted alert instances when unmuting an alert? If it works like unmute all, we should clear out all the individually muted instances?

elasticmachine · 2019-09-18T15:59:39Z

Pinging @elastic/kibana-stack-services

mikecote · 2019-09-18T16:00:31Z

cc @peterschretlen @alexfrancoeur

pmuellr · 2019-09-18T16:45:51Z

Some quick takes:

When a user mutes an alert or alert instance, does the muting ever expire?

Seems easy to allow an alert muting to never expire, but allowing alert-instance muting to never expire seems potentially problematic, depending on how alert-type implementors use alert instances. Eg, I believe alert instance data is deleted, if on a turn of the alert-type function, no scheduleAction() is called for an alert instance. So they could get deleted, and then potentially re-appear later, but would have lost all their state, look like new alert instances. But if we keep all them around, how do we clean them up, because they could easily go unbounded?

Is there a difference between a mute with an expiration, and a throttle? What about "snooze"?

Does it make sense to add alert level and alert instance level muting? Alert level would behave as a "mute all" including new instances that would exist in the future.

Ya, I think we need both. Long term. I could totally see not having alert-instance level muting till a little later; if we at least had alert level muting.

hmmm ... after writing down ^^^ I'm realizing we need to have some pretty crisp (and hopefully concise) human-readable description of all these things, alerts/alert-instances, muting, throttling, snoozing, describing how it works at a high level anyway. Maybe we do already?

mikecote · 2019-09-18T18:55:36Z

So they could get deleted, and then potentially re-appear later, but would have lost all their state

I've designed that the muted alert instance ids get saved with the alert in an array (so far in #43712). With this approach, there won't be the problem if scheduleAction() isn't called. The array will exist as long as the alert exists.

But if we keep all them around, how do we clean them up, because they could easily go unbounded?

I'm not sure on this one. So far the design would be to unmute via UI otherwise the array will keep growing. I'm curious what others think on this point.

Is there a difference between a mute with an expiration, and a throttle? What about "snooze"?

Snooze and mute are about the same. From what I hear snooze will just be "mute for x period".

The difference between mute and throttle would be: 1) mute applies even if group changes. 2) mute is a throttle without expiry (so far). 3) the "notifications" view would still show events for muted alerts. Just the actions wouldn't be executed.

I could totally see not having alert-instance level muting till a little later; if we at least had alert level muting.

This was requested by the solution teams (can't recall which one, maybe stack monitoring?) where they want to mute a problematic server out of the cluster until its fixed.

I'm realizing we need to have some pretty crisp (and hopefully concise) human-readable description of all these things, alerts/alert-instances, muting, throttling, snoozing, describing how it works at a high level anyway. Maybe we do already?

We do! I will send you the link to the glossary.

pmuellr · 2019-09-18T22:12:19Z

But if we keep all them around, how do we clean them up, because they could easily go unbounded?

I'm not sure on this one. So far the design would be to unmute via UI otherwise the array will keep growing. I'm curious what others think on this point.

I think Clint mentioned "mute with a timeout", which I think would make sense for cleaning up the garbage automatically. We'd stick a time in there with the alertInstanceId, and can clean them up the next time they're read/written. So, alert instances would only have a snooze (mute with time), not a mute.

I could totally see not having alert-instance level muting till a little later; if we at least had alert level muting.

This was requested by the solution teams (can't recall which one, maybe stack monitoring?) where they want to mute a problematic server out of the cluster until its fixed.

Ya, that's a great use case! thx for fixing that for me :-)

peterschretlen · 2019-09-19T16:56:36Z

I think muting is a state, and changing the state over time should be considered separately, it fits into enhancements like:

snoozing (mute for x period)
maintenance window (mute between time X and time Y)

As for alert-level or alert-instance-level, my opinion is we need to support both there are use cases for each:

The desire to mute (and other operations) at the instance level is a recurring comment/request with watches, so I think we should include support for it.
at the alert-level, sometime's an alert is just too noisy when first deployed or something else changes that makes an alert temporarily noisy, you want to mute the whole thing.

Having an unbounded list of muted instances is a concern. If it's really problematic, we could place limits on the number of instances that can be muted. I think the expected usage would be you have a handful of instances muted that you are working on. At a certain point enough things are firing you mute the entire alert.

Through testing we could probably find some reasonable number we could limit it. If there's 1M instances in the muted list it feels like the system isn't being used correctly.

pmuellr · 2019-09-20T01:18:25Z

Having an unbounded list of muted instances is a concern. If it's really problematic, we could place limits on the number of instances that can be muted.

I think that would be very confusing, difficult to diagnose when you crossed the line. Would much prefer to say alert instances can only be snoozed (muted for some fixed amount of time) and not indefinitely muted. There's some potential for abuse there - setting snoozes on an alert instances for a century, but at least we'd have an explicit record they did that (the date would end up in the alert document). It would be difficult to diagnose issues around culled instances because of a limit - some clues would only show up in the kibana logs (assuming hitting limits would end up logging some info).

mikecote · 2019-09-20T11:43:22Z

I've added a 3rd question that goes in hand with unbounded concerns. I'm not sure how we can bind it to the alert instance state if it clears up after not firing. Unless we don't clean them up because they're muted.

When a muted alert instance stops firing, should the instance be unmuted?

peterschretlen · 2019-09-20T13:42:31Z

I think that would be very confusing, difficult to diagnose when you crossed the line.

If the system degrades or falls over when crossing the line though - wouldn't that also be confusing and hard to diagnose? Snoozing might help but still leaves the system vulnerable in my opinion.

Perhaps we figure out what the limits are first, then figure out the best way to address them? Are we talking 100 instances, 1k, 100k? Whether it's changing the operation from mute to snooze, or imposing some hard cap, or something else, it can likely be done at a later point rather than trying to built it in up front.

Alert instances in general and the state/operations on them feels like an area where we'll hit practical limits pretty quickly, so maybe it warrants a separate discussion. Personally I'd much rather take an alerting system into production that has safeguards in place and enforces limits, GCP alerting limits for example has a alert-instance equivalent "Simultaneously open incidents per alerting policy" cap of 5000.

peterschretlen · 2019-09-20T13:51:25Z

When a muted alert instance stops firing, should the instance be unmuted?

I would say no. If you've set the state on an instance it should stay until you change the state.

For example, on a noisy threshold alert where the value fluctuates above and below the threshold, you'd want the muting state to persist because you know the alert instance is going to fire again.

mikecote · 2019-09-23T11:58:01Z

Added a 4th question

What happens to muted alert instances when unmuting an alert? If it works like unmute all, we should clear out all the individually muted instances?

bmcconaghy · 2019-09-23T18:35:33Z

I think what makes sense is that it clears all muting state, similar to what happens in a table when I select a couple of rows, I click select all, then I unclick select all.

peterschretlen · 2019-09-24T19:07:17Z

I think what makes sense is that it clears all muting state, similar to what happens in a table when I select a couple of rows, I click select all, then I unclick select all.

++, and maybe we consider calling it "mute-all" and "unmute-all" at the alert level? That clarifies the expected behaviour.

mikecote · 2019-09-24T19:17:42Z

and maybe we consider calling it "mute-all" and "unmute-all" at the alert level? That clarifies the expected behaviour.

I like that. I can rename the APIs (and all terminology) to be more explicit too /_mute_all and /_unmute_all.

mikecote added discuss Feature:Alerting Team:Stack Services labels Sep 18, 2019

peterschretlen mentioned this issue Sep 20, 2019

Alert muting & throttling #40023

Closed

mikecote mentioned this issue Sep 24, 2019

Add muting support for alerts #43712

Merged

mikecote closed this as completed in #43712 Sep 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alert muting #46034

Alert muting #46034

mikecote commented Sep 18, 2019 •

edited

Loading

elasticmachine commented Sep 18, 2019

mikecote commented Sep 18, 2019

pmuellr commented Sep 18, 2019

mikecote commented Sep 18, 2019

pmuellr commented Sep 18, 2019

peterschretlen commented Sep 19, 2019 •

edited

Loading

pmuellr commented Sep 20, 2019

mikecote commented Sep 20, 2019 •

edited

Loading

peterschretlen commented Sep 20, 2019

peterschretlen commented Sep 20, 2019

mikecote commented Sep 23, 2019 •

edited

Loading

bmcconaghy commented Sep 23, 2019

peterschretlen commented Sep 24, 2019

mikecote commented Sep 24, 2019

Alert muting #46034

Alert muting #46034

Comments

mikecote commented Sep 18, 2019 • edited Loading

Questions

elasticmachine commented Sep 18, 2019

mikecote commented Sep 18, 2019

pmuellr commented Sep 18, 2019

mikecote commented Sep 18, 2019

pmuellr commented Sep 18, 2019

peterschretlen commented Sep 19, 2019 • edited Loading

pmuellr commented Sep 20, 2019

mikecote commented Sep 20, 2019 • edited Loading

peterschretlen commented Sep 20, 2019

peterschretlen commented Sep 20, 2019

mikecote commented Sep 23, 2019 • edited Loading

bmcconaghy commented Sep 23, 2019

peterschretlen commented Sep 24, 2019

mikecote commented Sep 24, 2019

mikecote commented Sep 18, 2019 •

edited

Loading

peterschretlen commented Sep 19, 2019 •

edited

Loading

mikecote commented Sep 20, 2019 •

edited

Loading

mikecote commented Sep 23, 2019 •

edited

Loading