Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alert muting #46034

Closed
mikecote opened this issue Sep 18, 2019 · 14 comments · Fixed by #43712
Closed

Alert muting #46034

mikecote opened this issue Sep 18, 2019 · 14 comments · Fixed by #43712

Comments

@mikecote
Copy link
Contributor

mikecote commented Sep 18, 2019

Muting is designed to opt-out of executing the actions for a given alert or alert instance. There's been discussion about how exactly this should work so I have created this discuss issue for tracking purposes.

Questions

  1. When a user mutes an alert or alert instance, does the muting ever expire?
  2. Does it make sense to add alert level and alert instance level muting? Alert level would behave as a "mute all" including new instances that would exist in the future.
  3. When a muted alert instance stops firing, should the instance be unmuted?
  4. What happens to muted alert instances when unmuting an alert? If it works like unmute all, we should clear out all the individually muted instances?
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-stack-services

@mikecote
Copy link
Contributor Author

cc @peterschretlen @alexfrancoeur

@pmuellr
Copy link
Member

pmuellr commented Sep 18, 2019

Some quick takes:

When a user mutes an alert or alert instance, does the muting ever expire?

Seems easy to allow an alert muting to never expire, but allowing alert-instance muting to never expire seems potentially problematic, depending on how alert-type implementors use alert instances. Eg, I believe alert instance data is deleted, if on a turn of the alert-type function, no scheduleAction() is called for an alert instance. So they could get deleted, and then potentially re-appear later, but would have lost all their state, look like new alert instances. But if we keep all them around, how do we clean them up, because they could easily go unbounded?

Is there a difference between a mute with an expiration, and a throttle? What about "snooze"?

Does it make sense to add alert level and alert instance level muting? Alert level would behave as a "mute all" including new instances that would exist in the future.

Ya, I think we need both. Long term. I could totally see not having alert-instance level muting till a little later; if we at least had alert level muting.

hmmm ... after writing down ^^^ I'm realizing we need to have some pretty crisp (and hopefully concise) human-readable description of all these things, alerts/alert-instances, muting, throttling, snoozing, describing how it works at a high level anyway. Maybe we do already?

@mikecote
Copy link
Contributor Author

So they could get deleted, and then potentially re-appear later, but would have lost all their state

I've designed that the muted alert instance ids get saved with the alert in an array (so far in #43712). With this approach, there won't be the problem if scheduleAction() isn't called. The array will exist as long as the alert exists.

But if we keep all them around, how do we clean them up, because they could easily go unbounded?

I'm not sure on this one. So far the design would be to unmute via UI otherwise the array will keep growing. I'm curious what others think on this point.

Is there a difference between a mute with an expiration, and a throttle? What about "snooze"?

Snooze and mute are about the same. From what I hear snooze will just be "mute for x period".

The difference between mute and throttle would be: 1) mute applies even if group changes. 2) mute is a throttle without expiry (so far). 3) the "notifications" view would still show events for muted alerts. Just the actions wouldn't be executed.

I could totally see not having alert-instance level muting till a little later; if we at least had alert level muting.

This was requested by the solution teams (can't recall which one, maybe stack monitoring?) where they want to mute a problematic server out of the cluster until its fixed.

I'm realizing we need to have some pretty crisp (and hopefully concise) human-readable description of all these things, alerts/alert-instances, muting, throttling, snoozing, describing how it works at a high level anyway. Maybe we do already?

We do! I will send you the link to the glossary.

@pmuellr
Copy link
Member

pmuellr commented Sep 18, 2019

But if we keep all them around, how do we clean them up, because they could easily go unbounded?

I'm not sure on this one. So far the design would be to unmute via UI otherwise the array will keep growing. I'm curious what others think on this point.

I think Clint mentioned "mute with a timeout", which I think would make sense for cleaning up the garbage automatically. We'd stick a time in there with the alertInstanceId, and can clean them up the next time they're read/written. So, alert instances would only have a snooze (mute with time), not a mute.

I could totally see not having alert-instance level muting till a little later; if we at least had alert level muting.

This was requested by the solution teams (can't recall which one, maybe stack monitoring?) where they want to mute a problematic server out of the cluster until its fixed.

Ya, that's a great use case! thx for fixing that for me :-)

@peterschretlen
Copy link
Contributor

peterschretlen commented Sep 19, 2019

I think muting is a state, and changing the state over time should be considered separately, it fits into enhancements like:

  • snoozing (mute for x period)
  • maintenance window (mute between time X and time Y)

As for alert-level or alert-instance-level, my opinion is we need to support both there are use cases for each:

  • The desire to mute (and other operations) at the instance level is a recurring comment/request with watches, so I think we should include support for it.
  • at the alert-level, sometime's an alert is just too noisy when first deployed or something else changes that makes an alert temporarily noisy, you want to mute the whole thing.

Having an unbounded list of muted instances is a concern. If it's really problematic, we could place limits on the number of instances that can be muted. I think the expected usage would be you have a handful of instances muted that you are working on. At a certain point enough things are firing you mute the entire alert.

Through testing we could probably find some reasonable number we could limit it. If there's 1M instances in the muted list it feels like the system isn't being used correctly.

@pmuellr
Copy link
Member

pmuellr commented Sep 20, 2019

Having an unbounded list of muted instances is a concern. If it's really problematic, we could place limits on the number of instances that can be muted.

I think that would be very confusing, difficult to diagnose when you crossed the line. Would much prefer to say alert instances can only be snoozed (muted for some fixed amount of time) and not indefinitely muted. There's some potential for abuse there - setting snoozes on an alert instances for a century, but at least we'd have an explicit record they did that (the date would end up in the alert document). It would be difficult to diagnose issues around culled instances because of a limit - some clues would only show up in the kibana logs (assuming hitting limits would end up logging some info).

@mikecote
Copy link
Contributor Author

mikecote commented Sep 20, 2019

I've added a 3rd question that goes in hand with unbounded concerns. I'm not sure how we can bind it to the alert instance state if it clears up after not firing. Unless we don't clean them up because they're muted.

  1. When a muted alert instance stops firing, should the instance be unmuted?

@peterschretlen
Copy link
Contributor

I think that would be very confusing, difficult to diagnose when you crossed the line.

If the system degrades or falls over when crossing the line though - wouldn't that also be confusing and hard to diagnose? Snoozing might help but still leaves the system vulnerable in my opinion.

Perhaps we figure out what the limits are first, then figure out the best way to address them? Are we talking 100 instances, 1k, 100k? Whether it's changing the operation from mute to snooze, or imposing some hard cap, or something else, it can likely be done at a later point rather than trying to built it in up front.

Alert instances in general and the state/operations on them feels like an area where we'll hit practical limits pretty quickly, so maybe it warrants a separate discussion. Personally I'd much rather take an alerting system into production that has safeguards in place and enforces limits, GCP alerting limits for example has a alert-instance equivalent "Simultaneously open incidents per alerting policy" cap of 5000.

@peterschretlen
Copy link
Contributor

  1. When a muted alert instance stops firing, should the instance be unmuted?

I would say no. If you've set the state on an instance it should stay until you change the state.

For example, on a noisy threshold alert where the value fluctuates above and below the threshold, you'd want the muting state to persist because you know the alert instance is going to fire again.

@mikecote
Copy link
Contributor Author

mikecote commented Sep 23, 2019

Added a 4th question

  1. What happens to muted alert instances when unmuting an alert? If it works like unmute all, we should clear out all the individually muted instances?

@bmcconaghy
Copy link
Contributor

I think what makes sense is that it clears all muting state, similar to what happens in a table when I select a couple of rows, I click select all, then I unclick select all.

@peterschretlen
Copy link
Contributor

I think what makes sense is that it clears all muting state, similar to what happens in a table when I select a couple of rows, I click select all, then I unclick select all.

++, and maybe we consider calling it "mute-all" and "unmute-all" at the alert level? That clarifies the expected behaviour.

@mikecote
Copy link
Contributor Author

and maybe we consider calling it "mute-all" and "unmute-all" at the alert level? That clarifies the expected behaviour.

I like that. I can rename the APIs (and all terminology) to be more explicit too /_mute_all and /_unmute_all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants