Send recovery alerts #288

jasonrhaas · 2015-10-26T22:57:55Z

It is common with server monitoring tools to send a "resolve" message when the problem that triggered the alert has recovered. It would be nice if there was something in ElastAlert that would send a message out if the next query did not yield the same alert.

For example, if I get a "flatline" alert, I fix the problem, and ElastAlert no longer alerts on that issue, it should send a "recovery" message out to tell whomever is listening that the issue is resolved.

More specifically, I'm using the PagerDuty API to track ElastAlert alerts, and would like to make use of the "resolve events" API.

https://developer.pagerduty.com/documentation/integration/events/resolve

zetsub0u · 2015-10-27T00:58:24Z

Hi, i talked with the guys on irc about something similar i wanted to implement, basically extending all the alerts that occur over a period of time (frequency, flatline, spike, etc) with something like the EventWindow, but maybe an AlertWindow which starts on the first alert and tracks it until the alert "expires" (ie, no more alerts after x time).
I don't know if i'm going to be able to work on this anytime soon but i just wanted to comment and give a 👍

bolshoy · 2015-10-27T10:11:18Z

@jasonrhaas I have a simple implementation for this, running it for two weeks already and it looks fine.
The logic is simple: if there is no alert for the rule in the current run, check in the writeback index if we have an alert stored for the previous run of this rule. If there was an alert, try to resolve it. Resolver is implemented by the corresponding alerter. See https://github.com/rounds/elastalert/commit/efc449295636bddee913f4bd3c61d3a857e1d339

jasonrhaas · 2015-10-27T12:50:45Z

thanks @bolshoy for sharing. Something like this would be a really nice addition to ElastAlert. It could be another YAML option that is set per alert, like resolve_alert: true or something to that effect.

bolshoy · 2015-10-27T13:15:50Z

@jasonrhaas right, this switch might be needed. In our case, alerts are sent to Sensu, and it always needs resolving. Hopefully I'll have some spare time soon l try to create a proper PR.

fiunchinho · 2016-02-15T16:07:05Z

Any news on this?

tomwganem · 2016-08-09T17:07:12Z

Would really like this to be implemented.

eravion · 2016-09-16T17:37:10Z

+1 :)

iainlbc · 2016-09-19T16:46:40Z

+1

Mormaii · 2016-09-21T20:00:20Z

+1

bean5 · 2016-09-26T14:43:51Z

For flatline, I agree that this is entirely possible, and is something I was looking to do. I would foresee, though, that for other rule types, doing this become more complex, specifically when query_key is set.

I think one way of doing this is to allow any rule to specify a flatline rule whose time bounds must be after the first rule fires. But I don't understand all the internals of ElastAlert and am certainly still learning about ES. If anyone has a better way to do this, just pipe in here.

bean5 · 2016-09-26T14:50:10Z

Another way to do this would be to make rules able to be dependent/blocked by other rules. Or just define rules as being able to auto-resolve any other rule upon firing.

bobbyhubbard · 2017-01-23T17:59:52Z

@bolshoy Did you make any progress on a pull? Maybe one of us could work on a pull based on your latest?

bolshoy · 2017-01-24T13:43:57Z

@bobbyhubbard stopped working on this altogether, using Prometheus instead

JC1738 · 2017-02-08T16:37:57Z

Would definitely be a nice feature, thought about using a second rule that would be the clearing alert, though this would be difficult to maintain.

tkuther · 2017-03-07T11:54:30Z

I would love to see this, too. Currently I'm doing flatline | frequency pingpong with command alert to Alerta.

supernomad · 2017-07-19T19:35:51Z

So I just slammed into this brick wall myself, and had a thought about how this could be possible.

Essentially elastalert triggers only when a match is made on the query, could it also be told to trigger on the reverse, and no match is made? this would require being able to set fields to different values, for instance a status field in the post data of an HTTP POST or victorops_message_type in VO. This would also obviously require a switch to turn the functionality on or off.

Any thoughts on the above?

pblasquez · 2017-08-22T02:00:30Z

If the status was configurable to send to compatible outputs, e.g. for PagerDuty, set a new variable 'pagerduty_status' to one of:
['trigger','resolve','acknowledge'] with a default of 'trigger', it would at least cover cases where this could be set explicitly by query with a separate rule.

I know it's not a global solution but it would be welcomed for the outputs where it is possible. It is already available to do things this way for the JIRA output.

bean5 · 2017-08-22T03:14:38Z

@pblasquez: That would be a simple way to make it work. I agree that we should default to trigger for backwards-compatibilty. This is a low cost way achieve what this issue wants--although to some it may be a work-around rather than actual feature.

The only catch would be to make sure that the correct PagerDuty API version is used as I assume version one does not allow resolving. That is just an assumption, though.

Opening #1304 to do this.

bean5 · 2017-08-24T23:11:49Z

@jasonrhaas: Do you think your idea should only apply to the flatline rule?

jasonrhaas · 2017-08-25T16:12:38Z

@bean5 My idea was to have it apply to any rule that triggers an alert. If you have used DataDog before, it is similar to that.

bean5 · 2017-08-25T17:38:35Z

@pblasquez #1304 was accepted, so you should be able to do what you proposed. Although I wrote a test case for the code, I did not actually use it against PagerDuty, so it may be buggy. The PagerDuty API seemed to indicate that both their API versions should accept it. Let me know if it doesn't work for you.

@jasonrhaas I implemented what @pblasquez proposed. It works for PagerDuty use-cases, which is what you mentioned in your first post here. That being said, you had the other idea of resolve_alert: true. I can definitely understand how your idea applies to flatline since resolving/triggering in such a case makes sense. Same goes for Any rules. But for my typical use cases for rules like whitelist/blacklist, I'd easily run into cases where there is an offending document that occurs just once. For such use cases, not occurring anymore does not mean resolved--it just means that it only occurred once, but still requires RCA. I suppose there could be cases where no longer occurring implies resolution, but I would use flatline for those. (I would make flatline resolve instead of the heartbeat case where flatline = trigger.) Note: I have not used DataDog before.

Perhaps the best way for me to understand is to ask:

Do you have a case where you would use an alert (excluding Any/flatline) that you would like to auto-resolve?
Can this issue be resolved? Do you think we should move your resolve_alert: true to another issue along with an example use case?

Qmando · 2017-08-25T18:07:41Z

I've had a couple thoughts about this for a while, here's what I imagine:

Alerters will need to implement a new resolve_alert function, it would probably be able to optionally take a query_key value to differentiate alerts from the same rule.
Each rule type will need to define a function to determine if an alert can be resolved. This function could maybe take a time differential from the last alert for that rule/query_key. For some rule types this is easy, like for flatline, if the event window is above threshold, resolve. For frequency, if the event window is below num_events and maybe timeframe has elapsed since last alert, resolve. We might be able to make a default that's just "if we've gone some amount of time without any alert, send a resolve" to work with other harder to reason about rule types.
For added customization, maybe we could make an alerter that just resolves other alerts, so that you could create a custom resolve rule.
Alerters without a defined resolve function would be noops and you could it would probably be default to off for those that do. For plain alerters like email, you might be able to define a "resolve_alert_text" or something.

bean5 · 2017-08-25T20:13:43Z

So you have already put a decent amount of thought into auto-resolving in general. Given the side, I think it merits its own issue. This one can be closed (we solved the PagerDuty work-around), right?

I'm not quite sure what you mean by "differentiate alerts from the same rule" because in this project a rule triggers an alert, a 1-1 mapping.

Qmando · 2017-08-25T20:16:22Z

By that, I basically meant if you are using query_key. Just like how silence stashes are created per query_key value. Ex, flatline alert with query_key on hostname. If host1 goes flat, then host2 goes flat, but only host2 comes back, you don't want to resolve the host1 flatline.

bean5 · 2017-08-25T22:24:17Z

Oh, I agree on that. Definitely. For non-PagerDuty alerters, that may be sufficient. And for a first version, that should be sufficient.

When it comes to PagerDuty, unless you differentiate the incident in their title, they are considered the same incident. At least I think it is by title. So for PagerDuty, even if in EA they are considered 2 events, unless precautions are appropriately taken (differentiated by title), resolving one incident will actually resolve the other. Perhaps we could append query_key to the title of PagerDuty events automatically or if a rule indicates to do this? I think this project has a way to programmatically do that manually, via arguments, but it would be nice to have it automatically done as a matter of process. This is a PD-related consideration only. So leaving it for follow-up work seems appropriate.

Qmando · 2017-08-25T22:30:31Z

As of b301f2a, you can set a custom pagerduty incident title, possibly using query_key. The default doesn't use query_key though, as email and jira subjects do, it probably should.

pblasquez · 2017-08-25T22:32:00Z

Yes, you set 'pagerduty_incident_key' using 'pagerduty_incident_key_args'.

It is up to the user to keep things sufficiently specific so they can target the same incident key for resolution.

bean5 · 2017-08-26T01:15:49Z

Awesome. So the support is there for PagerDuty. I knew that at one time. What remains is to add support for 'auto_resolve' then, if anything, right? Close this ticket and open one for that, or leave this open? I can go either way.

meltingrobot · 2018-09-07T19:29:01Z

Would still love to get alerts closed automagically with VictorOps.

Atem18 · 2019-06-14T14:44:22Z

Hi any news about that ? Should we create the recovery alert manually ?

Qmando · 2019-06-14T20:51:37Z

@Atem18

I probably wouldn't get your hopes up too much for this. It's a fair amount of work to implement in a generic way, and we unfortunately aren't really doing work on new features right now.

For some alert types, you can implement this by creating a second alert which is an inverse of the original. For example, with the jira alerter, you can transition issues to closed. Other types might not be so easy.

If you give some specifics I might be able to help guide you.

ahbrown1 · 2019-09-16T16:40:23Z

If this feature is still dead in the water (generic alert recovery), I may have to implement something super hacky & ugly and stuff all the logic in an enhancement module, with the Elastalert config file doing the upfront stuff, like just handling the index and query_key

Qmando · 2019-09-16T20:57:26Z

Someone implemented this for a few alerters: https://github.com/Yelp/elastalert/pull/2446/files

Haven't looked through all of it, but maybe it's a good place to start

nsano-rururu · 2020-11-20T15:50:48Z

I have an inventory of problems.
I think this problem has been solved.
If it has been resolved, close it.

meltingrobot · 2020-11-20T16:01:15Z

@nsano-rururu I read through the changelog, I do not see anywhere that a resolve alert feature was added. I do not think this was ever fixed.

diogokiss · 2021-03-03T14:16:30Z

Any news on this issue? It would be really helpful to have this feature implemented. :-/

aclowkey · 2021-07-28T15:26:52Z

Perhaps this issue should be moved to https://github.com/jertel/elastalert2. Since this repo is no longer maintained?

Fix timezone conversion check

bean5 mentioned this issue Aug 24, 2017

Add support for each PagerDuty event_type (only trigger was supported… #1304

Merged

ranjithruban mentioned this issue May 8, 2018

OpsGenie: closing alerts automatically #1560

Open

ajaywk7 pushed a commit to freshdesk/elastalert that referenced this issue Feb 14, 2023

Merge pull request Yelp#288 from ferozsalam/fix-timezone-conversion

5a56ccd

Fix timezone conversion check

Send recovery alerts #288

Send recovery alerts #288

Comments

jasonrhaas commented Oct 26, 2015

zetsub0u commented Oct 27, 2015

bolshoy commented Oct 27, 2015

jasonrhaas commented Oct 27, 2015

bolshoy commented Oct 27, 2015

fiunchinho commented Feb 15, 2016

tomwganem commented Aug 9, 2016

eravion commented Sep 16, 2016

iainlbc commented Sep 19, 2016

Mormaii commented Sep 21, 2016

bean5 commented Sep 26, 2016

bean5 commented Sep 26, 2016

bobbyhubbard commented Jan 23, 2017

bolshoy commented Jan 24, 2017

JC1738 commented Feb 8, 2017

tkuther commented Mar 7, 2017

supernomad commented Jul 19, 2017

pblasquez commented Aug 22, 2017 • edited Loading

bean5 commented Aug 22, 2017 • edited Loading

bean5 commented Aug 24, 2017

jasonrhaas commented Aug 25, 2017

bean5 commented Aug 25, 2017

Qmando commented Aug 25, 2017

bean5 commented Aug 25, 2017

Qmando commented Aug 25, 2017

bean5 commented Aug 25, 2017

Qmando commented Aug 25, 2017

pblasquez commented Aug 25, 2017

bean5 commented Aug 26, 2017

meltingrobot commented Sep 7, 2018

Atem18 commented Jun 14, 2019

Qmando commented Jun 14, 2019

ahbrown1 commented Sep 16, 2019

Qmando commented Sep 16, 2019

nsano-rururu commented Nov 20, 2020

meltingrobot commented Nov 20, 2020

diogokiss commented Mar 3, 2021

aclowkey commented Jul 28, 2021

pblasquez commented Aug 22, 2017 •

edited

Loading

bean5 commented Aug 22, 2017 •

edited

Loading