Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send recovery alerts #288

Open
jasonrhaas opened this issue Oct 26, 2015 · 37 comments
Open

Send recovery alerts #288

jasonrhaas opened this issue Oct 26, 2015 · 37 comments

Comments

@jasonrhaas
Copy link
Contributor

It is common with server monitoring tools to send a "resolve" message when the problem that triggered the alert has recovered. It would be nice if there was something in ElastAlert that would send a message out if the next query did not yield the same alert.

For example, if I get a "flatline" alert, I fix the problem, and ElastAlert no longer alerts on that issue, it should send a "recovery" message out to tell whomever is listening that the issue is resolved.

More specifically, I'm using the PagerDuty API to track ElastAlert alerts, and would like to make use of the "resolve events" API.

https://developer.pagerduty.com/documentation/integration/events/resolve

@zetsub0u
Copy link
Contributor

Hi, i talked with the guys on irc about something similar i wanted to implement, basically extending all the alerts that occur over a period of time (frequency, flatline, spike, etc) with something like the EventWindow, but maybe an AlertWindow which starts on the first alert and tracks it until the alert "expires" (ie, no more alerts after x time).
I don't know if i'm going to be able to work on this anytime soon but i just wanted to comment and give a 👍

@bolshoy
Copy link
Contributor

bolshoy commented Oct 27, 2015

@jasonrhaas I have a simple implementation for this, running it for two weeks already and it looks fine.
The logic is simple: if there is no alert for the rule in the current run, check in the writeback index if we have an alert stored for the previous run of this rule. If there was an alert, try to resolve it. Resolver is implemented by the corresponding alerter. See https://github.com/rounds/elastalert/commit/efc449295636bddee913f4bd3c61d3a857e1d339

@jasonrhaas
Copy link
Contributor Author

thanks @bolshoy for sharing. Something like this would be a really nice addition to ElastAlert. It could be another YAML option that is set per alert, like resolve_alert: true or something to that effect.

@bolshoy
Copy link
Contributor

bolshoy commented Oct 27, 2015

@jasonrhaas right, this switch might be needed. In our case, alerts are sent to Sensu, and it always needs resolving. Hopefully I'll have some spare time soon l try to create a proper PR.

@fiunchinho
Copy link
Contributor

Any news on this?

@tomwganem
Copy link

Would really like this to be implemented.

@eravion
Copy link

eravion commented Sep 16, 2016

+1 :)

@iainlbc
Copy link

iainlbc commented Sep 19, 2016

+1

1 similar comment
@Mormaii
Copy link

Mormaii commented Sep 21, 2016

+1

@bean5
Copy link
Contributor

bean5 commented Sep 26, 2016

For flatline, I agree that this is entirely possible, and is something I was looking to do. I would foresee, though, that for other rule types, doing this become more complex, specifically when query_key is set.

I think one way of doing this is to allow any rule to specify a flatline rule whose time bounds must be after the first rule fires. But I don't understand all the internals of ElastAlert and am certainly still learning about ES. If anyone has a better way to do this, just pipe in here.

@bean5
Copy link
Contributor

bean5 commented Sep 26, 2016

Another way to do this would be to make rules able to be dependent/blocked by other rules. Or just define rules as being able to auto-resolve any other rule upon firing.

@bobbyhubbard
Copy link

@bolshoy Did you make any progress on a pull? Maybe one of us could work on a pull based on your latest?

@bolshoy
Copy link
Contributor

bolshoy commented Jan 24, 2017

@bobbyhubbard stopped working on this altogether, using Prometheus instead

@JC1738
Copy link

JC1738 commented Feb 8, 2017

Would definitely be a nice feature, thought about using a second rule that would be the clearing alert, though this would be difficult to maintain.

@tkuther
Copy link

tkuther commented Mar 7, 2017

I would love to see this, too. Currently I'm doing flatline | frequency pingpong with command alert to Alerta.

@supernomad
Copy link

So I just slammed into this brick wall myself, and had a thought about how this could be possible.

Essentially elastalert triggers only when a match is made on the query, could it also be told to trigger on the reverse, and no match is made? this would require being able to set fields to different values, for instance a status field in the post data of an HTTP POST or victorops_message_type in VO. This would also obviously require a switch to turn the functionality on or off.

Any thoughts on the above?

@pblasquez
Copy link
Contributor

pblasquez commented Aug 22, 2017

If the status was configurable to send to compatible outputs, e.g. for PagerDuty, set a new variable 'pagerduty_status' to one of:
['trigger','resolve','acknowledge'] with a default of 'trigger', it would at least cover cases where this could be set explicitly by query with a separate rule.

I know it's not a global solution but it would be welcomed for the outputs where it is possible. It is already available to do things this way for the JIRA output.

@bean5
Copy link
Contributor

bean5 commented Aug 22, 2017

@pblasquez: That would be a simple way to make it work. I agree that we should default to trigger for backwards-compatibilty. This is a low cost way achieve what this issue wants--although to some it may be a work-around rather than actual feature.

The only catch would be to make sure that the correct PagerDuty API version is used as I assume version one does not allow resolving. That is just an assumption, though.

Opening #1304 to do this.

@bean5
Copy link
Contributor

bean5 commented Aug 24, 2017

@jasonrhaas: Do you think your idea should only apply to the flatline rule?

@jasonrhaas
Copy link
Contributor Author

@bean5 My idea was to have it apply to any rule that triggers an alert. If you have used DataDog before, it is similar to that.

@bean5
Copy link
Contributor

bean5 commented Aug 25, 2017

@pblasquez #1304 was accepted, so you should be able to do what you proposed. Although I wrote a test case for the code, I did not actually use it against PagerDuty, so it may be buggy. The PagerDuty API seemed to indicate that both their API versions should accept it. Let me know if it doesn't work for you.

@jasonrhaas I implemented what @pblasquez proposed. It works for PagerDuty use-cases, which is what you mentioned in your first post here. That being said, you had the other idea of resolve_alert: true. I can definitely understand how your idea applies to flatline since resolving/triggering in such a case makes sense. Same goes for Any rules. But for my typical use cases for rules like whitelist/blacklist, I'd easily run into cases where there is an offending document that occurs just once. For such use cases, not occurring anymore does not mean resolved--it just means that it only occurred once, but still requires RCA. I suppose there could be cases where no longer occurring implies resolution, but I would use flatline for those. (I would make flatline resolve instead of the heartbeat case where flatline = trigger.) Note: I have not used DataDog before.

Perhaps the best way for me to understand is to ask:

  • Do you have a case where you would use an alert (excluding Any/flatline) that you would like to auto-resolve?
  • Can this issue be resolved? Do you think we should move your resolve_alert: true to another issue along with an example use case?

@Qmando
Copy link
Member

Qmando commented Aug 25, 2017

I've had a couple thoughts about this for a while, here's what I imagine:

  • Alerters will need to implement a new resolve_alert function, it would probably be able to optionally take a query_key value to differentiate alerts from the same rule.

  • Each rule type will need to define a function to determine if an alert can be resolved. This function could maybe take a time differential from the last alert for that rule/query_key. For some rule types this is easy, like for flatline, if the event window is above threshold, resolve. For frequency, if the event window is below num_events and maybe timeframe has elapsed since last alert, resolve. We might be able to make a default that's just "if we've gone some amount of time without any alert, send a resolve" to work with other harder to reason about rule types.

  • For added customization, maybe we could make an alerter that just resolves other alerts, so that you could create a custom resolve rule.

  • Alerters without a defined resolve function would be noops and you could it would probably be default to off for those that do. For plain alerters like email, you might be able to define a "resolve_alert_text" or something.

@bean5
Copy link
Contributor

bean5 commented Aug 25, 2017

So you have already put a decent amount of thought into auto-resolving in general. Given the side, I think it merits its own issue. This one can be closed (we solved the PagerDuty work-around), right?

I'm not quite sure what you mean by "differentiate alerts from the same rule" because in this project a rule triggers an alert, a 1-1 mapping.

@Qmando
Copy link
Member

Qmando commented Aug 25, 2017

By that, I basically meant if you are using query_key. Just like how silence stashes are created per query_key value. Ex, flatline alert with query_key on hostname. If host1 goes flat, then host2 goes flat, but only host2 comes back, you don't want to resolve the host1 flatline.

@bean5
Copy link
Contributor

bean5 commented Aug 25, 2017

Oh, I agree on that. Definitely. For non-PagerDuty alerters, that may be sufficient. And for a first version, that should be sufficient.

When it comes to PagerDuty, unless you differentiate the incident in their title, they are considered the same incident. At least I think it is by title. So for PagerDuty, even if in EA they are considered 2 events, unless precautions are appropriately taken (differentiated by title), resolving one incident will actually resolve the other. Perhaps we could append query_key to the title of PagerDuty events automatically or if a rule indicates to do this? I think this project has a way to programmatically do that manually, via arguments, but it would be nice to have it automatically done as a matter of process. This is a PD-related consideration only. So leaving it for follow-up work seems appropriate.

@Qmando
Copy link
Member

Qmando commented Aug 25, 2017

As of b301f2a, you can set a custom pagerduty incident title, possibly using query_key. The default doesn't use query_key though, as email and jira subjects do, it probably should.

@pblasquez
Copy link
Contributor

Yes, you set 'pagerduty_incident_key' using 'pagerduty_incident_key_args'.

It is up to the user to keep things sufficiently specific so they can target the same incident key for resolution.

@bean5
Copy link
Contributor

bean5 commented Aug 26, 2017

Awesome. So the support is there for PagerDuty. I knew that at one time. What remains is to add support for 'auto_resolve' then, if anything, right? Close this ticket and open one for that, or leave this open? I can go either way.

@meltingrobot
Copy link

Would still love to get alerts closed automagically with VictorOps.

@Atem18
Copy link

Atem18 commented Jun 14, 2019

Hi any news about that ? Should we create the recovery alert manually ?

@Qmando
Copy link
Member

Qmando commented Jun 14, 2019

@Atem18

I probably wouldn't get your hopes up too much for this. It's a fair amount of work to implement in a generic way, and we unfortunately aren't really doing work on new features right now.

For some alert types, you can implement this by creating a second alert which is an inverse of the original. For example, with the jira alerter, you can transition issues to closed. Other types might not be so easy.

If you give some specifics I might be able to help guide you.

@ahbrown1
Copy link

If this feature is still dead in the water (generic alert recovery), I may have to implement something super hacky & ugly and stuff all the logic in an enhancement module, with the Elastalert config file doing the upfront stuff, like just handling the index and query_key

@Qmando
Copy link
Member

Qmando commented Sep 16, 2019

Someone implemented this for a few alerters: https://github.com/Yelp/elastalert/pull/2446/files

Haven't looked through all of it, but maybe it's a good place to start

@nsano-rururu
Copy link
Contributor

I have an inventory of problems.
I think this problem has been solved.
If it has been resolved, close it.

@meltingrobot
Copy link

@nsano-rururu I read through the changelog, I do not see anywhere that a resolve alert feature was added. I do not think this was ever fixed.

@diogokiss
Copy link

Any news on this issue? It would be really helpful to have this feature implemented. :-/

@aclowkey
Copy link

Perhaps this issue should be moved to https://github.com/jertel/elastalert2. Since this repo is no longer maintained?

ajaywk7 pushed a commit to freshdesk/elastalert that referenced this issue Feb 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests