[Discuss] Should the framework drop accumulated alerts (and their related notifications) when a rule run times out? #139237

mikecote · 2022-08-22T19:22:44Z

The alerting framework currently adopts an all-or-nothing approach when it comes to persisting alerts and sending notifications. If a rule run times out, all accumulated alerts are dropped and nothing is persisted for that rule run.

I'm opening this issue to discuss if the framework should do something with the accumulated alerts when encountering a timeout? I'm wondering if we can take what we learned from alert circuit breakers and apply similar logic on timeout as we do when the alert circuit breaker is hit 🤔

cc @shanisagiv1

elasticmachine · 2022-08-22T19:22:46Z

Pinging @elastic/response-ops (Team:ResponseOps)

pmuellr · 2022-08-22T19:47:22Z

Is there some context on why we did it this way - the way it's working today?

I guess one thing is that if a rule times out after only handling some of it's queries, and say marked some alerts as still active, pretend it would have marked some other ones active if it did not time out. Those will now recover. But we probably don't want that. Almost feels like you could accept active alerts from a timed out rule, but shouldn't process recovered alerts from such runs - those alerts would remain active (or some other non-recovered state like maybe-active?)

ymao1 · 2022-08-22T20:14:31Z

If I recall correctly, we did it this way to avoid timed out runs possibly overwriting the state for rules that run ok. This is because even though we are cancelling ES queries and providing rule type executors with the services that they can check to see if the run is cancelled, we don't have 100% adoption, so we're still not guaranteeing that when a run is cancelled, the execution completely stops. It could be that rule runs the ES query but then takes 10 minutes to post-process the results, during which the run times out, but the rule keeps running. The next execution of this rule gets picked up and finishes within the timeout, updating the task manager state with its latest. Then the previous execution finally finishes. If we processed those alerts and then persisted them to the task state, we would be overwriting the state from a newer execution and sending outdated notifications.

Ideally, when we know for sure that cancelling a rule 100% stops the execution, we could look into doing something like what the alert circuit breaker does.

mikecote · 2022-08-23T16:52:44Z

Yeah, I think it'll take time to be 100% sure that cancelling a rule stops the execution. But I wonder if we should do anything with the alerts that did get reported before the timeout occurs.

For example, 5 minute timeout:
minute 0 - start rule execution
minute 0 - start query
minute 4 - query returns partial alerts
minute 4 - report alert A, B, C to the platform
minute 4 - start a second query
minute 5 - timeout error!
...

I wonder if the framework should do something with alert A, B, and C which got reported prior to the timeout occurring. Hopefully my example is clear, happy to make some diagrams or discuss synchronously.

mikecote added discuss Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework labels Aug 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discuss] Should the framework drop accumulated alerts (and their related notifications) when a rule run times out? #139237

[Discuss] Should the framework drop accumulated alerts (and their related notifications) when a rule run times out? #139237

mikecote commented Aug 22, 2022

elasticmachine commented Aug 22, 2022

pmuellr commented Aug 22, 2022

ymao1 commented Aug 22, 2022

mikecote commented Aug 23, 2022

[Discuss] Should the framework drop accumulated alerts (and their related notifications) when a rule run times out? #139237

[Discuss] Should the framework drop accumulated alerts (and their related notifications) when a rule run times out? #139237

Comments

mikecote commented Aug 22, 2022

elasticmachine commented Aug 22, 2022

pmuellr commented Aug 22, 2022

ymao1 commented Aug 22, 2022

mikecote commented Aug 23, 2022