Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discuss] Should the framework drop accumulated alerts (and their related notifications) when a rule run times out? #139237

Open
mikecote opened this issue Aug 22, 2022 · 4 comments
Labels
discuss Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

The alerting framework currently adopts an all-or-nothing approach when it comes to persisting alerts and sending notifications. If a rule run times out, all accumulated alerts are dropped and nothing is persisted for that rule run.

I'm opening this issue to discuss if the framework should do something with the accumulated alerts when encountering a timeout? I'm wondering if we can take what we learned from alert circuit breakers and apply similar logic on timeout as we do when the alert circuit breaker is hit 🤔

cc @shanisagiv1

@mikecote mikecote added discuss Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework labels Aug 22, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@pmuellr
Copy link
Member

pmuellr commented Aug 22, 2022

Is there some context on why we did it this way - the way it's working today?

I guess one thing is that if a rule times out after only handling some of it's queries, and say marked some alerts as still active, pretend it would have marked some other ones active if it did not time out. Those will now recover. But we probably don't want that. Almost feels like you could accept active alerts from a timed out rule, but shouldn't process recovered alerts from such runs - those alerts would remain active (or some other non-recovered state like maybe-active?)

@ymao1
Copy link
Contributor

ymao1 commented Aug 22, 2022

If I recall correctly, we did it this way to avoid timed out runs possibly overwriting the state for rules that run ok. This is because even though we are cancelling ES queries and providing rule type executors with the services that they can check to see if the run is cancelled, we don't have 100% adoption, so we're still not guaranteeing that when a run is cancelled, the execution completely stops. It could be that rule runs the ES query but then takes 10 minutes to post-process the results, during which the run times out, but the rule keeps running. The next execution of this rule gets picked up and finishes within the timeout, updating the task manager state with its latest. Then the previous execution finally finishes. If we processed those alerts and then persisted them to the task state, we would be overwriting the state from a newer execution and sending outdated notifications.

Ideally, when we know for sure that cancelling a rule 100% stops the execution, we could look into doing something like what the alert circuit breaker does.

@mikecote
Copy link
Contributor Author

Yeah, I think it'll take time to be 100% sure that cancelling a rule stops the execution. But I wonder if we should do anything with the alerts that did get reported before the timeout occurs.

For example, 5 minute timeout:
minute 0 - start rule execution
minute 0 - start query
minute 4 - query returns partial alerts
minute 4 - report alert A, B, C to the platform
minute 4 - start a second query
minute 5 - timeout error!
...

I wonder if the framework should do something with alert A, B, and C which got reported prior to the timeout occurring. Hopefully my example is clear, happy to make some diagrams or discuss synchronously.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
Development

No branches or pull requests

4 participants