[Alerting] Add cancel to alert and action tasks #64148

gmmorris · 2020-04-22T09:11:41Z

We should consider:

Adding a cancel implementation to Alert and Action tasks.
Making a cancel implementation a required field on tasks definitions. 🤔

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-04-22T09:11:43Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

jasonrhodes · 2021-07-08T02:16:52Z

Adding a comment here: in observability, we have a problem where when a composite agg query has a ton of pages to page through, it can run for a very very long time (days, sometimes) inside of a rule execution. In these situations, the user doesn't really have a good way out of it besides restarting Kibana, which would kill all of their currently running rules.

Disabling the specific rule is also not enough, because the existing execution would still be running. We could check in the executor to see if the rule has been disabled since the start of the execution run (if that context was accessible), but we figure that in many if not most cases, the user doesn't want disabling the rule to have this effect.

In other words, it seems like the only way to give users a way out of these situations is to give them control over cancelling individually running tasks. But we're open to any other ideas for how to solve this problem (outside of how to prevent it, which we are also looking into separately).

mikecote · 2021-07-08T11:49:21Z

@jasonrhodes, I'm thinking there may be a story about leveraging task manager timeouts. The task manager already has this concept and calls the cancel function whenever a task times out. The current problem is the alerting framework doesn't pass through this capability, which can be done.

One thing also worth revisiting is how timeouts are handled for recurring tasks. They equal whichever is greater between 5m or the schedule (ex: 1h).

Thoughts?

jasonrhodes · 2021-07-12T20:47:54Z

I think it's difficult because it's hard to know the difference between "a long query that is taking a while because it's running against frozen indices or cold tier storage and it's fine, we planned for this" vs. "a query that is taking way too long and the user wants to make it stop". That's what makes it seem like automated heuristics are going to be tough here. But maybe some kind of default timeout that can be adjusted for users who expect to be making very long queries would help?

mikecote · 2021-07-20T16:55:39Z

Thanks for the feedback, @jasonrhodes! It's is indeed hard to distinguish between expected and unexpected query times based on those cases.

From the sounds of it, research is needed to determine what approach should be taken to solve this problem. Regarding capacity, the alerting team won't be able to look into this soon but if you feel the change is best done at the platform level, we are open for someone from O11y team to do the research and come up with a proposal and implementation.

YulNaumenko · 2021-12-14T18:33:41Z

@gmmorris the alert task cancel function was created by this PR #114289 and action task cancel task by this PR #120853.
Do you think we can close this issue when the last PR will be merged?

gmmorris · 2022-01-10T17:31:03Z

@YulNaumenko sounds like it :)

gmmorris added Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Apr 22, 2020

mikecote added this to Long Term in Make it Action May 4, 2020

mikecote moved this from Long Term to 8.x - Tentative in Make it Action Aug 6, 2020

mikecote removed this from 8.x - Candidates in Make it Action Jan 27, 2021

mikecote added this to 8.x - Candidates in Kibana Alerting Jan 27, 2021

pmuellr mentioned this issue Mar 31, 2021

[Task manager][discuss] force implementation of cancel()? Or don't reschedule non-cancellable tasks until complete #95985

Open

gmmorris added loe:medium Medium Level of Effort loe:large Large Level of Effort and removed loe:medium Medium Level of Effort labels Jul 14, 2021

mikecote added loe:needs-research This issue requires some research before it can be worked on or estimated and removed loe:large Large Level of Effort labels Jul 20, 2021

gmmorris added resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility estimate:needs-research Estimated as too large and requires research to break down into workable issues labels Aug 13, 2021

gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021

gmmorris mentioned this issue Sep 10, 2021

[Alerting] Investigate resilience / side effects of excessively long running rules #111259

Open

3 tasks

gmmorris added the impact:critical This issue should be addressed immediately due to a critical level of impact on the product. label Sep 16, 2021

mikecote removed this from Backlog in Kibana Alerting Jan 6, 2022

gmmorris mentioned this issue Jan 10, 2022

[Actions] Fixed ad-hoc actions tasks remain as "running" when they timeout by adding cancellation support #120853

Merged

YulNaumenko closed this as completed in #120853 Jan 19, 2022

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Alerting] Add cancel to alert and action tasks #64148

[Alerting] Add cancel to alert and action tasks #64148

gmmorris commented Apr 22, 2020

elasticmachine commented Apr 22, 2020

jasonrhodes commented Jul 8, 2021

mikecote commented Jul 8, 2021 •

edited

jasonrhodes commented Jul 12, 2021

mikecote commented Jul 20, 2021

YulNaumenko commented Dec 14, 2021

gmmorris commented Jan 10, 2022

[Alerting] Add cancel to alert and action tasks #64148

[Alerting] Add cancel to alert and action tasks #64148

Comments

gmmorris commented Apr 22, 2020

elasticmachine commented Apr 22, 2020

jasonrhodes commented Jul 8, 2021

mikecote commented Jul 8, 2021 • edited

jasonrhodes commented Jul 12, 2021

mikecote commented Jul 20, 2021

YulNaumenko commented Dec 14, 2021

gmmorris commented Jan 10, 2022

mikecote commented Jul 8, 2021 •

edited