Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Add cancel to alert and action tasks #64148

Closed
gmmorris opened this issue Apr 22, 2020 · 7 comments 路 Fixed by #120853
Closed

[Alerting] Add cancel to alert and action tasks #64148

gmmorris opened this issue Apr 22, 2020 · 7 comments 路 Fixed by #120853
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting Feature:Task Manager impact:critical This issue should be addressed immediately due to a critical level of impact on the product. resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@gmmorris
Copy link
Contributor

Follow up from #64075 (comment)

We should consider:

  1. Adding a cancel implementation to Alert and Action tasks.
  2. Making a cancel implementation a required field on tasks definitions. 馃
@gmmorris gmmorris added Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Apr 22, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@jasonrhodes
Copy link
Member

Adding a comment here: in observability, we have a problem where when a composite agg query has a ton of pages to page through, it can run for a very very long time (days, sometimes) inside of a rule execution. In these situations, the user doesn't really have a good way out of it besides restarting Kibana, which would kill all of their currently running rules.

Disabling the specific rule is also not enough, because the existing execution would still be running. We could check in the executor to see if the rule has been disabled since the start of the execution run (if that context was accessible), but we figure that in many if not most cases, the user doesn't want disabling the rule to have this effect.

In other words, it seems like the only way to give users a way out of these situations is to give them control over cancelling individually running tasks. But we're open to any other ideas for how to solve this problem (outside of how to prevent it, which we are also looking into separately).

@mikecote
Copy link
Contributor

mikecote commented Jul 8, 2021

@jasonrhodes, I'm thinking there may be a story about leveraging task manager timeouts. The task manager already has this concept and calls the cancel function whenever a task times out. The current problem is the alerting framework doesn't pass through this capability, which can be done.

One thing also worth revisiting is how timeouts are handled for recurring tasks. They equal whichever is greater between 5m or the schedule (ex: 1h).

Thoughts?

@jasonrhodes
Copy link
Member

I think it's difficult because it's hard to know the difference between "a long query that is taking a while because it's running against frozen indices or cold tier storage and it's fine, we planned for this" vs. "a query that is taking way too long and the user wants to make it stop". That's what makes it seem like automated heuristics are going to be tough here. But maybe some kind of default timeout that can be adjusted for users who expect to be making very long queries would help?

@gmmorris gmmorris added loe:medium Medium Level of Effort loe:large Large Level of Effort and removed loe:medium Medium Level of Effort labels Jul 14, 2021
@mikecote
Copy link
Contributor

Thanks for the feedback, @jasonrhodes! It's is indeed hard to distinguish between expected and unexpected query times based on those cases.

From the sounds of it, research is needed to determine what approach should be taken to solve this problem. Regarding capacity, the alerting team won't be able to look into this soon but if you feel the change is best done at the platform level, we are open for someone from O11y team to do the research and come up with a proposal and implementation.

@mikecote mikecote added loe:needs-research This issue requires some research before it can be worked on or estimated and removed loe:large Large Level of Effort labels Jul 20, 2021
@gmmorris gmmorris added resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility estimate:needs-research Estimated as too large and requires research to break down into workable issues labels Aug 13, 2021
@gmmorris gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021
@gmmorris gmmorris added the impact:critical This issue should be addressed immediately due to a critical level of impact on the product. label Sep 16, 2021
@YulNaumenko
Copy link
Contributor

@gmmorris the alert task cancel function was created by this PR #114289 and action task cancel task by this PR #120853.
Do you think we can close this issue when the last PR will be merged?

@mikecote mikecote removed this from Backlog in Kibana Alerting Jan 6, 2022
@gmmorris
Copy link
Contributor Author

@YulNaumenko sounds like it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting Feature:Task Manager impact:critical This issue should be addressed immediately due to a critical level of impact on the product. resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
6 participants