[Task manager][discuss] force implementation of cancel()? Or don't reschedule non-cancellable tasks until complete #95985
Labels
discuss
estimate:needs-research
Estimated as too large and requires research to break down into workable issues
Feature:Task Manager
impact:high
Addressing this issue will have a high level of impact on the quality/strength of our product.
resilience
Issues related to Platform resilience in terms of scale, performance & backwards compatibility
Team:ResponseOps
Label for the ResponseOps team (formerly the Cases and Alerting teams)
Task manager allows tasks to "cancel" themselves, if they run over their timeout, by calling a
cancel()
method in their runner. Tasks do not have to implement this though.We're seeing a situation with a customer that involves a task manager task, which looks like has overlapping executions - in theory they should not be overlapping.
The task invocations are overlapping because the task exceeds it's timeout, but does not implement the
cancel()
method. Whether the task implements the method or not, task manager makes the task schedule-able. And because task manager can't actually STOP the task from running (it's just a function invocation, which you don't have control over like you do in Java withthread.kill()
or whatever), the task keeps running. And so, overlapping task executions.I repro'd this by hacking / instrumenting x-pack/plugins/alerting/server/health/task.ts as so:
hacked/instrumented alerting health task, changes marked with `/*!*/`
This changes the task so the interval is
5s
, the timeout is2s
, and the task will run for at least10s
.you can see the overlap, in the log messages
It seems to stabilize pretty quickly, never running more than 3 of these tasks at a time, after an hour or so running. Not sure if it might slowly grow, or other combinations of timeouts, intervals, and actual execution times would end up with a non-constant number of overlapping executions or not.
what can we do?
One thing we could do is FORCE task implementors to implement a
cancel()
method, by making it a required method. We should obviously provide a nice sample for this. Also note: we need to implementcancel()
on alerts and actions: issue #64148 (I suspect this won't end up being a "nice sample").Another option would be continue to allow non-cancellable tasks, but don't reschedule them till the original async task actually completes - I think that could be possible. We might want to do this even if we require a
cancel()
call.@mikecote pointed out this related PR: #83682
The text was updated successfully, but these errors were encountered: