-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve task manager warning log messaging when all workers are used #54920
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
Beyond some counts and task execution time stats, I'd like to see the number of "queued" tasks. When running alerting under stress with just 10 workers that were all, where every worker was running an action (server-log), I think the actions were getting "backed up" in the queue. I was tracking when the actions were queued to run (alert executor time) and when the action was actually run (action executor time), and was seeing delays in the scale of minutes. So, just a count of the number of docs in the tm SO store would be useful. |
I was thinking a route to get stats would be useful as well. At least in dev mode. :-) |
I'm guessing folks will be annoyed with messages from TM in the console on a regular interval. Seems like something we could turn into a config option |
hmm, Task Manager doesn't have a concept of queued tasks as it asks for as many as it has capacity to handle, rather than count how many there are that it doesn't have capacity to handle. We could track the lead time (we do this in the perf tests) showing the average/p90 time between execution time and the scheduled runAt, so you'd get a sense for how much tasks get backed up by. |
we could provide a separate cli that displays this rather than spam the log, like |
ya, figured, since we don't really need that count for anything today. Fairly important number though, I think it would be fine to pay for a separate query for ad-hoc purposes (http api) or an hourly or so stats summary for logging/telemetry. |
++ to this, as an API to be called adhoc, and/or optional monitor you can turn in config. Agree that the "queued" number lets you make an informed decision about whether If the queue spikes but then goes down then you know there's enough workers, the problem is the distribution of the load is too spiky so there will be delayed execution (which may be fine). But if the queue keeps growing to the point where it approaches the number of tasks in the system, you know you need more. |
I feel like we need to do something for 7.6 here - SIEM users that use the alerts/signals feature will likely end up seeing this message, and so then update their |
Yeah, that's simple enough. |
PR: #55334 |
Hello , Looking on the comments above, I got an understanding that the error message " Task Manager has skipped Claiming Ownership of available tasks at it has ran out Available Workers." is just a warning and will be removed since the solution to it as in increasing the workers could cause problems. I have a similar error message on my Kibana, could you please advise what to be done. Thanks in advance. |
@securigeek - this is just a diagnostic message from the Task Manager plugin. It's an indication that the Task Manager is busy running tasks, up to it's max worker size, so didn't ask for more tasks to execute. It asks for tasks on an interval, so as the tasks it's working on complete, on the subsequent rounds of the interval, it will start asking for more tasks to execute. You are correct that you might be able to avoid seeing this message by increasing the max worker size, but also as you suggest ... this could cause problems. For now, ignore the message. I'll create a PR to disable the message. Thanks for reporting! |
resolves elastic#54920 The scenario this message warns about is actually something that we should expect to happen in a busy system - and we back off as we should. But there's no reason it needs to be info-logged, especially as it seems somewhat frightening. Changing to debug, so in case we do need this for diagnostic purposes, we can still get it via logging config.
…ug (#62083) resolves #54920 The scenario this message warns about is actually something that we should expect to happen in a busy system - and we back off as we should. But there's no reason it needs to be info-logged, especially as it seems somewhat frightening. Changing to debug, so in case we do need this for diagnostic purposes, we can still get it via logging config.
…ug (elastic#62083) resolves elastic#54920 The scenario this message warns about is actually something that we should expect to happen in a busy system - and we back off as we should. But there's no reason it needs to be info-logged, especially as it seems somewhat frightening. Changing to debug, so in case we do need this for diagnostic purposes, we can still get it via logging config.
…ug (elastic#62083) resolves elastic#54920 The scenario this message warns about is actually something that we should expect to happen in a busy system - and we back off as we should. But there's no reason it needs to be info-logged, especially as it seems somewhat frightening. Changing to debug, so in case we do need this for diagnostic purposes, we can still get it via logging config.
…ug (#62083) (#62161) resolves #54920 The scenario this message warns about is actually something that we should expect to happen in a busy system - and we back off as we should. But there's no reason it needs to be info-logged, especially as it seems somewhat frightening. Changing to debug, so in case we do need this for diagnostic purposes, we can still get it via logging config.
…ug (#62083) (#62160) resolves #54920 The scenario this message warns about is actually something that we should expect to happen in a busy system - and we back off as we should. But there's no reason it needs to be info-logged, especially as it seems somewhat frightening. Changing to debug, so in case we do need this for diagnostic purposes, we can still get it via logging config.
From #54697 (comment)
The message from TM that gets logged when all the workers are in use seems ... not great.
I don't really have a great list of concrete alternatives, but will discuss some ideas below
ideas:
I think I like the first. Default might be 10 minutes, for dev I'd want to set it to 1 minute, or maybe even 30 seconds.
The text was updated successfully, but these errors were encountered: