-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed Task Manager task documents are never cleaned up bloating the index #79977
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
Initial feelz: we should have a recurring task that deletes old failed tasks. Is that the main source of "garbage" in the index? I'd think we could do something like delete them if > 1 week old (maybe 2, that seems kinda standard). Maybe run that once/day? |
Yeah that was my initial gut feeling as well but we should do some research... |
With #90888 opened, this problem won't be as common after regarding action execution tasks. |
Linking with #138344 |
We have the following cleanup logic for https://github.com/elastic/kibana/blob/main/x-pack/plugins/task_manager/server/saved_objects/index.ts#L29-L72 I'd imagine we'd use very similar if not the same logic to periodically cleanup outside of migrations as well. |
As part of #147237 we would no longer be performing a reindex during upgrade migrations. This means that |
From #152223 (comment), let's make sure to cleanup |
…52841) Part of #79977 (step 1 and 3). In this PR, I'm making Task Manager remove tasks instead of updating them with `status: failed` whenever a task is out of attempts. I've also added an optional `cleanup` hook to the task runner that can be defined if additional cleanup is necessary whenever a task has been deleted (ex: delete `action_task_params`). ## To verify an ad-hoc task that always fails 1. With this PR codebase, modify an action to always throw an error 2. Create an alerting rule that will invoke the action once 3. See the action fail three times 4. Observe the task SO is deleted (search by task type / action type) alongside the action_task_params SO ## To verify Kibana crashing on the last ad-hoc task attempt 1. With this PR codebase, modify an action to always throw an error (similar to scenario above) but also add a delay of 10s before the error is thrown (`await new Promise((resolve) => setTimeout(resolve, 10000));` and a log message before the delay begins 2. Create an alerting rule that will invoke the action once 3. See the action fail twice 4. On the third run, crash Kibana while the action is waiting for the 10s delay, this will cause the action to still be marked as running while it no longer is 5. Restart Kibana 6. Wait 5-10m until the task's retryAt is overdue 7. Observe the task getting deleted and the action_task_params getting deleted ## To verify recurring tasks that continuously fail 1. With this PR codebase, modify a rule type to always throw an error when it runs 2. Create an alerting rule of that type (with a short interval) 3. Observe the rule continuously running and not getting trapped into the PR changes Flaky test runner: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/2036 --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Part of #79977 (step 2). Resolves #79977. In this PR, I'm removing the recurring task defined by the actions plugin that removes unused `action_task_params` SOs. With the #152841 PR, tasks will no longer get marked as failed and we have a migration script (`excludeOnUpgrade`) that removes all tasks and action_task_params that are leftover during the migration https://github.com/elastic/kibana/blob/main/x-pack/plugins/actions/server/saved_objects/index.ts#L81-L94. ~~NOTE: I will hold off merging this PR until #152841 is merged.~~ (merged) ## To verify Not much to test here, but on a Kibana from `main` there will be this task type running in the background and moving to this PR will cause the task to get deleted because it is part of the `REMOVED_TYPES` array in Task Manager. --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
…1873) Part of elastic#79977 (step 2). Resolves elastic#79977. In this PR, I'm removing the recurring task defined by the actions plugin that removes unused `action_task_params` SOs. With the elastic#152841 PR, tasks will no longer get marked as failed and we have a migration script (`excludeOnUpgrade`) that removes all tasks and action_task_params that are leftover during the migration https://github.com/elastic/kibana/blob/main/x-pack/plugins/actions/server/saved_objects/index.ts#L81-L94. ~~NOTE: I will hold off merging this PR until elastic#152841 is merged.~~ (merged) ## To verify Not much to test here, but on a Kibana from `main` there will be this task type running in the background and moving to this PR will cause the task to get deleted because it is part of the `REMOVED_TYPES` array in Task Manager. --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
As we prepare to make Task Manager replaceable with a queue and scheduler service, we should limit the internals of Task Manager that are exposed to downstream plugins. We've done numerous changes over time to cleanup failed task documents (#109655 & #96971) and now we should change Task Manager to handle this first hand instead of the downstream plugins (cleanup code).
As part of this effort, we should:
NOTE: We also need to find a way to still delete
action_task_params
when the task gets deleted. We should find a way to make this happen without having to handle this determination logic in the actions plugin. Some options include:references
that can cascade deleteOriginal description
Should we consider cleaning up the index after a while?
Should we rely on the Event Log to keep track of failed Tasks which would allow us to then purge them from the index?
With the upcoming TM observability story (#77456) which executes scheduled queries against the whole index, it might be worth considering the potential size of this index.
The text was updated successfully, but these errors were encountered: