Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed Task Manager task documents are never cleaned up bloating the index #79977

Closed
3 tasks done
gmmorris opened this issue Oct 8, 2020 · 8 comments · Fixed by #151873
Closed
3 tasks done

Failed Task Manager task documents are never cleaned up bloating the index #79977

gmmorris opened this issue Oct 8, 2020 · 8 comments · Fixed by #151873
Assignees
Labels
discuss estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Task Manager impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@gmmorris
Copy link
Contributor

gmmorris commented Oct 8, 2020

As we prepare to make Task Manager replaceable with a queue and scheduler service, we should limit the internals of Task Manager that are exposed to downstream plugins. We've done numerous changes over time to cleanup failed task documents (#109655 & #96971) and now we should change Task Manager to handle this first hand instead of the downstream plugins (cleanup code).

As part of this effort, we should:

NOTE: We also need to find a way to still delete action_task_params when the task gets deleted. We should find a way to make this happen without having to handle this determination logic in the actions plugin. Some options include:

  • A cleanup hook that the actions plugin can use to delete downstream SOs
  • Functionality off SO references that can cascade delete
  • Make the task SO capable of storing the API key encrypted
Original description

Should we consider cleaning up the index after a while?
Should we rely on the Event Log to keep track of failed Tasks which would allow us to then purge them from the index?

With the upcoming TM observability story (#77456) which executes scheduled queries against the whole index, it might be worth considering the potential size of this index.

@gmmorris gmmorris added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Oct 8, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member

pmuellr commented Oct 8, 2020

Initial feelz: we should have a recurring task that deletes old failed tasks. Is that the main source of "garbage" in the index? I'd think we could do something like delete them if > 1 week old (maybe 2, that seems kinda standard). Maybe run that once/day?

@gmmorris
Copy link
Contributor Author

gmmorris commented Oct 9, 2020

Initial feelz: we should have a recurring task that deletes old failed tasks. Is that the main source of "garbage" in the index? I'd think we could do something like delete them if > 1 week old (maybe 2, that seems kinda standard). Maybe run that once/day?

Yeah that was my initial gut feeling as well but we should do some research...

@mikecote
Copy link
Contributor

With #90888 opened, this problem won't be as common after regarding action execution tasks.

@gmmorris gmmorris added loe:needs-research This issue requires some research before it can be worked on or estimated resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility labels Jul 14, 2021
@gmmorris gmmorris added the estimate:needs-research Estimated as too large and requires research to break down into workable issues label Aug 18, 2021
@gmmorris gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021
@gmmorris gmmorris added the impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. label Sep 16, 2021
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
@mikecote
Copy link
Contributor

Linking with #138344

@rudolf
Copy link
Contributor

rudolf commented Jan 30, 2023

We have the following cleanup logic for tasks and task_action_params:

https://github.com/elastic/kibana/blob/main/x-pack/plugins/task_manager/server/saved_objects/index.ts#L29-L72
https://github.com/elastic/kibana/blob/main/x-pack/plugins/actions/server/saved_objects/index.ts#L82-L94

I'd imagine we'd use very similar if not the same logic to periodically cleanup outside of migrations as well.

@rudolf
Copy link
Contributor

rudolf commented Jan 30, 2023

As part of #147237 we would no longer be performing a reindex during upgrade migrations. This means that excludeOnUpgrade would no longer be "free", instead of just leaving the SOs we wanted to exclude we now have to actively delete them to clean up. This means cleaning up task_action_params with excludeOnUpgrade can actually increase the upgrade downtime instead of decreasing them.

@mikecote
Copy link
Contributor

mikecote commented Mar 2, 2023

From #152223 (comment), let's make sure to cleanup isRetryableBasedOnAttempts.

mikecote added a commit that referenced this issue Mar 27, 2023
…52841)

Part of #79977 (step 1 and 3).

In this PR, I'm making Task Manager remove tasks instead of updating
them with `status: failed` whenever a task is out of attempts. I've also
added an optional `cleanup` hook to the task runner that can be defined
if additional cleanup is necessary whenever a task has been deleted (ex:
delete `action_task_params`).

## To verify an ad-hoc task that always fails

1. With this PR codebase, modify an action to always throw an error
2. Create an alerting rule that will invoke the action once
3. See the action fail three times
4. Observe the task SO is deleted (search by task type / action type)
alongside the action_task_params SO

## To verify Kibana crashing on the last ad-hoc task attempt

1. With this PR codebase, modify an action to always throw an error
(similar to scenario above) but also add a delay of 10s before the error
is thrown (`await new Promise((resolve) => setTimeout(resolve, 10000));`
and a log message before the delay begins
2. Create an alerting rule that will invoke the action once
3. See the action fail twice
4. On the third run, crash Kibana while the action is waiting for the
10s delay, this will cause the action to still be marked as running
while it no longer is
5. Restart Kibana
6. Wait 5-10m until the task's retryAt is overdue
7. Observe the task getting deleted and the action_task_params getting
deleted

## To verify recurring tasks that continuously fail

1. With this PR codebase, modify a rule type to always throw an error
when it runs
2. Create an alerting rule of that type (with a short interval)
3. Observe the rule continuously running and not getting trapped into
the PR changes

Flaky test runner:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/2036

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
mikecote added a commit that referenced this issue Mar 29, 2023
Part of #79977 (step 2).
Resolves #79977.

In this PR, I'm removing the recurring task defined by the actions
plugin that removes unused `action_task_params` SOs. With the
#152841 PR, tasks will no longer
get marked as failed and we have a migration script (`excludeOnUpgrade`)
that removes all tasks and action_task_params that are leftover during
the migration
https://github.com/elastic/kibana/blob/main/x-pack/plugins/actions/server/saved_objects/index.ts#L81-L94.

~~NOTE: I will hold off merging this PR until
#152841 is merged.~~ (merged)

## To verify

Not much to test here, but on a Kibana from `main` there will be this
task type running in the background and moving to this PR will cause the
task to get deleted because it is part of the `REMOVED_TYPES` array in
Task Manager.

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
jgowdyelastic pushed a commit to jgowdyelastic/kibana that referenced this issue Mar 30, 2023
…1873)

Part of elastic#79977 (step 2).
Resolves elastic#79977.

In this PR, I'm removing the recurring task defined by the actions
plugin that removes unused `action_task_params` SOs. With the
elastic#152841 PR, tasks will no longer
get marked as failed and we have a migration script (`excludeOnUpgrade`)
that removes all tasks and action_task_params that are leftover during
the migration
https://github.com/elastic/kibana/blob/main/x-pack/plugins/actions/server/saved_objects/index.ts#L81-L94.

~~NOTE: I will hold off merging this PR until
elastic#152841 is merged.~~ (merged)

## To verify

Not much to test here, but on a Kibana from `main` there will be this
task type running in the background and moving to this PR will cause the
task to get deleted because it is part of the `REMOVED_TYPES` array in
Task Manager.

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Task Manager impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

6 participants