Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create task to cleanup action execution failures #96971

Merged
merged 19 commits into from
Apr 20, 2021

Conversation

mikecote
Copy link
Contributor

@mikecote mikecote commented Apr 13, 2021

Resolves #96577

In this PR, I'm creating a task that runs periodically to cleanup action execution failures and their related action_task_params. As discussed here, there are two intervals used, a short one when there are more failures to clean up and a longer one when there aren't any failures to clean up at a given time.

The configuration is unofficially available under xpack.actions.cleanupFailedExecutionsTask to enable/disable, change intervals and page size.

@mikecote mikecote added v8.0.0 release_note:skip Skip the PR/issue when compiling release notes Feature:Actions Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.13.0 labels Apr 13, 2021
@mikecote mikecote self-assigned this Apr 13, 2021
@@ -50,6 +50,11 @@ export const configSchema = schema.object({
rejectUnauthorized: schema.boolean({ defaultValue: true }),
maxResponseContentLength: schema.byteSize({ defaultValue: '1mb' }),
responseTimeout: schema.duration({ defaultValue: '60s' }),
cleanupFailedExecutionsTask: schema.object({
enabled: schema.boolean({ defaultValue: true }),
interval: schema.duration({ defaultValue: '15s' }),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15 seconds? Is that just so you can test this? :-)

I was thinking more like 12 or 24 hours ... maybe an hour?

Copy link
Contributor Author

@mikecote mikecote Apr 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a mistake, though I was thinking something small, like 5m. I'm trying to figure out how long it would take to cleanup environments for customers with 400,000 SOs..

1h processing 100 at a time = ~167 days 🤔
5m processing 100 at a time = ~14 days

It feels like there are two modes, one to clean up a large amount of data (short interval) and one to clean up executions that slipped (long interval). In theory, if I fix the task to always be successful, this feature would only be for the former (clean up a large amount of data) and would never be necessary again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, that kinda seems right - long interval that always runs, and a special one that cleans up a large backlog. Could this be two separate tasks? Maybe we'd have the large backlog task be disabled, and then enable it if we see > 1000 docs that need to be cleaned. Once that task went under that threshold (1000 in this case), it would disable itself.

Copy link
Contributor Author

@mikecote mikecote Apr 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be two separate tasks?

We can return different intervals after each run. I could return, ex 5m if there are still docs left to clean up after a run.. or return 1h / 12h / 24h if there are none or something. This way, we have one task to worry about. What are your thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can return different intervals after each run.

That should work out well. We should probably add a counter for the number of runs to set in alerting health, so we can tell from a report if it appears to be firing too often, or too infrequently.

I'm wondering if we should log an INFO for the long runs, indicating how many things have been deleted. Probably overkill - I did notice some debug logging which is probably good enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should work out well. We should probably add a counter for the number of runs to set in alerting health, so we can tell from a report if it appears to be firing too often, or too infrequently.

I capture runs and total_cleaned_up within the task manager state. Do you think that would suffice?

I'm wondering if we should log an INFO for the long runs, indicating how many things have been deleted. Probably overkill - I did notice some debug logging which is probably good enough

I fear if the user has 400k docs to delete that it would be 4000 info logs (one every 5m) and flood their logs over time.

@mikecote
Copy link
Contributor Author

@elasticmachine merge upstream

@mikecote mikecote added release_note:fix and removed release_note:skip Skip the PR/issue when compiling release notes labels Apr 20, 2021
@mikecote mikecote marked this pull request as ready for review April 20, 2021 12:46
@mikecote mikecote requested a review from a team as a code owner April 20, 2021 12:46
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@chrisronline
Copy link
Contributor

Will this PR be somewhat moot based on the work done for #90888? I'm not sure if we know the full direction yet but I'm worried these changes might need to all be backed out in the near future, unless I'm missing something?

@mikecote
Copy link
Contributor Author

Will this PR be somewhat moot based on the work done for #90888? I'm not sure if we know the full direction yet but I'm worried these changes might need to all be backed out in the near future, unless I'm missing something?

@chrisronline The assumption is correct. This PR is the short-term fix to clean up existing objects on the existing set-ups. The work done in #90888 will be to prevent creating future objects that need to be cleaned up. At one point we should be able to revert this code.

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Verified by building up a bunch of bad email actions and then switching to this PR and watched them all disappear!

logger.debug(
`Removing ${result.saved_objects.length} of ${result.total} failed execution task(s)`
);
const cleanupResult = await cleanupTasks({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we check for length result.saved_objects > 0 and skip calling cleanupTasks if no failed tasks are found?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to bulk delete request fails when there isn't anything to bulk. I've moved the check to https://github.com/elastic/kibana/pull/96971/files#diff-676722bd4c174cfa74b5a0f77c8a5687e081ff306bc2376dbc2cdfa87b199333R16 but I'm happy to add another here.

@ymao1
Copy link
Contributor

ymao1 commented Apr 20, 2021

The configuration is unofficially available under xpack.actions.cleanupFailedExecutionsTask to enable/disable, change intervals and page size.

Do we still need to allow-list in cloud even if not officially documented in case there is a cloud deployment that needs to change these settings?

@mikecote
Copy link
Contributor Author

Do we still need to allow-list in cloud even if not officially documented in case there is a cloud deployment that needs to change these settings?

I think it would be worth it just in case a customer needs to change a value on short notice.

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - a few non-critical comments / questions

@@ -3,7 +3,7 @@
"version": "1.0.0",
"kibanaVersion": "kibana",
"configPath": ["xpack"],
"requiredPlugins": ["taskManager", "features", "actions", "alerting", "encryptedSavedObjects"],
"requiredPlugins": ["taskManager", "features", "actions", "alerting", "encryptedSavedObjects", "actions"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"actions" was already in the list, don't think it needs to be added here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 I guess I really realllly wanted to make sure the plugin was required 😄 Fixed in 4b82a46.

actionTypeRegistry
.list()
.map((actionType) =>
nodeBuilder.is('task.attributes.taskType', `actions:${actionType.id}`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this means we won't clean up anything but failed actions? Seems ok for now, but I suspect we'll end up with some other task types in the future which also leave tombstones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed. I felt safe doing so for actions to begin with since they are traced with the event log while other ad-hoc tasks aren't (yet).

@@ -50,6 +50,12 @@ export const configSchema = schema.object({
rejectUnauthorized: schema.boolean({ defaultValue: true }),
maxResponseContentLength: schema.byteSize({ defaultValue: '1mb' }),
responseTimeout: schema.duration({ defaultValue: '60s' }),
cleanupFailedExecutionsTask: schema.object({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, doesn't this mean you HAVE to include this property, and {} would be fine since all the sub-properties are optional? Or I guess config-schema is smart enough to realize all the sub-properties are optional, so the parent is also essentially optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a local test. It's smart enough to fill in the default sub-properties automatically without requiring the root property defined.

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

API count

id before after diff
taskManager 41 43 +2

API count missing comments

id before after diff
taskManager 16 18 +2

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @mikecote

@mikecote mikecote added the auto-backport Deprecated: Automatically backport this PR after it's merged label Apr 20, 2021
@mikecote mikecote merged commit 0507ac5 into elastic:master Apr 20, 2021
kibanamachine added a commit to kibanamachine/kibana that referenced this pull request Apr 20, 2021
* Initial commit

* Add tests and support for concurrency

* Ability to disable functionality, use bulk APIs

* Fix type check

* Fix jest tests

* Cleanup

* Cleanup pt2

* Add unit tests

* Fix type check

* Fixes

* Update test failures

* Split schedule between cleanup and idle

* Add functional tests

* Add one more test

* Cleanup repeated code

* Remove duplicate actions plugin requirement

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
@kibanamachine
Copy link
Contributor

💚 Backport successful

Status Branch Result
7.x

This backport PR will be merged automatically after passing CI.

kibanamachine added a commit that referenced this pull request Apr 21, 2021
* Initial commit

* Add tests and support for concurrency

* Ability to disable functionality, use bulk APIs

* Fix type check

* Fix jest tests

* Cleanup

* Cleanup pt2

* Add unit tests

* Fix type check

* Fixes

* Update test failures

* Split schedule between cleanup and idle

* Add functional tests

* Add one more test

* Cleanup repeated code

* Remove duplicate actions plugin requirement

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

Co-authored-by: Mike Côté <mikecote@users.noreply.github.com>
madirey pushed a commit to madirey/kibana that referenced this pull request May 11, 2021
* Initial commit

* Add tests and support for concurrency

* Ability to disable functionality, use bulk APIs

* Fix type check

* Fix jest tests

* Cleanup

* Cleanup pt2

* Add unit tests

* Fix type check

* Fixes

* Update test failures

* Split schedule between cleanup and idle

* Add functional tests

* Add one more test

* Cleanup repeated code

* Remove duplicate actions plugin requirement

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated: Automatically backport this PR after it's merged Feature:Actions release_note:fix Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.13.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Actions] A build up of failed actions can cause issues when upgrading Kibana
6 participants