[Task Manager] Evenly distribute bulk-enabled alerting rules #172742

Zacqary · 2023-12-06T22:25:10Z

Summary

~~Closes~~ Part of #171980

When bulkEnableing more than 1 task, adds a random delay to each subsequent task's runAt and scheduledAt to more evenly distribute their execution times. This offset is a maximum of 5 minutes, or the task's interval, whichever is shorter.

As per Slack discussion with @mikecote, this is a random distribution of execution times instead of a predictable, algorithmic offset. We believe that a random distribution will do a better job of avoiding spikes than anything more directed.

Checklist

Unit or functional tests were updated or added to match the most common scenarios

apmmachine · 2023-12-06T22:25:24Z

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

/oblt-deploy : Deploy a Kibana instance using the Observability test environments.
/oblt-deploy-serverless : Deploy a serverless Kibana instance using the Observability test environments.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine · 2023-12-19T20:10:24Z

Pinging @elastic/response-ops (Team:ResponseOps)

kibana-ci · 2023-12-19T21:19:07Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: 5cd0021

Failed CI Steps

Test Failures

[job] [logs] Investigations - Security Solution Cypress Tests #7 / [ESS] Save Timeline Prompts "before each" hook for "should NOT prompt when navigating with a changed and unsaved timeline from the page where timeline is disabled" "before each" hook for "should NOT prompt when navigating with a changed and unsaved timeline from the page where timeline is disabled"

Metrics [docs]

✅ unchanged

History

💛 Build #182316 was flaky 067b2cf

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

mikecote · 2023-12-20T16:06:51Z

I learned a few things as I started reviewing this PR, we'll need to address these before we benefit from this change:

Security solution bulk enable still calls the alerting enable API , one rule at a time (this prevents the bulkEnable flow to be utilized)
After using our bulkEnableRules API, the underlying code currently schedules the underlying tasks one-by-one whenever it's missing (including the flow of installing then enabling the prebuild security rules)
The bulk enable API floods Elasticsearch with requests trying to migrate legacy actions [ERROR][plugins.alerting] migrateLegacyActions(): Failed to migrate legacy actions for SIEM rule c390e56e-c493-48a8-8d9c-b67ae0f7a072: connect ECONNRESET 127.0.0.1:9200 - Local: unknown:unknown, Remote: unknown:unknown

Happy to brainstorm if needed.

cc @kobelb

Zacqary · 2023-12-20T16:39:40Z

@mikecote Should we merge this PR but change the definition of done on the issue? I feel like this change is at least part of the solution.

mikecote · 2023-12-20T17:07:57Z

@mikecote Should we merge this PR but change the definition of done on the issue? I feel like this change is at least part of the solution.

Yes we can continue with this PR as part of the solution. I'll continue my review in it's current state but I think we'll need to hold off closing the #171980 issue until we fix the last two bullet points that I mentioned earlier, and we'll need to ask or work with Security Solution to move to use our bulk enable API. cc @XavierM

I think the main goal @kobelb wants solved is when enabling > 1,000 security solution rules that we don't make the K8s autoscaler go crazy. I think it would need the bullet points I mentioned earlier to be solved.

mikecote

Changes LGTM! Tested locally by creating some rules in stack management and using the bulk enable / disable API to see the calculations apply on enable.

resolves: #171980 This is a follow-on issue of: #172742 Above issue randomises `runAt` of the bulk enabled rules. And creates new tasks (by using `scheduleTask` for each one of them) if they don't have any. But, as we create the tasks already enabled `bulkEnable` method of the TM skips them. This PR replaces scheduleTask with bulkSchedule and creates task as disabled, so `bulkEnable` can pick them up. ## To verify: Add Security Solutions' [prebuilt detection rules](http://localhost:5601/app/security/rules/management?sourcerer=(default:(id:security-solution-default,selectedPatterns:!()))&timeline=(activeTab:query,graphEventId:%27%27,isOpen:!f)) and bulk enable them after installing all.

resolves: elastic#171980 This is a follow-on issue of: elastic#172742 Above issue randomises `runAt` of the bulk enabled rules. And creates new tasks (by using `scheduleTask` for each one of them) if they don't have any. But, as we create the tasks already enabled `bulkEnable` method of the TM skips them. This PR replaces scheduleTask with bulkSchedule and creates task as disabled, so `bulkEnable` can pick them up. ## To verify: Add Security Solutions' [prebuilt detection rules](http://localhost:5601/app/security/rules/management?sourcerer=(default:(id:security-solution-default,selectedPatterns:!()))&timeline=(activeTab:query,graphEventId:%27%27,isOpen:!f)) and bulk enable them after installing all.

[RAM] Add initial bulk enable task distributor function

067b2cf

Zacqary added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework labels Dec 6, 2023

Zacqary mentioned this pull request Dec 6, 2023

Evenly distribute bulk-enabled alerting rules #171980

Closed

Zacqary added the v8.13.0 label Dec 18, 2023

Zacqary added 3 commits December 19, 2023 11:06

Switch distribution algorithm to random

9a72d81

Move random offset function outside bulkEnable

7e39817

Add test for offset

fcc2323

Zacqary marked this pull request as ready for review December 19, 2023 20:10

Zacqary requested a review from a team as a code owner December 19, 2023 20:10

Zacqary added the release_note:enhancement label Dec 19, 2023

Merge remote-tracking branch 'upstream/main' into 171980-bulk-distribute

5cd0021

mikecote approved these changes Dec 20, 2023

View reviewed changes

Zacqary merged commit 00a2f49 into elastic:main Dec 20, 2023
38 checks passed

kibanamachine added the backport:skip This commit does not require backporting label Dec 20, 2023

ersin-erdal mentioned this pull request Jan 15, 2024

Evenly distribute bulk-enabled alerting rules #174656

Merged

banderror mentioned this pull request Mar 25, 2024

[Security Solution] Bulk enabling many rules can result in flooding Task Manager #171350

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task Manager] Evenly distribute bulk-enabled alerting rules #172742

[Task Manager] Evenly distribute bulk-enabled alerting rules #172742

Zacqary commented Dec 6, 2023 •

edited

apmmachine commented Dec 6, 2023

elasticmachine commented Dec 19, 2023

kibana-ci commented Dec 19, 2023

mikecote commented Dec 20, 2023

Zacqary commented Dec 20, 2023

mikecote commented Dec 20, 2023

mikecote left a comment

[Task Manager] Evenly distribute bulk-enabled alerting rules #172742

[Task Manager] Evenly distribute bulk-enabled alerting rules #172742

Conversation

Zacqary commented Dec 6, 2023 • edited

Summary

Checklist

apmmachine commented Dec 6, 2023

🤖 GitHub comments

elasticmachine commented Dec 19, 2023

kibana-ci commented Dec 19, 2023

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

Metrics [docs]

History

mikecote commented Dec 20, 2023

Zacqary commented Dec 20, 2023

mikecote commented Dec 20, 2023

mikecote left a comment

Choose a reason for hiding this comment

Zacqary commented Dec 6, 2023 •

edited