Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task Manager] Evenly distribute bulk-enabled alerting rules #172742

Merged
merged 5 commits into from Dec 20, 2023

Conversation

Zacqary
Copy link
Contributor

@Zacqary Zacqary commented Dec 6, 2023

Summary

Closes Part of #171980

When bulkEnableing more than 1 task, adds a random delay to each subsequent task's runAt and scheduledAt to more evenly distribute their execution times. This offset is a maximum of 5 minutes, or the task's interval, whichever is shorter.

As per Slack discussion with @mikecote, this is a random distribution of execution times instead of a predictable, algorithmic offset. We believe that a random distribution will do a better job of avoiding spikes than anything more directed.

Checklist

@Zacqary Zacqary added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework labels Dec 6, 2023
@apmmachine
Copy link
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • /oblt-deploy-serverless : Deploy a serverless Kibana instance using the Observability test environments.
  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@Zacqary Zacqary marked this pull request as ready for review December 19, 2023 20:10
@Zacqary Zacqary requested a review from a team as a code owner December 19, 2023 20:10
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] Investigations - Security Solution Cypress Tests #7 / [ESS] Save Timeline Prompts "before each" hook for "should NOT prompt when navigating with a changed and unsaved timeline from the page where timeline is disabled" "before each" hook for "should NOT prompt when navigating with a changed and unsaved timeline from the page where timeline is disabled"

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@mikecote
Copy link
Contributor

I learned a few things as I started reviewing this PR, we'll need to address these before we benefit from this change:

  • Security solution bulk enable still calls the alerting enable API , one rule at a time (this prevents the bulkEnable flow to be utilized)
  • After using our bulkEnableRules API, the underlying code currently schedules the underlying tasks one-by-one whenever it's missing (including the flow of installing then enabling the prebuild security rules)
  • The bulk enable API floods Elasticsearch with requests trying to migrate legacy actions [ERROR][plugins.alerting] migrateLegacyActions(): Failed to migrate legacy actions for SIEM rule c390e56e-c493-48a8-8d9c-b67ae0f7a072: connect ECONNRESET 127.0.0.1:9200 - Local: unknown:unknown, Remote: unknown:unknown

Happy to brainstorm if needed.

cc @kobelb

@Zacqary
Copy link
Contributor Author

Zacqary commented Dec 20, 2023

@mikecote Should we merge this PR but change the definition of done on the issue? I feel like this change is at least part of the solution.

@mikecote
Copy link
Contributor

@mikecote Should we merge this PR but change the definition of done on the issue? I feel like this change is at least part of the solution.

Yes we can continue with this PR as part of the solution. I'll continue my review in it's current state but I think we'll need to hold off closing the #171980 issue until we fix the last two bullet points that I mentioned earlier, and we'll need to ask or work with Security Solution to move to use our bulk enable API. cc @XavierM

I think the main goal @kobelb wants solved is when enabling > 1,000 security solution rules that we don't make the K8s autoscaler go crazy. I think it would need the bullet points I mentioned earlier to be solved.

Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM! Tested locally by creating some rules in stack management and using the bulk enable / disable API to see the calculations apply on enable.

@Zacqary Zacqary merged commit 00a2f49 into elastic:main Dec 20, 2023
38 checks passed
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Dec 20, 2023
ersin-erdal added a commit that referenced this pull request Jan 17, 2024
resolves: #171980

This is a follow-on issue of:
#172742
Above issue randomises `runAt` of the bulk enabled rules. And creates
new tasks (by using `scheduleTask` for each one of them) if they don't
have any.
But, as we create the tasks already enabled `bulkEnable` method of the
TM skips them.

This PR replaces scheduleTask with bulkSchedule and creates task as
disabled, so `bulkEnable` can pick them up.

## To verify:

Add Security Solutions' [prebuilt detection
rules](http://localhost:5601/app/security/rules/management?sourcerer=(default:(id:security-solution-default,selectedPatterns:!()))&timeline=(activeTab:query,graphEventId:%27%27,isOpen:!f))
and bulk enable them after installing all.
CoenWarmer pushed a commit to CoenWarmer/kibana that referenced this pull request Feb 15, 2024
resolves: elastic#171980

This is a follow-on issue of:
elastic#172742
Above issue randomises `runAt` of the bulk enabled rules. And creates
new tasks (by using `scheduleTask` for each one of them) if they don't
have any.
But, as we create the tasks already enabled `bulkEnable` method of the
TM skips them.

This PR replaces scheduleTask with bulkSchedule and creates task as
disabled, so `bulkEnable` can pick them up.

## To verify:

Add Security Solutions' [prebuilt detection
rules](http://localhost:5601/app/security/rules/management?sourcerer=(default:(id:security-solution-default,selectedPatterns:!()))&timeline=(activeTab:query,graphEventId:%27%27,isOpen:!f))
and bulk enable them after installing all.
CoenWarmer pushed a commit to CoenWarmer/kibana that referenced this pull request Feb 15, 2024
resolves: elastic#171980

This is a follow-on issue of:
elastic#172742
Above issue randomises `runAt` of the bulk enabled rules. And creates
new tasks (by using `scheduleTask` for each one of them) if they don't
have any.
But, as we create the tasks already enabled `bulkEnable` method of the
TM skips them.

This PR replaces scheduleTask with bulkSchedule and creates task as
disabled, so `bulkEnable` can pick them up.

## To verify:

Add Security Solutions' [prebuilt detection
rules](http://localhost:5601/app/security/rules/management?sourcerer=(default:(id:security-solution-default,selectedPatterns:!()))&timeline=(activeTab:query,graphEventId:%27%27,isOpen:!f))
and bulk enable them after installing all.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework release_note:enhancement Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.13.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants