New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Task Manager] Evenly distribute bulk-enabled alerting rules #172742
Conversation
🤖 GitHub commentsExpand to view the GitHub comments
Just comment with:
|
Pinging @elastic/response-ops (Team:ResponseOps) |
💛 Build succeeded, but was flaky
Failed CI Steps
Test Failures
Metrics [docs]
History
To update your PR or re-run it, just comment with: |
I learned a few things as I started reviewing this PR, we'll need to address these before we benefit from this change:
Happy to brainstorm if needed. cc @kobelb |
@mikecote Should we merge this PR but change the definition of done on the issue? I feel like this change is at least part of the solution. |
Yes we can continue with this PR as part of the solution. I'll continue my review in it's current state but I think we'll need to hold off closing the #171980 issue until we fix the last two bullet points that I mentioned earlier, and we'll need to ask or work with Security Solution to move to use our bulk enable API. cc @XavierM I think the main goal @kobelb wants solved is when enabling > 1,000 security solution rules that we don't make the K8s autoscaler go crazy. I think it would need the bullet points I mentioned earlier to be solved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM! Tested locally by creating some rules in stack management and using the bulk enable / disable API to see the calculations apply on enable.
resolves: #171980 This is a follow-on issue of: #172742 Above issue randomises `runAt` of the bulk enabled rules. And creates new tasks (by using `scheduleTask` for each one of them) if they don't have any. But, as we create the tasks already enabled `bulkEnable` method of the TM skips them. This PR replaces scheduleTask with bulkSchedule and creates task as disabled, so `bulkEnable` can pick them up. ## To verify: Add Security Solutions' [prebuilt detection rules](http://localhost:5601/app/security/rules/management?sourcerer=(default:(id:security-solution-default,selectedPatterns:!()))&timeline=(activeTab:query,graphEventId:%27%27,isOpen:!f)) and bulk enable them after installing all.
resolves: elastic#171980 This is a follow-on issue of: elastic#172742 Above issue randomises `runAt` of the bulk enabled rules. And creates new tasks (by using `scheduleTask` for each one of them) if they don't have any. But, as we create the tasks already enabled `bulkEnable` method of the TM skips them. This PR replaces scheduleTask with bulkSchedule and creates task as disabled, so `bulkEnable` can pick them up. ## To verify: Add Security Solutions' [prebuilt detection rules](http://localhost:5601/app/security/rules/management?sourcerer=(default:(id:security-solution-default,selectedPatterns:!()))&timeline=(activeTab:query,graphEventId:%27%27,isOpen:!f)) and bulk enable them after installing all.
resolves: elastic#171980 This is a follow-on issue of: elastic#172742 Above issue randomises `runAt` of the bulk enabled rules. And creates new tasks (by using `scheduleTask` for each one of them) if they don't have any. But, as we create the tasks already enabled `bulkEnable` method of the TM skips them. This PR replaces scheduleTask with bulkSchedule and creates task as disabled, so `bulkEnable` can pick them up. ## To verify: Add Security Solutions' [prebuilt detection rules](http://localhost:5601/app/security/rules/management?sourcerer=(default:(id:security-solution-default,selectedPatterns:!()))&timeline=(activeTab:query,graphEventId:%27%27,isOpen:!f)) and bulk enable them after installing all.
Summary
ClosesPart of #171980When
bulkEnable
ing more than 1 task, adds a random delay to each subsequent task'srunAt
andscheduledAt
to more evenly distribute their execution times. This offset is a maximum of 5 minutes, or the task's interval, whichever is shorter.As per Slack discussion with @mikecote, this is a random distribution of execution times instead of a predictable, algorithmic offset. We believe that a random distribution will do a better job of avoiding spikes than anything more directed.
Checklist