-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PoC] Attaining 10x alerting throughput (32,000 rules per minute) #182394
Draft
mikecote
wants to merge
70
commits into
elastic:main
Choose a base branch
from
mikecote:task-manager-32k
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+3,976
−552
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/ci |
/ci |
mikecote
added
ci:cloud-redeploy
Always create a new Cloud deployment
and removed
ci:cloud-deploy
Create or update a Cloud deployment
labels
May 3, 2024
/ci |
/ci |
/ci |
mikecote
added
ci:cloud-deploy
Create or update a Cloud deployment
and removed
ci:cloud-redeploy
Always create a new Cloud deployment
labels
May 3, 2024
/ci |
/ci |
/ci |
/ci |
/ci |
/ci |
/ci |
/ci |
1 similar comment
/ci |
💔 Build FailedFailed CI StepsHistory
To update your PR or re-run it, just comment with: |
/ci |
mikecote
changed the title
Task Manager 32k
[PoC] Attaining 10x alerting throughput (32,000 rules per minute)
Jul 3, 2024
mikecote
added
ci:cloud-redeploy
Always create a new Cloud deployment
and removed
ci:cloud-deploy
Create or update a Cloud deployment
labels
Jul 3, 2024
/ci |
1 similar comment
/ci |
mikecote
added
ci:cloud-deploy
Create or update a Cloud deployment
ci:cloud-redeploy
Always create a new Cloud deployment
and removed
ci:cloud-redeploy
Always create a new Cloud deployment
ci:cloud-deploy
Create or update a Cloud deployment
labels
Jul 3, 2024
/ci |
💔 Build Failed
Failed CI StepsHistory
|
mikecote
added
ci:cloud-deploy
Create or update a Cloud deployment
ci:cloud-redeploy
Always create a new Cloud deployment
and removed
ci:cloud-redeploy
Always create a new Cloud deployment
ci:cloud-deploy
Create or update a Cloud deployment
labels
Jul 4, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In this PoC, I made the improvements listed below to move the alerting scalability ceiling (rules per minute) by at least 10x. The scenario used is creating ES Query rules that run every minute on sample indices that do not detect alerts.
List of improvements
.kibana_task_manager
and.kibana_alerting_cases
index configurations to3
shards50
from10
500ms
from3s
360
partitions360
partitions in a round-robin mannerxpack.alerting.maxScheduledPerMinute
to1000000
to increase the upper bound limitmget
as the task-claiming strategydataViews
andsearchSourceClient
alerting rule executor services when not necessary (Lazy load dataViews and wrappedSearchSourceClient services when running alerting rules #184322)claiming
phase of tasks (Make the mget task claimer skip theclaiming
phase and update the task document directly torunning
#184739)Test scenario
1m
intervalNotes
_has_privileges
API calls to Elasticsearch, we need to set the.security
index settings to haveauto_expand_replicas: 0-all
so not only one node is capable of performing the requestsxpack.security.authc.api_key.cache.max_keys: 50000
)Conclusion
These optimizations have shown that we can attain a 10x scale with the alerting system. However, during further testing, I was able to push the limits even further, attaining much more than 10x in various ES and Kibana configurations, confirming that this approach will break the horizontal scalability ceiling that we previously had.