Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PoC] Attaining 10x alerting throughput (32,000 rules per minute) #182394

Draft
wants to merge 70 commits into
base: main
Choose a base branch
from

Conversation

mikecote
Copy link
Contributor

@mikecote mikecote commented May 2, 2024

In this PoC, I made the improvements listed below to move the alerting scalability ceiling (rules per minute) by at least 10x. The scenario used is creating ES Query rules that run every minute on sample indices that do not detect alerts.

List of improvements

Test scenario

  • Elasticsearch Query rule
  • 1m interval
  • Queries either one of the sample indices or the event log index
  • Rule is written to not detect alerts after running the query

Notes

  • When the search queries encounter an HTTP 429 or 504 error, we need to set more replicas on the affected index
  • When encountering slow or failing _has_privileges API calls to Elasticsearch, we need to set the .security index settings to have auto_expand_replicas: 0-all so not only one node is capable of performing the requests
  • The API key cache needs to be bumped to a number higher than the number of alerting rules (xpack.security.authc.api_key.cache.max_keys: 50000)
  • If 429 errors are observed when updating a task or alerting rule saved-object, we need to observe the Elasticsearch node in context and ensure it is not hosting other primary shards of task or rule saved-objects.
  • Rendezvous hashing did not seem to distribute the partitions evenly across Kibana, using the round-robin method may be sufficient for now.
  • At times when pushing the boundaries, Elasticsearch queries at times timeout while returning the majority of the queries within seconds. Moving Elasticsearch to multi zone seems to solve the issue.

Conclusion

These optimizations have shown that we can attain a 10x scale with the alerting system. However, during further testing, I was able to push the limits even further, attaining much more than 10x in various ES and Kibana configurations, confirming that this approach will break the horizontal scalability ceiling that we previously had.

Screenshot 2024-07-04 at 12 19 42 PM

@mikecote mikecote added the ci:cloud-deploy Create or update a Cloud deployment label May 2, 2024
@mikecote
Copy link
Contributor Author

mikecote commented May 2, 2024

/ci

@mikecote
Copy link
Contributor Author

mikecote commented May 2, 2024

/ci

@mikecote mikecote added ci:cloud-redeploy Always create a new Cloud deployment and removed ci:cloud-deploy Create or update a Cloud deployment labels May 3, 2024
@mikecote
Copy link
Contributor Author

mikecote commented May 3, 2024

/ci

@mikecote
Copy link
Contributor Author

mikecote commented May 3, 2024

/ci

@mikecote
Copy link
Contributor Author

mikecote commented May 3, 2024

/ci

@mikecote mikecote added ci:cloud-deploy Create or update a Cloud deployment and removed ci:cloud-redeploy Always create a new Cloud deployment labels May 3, 2024
@mikecote
Copy link
Contributor Author

mikecote commented May 3, 2024

/ci

@mikecote
Copy link
Contributor Author

mikecote commented May 3, 2024

/ci

@mikecote
Copy link
Contributor Author

mikecote commented May 3, 2024

/ci

@mikecote
Copy link
Contributor Author

mikecote commented May 6, 2024

/ci

@mikecote
Copy link
Contributor Author

mikecote commented May 7, 2024

/ci

@mikecote
Copy link
Contributor Author

/ci

@mikecote
Copy link
Contributor Author

/ci

@mikecote
Copy link
Contributor Author

/ci

1 similar comment
@mikecote
Copy link
Contributor Author

mikecote commented Jul 2, 2024

/ci

@kibana-ci
Copy link
Collaborator

💔 Build Failed

Failed CI Steps

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@mikecote
Copy link
Contributor Author

mikecote commented Jul 3, 2024

/ci

@mikecote mikecote changed the title Task Manager 32k [PoC] Attaining 10x alerting throughput (32,000 rules per minute) Jul 3, 2024
@mikecote mikecote added ci:cloud-redeploy Always create a new Cloud deployment and removed ci:cloud-deploy Create or update a Cloud deployment labels Jul 3, 2024
@mikecote
Copy link
Contributor Author

mikecote commented Jul 3, 2024

/ci

1 similar comment
@mikecote
Copy link
Contributor Author

mikecote commented Jul 3, 2024

/ci

@mikecote mikecote added ci:cloud-deploy Create or update a Cloud deployment ci:cloud-redeploy Always create a new Cloud deployment and removed ci:cloud-redeploy Always create a new Cloud deployment ci:cloud-deploy Create or update a Cloud deployment labels Jul 3, 2024
@mikecote mikecote closed this Jul 4, 2024
@mikecote mikecote reopened this Jul 4, 2024
@mikecote
Copy link
Contributor Author

mikecote commented Jul 4, 2024

/ci

@elasticmachine
Copy link
Contributor

elasticmachine commented Jul 4, 2024

💔 Build Failed

Failed CI Steps

History

@mikecote mikecote added ci:cloud-deploy Create or update a Cloud deployment ci:cloud-redeploy Always create a new Cloud deployment and removed ci:cloud-redeploy Always create a new Cloud deployment ci:cloud-deploy Create or update a Cloud deployment labels Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci:cloud-deploy Create or update a Cloud deployment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants