Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution] Threshold rule performance fixes #113587

Closed
4 of 5 tasks
madirey opened this issue Oct 1, 2021 · 11 comments · Fixed by #131088
Closed
4 of 5 tasks

[Security Solution] Threshold rule performance fixes #113587

madirey opened this issue Oct 1, 2021 · 11 comments · Fixed by #131088
Assignees
Labels
8.3 candidate bug Fixes for quality problems that affect the customer experience Feature:Detection Rules Anything related to Security Solution's Detection Rules impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. performance sdh-linked Team:Detection Alerts Security Detection Alerts Area Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc.

Comments

@madirey
Copy link
Contributor

madirey commented Oct 1, 2021

There are 2 known issues which can impact performance of a threshold rule over large indices. I've outlined some steps below that should be addressed ASAP:

@madirey madirey added bug Fixes for quality problems that affect the customer experience performance triage_needed Feature:Detection Rules Anything related to Security Solution's Detection Rules Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Detection Rule Management Security Detection Rule Management Team labels Oct 1, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@MadameSheema
Copy link
Member

@peluja1012 @banderror can you please add this bug to a project (if needed), add an impact and remove the triage label? Thanks!

@peluja1012 peluja1012 added Team:Detection Alerts Security Detection Alerts Area Team impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. and removed Team:Detection Rule Management Security Detection Rule Management Team triage_needed labels Oct 11, 2021
@marshallmain marshallmain added 8.2 candidate considered, but not committed, for 8.2 release and removed 8.1 candidate labels Feb 9, 2022
@MindyRS MindyRS added the Team:Detections and Resp Security Detection Response Team label Feb 23, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@marshallmain marshallmain added impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. and removed impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. labels Mar 29, 2022
@madirey
Copy link
Contributor Author

madirey commented Mar 31, 2022

Performance of 3, 4, 5, and 10 threshold aggregation fields:

perf1

perf2

perf3

@madirey
Copy link
Contributor Author

madirey commented Mar 31, 2022

From the above, the only thing that is really affected by increasing the max number of aggregation fields is the average query time. The current max (3) results in a maximum query time of 4.5 seconds. This increases to 6 seconds when 4 fields are used, and goes up to about 8 seconds for 5 and 10 fields. Index time is not affected.

Tests were run using a modified version of kbn-alert-load against stack version 8.1.2, using 50 threshold rules running every minute over approximately 600,000 total events. Metrics were collected for 5 minutes.

@madirey
Copy link
Contributor Author

madirey commented Mar 31, 2022

The query time increases almost linearly for lower values of threshold fields, but this affect appears to attenuate over time, as the query time for both 5 and 10 threshold fields is nearly identical. This shows that there will be a cost associated with increasing the number of allowed threshold fields, but if we DO wish to increase it, perhaps we shouldn't limit the value at all.

@marshallmain
Copy link
Contributor

@madirey is it possible to get numbers for a multi_terms based threshold implementation for comparison?

@marshallmain marshallmain added 8.3 candidate and removed 8.2 candidate considered, but not committed, for 8.2 release labels Apr 7, 2022
@madirey
Copy link
Contributor Author

madirey commented Apr 11, 2022

multi_terms is consistently about 10x slower than a nested terms agg. The documentation notes that multi_terms will be slower than nested terms aggregations except in a few specific cases. We can expect that our search results may contain many results, so it's expected that multi_terms with its execution hint of map will not perform as well.
image

@madirey
Copy link
Contributor Author

madirey commented Apr 26, 2022

composite aggs are consistently slower than our nested terms aggs by about 3x. I'm trying an optimization that was recommended by the ES team to see if that helps. multi_terms likely won't fit our use case... it's great for getting the top N buckets, but we'd like to get ALL buckets.

That said, our current implementation (nested terms) is limited in that we can only get 10k (max is 65k, so we could bump this up) top-level buckets. Which means we could get significantly less than that after the cardinality of the child buckets are multiplied in. We mitigate this by sorting the buckets by cardinality (desc), but this could still result in false negatives. The advantage of composite is that we could page over all the results and we won't miss anything. (But again, how useful is it going to be for an operator to have to sort through >10k alerts...?)

@madirey
Copy link
Contributor Author

madirey commented Apr 27, 2022

The optimization mentioned above actually increased the rule runtime by about 3x. So far, it looks like nested terms aggs are by far the best option. Conclusion: we could consider increasing the number of threshold fields allowed at the expense of increasing potential heap size of the ES cluster.

@madirey
Copy link
Contributor Author

madirey commented Apr 27, 2022

@marshallmain @spong ^^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.3 candidate bug Fixes for quality problems that affect the customer experience Feature:Detection Rules Anything related to Security Solution's Detection Rules impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. performance sdh-linked Team:Detection Alerts Security Detection Alerts Area Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants