Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add circuit breaker for aggregation memory limit #67474

Closed
williamrandolph opened this issue Jan 13, 2021 · 6 comments
Closed

Add circuit breaker for aggregation memory limit #67474

williamrandolph opened this issue Jan 13, 2021 · 6 comments
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@williamrandolph
Copy link
Contributor

Add a circuit breaker that limits the amount of memory an aggregation can use. For example, breaker.search.aggregation_memory.limit.

This should work better than max_buckets, as this would be more dynamic.

This issue was first raised as part of #62457 .

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jan 13, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@not-napoleon
Copy link
Member

(CC @lanerjo since you opened the original issue this is based on)

I believe we already have a circuit breaker for aggregation memory usage, as of 7.7: #46751. Does that meet your needs, or if not can you clarify what you're looking for specifically?

@maosuhan
Copy link

@not-napoleon Request circuit breaker trips is more like a final check for protection and is slow to react. What's more, it will reject all the requests no matter it is small request or big one after breaker trips .

In most cases, user need to get rapid response from ES. If some request is too large, user still want to get rejection message quickly. If breaker.search.aggregation_memory.limit apply to per-request context, a large request will be rejected quickly and will not impact other small requests.

@williamrandolph is this a cluster-level setting which apply to each request context like search.max_buckets

@not-napoleon
Copy link
Member

I see. Essentially, you're asking for a way to cap a large aggregation's memory usage without exhausting the available cluster resources. So for example, you might want to say that one aggregation can never use more than 10% of memory, or something like that (just making up a number, that's not a recommendation). This would then kill large requests while allowing smaller requests to complete, which is not the case with the current request circuit breaker.

I'm not sure this will give a rejection message faster than the current approach though. In general, aggregation memory usage is highest during the reduce phase, which happens fairly late in the request life cycle. Adding a per request breaker doesn't really change that.

@maosuhan
Copy link

maosuhan commented Mar 2, 2021

@not-napoleon Thanks for reading my comment and replying. I think you totally get my point.
Aggregation memory can not only apply to reduce phase in coordinator but also can apply to aggregation phase in data node.

Datanode breaker trip

In our production, we have 30GB heap ES nodes and we run a high cardinality query, if we rely on circuit breaker, it takes 90 seconds before parent circuit breaker trips(real memory exceeds 29.4GB). But if we rely on search.max_buckets, it only takes several seconds to raise the exception.
You can see response time is proportional to the search.max_buckets

search.max_buckets response time
10000 511ms
100000 1173ms
1000000 9569ms
2000000 23000ms

Coordinator breaker trip

As you said, aggregation memory usage is highest during the reduce phase, which happens fairly late in the request life cycle. Adding a per request breaker doesn't really change that. According to our test, the most memory consumption part is the shard response from all shards especially batched_reduce_size is not tuned. And #67478 is already trying to fix this issue but it is not a per-request breaker.

And if we can only break the big query and let small queries run smoothly ,it will be a benefit too.

@not-napoleon
Copy link
Member

Thanks for taking the time to reply. We discussed this as a team, and currently we do not want to head in this direction. It's not at all clear that killing large queries in favor of smaller queries should be a good general policy. The existing aggregation circuit breaker covers the majority of cases in our opinion.

If we were to take up something like this in the future, I think it would have to come from a different direction. Elasticsearch doesn't currently support (or have a plan to support) per-user usage limiting, but I could see that as a more reasonable direction to take something like this.

Overall, we want to support the ability to run very large queries as needed, and don't think killing a query just because it is large, when we otherwise might be able to serve it, is a good policy. The assumption that small queries are more important than large ones doesn't play out in all cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests

4 participants