New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved Request Circuit Breaker #11070
Comments
Could the circuit breaker be enhanced to set a threshold that any query which will be sent to more than X shards should be blocked? We have a similar issue where we have some search requests that are hitting every shard and we would like to block such requests from executing. |
Have a similar request where the end user asked for circuit breaker functionality at the node level, i.e limiting accumulative memory used by all queries on a node. |
So, for the cardinality aggregation and request-level semantics, the circuit breaker (even though it's called "request breaker") has no notion of the request itself. Instead, it is part of the Now, multiple queries using |
Does someone know what the outlook for this feature is? |
Fixed by #19394 |
In certain circumstances, the request circuit breaker is not blocking requests that are individually fine, but holistically a problem. For example, if you have an aggregation on a very high-cardinality field and you allow the
shard_size
to becomeInteger.MAX_VALUE
(either directly, or indirectly by setting it to0
), then you can create a lot of CPU and network congestion (this is documented behavior).On a per-request basis, this may be caught and safely blocked. However, for requests that manage to sneak in under the request threshold, I have come across scenarios where I can have multiple in-flight requests that manage to crash the node that handles the request.
In particular, I have seen a client node forced into OOM conditions due parallel aggregations with a lot of shards:
In this case, an individual shard response was only ~70 MB, but there were many shards. Worse, other aggregations were in-flight at the same time. Eventually the memory became too much, causing the client node (in this case) to drop out due to OOM. I suspect that a similar problem could surface if a data node were forced to handle the initial request.
This is certainly not an easy problem to catch, nor will the solution to it be easy, but hopefully we can figure something out to combat the issue.
The text was updated successfully, but these errors were encountered: