Coordinate node memory checking during accumulating shard result response #47806

howardhuanghua · 2019-10-09T15:15:28Z

About this PR

Currently we don't have circuit breaker checking on coordinate node. There are two main memory consumption parts on coordinate node:

Accumulating shard results from other data nodes.
This PR is trying to add memory checking for this part.
Aggregation reduction phase.
ES has search.max_buckets to limit memory, and we introduced a more convenient setting in Bucket aggregation circuit breaker optimization. #46751 to enhance this part.

Let's focus on the first part about this PR.
Coordinate node would receive search result responses from other data nodes, this would be handled in InboundHandler.java#handleResponse. We could get accurate message bytes length before deserialization, and we could add this length to request breaker directly. The main purpose here is to check parent circuit breaker to avoid continuous incoming responses explode coordinate node memory. We only check large size responses (>1kb) to avoid massive breaker checking. The added request breaker bytes would be released in the subsequent finally block, as it's hard to trace all the memory reclamation occasions.

More Considerations

Besides the PR provided implementation, we also have some other considerations:

1. Introduce memory expansion ratio to estimate response object size based on received message length before deserialization.

After message deserialization, real memory size of response object would be much larger than the original received message bytes length. Take aggregation buckets as example, we have done some tests about response object memory size expansion ratio, and here are the results:

Aggregation Type	Expansion Ratio (After deserialization/Before deserialization)
10-chars string terms agg	19.5
20-chars string terms agg	11.2
30-chars string terms agg	8.2
50-chars string terms agg	5.5
2-digits long terms agg	14.7
5-digits long terms agg	15
10-digits long terms agg	15.2

We could see the memory expansion ratio of string terms aggregation is about 10-20, and long terms aggregation would be stable at around 15.
So we could use a default expansion ratio like 10 to estimate response object size before deserialization, and make this expansion ratio configurable for different data analysis scenario.
Without considering the memory expansion, it's easy to explode coordinate node memory.

2. Release transport response object size from request breaker after object GC.

Current PR release the added bytes in the subsequent finally block, however the object hasn’t been GC collected at this stage. For aggregation case, I have tried to release it in SearchPhaseController.java#reducedQueryPhase() right after aggregation results consumed in result.consumeAggs(). But we still could not make sure the added bytes have always been released especially in some exception case. So we simple to release it in finally block and the main purpose is to make sure we have enough free memory through parent breaker.

3. Deserialize response messages until all the shard results have been received.

We could hold all the shard results before deserialization, if request breaker has already been triggered during accumulating, we could break the request directly without deserialize any of them.
If we deserialize each response immediately, all the deserialized object will be kept in memory until all the results have been received. This would be wastey for memory usage.

We have tried some implementations about above considerations, current PR is the simplest one, I hope we could discuss more details. Please help to evaluate these ideas, thanks a lot.

elasticmachine · 2019-10-18T06:29:04Z

Pinging @elastic/es-core-infra (:Core/Infra/Circuit Breakers)

howardhuanghua · 2019-10-23T06:58:25Z

Hi Adrien @jpountz , would you please help to review this PR? Thanks a lot.

jpountz · 2019-10-25T16:42:06Z

server/src/main/java/org/elasticsearch/transport/InboundHandler.java

        final T response;
+        long bytesNeedToRelease = 0;
+        CircuitBreaker breaker = circuitBreakerService.getBreaker(CircuitBreaker.REQUEST);


FYI the relevant circuit breaker is IN_FLIGHT_REQUESTS

jpountz · 2019-10-25T16:44:14Z

I'm not deep into how transport handles circuit breakers, and transport tends to be a bit tricky as there are some requests that should never fail to keep the transport stable. @tbrooks8 Could you help review this one?

Tim-Brooks

This is IMO not the correct approach for a cohesive solution on circuit breaking on responses.

The InboundHandler which is a networking layer component should be agnostic to how specific transport messages operate. This is lost when we start instanceOf and casting to verifying what message we are receiving.

We have discussed circuit breaking on responses and I think it is something we want to do (#42318). However it is obviously more tricky because we do not have the concept of a "response context" (some indicator telling us when we are done). This solution only works for the response in question because the Executor is SAME. If we dispatched this would not really do much.

If we want to circuit break on this response, I would suggest one of two approaches:

Put the circuit breaker in SearchTransportService and have this response do whatever it wants in regards to circuit breaking.
Make this a team-discuss topic for the infra team and add a default method canTripCircuitBreaker to TransportResponseHandler. By default it will return false and this specific implementation can override it and return true. The network layer can then circuit break based on this method agnostic to response type.

Number 2 would obviously be the basis of the future circuit breaking on response work (although it obviously only would work with synchronous response handling for now).

…mory

howardhuanghua · 2019-10-29T14:25:07Z

Hi @tbrooks8 , thanks for the comment. I have updated the PR and use canTripCircuitBreaker to check whether we need to trigger circuit breaker checking. And also changed REQUEST breaker to IN_FLIGHT_REQUESTS based on Adrien's suggestion. Would you please help to review again?

By the way, how do you think about memory expansion after shard level result deserialization? Shall we need to use the estimated response object size as I described in above comment?

1. Introduce memory expansion ratio to estimate response object size based on received message length before deserialization.

After message deserialization, real memory size of response object would be much larger than the original received message bytes length. Take aggregation buckets as example, we have done some tests about response object memory size expansion ratio, and here are the results:

Aggregation Type Expansion Ratio (After deserialization/Before deserialization)
10-chars string terms agg 19.5
20-chars string terms agg 11.2
30-chars string terms agg 8.2
50-chars string terms agg 5.5
2-digits long terms agg 14.7
5-digits long terms agg 15
10-digits long terms agg 15.2
We could see the memory expansion ratio of string terms aggregation is about 10-20, and long terms aggregation would be stable at around 15.
So we could use a default expansion ratio like 10 to estimate response object size before deserialization, and make this expansion ratio configurable for different data analysis scenario.
Without considering the memory expansion, it's easy to explode coordinate node memory.

Tim-Brooks · 2019-11-04T23:11:18Z

@howardhuanghua I was out last week. I will take a look at this soon. I will also engage some other people about this to ensure we are all in agreement about the default method approach.

howardhuanghua · 2019-11-11T06:10:47Z

Thank you @tbrooks8 , if you have free time, please help to check this approach again. :)

Tim-Brooks · 2019-11-14T01:08:10Z

I brought this up in the distributed sync today to discuss. To give some context, I mainly focus on the networking area of Elasticsearch which is why @jpountz pinged me on this topic.

So the first thing is that as I mentioned we have discussed adding response circuit breaking before and it is something we are interested in. We definitely accumulate data at the network level and it would be nice to avoid buffering a ton of unaccounted data. But getting the maximum value out of that work requires additional work to the transport protocol (which is in progress but seems unlikely to be ready in the short-term).

This particular PR seems to be addressing a specific concern - accumulating a significant amount of memory on the coordinator node. The reason that the networking oriented solution (IN_FLIGHT_REQUESTS) might be inadequate for the desired results it that it depends on the responses from other nodes all occurring at the same time. A lot of the response handlers here (my understanding) will simply set the results in some array and then the execution will be complete (immediately releasing the circuit breaker).

I guess it was not clear that this would provide much value for the accumulation concern? Is the expectation here that we would always be depending on the JMX memory stats circuit breaker to save us? And the hope is that we could get some type of essentially "deserialization estimate" prior to performing the deserialization?

On the topic of deserialization being larger than the network bytes, once again probably the best approach there is to have some type of TransportResponseHandler method which provides the actual deserialization size estimate. But that is also a level of circuit breaking logic that we have never done before. I want to ping @elastic/es-core-infra / @danielmitterdorfer here as I know that Daniel might have some thoughts about this.

I am personally okay with the additional response size circuit breaking. I just don't know how much value it provides in relation to what exists at the network level currently (plus the fact that releasing is clearly broken at the moment since it does not support async, it completely depends on the JMX real memory stuff). And maybe this should just be some internal logic inside the accumulation handler.

Tim-Brooks · 2019-11-14T01:10:42Z

server/src/main/java/org/elasticsearch/transport/InboundHandler.java

+            if (handler.canTripCircuitBreaker() && messageLengthBytes > 1024) {
+                // the main purpose is to check memory before deserialization for large size of response
+                bytesNeedToRelease = messageLengthBytes;
+                breaker.addEstimateBytesAndMaybeBreak(messageLengthBytes, "<transport_response>");


On this particular approach (the network level approach), you will need to catch the circuit breaking exception and handle it. With this code path it is going to appear as a transport serialization exception.

I also this that we would want to approach to literally surround the execution. I understand you are primarily concern about the increase in size from deserialization, but as a generic approach we would want to wrap the call to the response handler.

elasticcla · 2019-11-14T06:18:54Z

Hi @howardhuanghua, we have found your signature in our records, but it seems like you have signed with a different e-mail than the one used in your Git commit. Can you please add both of these e-mails into your Github profile (they can be hidden), so we can match your e-mails to your Github profile?

howardhuanghua · 2019-11-14T12:42:51Z

Hi @tbrooks8 , thank you for your patient explanations.

I guess it was not clear that this would provide much value for the accumulation concern? Is the expectation here that we would always be depending on the JMX memory stats circuit breaker to save us? And the hope is that we could get some type of essentially "deserialization estimate" prior to performing the deserialization?

Yes, you are right, this PR is not going to check accumulated response results memory, the main purpose is to prevent continuous shard level results deserialization, and discard these response results if the total JVM heap is going to be used up (based on checking parent circuit breaker). For more accurate prediction, we hope to add the estimated deserialized size to the breaker, as after deserialization, result size would be several times of network message bytes.

The network layer accumulated buffered memory should be a big part of memory usage on coordinate node, another big part is bucket reduce phase that I have already opened #46751 to optimize it. This PR could discard the continuous shard level responses to protect coordinate node heap.

Shard level search response results would be accumulated in memory in the follow logic:

elasticsearch/server/src/main/java/org/elasticsearch/action/search/ArraySearchPhaseResults.java

Lines 42 to 45 in 82bc3f5

    
           void consumeResult(Result result) { 
        
               assert results.get(result.getShardIndex()) == null : "shardIndex: " + result.getShardIndex() + " is already set"; 
        
               results.set(result.getShardIndex(), result); 
        
           }

Before getting all the shard results, the memory would not be released, related checking logic:

elasticsearch/server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java

Line 474 in 82bc3f5

successfulShardExecution(shardIt);

I have tried to add the accumulated result memory into REQUEST breaker in ArraySearchPhaseResults.java#consumeResult before, the main issues are:

Response result has already been deserialized, it's real memory used size, not predicted.
It's hard to estimate response result object size (no sizeof method in java to calc object), above test result about expansion ratio is calculated by some external tools.
It's hard to release the added size from breaker completely, would potentially got breaker memory size leak.

We definitely accumulate data at the network level and it would be nice to avoid buffering a ton of unaccounted data.

I got an idea to optimize it a little bit, could we only accumulate the network messages without deserializing them? After getting all the shard results for the single search request on coordinate node, then deserialize them together and continue to execute the next phase. We could stop accumulate the original messages if parent breaker has already been triggered. This would save much more memory compare to buffer deserialized result object. I have described this in above comment in the follow section:
3. Deserialize response messages until all the shard results have been received.

And I have updated the commit based on your suggestion, please let me know if something wrong.
I know coordinate node's circuit breaker is being continuously optimized, we also have problems on this part and try to improve it, we hope that we could contribute on this part with you togther, thank you. :)

check shard level transport response memory

0a45332

This was referenced Oct 9, 2019

Bucket aggregation circuit breaker optimization. #46751

Merged

Add single request circuit breaker. #46962

Closed

howardhuanghua changed the title ~~coordinate node memory checking during accumulating shard result response~~ Coordinate node memory checking during accumulating shard result response Oct 11, 2019

$@polyfractal$ polyfractal requested a review from jpountz October 16, 2019 20:28

alpar-t added the :Core/Infra/Circuit Breakers Track estimates of memory consumption to prevent overload label Oct 18, 2019

jpountz requested a review from Tim-Brooks October 25, 2019 16:38

jpountz reviewed Oct 25, 2019

View reviewed changes

Tim-Brooks requested changes Oct 25, 2019

View reviewed changes

howardhuanghua added 2 commits October 29, 2019 21:02

add canTripCircuitBreaker to filter search response

adef8ef

Merge remote-tracking branch 'upstream/master' into check_response_me…

bf41a7c

…mory

howardhuanghua force-pushed the check_response_memory branch from 3ddf32c to bf41a7c Compare October 29, 2019 13:30

change request to in flight request breaker

a765a47

howardhuanghua requested a review from Tim-Brooks October 31, 2019 00:58

Tim-Brooks reviewed Nov 14, 2019

View reviewed changes

howardhuanghua force-pushed the check_response_memory branch from b80c14c to a765a47 Compare November 14, 2019 06:38

handle specific breaker exception

76917bb

rjernst added the Team:Core/Infra Meta label for core/infra team label May 4, 2020

elasticsearchmachine changed the base branch from master to main July 22, 2022 23:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coordinate node memory checking during accumulating shard result response #47806

Coordinate node memory checking during accumulating shard result response #47806

howardhuanghua commented Oct 9, 2019

elasticmachine commented Oct 18, 2019

howardhuanghua commented Oct 23, 2019

jpountz Oct 25, 2019

jpountz commented Oct 25, 2019

Tim-Brooks left a comment

howardhuanghua commented Oct 29, 2019

1. Introduce memory expansion ratio to estimate response object size based on received message length before deserialization.

Tim-Brooks commented Nov 4, 2019

howardhuanghua commented Nov 11, 2019

Tim-Brooks commented Nov 14, 2019

Tim-Brooks Nov 14, 2019

elasticcla commented Nov 14, 2019

howardhuanghua commented Nov 14, 2019

Coordinate node memory checking during accumulating shard result response #47806

Are you sure you want to change the base?

Coordinate node memory checking during accumulating shard result response #47806

Conversation

howardhuanghua commented Oct 9, 2019

About this PR

More Considerations

1. Introduce memory expansion ratio to estimate response object size based on received message length before deserialization.

2. Release transport response object size from request breaker after object GC.

3. Deserialize response messages until all the shard results have been received.

elasticmachine commented Oct 18, 2019

howardhuanghua commented Oct 23, 2019

jpountz Oct 25, 2019

Choose a reason for hiding this comment

jpountz commented Oct 25, 2019

Tim-Brooks left a comment

Choose a reason for hiding this comment

howardhuanghua commented Oct 29, 2019

1. Introduce memory expansion ratio to estimate response object size based on received message length before deserialization.

Tim-Brooks commented Nov 4, 2019

howardhuanghua commented Nov 11, 2019

Tim-Brooks commented Nov 14, 2019

Tim-Brooks Nov 14, 2019

Choose a reason for hiding this comment

elasticcla commented Nov 14, 2019

howardhuanghua commented Nov 14, 2019