Change the default batched_reduce_size of search requests #51857

jimczi · 2020-02-04T10:37:35Z

Today we execute a partial reduce of search requests after we buffered at least 512 shard search results. The default, users can change this value with batched_reduce_size=N, seems quite high and can cause memory issue for queries that target a large amount of shards. We also want to use the partial reduce to speed up the search on subsequent search shard request (#51852) but users won't see the benefit unless they reduce the batched reduce size explicitly. Partial (and final) reduce are usually very fast so I am opening this issue to use a sane default that could save memories on coordinating node and speed up sorted queries on time-based indices (queries that can target a lot of shards). We have plenty of options so here's a non-exhaustive list:

Reduce the default to 5-10, that could slightly increase the overall latency but the benefit are non-negligible on time-based indices.
Reduce the default only for queries that can use the partial reduce to speed up subsequent shard search (sorted queries on time based indices).
Change the threshold to check the size of the buffered results rather than the absolute number of shards. We could also trigger partial reduce based on inactivity (if no more shard responded in the last N seconds/minutes).
Keep the default as it is :(

I am curious to hear your thoughts on these options.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-02-04T10:37:38Z

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

elasticmachine · 2020-02-04T10:37:39Z

Pinging @elastic/es-search (:Search/Search)

polyfractal · 2020-02-04T19:15:53Z

Wow yeah, 512 is high :)

I like the idea of triggering a reduce based on a timeout, but that seems potentially tricky to get right. And as you said, reductions should be pretty fast, so a lowish 5-50 default seems reasonable to me.

jpountz · 2020-02-05T16:50:35Z

I like the idea of just decreasing it to 5 or 10. It's very simple and most requests that hit a single index will keep doing a single reduction. There are some downsides too but the benefits outweigh downsides in my opinion.

This change optimizes the merge of terms aggregations by removing the priority queue that was used to collect all the buckets during a non-final reduction. We don't need to keep the result sorted since the merge of buckets in a subsequent reduce can modify the order. I wrote a small micro-benchmark to test the change and the speed ups are significative for small merge buffer sizes: ```` ########## Master: Benchmark (bufferSize) (cardinality) (numShards) (topNSize) Mode Cnt Score Error Units TermsReduceBenchmark.reduceTopHits 5 10000 1000 1000 avgt 10 2459,690 ± 198,682 ms/op TermsReduceBenchmark.reduceTopHits 16 10000 1000 1000 avgt 10 1030,620 ± 91,544 ms/op TermsReduceBenchmark.reduceTopHits 32 10000 1000 1000 avgt 10 558,608 ± 44,915 ms/op TermsReduceBenchmark.reduceTopHits 128 10000 1000 1000 avgt 10 287,333 ± 8,342 ms/op TermsReduceBenchmark.reduceTopHits 512 10000 1000 1000 avgt 10 257,325 ± 54,515 ms/op ########## Patch: Benchmark (bufferSize) (cardinality) (numShards) (topNSize) Mode Cnt Score Error Units TermsReduceBenchmark.reduceTopHits 5 10000 1000 1000 avgt 10 805,611 ± 14,630 ms/op TermsReduceBenchmark.reduceTopHits 16 10000 1000 1000 avgt 10 378,851 ± 17,929 ms/op TermsReduceBenchmark.reduceTopHits 32 10000 1000 1000 avgt 10 261,094 ± 10,176 ms/op TermsReduceBenchmark.reduceTopHits 128 10000 1000 1000 avgt 10 241,051 ± 19,558 ms/op TermsReduceBenchmark.reduceTopHits 512 10000 1000 1000 avgt 10 231,643 ± 6,170 ms/op ```` The code for the benchmark can be found [here](). It seems to be up to 3x faster for terms aggregations that return 10,000 unique terms (1000 terms per shard). For a cardinality of 100,000 terms, this patch is up to 5x faster: ```` ########## Patch: Benchmark (bufferSize) (cardinality) (numShards) (topNSize) Mode Cnt Score Error Units TermsReduceBenchmark.reduceTopHits 5 100000 1000 1000 avgt 10 12791,083 ± 397,128 ms/op TermsReduceBenchmark.reduceTopHits 16 100000 1000 1000 avgt 10 3974,939 ± 324,617 ms/op TermsReduceBenchmark.reduceTopHits 32 100000 1000 1000 avgt 10 2186,285 ± 267,124 ms/op TermsReduceBenchmark.reduceTopHits 128 100000 1000 1000 avgt 10 914,657 ± 160,784 ms/op TermsReduceBenchmark.reduceTopHits 512 100000 1000 1000 avgt 10 604,198 ± 145,457 ms/op ########## Master: Benchmark (bufferSize) (cardinality) (numShards) (topNSize) Mode Cnt Score Error Units TermsReduceBenchmark.reduceTopHits 5 100000 1000 1000 avgt 10 60696,107 ± 929,944 ms/op TermsReduceBenchmark.reduceTopHits 16 100000 1000 1000 avgt 10 16292,894 ± 783,398 ms/op TermsReduceBenchmark.reduceTopHits 32 100000 1000 1000 avgt 10 7705,444 ± 77,588 ms/op TermsReduceBenchmark.reduceTopHits 128 100000 1000 1000 avgt 10 2156,685 ± 88,795 ms/op TermsReduceBenchmark.reduceTopHits 512 100000 1000 1000 avgt 10 760,273 ± 53,738 ms/op ```` The merge of buckets can also be optimized. Currently we use an hash map to merge buckets coming from different shards so this can be costly if the number of unique terms is high. Instead, we could always sort the shard terms result by key and perform a merge sort to reduce the results. This would save memory and make the merge more linear in terms of complexity in the coordinating node at the expense of an additional sort in the shards. I plan to test this possible optimization in a follow up. Relates elastic#51857

This change optimizes the merge of terms aggregations by removing the priority queue that was used to collect all the buckets during a non-final reduction. We don't need to keep the result sorted since the merge of buckets in a subsequent reduce can modify the order. I wrote a small micro-benchmark to test the change and the speed ups are significative for small merge buffer sizes: ```` ########## Master: Benchmark (bufferSize) (cardinality) (numShards) (topNSize) Mode Cnt Score Error Units TermsReduceBenchmark.reduceTopHits 5 10000 1000 1000 avgt 10 2459,690 ± 198,682 ms/op TermsReduceBenchmark.reduceTopHits 16 10000 1000 1000 avgt 10 1030,620 ± 91,544 ms/op TermsReduceBenchmark.reduceTopHits 32 10000 1000 1000 avgt 10 558,608 ± 44,915 ms/op TermsReduceBenchmark.reduceTopHits 128 10000 1000 1000 avgt 10 287,333 ± 8,342 ms/op TermsReduceBenchmark.reduceTopHits 512 10000 1000 1000 avgt 10 257,325 ± 54,515 ms/op ########## Patch: Benchmark (bufferSize) (cardinality) (numShards) (topNSize) Mode Cnt Score Error Units TermsReduceBenchmark.reduceTopHits 5 10000 1000 1000 avgt 10 805,611 ± 14,630 ms/op TermsReduceBenchmark.reduceTopHits 16 10000 1000 1000 avgt 10 378,851 ± 17,929 ms/op TermsReduceBenchmark.reduceTopHits 32 10000 1000 1000 avgt 10 261,094 ± 10,176 ms/op TermsReduceBenchmark.reduceTopHits 128 10000 1000 1000 avgt 10 241,051 ± 19,558 ms/op TermsReduceBenchmark.reduceTopHits 512 10000 1000 1000 avgt 10 231,643 ± 6,170 ms/op ```` The code for the benchmark can be found [here](). It seems to be up to 3x faster for terms aggregations that return 10,000 unique terms (1000 terms per shard). For a cardinality of 100,000 terms, this patch is up to 5x faster: ```` ########## Patch: Benchmark (bufferSize) (cardinality) (numShards) (topNSize) Mode Cnt Score Error Units TermsReduceBenchmark.reduceTopHits 5 100000 1000 1000 avgt 10 12791,083 ± 397,128 ms/op TermsReduceBenchmark.reduceTopHits 16 100000 1000 1000 avgt 10 3974,939 ± 324,617 ms/op TermsReduceBenchmark.reduceTopHits 32 100000 1000 1000 avgt 10 2186,285 ± 267,124 ms/op TermsReduceBenchmark.reduceTopHits 128 100000 1000 1000 avgt 10 914,657 ± 160,784 ms/op TermsReduceBenchmark.reduceTopHits 512 100000 1000 1000 avgt 10 604,198 ± 145,457 ms/op ########## Master: Benchmark (bufferSize) (cardinality) (numShards) (topNSize) Mode Cnt Score Error Units TermsReduceBenchmark.reduceTopHits 5 100000 1000 1000 avgt 10 60696,107 ± 929,944 ms/op TermsReduceBenchmark.reduceTopHits 16 100000 1000 1000 avgt 10 16292,894 ± 783,398 ms/op TermsReduceBenchmark.reduceTopHits 32 100000 1000 1000 avgt 10 7705,444 ± 77,588 ms/op TermsReduceBenchmark.reduceTopHits 128 100000 1000 1000 avgt 10 2156,685 ± 88,795 ms/op TermsReduceBenchmark.reduceTopHits 512 100000 1000 1000 avgt 10 760,273 ± 53,738 ms/op ```` The merge of buckets can also be optimized. Currently we use an hash map to merge buckets coming from different shards so this can be costly if the number of unique terms is high. Instead, we could always sort the shard terms result by key and perform a merge sort to reduce the results. This would save memory and make the merge more linear in terms of complexity in the coordinating node at the expense of an additional sort in the shards. I plan to test this possible optimization in a follow up. Relates #51857

jimczi · 2020-03-27T00:33:56Z

We've made several progress on this issue recently so apologies for not updating earlier.

While discussing the proposed change offline, we've decided to start with benchmarks to evaluate the effect of changing the default value.
We've started with the terms aggregation and opened #53216 to fix a performance bug.
In the meantime @nik9000 has changed the incremental reduce to lazily deserialize shard result aggregations on partial reduce.
This has two main advantages:

We can now evaluate the memory cost of keeping shard aggregation results in memory more precisely. This is important if we want to allow partial reduce to be executed when a memory threshold is reached.
We create the java representation of the aggregation tree only during the partial reduce. This makes these java objects short-lived which should help the overall garbage collection in the node.

In order to be able to effectively reduce the default value without impacting users, we think that more work is needed:

Perform real merge-sort reduce on terms aggregations. This should reduce the cost of running incremental reduce.
Provides micro-benchmark for date_histograms and histograms with different batched_reduce_size.

Once these tasks are completed we will resume the discussion with the additional informations in order to take a decision based on real measurements.

nik9000 · 2020-04-03T18:03:16Z

I wonder if the number should be different for searches with aggs vs searches without them. Aggs are the most expensive thing to reduce but they are also the biggest thing to keep around.

jpountz · 2020-04-07T09:17:41Z

My inclination would be to try hard to have the same number all the time. I've been hit by moving parts like that a couple times when digging performance issues.

hackerwin7 · 2020-06-09T11:20:48Z

Is there any plans to support parallel reduce on coordinator.
IMO, If a search request's shard requests count is extremely huge, and batched reduce size is relatively low. the subsequent shard query result maybe block on synchronized consumeInternal().
how about support parallel reduce on this ?

hackerwin7 · 2020-06-19T07:29:57Z

I test a huge aggs case : if batched size = 5, the query's took is extremely high compare to default batched size parameters

jimczi · 2020-06-19T08:59:45Z

how about support parallel reduce on this ?

The plan is to move to a thread pool in order to limit the number of partial reduce we perform in parallel. So the synchronization will be removed but we'll continue to limit the number of partial reduce executed in parallel for a single search request to 1.

if batched size = 5, the query's took is extremely high compare to default batched size parameters

We're making improvements to the partial reduce but this is expected to be slower. We didn't settle the default yet and 5 might be too small. We'll test the performance before doing the change but first we need to cleanup the synchronization and move the operation into a thread pool. Maybe you can share your results for comparison ?

hackerwin7 · 2020-06-19T10:20:06Z

@jimczi
I have implement a path of parallel reduce for master branch.
here is my test results:

Mem stats

JVM	default	batched=5	parallel
Eden	increase 8.4 G	Increase more than Eden capacity	Increase more than Eden capacity
Old	Increase 1.6 G	Increase 200 MB	Increase 6 G
YGC count	0	3	15
FGC count	0	0	0

Search stats

Query	default	batched=5	parallel
Huge-aggs-query	14.184 s	142.341 s	37.681 s

It seems like the more reduce count cause to more mem cost

hackerwin7 · 2020-06-19T10:22:58Z

The plan is to move to a thread pool in order to limit the number of partial reduce we perform in parallel. So the synchronization will be removed but we'll continue to limit the number of partial reduce executed in parallel for a single search request to 1.

I have a similar patch for this.

This change forks the execution of partial reduces in the coordinating node to the search thread pool. It also ensures that partial reduces are executed sequentially and asynchronously in order to limit the memory and cpu that a single search request can use but also to avoid blocking a network thread. If a partial reduce fails with an exception, the search request is cancelled and the reporting of the error is delayed to the start of the fetch phase (when the final reduce is performed). This ensures that we cleanup the in-flight search requests before returning an error to the user. Closes elastic#53411 Relates elastic#51857

This change forks the execution of partial reduces in the coordinating node to the search thread pool. It also ensures that partial reduces are executed sequentially and asynchronously in order to limit the memory and cpu that a single search request can use but also to avoid blocking a network thread. If a partial reduce fails with an exception, the search request is cancelled and the reporting of the error is delayed to the start of the fetch phase (when the final reduce is performed). This ensures that we cleanup the in-flight search requests before returning an error to the user. Closes #53411 Relates #51857

Today, the terms aggregation reduces multiple aggregations at once using a map to group same buckets together. This operation can be costly since it requires to lookup every bucket in a global map with no particular order. This commit changes how term buckets are sorted by shards and partial reduces in order to be able to reduce results using a merge-sort strategy. For bwc, results are merged with the legacy code if any of the aggregations use a different sort (if it was returned by a node in prior versions). Relates elastic#51857

* Improve reduction of terms aggregations Today, the terms aggregation reduces multiple aggregations at once using a map to group same buckets together. This operation can be costly since it requires to lookup every bucket in a global map with no particular order. This commit changes how term buckets are sorted by shards and partial reduces in order to be able to reduce results using a merge-sort strategy. For bwc, results are merged with the legacy code if any of the aggregations use a different sort (if it was returned by a node in prior versions). Relates #51857

Today, the terms aggregation reduces multiple aggregations at once using a map to group same buckets together. This operation can be costly since it requires to lookup every bucket in a global map with no particular order. This commit changes how term buckets are sorted by shards and partial reduces in order to be able to reduce results using a merge-sort strategy. For bwc, results are merged with the legacy code if any of the aggregations use a different sort (if it was returned by a node in prior versions). Relates #51857

This commit allows coordinating node to account the memory used to perform partial and final reduce of aggregations in the request circuit breaker. The search coordinator adds the memory that it used to save and reduce the results of shard aggregations in the request circuit breaker. Before any partial or final reduce, the memory needed to reduce the aggregations is estimated and a CircuitBreakingException} is thrown if exceeds the maximum memory allowed in this breaker. This size is estimated as roughly 1.5 times the size of the serialized aggregations that need to be reduced. This estimation can be completely off for some aggregations but it is corrected with the real size after the reduce completes. If the reduce is successful, we update the circuit breaker to remove the size of the source aggregations and replace the estimation with the serialized size of the newly reduced result. As a follow up we could trigger partial reduces based on the memory accounted in the circuit breaker instead of relying on a static number of shard responses. A simpler follow up that could be done in the mean time is to [reduce the default batch reduce size](elastic#51857) of blocking search request to a more sane number. Closes elastic#37182

This commit allows coordinating node to account the memory used to perform partial and final reduce of aggregations in the request circuit breaker. The search coordinator adds the memory that it used to save and reduce the results of shard aggregations in the request circuit breaker. Before any partial or final reduce, the memory needed to reduce the aggregations is estimated and a CircuitBreakingException} is thrown if exceeds the maximum memory allowed in this breaker. This size is estimated as roughly 1.5 times the size of the serialized aggregations that need to be reduced. This estimation can be completely off for some aggregations but it is corrected with the real size after the reduce completes. If the reduce is successful, we update the circuit breaker to remove the size of the source aggregations and replace the estimation with the serialized size of the newly reduced result. As a follow up we could trigger partial reduces based on the memory accounted in the circuit breaker instead of relying on a static number of shard responses. A simpler follow up that could be done in the mean time is to [reduce the default batch reduce size](#51857) of blocking search request to a more sane number. Closes #37182

This commit allows coordinating node to account the memory used to perform partial and final reduce of aggregations in the request circuit breaker. The search coordinator adds the memory that it used to save and reduce the results of shard aggregations in the request circuit breaker. Before any partial or final reduce, the memory needed to reduce the aggregations is estimated and a CircuitBreakingException} is thrown if exceeds the maximum memory allowed in this breaker. This size is estimated as roughly 1.5 times the size of the serialized aggregations that need to be reduced. This estimation can be completely off for some aggregations but it is corrected with the real size after the reduce completes. If the reduce is successful, we update the circuit breaker to remove the size of the source aggregations and replace the estimation with the serialized size of the newly reduced result. As a follow up we could trigger partial reduces based on the memory accounted in the circuit breaker instead of relying on a static number of shard responses. A simpler follow up that could be done in the mean time is to [reduce the default batch reduce size](elastic/elasticsearch#51857) of blocking search request to a more sane number. Closes #37182

jimczi added discuss :Analytics/Aggregations Aggregations :Search/Search Search-related issues that do not fall into other categories labels Feb 4, 2020

jimczi mentioned this issue Mar 6, 2020

Speed up partial reduce of terms aggregations #53216

Merged

jimczi added team-discuss and removed discuss labels Mar 6, 2020

javanna mentioned this issue Mar 26, 2020

HLRC: Don't send defaults for SubmitAsyncSearchRequest #54200

Merged

jimczi added >enhancement and removed team-discuss labels Mar 30, 2020

codebrain mentioned this issue Apr 1, 2020

7.7.0 meta ticket (Part 2) elastic/elasticsearch-net#4533

Closed

rjernst added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Search Meta label for search team labels May 4, 2020

hackerwin7 mentioned this issue Jun 19, 2020

parallel reduce support on coordinator and parameterized #58377

Closed

jimczi mentioned this issue Jun 23, 2020

Executes incremental reduce in the search thread pool #58461

Merged

jimczi mentioned this issue Sep 1, 2020

Improve reduction of terms aggregations #61779

Merged

jimczi mentioned this issue Sep 10, 2020

Request-level circuit breaker support on coordinating nodes #62223

Merged

iverase mentioned this issue Jun 13, 2023

Always Partially reduce aggregations more aggressive #96802

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change the default batched_reduce_size of search requests #51857

Change the default batched_reduce_size of search requests #51857

jimczi commented Feb 4, 2020

elasticmachine commented Feb 4, 2020

elasticmachine commented Feb 4, 2020

polyfractal commented Feb 4, 2020

jpountz commented Feb 5, 2020

jimczi commented Mar 27, 2020 •

edited by salvatore-campagna

Loading

nik9000 commented Apr 3, 2020

jpountz commented Apr 7, 2020

hackerwin7 commented Jun 9, 2020 •

edited

Loading

hackerwin7 commented Jun 19, 2020 •

edited

Loading

jimczi commented Jun 19, 2020

hackerwin7 commented Jun 19, 2020 •

edited

Loading

hackerwin7 commented Jun 19, 2020

Change the default batched_reduce_size of search requests #51857

Change the default batched_reduce_size of search requests #51857

Comments

jimczi commented Feb 4, 2020

elasticmachine commented Feb 4, 2020

elasticmachine commented Feb 4, 2020

polyfractal commented Feb 4, 2020

jpountz commented Feb 5, 2020

jimczi commented Mar 27, 2020 • edited by salvatore-campagna Loading

nik9000 commented Apr 3, 2020

jpountz commented Apr 7, 2020

hackerwin7 commented Jun 9, 2020 • edited Loading

hackerwin7 commented Jun 19, 2020 • edited Loading

jimczi commented Jun 19, 2020

hackerwin7 commented Jun 19, 2020 • edited Loading

Mem stats

Search stats

hackerwin7 commented Jun 19, 2020

jimczi commented Mar 27, 2020 •

edited by salvatore-campagna

Loading

hackerwin7 commented Jun 9, 2020 •

edited

Loading

hackerwin7 commented Jun 19, 2020 •

edited

Loading

hackerwin7 commented Jun 19, 2020 •

edited

Loading