Speed up partial reduce of terms aggregations #53216

jimczi · 2020-03-06T09:32:29Z

This change optimizes the merge of terms aggregations by removing
the priority queue that was used to collect all the buckets during
a non-final reduction. We don't need to keep the result sorted since
the merge of buckets in a subsequent reduce can modify the order.
I wrote a small micro-benchmark to test the change and the speed ups
are significative for small merge buffer sizes:

########## Master:
Benchmark                           (bufferSize)  (cardinality)  (numShards)  (topNSize)  Mode  Cnt     Score     Error  Units
TermsReduceBenchmark.reduceTopHits             5          10000         1000        1000  avgt   10  2459,690 ± 198,682  ms/op
TermsReduceBenchmark.reduceTopHits            16          10000         1000        1000  avgt   10  1030,620 ±  91,544  ms/op
TermsReduceBenchmark.reduceTopHits            32          10000         1000        1000  avgt   10   558,608 ±  44,915  ms/op
TermsReduceBenchmark.reduceTopHits           128          10000         1000        1000  avgt   10   287,333 ±   8,342  ms/op
TermsReduceBenchmark.reduceTopHits           512          10000         1000        1000  avgt   10   257,325 ±  54,515  ms/op

########## Patch:
Benchmark                           (bufferSize)  (cardinality)  (numShards)  (topNSize)  Mode  Cnt    Score    Error  Units
TermsReduceBenchmark.reduceTopHits             5          10000         1000        1000  avgt   10  805,611 ± 14,630  ms/op
TermsReduceBenchmark.reduceTopHits            16          10000         1000        1000  avgt   10  378,851 ± 17,929  ms/op
TermsReduceBenchmark.reduceTopHits            32          10000         1000        1000  avgt   10  261,094 ± 10,176  ms/op
TermsReduceBenchmark.reduceTopHits           128          10000         1000        1000  avgt   10  241,051 ± 19,558  ms/op
TermsReduceBenchmark.reduceTopHits           512          10000         1000        1000  avgt   10  231,643 ±  6,170  ms/op

The code for the benchmark can be found here.
It seems to be up to 3x faster for terms aggregations that return 10,000 unique terms (1000 terms per shard).
For a cardinality of 100,000 terms, this patch is up to 5x faster:

########## Patch:
Benchmark                           (bufferSize)  (cardinality)  (numShards)  (topNSize)  Mode  Cnt      Score     Error  Units
TermsReduceBenchmark.reduceTopHits             5         100000         1000        1000  avgt   10  12791,083 ± 397,128  ms/op
TermsReduceBenchmark.reduceTopHits            16         100000         1000        1000  avgt   10   3974,939 ± 324,617  ms/op
TermsReduceBenchmark.reduceTopHits            32         100000         1000        1000  avgt   10   2186,285 ± 267,124  ms/op
TermsReduceBenchmark.reduceTopHits           128         100000         1000        1000  avgt   10    914,657 ± 160,784  ms/op
TermsReduceBenchmark.reduceTopHits           512         100000         1000        1000  avgt   10    604,198 ± 145,457  ms/op

########## Master:
Benchmark                           (bufferSize)  (cardinality)  (numShards)  (topNSize)  Mode  Cnt      Score     Error  Units
TermsReduceBenchmark.reduceTopHits             5         100000         1000        1000  avgt   10  60696,107 ± 929,944  ms/op
TermsReduceBenchmark.reduceTopHits            16         100000         1000        1000  avgt   10  16292,894 ± 783,398  ms/op
TermsReduceBenchmark.reduceTopHits            32         100000         1000        1000  avgt   10   7705,444 ±  77,588  ms/op
TermsReduceBenchmark.reduceTopHits           128         100000         1000        1000  avgt   10   2156,685 ±  88,795  ms/op
TermsReduceBenchmark.reduceTopHits           512         100000         1000        1000  avgt   10    760,273 ±  53,738  ms/op

The merge of buckets can also be optimized. Currently we use an hash map to merge buckets coming from different shards so this can be costly if the number of unique terms is high. Instead, we could always sort the shard terms result by key and perform a merge sort to reduce the results. This would save memory and make the merge more linear in terms
of complexity in the coordinating node at the expense of an additional sort in the shards.
I plan to test this possible optimization in a follow up.

Relates #51857

This change optimizes the merge of terms aggregations by removing the priority queue that was used to collect all the buckets during a non-final reduction. We don't need to keep the result sorted since the merge of buckets in a subsequent reduce can modify the order. I wrote a small micro-benchmark to test the change and the speed ups are significative for small merge buffer sizes: ```` ########## Master: Benchmark (bufferSize) (cardinality) (numShards) (topNSize) Mode Cnt Score Error Units TermsReduceBenchmark.reduceTopHits 5 10000 1000 1000 avgt 10 2459,690 ± 198,682 ms/op TermsReduceBenchmark.reduceTopHits 16 10000 1000 1000 avgt 10 1030,620 ± 91,544 ms/op TermsReduceBenchmark.reduceTopHits 32 10000 1000 1000 avgt 10 558,608 ± 44,915 ms/op TermsReduceBenchmark.reduceTopHits 128 10000 1000 1000 avgt 10 287,333 ± 8,342 ms/op TermsReduceBenchmark.reduceTopHits 512 10000 1000 1000 avgt 10 257,325 ± 54,515 ms/op ########## Patch: Benchmark (bufferSize) (cardinality) (numShards) (topNSize) Mode Cnt Score Error Units TermsReduceBenchmark.reduceTopHits 5 10000 1000 1000 avgt 10 805,611 ± 14,630 ms/op TermsReduceBenchmark.reduceTopHits 16 10000 1000 1000 avgt 10 378,851 ± 17,929 ms/op TermsReduceBenchmark.reduceTopHits 32 10000 1000 1000 avgt 10 261,094 ± 10,176 ms/op TermsReduceBenchmark.reduceTopHits 128 10000 1000 1000 avgt 10 241,051 ± 19,558 ms/op TermsReduceBenchmark.reduceTopHits 512 10000 1000 1000 avgt 10 231,643 ± 6,170 ms/op ```` The code for the benchmark can be found [here](). It seems to be up to 3x faster for terms aggregations that return 10,000 unique terms (1000 terms per shard). For a cardinality of 100,000 terms, this patch is up to 5x faster: ```` ########## Patch: Benchmark (bufferSize) (cardinality) (numShards) (topNSize) Mode Cnt Score Error Units TermsReduceBenchmark.reduceTopHits 5 100000 1000 1000 avgt 10 12791,083 ± 397,128 ms/op TermsReduceBenchmark.reduceTopHits 16 100000 1000 1000 avgt 10 3974,939 ± 324,617 ms/op TermsReduceBenchmark.reduceTopHits 32 100000 1000 1000 avgt 10 2186,285 ± 267,124 ms/op TermsReduceBenchmark.reduceTopHits 128 100000 1000 1000 avgt 10 914,657 ± 160,784 ms/op TermsReduceBenchmark.reduceTopHits 512 100000 1000 1000 avgt 10 604,198 ± 145,457 ms/op ########## Master: Benchmark (bufferSize) (cardinality) (numShards) (topNSize) Mode Cnt Score Error Units TermsReduceBenchmark.reduceTopHits 5 100000 1000 1000 avgt 10 60696,107 ± 929,944 ms/op TermsReduceBenchmark.reduceTopHits 16 100000 1000 1000 avgt 10 16292,894 ± 783,398 ms/op TermsReduceBenchmark.reduceTopHits 32 100000 1000 1000 avgt 10 7705,444 ± 77,588 ms/op TermsReduceBenchmark.reduceTopHits 128 100000 1000 1000 avgt 10 2156,685 ± 88,795 ms/op TermsReduceBenchmark.reduceTopHits 512 100000 1000 1000 avgt 10 760,273 ± 53,738 ms/op ```` The merge of buckets can also be optimized. Currently we use an hash map to merge buckets coming from different shards so this can be costly if the number of unique terms is high. Instead, we could always sort the shard terms result by key and perform a merge sort to reduce the results. This would save memory and make the merge more linear in terms of complexity in the coordinating node at the expense of an additional sort in the shards. I plan to test this possible optimization in a follow up. Relates elastic#51857

elasticmachine · 2020-03-06T09:32:31Z

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

nik9000

LGTM.

Could you add a comment about why we need to keep all the buckets until final reduction? I could see wanting the priority queue/throwing away buckets in some cases. Like, say, you are sorting on key. But if you are sorting on doc count or something then you can't throw them away. And we don't have any way to expose that to this code. I think. Is that right?

jimczi · 2020-03-06T14:16:21Z

Like, say, you are sorting on key. But if you are sorting on doc count or something then you can't throw them away. And we don't have any way to expose that to this code. I think. Is that right?

Yes, ideally the intermediate buckets (shard and partial reduce) should sort by key so that we can perform a merge sort on reduce and prune on partial reduce if the final sort is also by key. That's the follow up I mentioned in the description.

nik9000 · 2020-03-06T14:35:32Z

Yes, ideally the intermediate buckets (shard and partial reduce) should sort by key so that we can perform a merge sort on reduce and prune on partial reduce if the final sort is also by key. That's the follow up I mentioned in the description.

👍

This change optimizes the merge of terms aggregations by removing the priority queue that was used to collect all the buckets during a non-final reduction. We don't need to keep the result sorted since the merge of buckets in a subsequent reduce can modify the order. I wrote a small micro-benchmark to test the change and the speed ups are significative for small merge buffer sizes: ```` ########## Master: Benchmark (bufferSize) (cardinality) (numShards) (topNSize) Mode Cnt Score Error Units TermsReduceBenchmark.reduceTopHits 5 10000 1000 1000 avgt 10 2459,690 ± 198,682 ms/op TermsReduceBenchmark.reduceTopHits 16 10000 1000 1000 avgt 10 1030,620 ± 91,544 ms/op TermsReduceBenchmark.reduceTopHits 32 10000 1000 1000 avgt 10 558,608 ± 44,915 ms/op TermsReduceBenchmark.reduceTopHits 128 10000 1000 1000 avgt 10 287,333 ± 8,342 ms/op TermsReduceBenchmark.reduceTopHits 512 10000 1000 1000 avgt 10 257,325 ± 54,515 ms/op ########## Patch: Benchmark (bufferSize) (cardinality) (numShards) (topNSize) Mode Cnt Score Error Units TermsReduceBenchmark.reduceTopHits 5 10000 1000 1000 avgt 10 805,611 ± 14,630 ms/op TermsReduceBenchmark.reduceTopHits 16 10000 1000 1000 avgt 10 378,851 ± 17,929 ms/op TermsReduceBenchmark.reduceTopHits 32 10000 1000 1000 avgt 10 261,094 ± 10,176 ms/op TermsReduceBenchmark.reduceTopHits 128 10000 1000 1000 avgt 10 241,051 ± 19,558 ms/op TermsReduceBenchmark.reduceTopHits 512 10000 1000 1000 avgt 10 231,643 ± 6,170 ms/op ```` The code for the benchmark can be found [here](). It seems to be up to 3x faster for terms aggregations that return 10,000 unique terms (1000 terms per shard). For a cardinality of 100,000 terms, this patch is up to 5x faster: ```` ########## Patch: Benchmark (bufferSize) (cardinality) (numShards) (topNSize) Mode Cnt Score Error Units TermsReduceBenchmark.reduceTopHits 5 100000 1000 1000 avgt 10 12791,083 ± 397,128 ms/op TermsReduceBenchmark.reduceTopHits 16 100000 1000 1000 avgt 10 3974,939 ± 324,617 ms/op TermsReduceBenchmark.reduceTopHits 32 100000 1000 1000 avgt 10 2186,285 ± 267,124 ms/op TermsReduceBenchmark.reduceTopHits 128 100000 1000 1000 avgt 10 914,657 ± 160,784 ms/op TermsReduceBenchmark.reduceTopHits 512 100000 1000 1000 avgt 10 604,198 ± 145,457 ms/op ########## Master: Benchmark (bufferSize) (cardinality) (numShards) (topNSize) Mode Cnt Score Error Units TermsReduceBenchmark.reduceTopHits 5 100000 1000 1000 avgt 10 60696,107 ± 929,944 ms/op TermsReduceBenchmark.reduceTopHits 16 100000 1000 1000 avgt 10 16292,894 ± 783,398 ms/op TermsReduceBenchmark.reduceTopHits 32 100000 1000 1000 avgt 10 7705,444 ± 77,588 ms/op TermsReduceBenchmark.reduceTopHits 128 100000 1000 1000 avgt 10 2156,685 ± 88,795 ms/op TermsReduceBenchmark.reduceTopHits 512 100000 1000 1000 avgt 10 760,273 ± 53,738 ms/op ```` The merge of buckets can also be optimized. Currently we use an hash map to merge buckets coming from different shards so this can be costly if the number of unique terms is high. Instead, we could always sort the shard terms result by key and perform a merge sort to reduce the results. This would save memory and make the merge more linear in terms of complexity in the coordinating node at the expense of an additional sort in the shards. I plan to test this possible optimization in a follow up. Relates #51857

jimczi added 2 commits March 6, 2020 10:28

add comment

b4944b4

jimczi added >enhancement :Analytics/Aggregations Aggregations v8.0.0 v7.7.0 labels Mar 6, 2020

nik9000 approved these changes Mar 6, 2020

View reviewed changes

jimczi added 2 commits March 10, 2020 09:06

Merge branch 'master' into terms_partial_reduce_optim

1f1040d

address review

d67d34a

jimczi merged commit f153f19 into elastic:master Mar 10, 2020

jimczi deleted the terms_partial_reduce_optim branch March 10, 2020 12:24

jimczi mentioned this pull request Mar 27, 2020

Change the default batched_reduce_size of search requests #51857

Open

codebrain mentioned this pull request Apr 1, 2020

7.7.0 meta ticket (Part 2) elastic/elasticsearch-net#4533

Closed

maosuhan mentioned this pull request Feb 22, 2021

Add partial reduce nodes for reducing intermediate aggregation results #56748

Open

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up partial reduce of terms aggregations #53216

Speed up partial reduce of terms aggregations #53216

jimczi commented Mar 6, 2020

elasticmachine commented Mar 6, 2020

nik9000 left a comment

jimczi commented Mar 6, 2020

nik9000 commented Mar 6, 2020

Speed up partial reduce of terms aggregations #53216

Speed up partial reduce of terms aggregations #53216

Conversation

jimczi commented Mar 6, 2020

elasticmachine commented Mar 6, 2020

nik9000 left a comment

Choose a reason for hiding this comment

jimczi commented Mar 6, 2020

nik9000 commented Mar 6, 2020