Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache buckets to speed up BytesRefHash#sort #12784

Merged
merged 10 commits into from
Nov 10, 2023

Conversation

gf2121
Copy link
Contributor

@gf2121 gf2121 commented Nov 8, 2023

Following #12775, this PR tries another approach to speed up BytesRefHash#sort:
The idea is that since we have extra ints in this map, we can cache the bucket when building the histograms, and reuse them when reorder. I checked this approach on intel chip, showing ~30% speed up. I'll check M2 chip and wikimedium data tomorrow.

BASELINE:  sort 5169965 terms, build histogram took: 1968ms, reorder took: 2132ms, total took: 5470ms.
BASELINE:  sort 5169965 terms, build histogram took: 1975ms, reorder took: 2133ms, total took: 5526ms.
BASELINE:  sort 5169965 terms, build histogram took: 1999ms, reorder took: 2157ms, total took: 5573ms.
BASELINE:  sort 5169965 terms, build histogram took: 1955ms, reorder took: 2138ms, total took: 5446ms.
BASELINE:  sort 5169965 terms, build histogram took: 1990ms, reorder took: 2161ms, total took: 5528ms.
BASELINE:  sort 5169965 terms, build histogram took: 1997ms, reorder took: 2175ms, total took: 5571ms.
BASELINE:  sort 5169965 terms, build histogram took: 2004ms, reorder took: 2119ms, total took: 5477ms.
BASELINE:  sort 5169965 terms, build histogram took: 1978ms, reorder took: 2155ms, total took: 5501ms.
BASELINE:  sort 5169965 terms, build histogram took: 2015ms, reorder took: 2169ms, total took: 5572ms.
BASELINE:  sort 5169965 terms, build histogram took: 1941ms, reorder took: 2138ms, total took: 5400ms.
BASELINE:  sort 5169965 terms, build histogram took: 2000ms, reorder took: 2155ms, total took: 5558ms.

CANDIDATE:  sort 5169965 terms, build histogram took: 1996ms, reorder took: 133ms, total took: 3734ms.
CANDIDATE:  sort 5169965 terms, build histogram took: 1989ms, reorder took: 142ms, total took: 3655ms.
CANDIDATE:  sort 5169965 terms, build histogram took: 2031ms, reorder took: 155ms, total took: 3762ms.
CANDIDATE:  sort 5169965 terms, build histogram took: 2016ms, reorder took: 145ms, total took: 3739ms.
CANDIDATE:  sort 5169965 terms, build histogram took: 1994ms, reorder took: 142ms, total took: 3667ms.
CANDIDATE:  sort 5169965 terms, build histogram took: 2010ms, reorder took: 140ms, total took: 3651ms.
CANDIDATE:  sort 5169965 terms, build histogram took: 2021ms, reorder took: 154ms, total took: 3731ms.
CANDIDATE:  sort 5169965 terms, build histogram took: 2019ms, reorder took: 144ms, total took: 3727ms.
CANDIDATE:  sort 5169965 terms, build histogram took: 2064ms, reorder took: 138ms, total took: 3784ms.
CANDIDATE:  sort 5169965 terms, build histogram took: 2043ms, reorder took: 142ms, total took: 3727ms.
CANDIDATE:  sort 5169965 terms, build histogram took: 1964ms, reorder took: 140ms, total took: 3630ms.

@gf2121
Copy link
Contributor Author

gf2121 commented Nov 9, 2023

Even faster than the original approach on M2:

BASELINE:  sort 5169965 terms, build histogram took: 489ms, reorder took: 1359ms, total took: 2381ms.
BASELINE:  sort 5169965 terms, build histogram took: 449ms, reorder took: 1290ms, total took: 2249ms.
BASELINE:  sort 5169965 terms, build histogram took: 458ms, reorder took: 1279ms, total took: 2238ms.
BASELINE:  sort 5169965 terms, build histogram took: 462ms, reorder took: 1302ms, total took: 2260ms.
BASELINE:  sort 5169965 terms, build histogram took: 455ms, reorder took: 1282ms, total took: 2239ms.
BASELINE:  sort 5169965 terms, build histogram took: 449ms, reorder took: 1327ms, total took: 2282ms.
BASELINE:  sort 5169965 terms, build histogram took: 474ms, reorder took: 1320ms, total took: 2303ms.
BASELINE:  sort 5169965 terms, build histogram took: 448ms, reorder took: 1260ms, total took: 2227ms.
BASELINE:  sort 5169965 terms, build histogram took: 448ms, reorder took: 1307ms, total took: 2258ms.
BASELINE:  sort 5169965 terms, build histogram took: 464ms, reorder took: 1313ms, total took: 2280ms.
BASELINE:  sort 5169965 terms, build histogram took: 444ms, reorder took: 1294ms, total took: 2261ms.

CANDIDATE:  sort 5169965 terms, build histogram took: 549ms, reorder took: 84ms, total took: 1287ms.
CANDIDATE:  sort 5169965 terms, build histogram took: 535ms, reorder took: 83ms, total took: 1205ms.
CANDIDATE:  sort 5169965 terms, build histogram took: 531ms, reorder took: 65ms, total took: 1197ms.
CANDIDATE:  sort 5169965 terms, build histogram took: 552ms, reorder took: 70ms, total took: 1217ms.
CANDIDATE:  sort 5169965 terms, build histogram took: 576ms, reorder took: 73ms, total took: 1255ms.
CANDIDATE:  sort 5169965 terms, build histogram took: 551ms, reorder took: 66ms, total took: 1207ms.
CANDIDATE:  sort 5169965 terms, build histogram took: 551ms, reorder took: 68ms, total took: 1208ms.
CANDIDATE:  sort 5169965 terms, build histogram took: 553ms, reorder took: 85ms, total took: 1240ms.
CANDIDATE:  sort 5169965 terms, build histogram took: 562ms, reorder took: 66ms, total took: 1235ms.
CANDIDATE:  sort 5169965 terms, build histogram took: 557ms, reorder took: 76ms, total took: 1233ms.
CANDIDATE:  sort 5169965 terms, build histogram took: 541ms, reorder took: 66ms, total took: 1204ms.

@gf2121
Copy link
Contributor Author

gf2121 commented Nov 9, 2023

As "reorder" gets faster, I'm considering lowering the fallback threshold and letting radix sort do more of the work.
 
Benchmark result

MAC Intel
BASELINE:  sort 5169965 terms, build histogram took: 1969ms, reorder took: 2103ms, total took: 5524ms.
BASELINE:  sort 5169965 terms, build histogram took: 1975ms, reorder took: 2118ms, total took: 5435ms.
BASELINE:  sort 5169965 terms, build histogram took: 1977ms, reorder took: 2140ms, total took: 5441ms.
BASELINE:  sort 5169965 terms, build histogram took: 1982ms, reorder took: 2168ms, total took: 5452ms.
BASELINE:  sort 5169965 terms, build histogram took: 1987ms, reorder took: 2164ms, total took: 5475ms.
BASELINE:  sort 5169965 terms, build histogram took: 1963ms, reorder took: 2157ms, total took: 5456ms.
BASELINE:  sort 5169965 terms, build histogram took: 1979ms, reorder took: 2162ms, total took: 5452ms.
BASELINE:  sort 5169965 terms, build histogram took: 1984ms, reorder took: 2172ms, total took: 5493ms.
BASELINE:  sort 5169965 terms, build histogram took: 1986ms, reorder took: 2166ms, total took: 5476ms.
BASELINE:  sort 5169965 terms, build histogram took: 2016ms, reorder took: 2287ms, total took: 5660ms.
BASELINE:  sort 5169965 terms, build histogram took: 1761ms, reorder took: 2109ms, total took: 5183ms.

CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram took: 2018ms, reorder took: 125ms, total took: 3760ms.
CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram took: 1998ms, reorder took: 118ms, total took: 3630ms.
CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram took: 1987ms, reorder took: 115ms, total took: 3596ms.
CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram took: 2001ms, reorder took: 111ms, total took: 3622ms.
CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram took: 2017ms, reorder took: 110ms, total took: 3610ms.
CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram took: 2021ms, reorder took: 118ms, total took: 3712ms.
CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram took: 2006ms, reorder took: 108ms, total took: 3634ms.
CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram took: 2015ms, reorder took: 106ms, total took: 3620ms.
CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram took: 2006ms, reorder took: 101ms, total took: 3611ms.
CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram took: 1993ms, reorder took: 115ms, total took: 3609ms.
CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram took: 2024ms, reorder took: 106ms, total took: 3633ms.

CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram took: 2972ms, reorder took: 199ms, total took: 3288ms.
CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram took: 2941ms, reorder took: 211ms, total took: 3279ms.
CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram took: 2991ms, reorder took: 178ms, total took: 3283ms.
CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram took: 2932ms, reorder took: 195ms, total took: 3233ms.
CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram took: 2954ms, reorder took: 202ms, total took: 3259ms.
CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram took: 2938ms, reorder took: 201ms, total took: 3249ms.
CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram took: 2949ms, reorder took: 195ms, total took: 3295ms.
CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram took: 2961ms, reorder took: 208ms, total took: 3274ms.
CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram took: 2965ms, reorder took: 201ms, total took: 3257ms.
CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram took: 2957ms, reorder took: 188ms, total took: 3285ms.
CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram took: 2975ms, reorder took: 182ms, total took: 3291ms.
MAC M2
BASELINE:  sort 5169965 terms, build histogram took: 478ms, reorder took: 1332ms, total took: 2334ms.
BASELINE:  sort 5169965 terms, build histogram took: 483ms, reorder took: 1351ms, total took: 2333ms.
BASELINE:  sort 5169965 terms, build histogram took: 462ms, reorder took: 1319ms, total took: 2284ms.
BASELINE:  sort 5169965 terms, build histogram took: 463ms, reorder took: 1272ms, total took: 2246ms.
BASELINE:  sort 5169965 terms, build histogram took: 466ms, reorder took: 1285ms, total took: 2257ms.
BASELINE:  sort 5169965 terms, build histogram took: 471ms, reorder took: 1318ms, total took: 2289ms.
BASELINE:  sort 5169965 terms, build histogram took: 465ms, reorder took: 1322ms, total took: 2291ms.
BASELINE:  sort 5169965 terms, build histogram took: 449ms, reorder took: 1288ms, total took: 2234ms.
BASELINE:  sort 5169965 terms, build histogram took: 448ms, reorder took: 1302ms, total took: 2257ms.
BASELINE:  sort 5169965 terms, build histogram took: 453ms, reorder took: 1305ms, total took: 2263ms.
BASELINE:  sort 5169965 terms, build histogram took: 456ms, reorder took: 1270ms, total took: 2229ms.

CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram took: 586ms, reorder took: 79ms, total took: 1343ms.
CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram took: 551ms, reorder took: 62ms, total took: 1212ms.
CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram took: 588ms, reorder took: 72ms, total took: 1253ms.
CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram took: 561ms, reorder took: 74ms, total took: 1227ms.
CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram took: 572ms, reorder took: 67ms, total took: 1260ms.
CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram took: 539ms, reorder took: 63ms, total took: 1208ms.
CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram took: 542ms, reorder took: 74ms, total took: 1209ms.
CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram took: 546ms, reorder took: 78ms, total took: 1222ms.
CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram took: 532ms, reorder took: 70ms, total took: 1195ms.
CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram took: 536ms, reorder took: 81ms, total took: 1224ms.
CANDIDATE( fallback threshold = 100 ):  sort 5169965 terms, build histogram took: 548ms, reorder took: 66ms, total took: 1207ms.

CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram took: 660ms, reorder took: 108ms, total took: 822ms.
CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram took: 645ms, reorder took: 120ms, total took: 815ms.
CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram took: 648ms, reorder took: 124ms, total took: 829ms.
CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram took: 654ms, reorder took: 145ms, total took: 835ms.
CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram took: 652ms, reorder took: 137ms, total took: 837ms.
CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram took: 631ms, reorder took: 144ms, total took: 816ms.
CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram took: 654ms, reorder took: 120ms, total took: 821ms.
CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram took: 684ms, reorder took: 111ms, total took: 841ms.
CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram took: 652ms, reorder took: 146ms, total took: 836ms.
CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram took: 658ms, reorder took: 127ms, total took: 837ms.
CANDIDATE( fallback threshold = 50 ):  sort 5169965 terms, build histogram took: 673ms, reorder took: 123ms, total took: 838ms.

@gf2121
Copy link
Contributor Author

gf2121 commented Nov 9, 2023

I try this approach with wikimedium10m. The sort took sum decreased ~60% on M2, and ~30% on Intel.
 
Benchmark result

MAC Intel
flush round sort field name sort term count baseline candidate diff
1 date 45925 59 35 -40.68%
1 groupend 1 0 0 #DIV/0!
1 titleTokenized 33250 19 16 -15.79%
1 id 989620 840 575 -31.55%
1 body 3444798 5512 3739 -32.17%
1 title 46261 31 29 -6.45%
2 date 84517 107 86 -19.63%
2 groupend 1 0 0 #DIV/0!
2 titleTokenized 50134 34 30 -11.76%
2 id 1059900 1280 877 -31.48%
2 body 3210318 5126 3464 -32.42%
2 title 85009 71 62 -12.68%
3 date 86188 127 92 -27.56%
3 groupend 1 0 0 #DIV/0!
3 titleTokenized 60502 46 43 -6.52%
3 id 1011210 1079 764 -29.19%
3 body 3389689 5777 3570 -38.20%
3 title 86756 73 59 -19.18%
4 date 123222 211 158 -25.12%
4 groupend 1 0 0 #DIV/0!
4 titleTokenized 84573 71 75 5.63%
4 id 1030670 1264 819 -35.21%
4 body 3442602 5663 3770 -33.43%
4 title 124156 121 108 -10.74%
5 date 150957 356 158 -55.62%
5 groupend 1 0 0 #DIV/0!
5 titleTokenized 100910 92 79 -14.13%
5 id 1045550 2117 959 -54.70%
5 body 3267300 5607 3521 -37.20%
5 title 152061 170 147 -13.53%
6 date 162374 227 219 -3.52%
6 groupend 1 0 0 #DIV/0!
6 titleTokenized 107464 87 95 9.20%
6 id 1057430 1057 813 -23.08%
6 body 3140161 5013 3424 -31.70%
6 title 163347 176 151 -14.20%
7 date 177647 257 230 -10.51%
7 groupend 1 1 0 -100.00%
7 titleTokenized 120037 86 124 44.19%
7 id 1062920 1214 1008 -16.97%
7 body 3173313 5082 3566 -29.83%
7 title 178886 190 201 5.79%
8 date 199744 277 227 -18.05%
8 groupend 1 0 0 #DIV/0!
8 titleTokenized 133801 112 111 -0.89%
8 id 1072970 1059 792 -25.21%
8 body 3090582 4854 3247 -33.11%
8 title 201130 221 216 -2.26%
9 date 209298 300 280 -6.67%
9 groupend 1 0 0 #DIV/0!
9 titleTokenized 142697 123 124 0.81%
9 id 1084580 1367 923 -32.48%
9 body 3032046 4977 3190 -35.91%
9 title 210958 231 239 3.46%
10 date 119014 147 129 -12.24%
10 groupend 1 0 0 #DIV/0!
10 titleTokenized 93308 72 74 2.78%
10 id 585150 504 409 -18.85%
10 body 1909153 2536 1753 -30.88%
10 title 119806 120 117 -2.50%
sort took sum     66143 44897 -32.12%

MAC M2
flush round sort field name sort term count baseline candidate diff
1 date 45925 32 15 -53.13%
1 groupend 1 0 0 #DIV/0!
1 titleTokenized 33250 10 6 -40.00%
1 id 989620 356 167 -53.09%
1 body 3444798 1991 698 -64.94%
1 title 46261 10 7 -30.00%
2 date 84517 52 21 -59.62%
2 groupend 1 0 0 #DIV/0!
2 titleTokenized 50134 15 7 -53.33%
2 id 1059900 680 212 -68.82%
2 body 3210318 1730 660 -61.85%
2 title 85009 22 13 -40.91%
3 date 86188 37 21 -43.24%
3 groupend 1 0 0 #DIV/0!
3 titleTokenized 60502 14 9 -35.71%
3 id 1011210 397 185 -53.40%
3 body 3389689 1865 692 -62.90%
3 title 86756 24 15 -37.50%
4 date 123222 76 39 -48.68%
4 groupend 1 0 0 #DIV/0!
4 titleTokenized 84573 23 16 -30.43%
4 id 1030670 533 205 -61.54%
4 body 3442602 1985 717 -63.88%
4 title 124156 41 24 -41.46%
5 date 150957 85 38 -55.29%
5 groupend 1 0 0 #DIV/0!
5 titleTokenized 100910 37 16 -56.76%
5 id 1045550 595 209 -64.87%
5 body 3267300 2031 689 -66.08%
5 title 152061 58 32 -44.83%
6 date 162374 77 41 -46.75%
6 groupend 1 0 0 #DIV/0!
6 titleTokenized 107464 30 17 -43.33%
6 id 1057430 385 194 -49.61%
6 body 3140161 1834 633 -65.49%
6 title 163347 68 36 -47.06%
7 date 177647 116 45 -61.21%
7 groupend 1 0 0 #DIV/0!
7 titleTokenized 120037 59 18 -69.49%
7 id 1062920 652 211 -67.64%
7 body 3173313 2373 638 -73.11%
7 title 178886 79 37 -53.16%
8 date 199744 106 50 -52.83%
8 groupend 1 0 0 #DIV/0!
8 titleTokenized 133801 42 21 -50.00%
8 id 1072970 409 193 -52.81%
8 body 3090582 1817 641 -64.72%
8 title 201130 84 45 -46.43%
9 date 209298 104 62 -40.38%
9 groupend 1 0 0 #DIV/0!
9 titleTokenized 142697 50 25 -50.00%
9 id 1084580 571 225 -60.60%
9 body 3032046 1642 631 -61.57%
9 title 210958 92 46 -50.00%
10 date 119014 47 30 -36.17%
10 groupend 1 0 0 #DIV/0!
10 titleTokenized 93308 26 16 -38.46%
10 id 585150 208 111 -46.63%
10 body 1909153 918 380 -58.61%
10 title 119806 50 27 -46.00%
sort took sum     24538 9086 -62.97%

@gf2121 gf2121 requested a review from jpountz November 9, 2023 14:14
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fantastic speedup. The change looks correct to me. I would not have expected caching buckets to help that much, it's great you thought of trying it out.

@gf2121 gf2121 changed the title Speed up BytesRefHash#sort (another approach) Cache buckets to speed up BytesRefHash#sort Nov 10, 2023
@gf2121
Copy link
Contributor Author

gf2121 commented Nov 10, 2023

Thanks for review @jpountz !

I'll merge this and close #12775.

@gf2121 gf2121 merged commit d458356 into apache:main Nov 10, 2023
4 checks passed
@mikemccand
Copy link
Member

Did we see any bump in nightly benchmarks? This should make initial segment flush when there are many terms in an inverted field faster?

@gf2121
Copy link
Contributor Author

gf2121 commented Nov 13, 2023

Thanks for tracking in ! @mikemccand

Did we see any bump in nightly benchmarks?

I would expect this change more likely bring some improvements for flushing high cardinality StringField, or other places taking advantages of the BytesRefHash#sort like DeletedTerms.

However, I had not expected this change would bring a bump in nightly benchmarks - If you look at the CPU profile of the nightly indexing, you may find that most of CPU was used to tokenize / deduplicate terms before flushing. After deduplication the terms count will get reduced and the sort is not the bottleneck of the indexing speed (sort only use ~1% CPU).

Small difference may be found in the flame graph that reorder is not there any more. But the proportion of sorting overhead is too low to bring a noticeable E2E difference for nightly indexing.

Before this patch
image

After this patch
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants