Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Optimize SortMergeReader: use loser tree to reduce comparisons #833

Merged
merged 9 commits into from Apr 27, 2023

Conversation

liming30
Copy link
Contributor

@liming30 liming30 commented Apr 4, 2023

Purpose

[Feature] Optimize SortMergeReader: use loser tree to reduce comparisons (#741)

Tests

  • Unit Tests: SortMergeReaderTestBase already exists to verify the correctness of SortMergeReader.
  • Benchmark:I created a jmh benchmark to compare the performance of different implementations,Here is the benchmark code.

From the benchmark results, there is almost no difference between loser-tree and min-heap when RecordReader/data volume is small. With the increase of RecordReader/data volume, there can be about 10% performance improvement.

Benchmark                               (readersNum)  (recordNum)  Mode  Cnt     Score    Error  Units
MergeReaderBenchmark.group:min-heap            2         1000  avgt   10     0.172 ±  0.017  ms/op
MergeReaderBenchmark.group:loser-tree          2         1000  avgt   10     0.172 ±  0.015  ms/op
MergeReaderBenchmark.group:min-heap            2        10000  avgt   10     1.829 ±  0.247  ms/op
MergeReaderBenchmark.group:loser-tree          2        10000  avgt   10     1.633 ±  0.217  ms/op
MergeReaderBenchmark.group:min-heap            2       100000  avgt   10    17.664 ±  1.678  ms/op
MergeReaderBenchmark.group:loser-tree          2       100000  avgt   10    17.355 ±  1.238  ms/op
MergeReaderBenchmark.group:min-heap            5         1000  avgt   10     0.615 ±  0.095  ms/op
MergeReaderBenchmark.group:loser-tree          5         1000  avgt   10     0.616 ±  0.092  ms/op
MergeReaderBenchmark.group:min-heap            5        10000  avgt   10     6.207 ±  0.261  ms/op
MergeReaderBenchmark.group:loser-tree          5        10000  avgt   10     6.169 ±  0.221  ms/op
MergeReaderBenchmark.group:min-heap            5       100000  avgt   10    76.777 ±  6.675  ms/op
MergeReaderBenchmark.group:loser-tree          5       100000  avgt   10    61.951 ±  5.007  ms/op
MergeReaderBenchmark.group:min-heap           10         1000  avgt   10     1.719 ±  0.127  ms/op
MergeReaderBenchmark.group:loser-tree         10         1000  avgt   10     1.474 ±  0.109  ms/op
MergeReaderBenchmark.group:min-heap           10        10000  avgt   10    20.064 ±  2.266  ms/op
MergeReaderBenchmark.group:loser-tree         10        10000  avgt   10    17.543 ±  1.530  ms/op
MergeReaderBenchmark.group:min-heap           10       100000  avgt   10   186.422 ± 13.484  ms/op
MergeReaderBenchmark.group:loser-tree         10       100000  avgt   10   165.942 ± 11.544  ms/op
MergeReaderBenchmark.group:min-heap           20         1000  avgt   10     3.686 ±  0.084  ms/op
MergeReaderBenchmark.group:loser-tree         20         1000  avgt   10     3.333 ±  0.079  ms/op
MergeReaderBenchmark.group:min-heap           20        10000  avgt   10    40.124 ±  0.496  ms/op
MergeReaderBenchmark.group:loser-tree         20        10000  avgt   10    35.157 ±  0.239  ms/op
MergeReaderBenchmark.group:min-heap           20       100000  avgt   10   405.549 ± 10.104  ms/op
MergeReaderBenchmark.group:loser-tree         20       100000  avgt   10   361.460 ±  8.872  ms/op
MergeReaderBenchmark.group:min-heap           50         1000  avgt   10    11.097 ±  1.025  ms/op
MergeReaderBenchmark.group:loser-tree         50         1000  avgt   10     9.721 ±  0.975  ms/op
MergeReaderBenchmark.group:min-heap           50        10000  avgt   10   155.610 ± 18.069  ms/op
MergeReaderBenchmark.group:loser-tree         50        10000  avgt   10   131.680 ± 13.984  ms/op
MergeReaderBenchmark.group:min-heap           50       100000  avgt   10  1394.615 ± 97.327  ms/op
MergeReaderBenchmark.group:loser-tree         50       100000  avgt   10  1283.037 ± 84.270  ms/op

API and Format

No

Documentation

A variant implementation of loser tree is introduced to optimize the comparison times of SortMergeReader.

Different from the traditional loser tree, since RecordReader and MergeFunction may reuse objects, we can iterate data forward for RecordReader only after all the same keys in the entire tree are processed.

@liming30 liming30 changed the title [Feature] Optimize SortMergeReader: use loser tree to reduce comparisons (#741) [Feature] Optimize SortMergeReader: use loser tree to reduce comparisons Apr 4, 2023
@JingsongLi
Copy link
Contributor

Thanks @liming30 !
It seems to me that maybe the loser tree is optimized to reduce comparisons? Can we add some tests with string as key, so that the comparison will be more expensive.

@liming30
Copy link
Contributor Author

liming30 commented Apr 6, 2023

@JingsongLi hi, I changed the key in the benchmark to a String type with a length of 128 bytes. Now loserTree has a significant performance improvement compared to minHeap(30%~50%). With the increase of the number of RecordReader and the amount of data, the performance improvement is more obvious.

Benchmark                             (readersNum)  (recordNum)  Mode  Cnt     Score     Error  Units
MergeReaderBenchmark.group:loserTree             2         1000  avgt   10     0.495 ±   0.282  ms/op
MergeReaderBenchmark.group:minHeap               2         1000  avgt   10     0.777 ±   0.462  ms/op
MergeReaderBenchmark.group:loserTree             2        10000  avgt   10     5.461 ±   1.191  ms/op
MergeReaderBenchmark.group:minHeap               2        10000  avgt   10     8.405 ±   1.735  ms/op
MergeReaderBenchmark.group:loserTree             2       100000  avgt   10    60.266 ±   4.047  ms/op
MergeReaderBenchmark.group:minHeap               2       100000  avgt   10    91.507 ±   5.479  ms/op
MergeReaderBenchmark.group:loserTree             5         1000  avgt   10     1.752 ±   0.239  ms/op
MergeReaderBenchmark.group:minHeap               5         1000  avgt   10     2.502 ±   0.329  ms/op
MergeReaderBenchmark.group:loserTree             5        10000  avgt   10    19.769 ±   3.273  ms/op
MergeReaderBenchmark.group:minHeap               5        10000  avgt   10    28.785 ±   4.636  ms/op
MergeReaderBenchmark.group:loserTree             5       100000  avgt   10   216.575 ±  33.923  ms/op
MergeReaderBenchmark.group:minHeap               5       100000  avgt   10   378.799 ±  58.567  ms/op
MergeReaderBenchmark.group:loserTree            10         1000  avgt   10     5.360 ±   0.293  ms/op
MergeReaderBenchmark.group:minHeap              10         1000  avgt   10     8.919 ±   0.470  ms/op
MergeReaderBenchmark.group:loserTree            10        10000  avgt   10    54.107 ±   5.948  ms/op
MergeReaderBenchmark.group:minHeap              10        10000  avgt   10    87.113 ±   9.001  ms/op
MergeReaderBenchmark.group:loserTree            10       100000  avgt   10   495.890 ±  20.231  ms/op
MergeReaderBenchmark.group:minHeap              10       100000  avgt   10   792.621 ±  34.431  ms/op
MergeReaderBenchmark.group:loserTree            20         1000  avgt   10    18.246 ±   9.235  ms/op
MergeReaderBenchmark.group:minHeap              20         1000  avgt   10    29.491 ±  15.529  ms/op
MergeReaderBenchmark.group:loserTree            20        10000  avgt   10   252.389 ±  63.124  ms/op
MergeReaderBenchmark.group:minHeap              20        10000  avgt   10   410.268 ± 106.181  ms/op
MergeReaderBenchmark.group:loserTree            20       100000  avgt   10  1797.648 ± 598.206  ms/op
MergeReaderBenchmark.group:minHeap              20       100000  avgt   10  2802.025 ± 742.304  ms/op
MergeReaderBenchmark.group:loserTree            50         1000  avgt   10    33.583 ±   1.122  ms/op
MergeReaderBenchmark.group:minHeap              50         1000  avgt   10    64.716 ±   2.202  ms/op
MergeReaderBenchmark.group:loserTree            50        10000  avgt   10   351.078 ±  36.385  ms/op
MergeReaderBenchmark.group:minHeap              50        10000  avgt   10   707.821 ±  77.199  ms/op

Copy link
Contributor

@tsreaper tsreaper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @liming30 !

Thanks for the contribution. It's a clever implementation with state transitions which utilizes several properties of the loser tree. I've gone through the algorithm several times and can't find an error.

However to help maintain the code in the future, I'd like to see a detailed Java docs about this algorithm. The document should include the following:

  1. What each state represents. The most confusing part to me is that WINNER_WITH_SAME_KEY indicates that the leaf wins with the same key as the previous global winner, however LOSER_WITH_SAME_KEY only indicates that the leaf loses in its last local comparison with the same key.
  2. The meaning of firstSameKeyIndex, and why can we iterate through all KeyValues with the same key by following this pointer. At first I was thinking firstSameKeyIndex formed a linked list but it's not.
  3. A detailed explanation of each step of your implementation. It seems to me that your implementation consists of three steps: initializing, moving all KeyValues with the same smallest key to the top through several adjust(), and move them to the top through several adjust() again to advance their readers.
  4. The proof of this implementation. The most important part is that why adjust() works, and why do we only need to check several, but not all combinations of states in the three adjustWithXXLoserKey.

Of course, if you have any reference link for the implementation or proof, you can add them in the Java docs to help for explanation.

@JingsongLi
Copy link
Contributor

@JingsongLi hi, I changed the key in the benchmark to a String type with a length of 128 bytes. Now loserTree has a significant performance improvement compared to minHeap(30%~50%). With the increase of the number of RecordReader and the amount of data, the performance improvement is more obvious.

Benchmark                             (readersNum)  (recordNum)  Mode  Cnt     Score     Error  Units
MergeReaderBenchmark.group:loserTree             2         1000  avgt   10     0.495 ±   0.282  ms/op
MergeReaderBenchmark.group:minHeap               2         1000  avgt   10     0.777 ±   0.462  ms/op
MergeReaderBenchmark.group:loserTree             2        10000  avgt   10     5.461 ±   1.191  ms/op
MergeReaderBenchmark.group:minHeap               2        10000  avgt   10     8.405 ±   1.735  ms/op
MergeReaderBenchmark.group:loserTree             2       100000  avgt   10    60.266 ±   4.047  ms/op
MergeReaderBenchmark.group:minHeap               2       100000  avgt   10    91.507 ±   5.479  ms/op
MergeReaderBenchmark.group:loserTree             5         1000  avgt   10     1.752 ±   0.239  ms/op
MergeReaderBenchmark.group:minHeap               5         1000  avgt   10     2.502 ±   0.329  ms/op
MergeReaderBenchmark.group:loserTree             5        10000  avgt   10    19.769 ±   3.273  ms/op
MergeReaderBenchmark.group:minHeap               5        10000  avgt   10    28.785 ±   4.636  ms/op
MergeReaderBenchmark.group:loserTree             5       100000  avgt   10   216.575 ±  33.923  ms/op
MergeReaderBenchmark.group:minHeap               5       100000  avgt   10   378.799 ±  58.567  ms/op
MergeReaderBenchmark.group:loserTree            10         1000  avgt   10     5.360 ±   0.293  ms/op
MergeReaderBenchmark.group:minHeap              10         1000  avgt   10     8.919 ±   0.470  ms/op
MergeReaderBenchmark.group:loserTree            10        10000  avgt   10    54.107 ±   5.948  ms/op
MergeReaderBenchmark.group:minHeap              10        10000  avgt   10    87.113 ±   9.001  ms/op
MergeReaderBenchmark.group:loserTree            10       100000  avgt   10   495.890 ±  20.231  ms/op
MergeReaderBenchmark.group:minHeap              10       100000  avgt   10   792.621 ±  34.431  ms/op
MergeReaderBenchmark.group:loserTree            20         1000  avgt   10    18.246 ±   9.235  ms/op
MergeReaderBenchmark.group:minHeap              20         1000  avgt   10    29.491 ±  15.529  ms/op
MergeReaderBenchmark.group:loserTree            20        10000  avgt   10   252.389 ±  63.124  ms/op
MergeReaderBenchmark.group:minHeap              20        10000  avgt   10   410.268 ± 106.181  ms/op
MergeReaderBenchmark.group:loserTree            20       100000  avgt   10  1797.648 ± 598.206  ms/op
MergeReaderBenchmark.group:minHeap              20       100000  avgt   10  2802.025 ± 742.304  ms/op
MergeReaderBenchmark.group:loserTree            50         1000  avgt   10    33.583 ±   1.122  ms/op
MergeReaderBenchmark.group:minHeap              50         1000  avgt   10    64.716 ±   2.202  ms/op
MergeReaderBenchmark.group:loserTree            50        10000  avgt   10   351.078 ±  36.385  ms/op
MergeReaderBenchmark.group:minHeap              50        10000  avgt   10   707.821 ±  77.199  ms/op

This result is exciting!

@liming30
Copy link
Contributor Author

@tsreaper Thanks for your help reviewing the code. I created a google doc for the design and implementation of this code, please help review this document. If there is no problem with the content of the document, I can attach the core idea of ​​the document and the link to the document to the comments of the code. Thanks.

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some tests for LoserTree? There can be some unit cases.

@tsreaper
Copy link
Contributor

@tsreaper Thanks for your help reviewing the code. I created a google doc for the design and implementation of this code, please help review this document. If there is no problem with the content of the document, I can attach the core idea of ​​the document and the link to the document to the comments of the code. Thanks.

Thanks for your detailed document. I've left some comments in the document, please resolve them.

@liming30
Copy link
Contributor Author

Thanks for your detailed document. I've left some comments in the document, please resolve them.

@tsreaper @JingsongLi Thanks for your review, related comments have been replied in doc.

At the same time, I modified the implementation of part of the code in the new commit to perform state transition from the perspective of the local winner, so that the code is more concise and easy to understand.

@liming30
Copy link
Contributor Author

@tsreaper @JingsongLi Hi, I have addressed your comments, is there any other problem?

Copy link
Contributor

@tsreaper tsreaper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add an option in CoreOptions for using loser tree or heap. We can use loser tree by default and users also have the ability to fall back to heap if they encounter any problem.

@liming30 liming30 force-pushed the optimize-reader branch 3 times, most recently from 1efef9e to b2a1b84 Compare April 24, 2023 16:37
@liming30
Copy link
Contributor Author

Please also add an option in CoreOptions for using loser tree or heap. We can use loser tree by default and users also have the ability to fall back to heap if they encounter any problem.

@tsreaper added sort-engine configuration, which uses loser-tree by default. The original min-heap implementation
is retained in the code and renamed to SortMergeReaderWithMinHeap.

@liming30 liming30 requested a review from tsreaper April 25, 2023 07:13
@liming30 liming30 requested a review from tsreaper April 26, 2023 07:26
@tsreaper tsreaper merged commit 5edf3a9 into apache:master Apr 27, 2023
7 checks passed
@liming30 liming30 deleted the optimize-reader branch April 27, 2023 11:57
ZhangChaoming pushed a commit to ZhangChaoming/incubator-paimon that referenced this pull request May 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants