Skip to content

Conversation

@FangYongs
Copy link
Contributor

@FangYongs FangYongs commented Oct 10, 2022

Currently the SortMergeReader will compare and sort the readers after reading one batch from them to ensure that the sequence is correct. The readers are created from SortedRun list and the key ranges of them may be disjoint. We can compare batch minKey and maxKey for each read in the files of SortedRun list and divide them to multiple regions. When there's only one reader in the region, it can read data directly without compare and sort.

So the main changes are as follows:

  1. Add SortedRegionDataRecordReader class which can create a reader with minKey and maxKey from each file in SortedRun
  2. Add RecordReaderSubRegion class which includes SortedRegionDataRecordReader list, it is created from one SortedRun
  3. Add RecordReaderRegionManager to divide RecordReaderSubRegion list into multiple RecordReaderRegion, each RecordReaderRegion manages its own RecordReaderSubRegion list and the key range in different RecordReaderRegions are disjoint
  4. Create SortMergeReader from each RecordReaderRegion to reduce the comparisons in different RecordReaderRegions. If the RecordReaderRegion has only one reader, using the specify reader directly

Test cases RecordReaderRegionTest and RecordReaderRegionManagerTest are added to test the new classes, the SortMergeReader and related classes are tested in MergeTreeTest

@FangYongs
Copy link
Contributor Author

FangYongs commented Oct 10, 2022

Hi @JingsongLi I tried to fix FLINK-27958 and the main changes are described as above. Can you help to review the implementation and codes when you're free THX

@JingsongLi
Copy link
Contributor

CC: @tsreaper

@JingsongLi
Copy link
Contributor

Hi @zjureel can you do some benchmark to verify the improvement?

@FangYongs
Copy link
Contributor Author

FangYongs commented Oct 13, 2022

Hi @zjureel can you do some benchmark to verify the improvement?

Hi @JingsongLi It's a good idea and I like it. I find there's a flink-table-store-benchmark project in flink-table-store to setup a flink cluster, run a query in the cluster and collect some metrics. I propose to add a new micro benchmark project in flink-table-store, we then add mcro benchmarks of core operation steps in flink-table-store-micro-benchmarks such as the throughput of read, write and compaction. We can create a view for the micro benchmarks, and the flink-table-store-micro-benchmarks project is just similar to flink-benchmarks for flink. What do you think ? Hope to hear from you, THX

@JingsongLi
Copy link
Contributor

+1 to flink-table-store-micro-benchmarks

@FangYongs FangYongs force-pushed the FLINK_27958_batch_maxKey_in_SortMergeReader branch from 3a03dea to bd050f3 Compare October 31, 2022 07:37
@FangYongs FangYongs closed this Feb 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants