ARROW-10796: [C++] Implement optimized RecordBatch sorting#8890
Closed
pitrou wants to merge 2 commits intoapache:masterfrom
Closed
ARROW-10796: [C++] Implement optimized RecordBatch sorting#8890pitrou wants to merge 2 commits intoapache:masterfrom
pitrou wants to merge 2 commits intoapache:masterfrom
Conversation
b87ef64 to
661487a
Compare
Member
Author
|
cc @kou |
661487a to
07611a8
Compare
kou
approved these changes
Dec 11, 2020
Member
kou
left a comment
There was a problem hiding this comment.
+1
Great!
left-to-right radix sort
Wow!
| [&](uint64_t index) { return !array.IsNull(index); }); | ||
| // Sort all nulls by second and following sort keys | ||
| // TODO: could we instead run an independent sort from the second key on | ||
| // this slice? |
Member
There was a problem hiding this comment.
Like ConcreteRecordBatchColumnSorter's next_column_?
It would work.
| private: | ||
| // TODO instead of resolving chunks for each column independently, we could | ||
| // split the table into RecordBatches and pay the cost of chunked indexing | ||
| // at the first column only. |
Member
There was a problem hiding this comment.
Can we always do it?
My understanding that each chunked array in a table can have different number of chunks. For example, the table is valid:
a: [[0, 1], [2, 3, 4]]
b: [[10], [11, 12], [13], [14]]
I'm not sure we can split the table into record batches efficiently.
Member
Author
There was a problem hiding this comment.
TableBatchReader can be used for that.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add two RecordBatch sorting implementations:
Both implementations benefit from direct indexed access into the contiguous RecordBatch columns (as opposed to table sorting, which must index into the chunks).
Add some RecordBatch-sorting benchmarks.
Also, add and improve tests; and fix a bug related to sorting of NaNs and nulls.
Benchmarks (changes less than 10% in absolute value not shown):