-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-29887: [C++] Implement dictionary array sorting #35280
Conversation
|
94181ab
to
e347db5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions but looks good overall
c81b0d4
to
bfd81c0
Compare
I did some refactoring/reorganizing of the benchmarks so the new ones should hopefully make more sense. The original intent should still be intact (although I changed some of the string length parameters to be consistent with the existing non-dictionary string benchmarks). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very neat, thanks. A couple comments and questions below.
Just for the record, it's a pity that the all-or-mostly-null case is a bit slower:
This can be left to another issue or PR, though. Overall the improvement is very nice, especially for strings. |
502fe15
to
7d1c591
Compare
After some ad-hoc testing, the benchmark' array lengths/sizes should be essentially the same now between the dict and non-dict versions. For strings, the dict versions slightly overshoot the others (in terms of bytes) due to some null overlap between the index/value arrays. I also added some special-casing for the 100% null benchmarks - primarily to ensure that the input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @benibus . Looks mostly good to me, just one more suggestion. Also, can you rebase?
Draft sort indices on array dictionary Update sort indices Update sorter Address feedback Address feedback Add tests and benchmarks Some nits re:benchmarking Some nits Use the faster algorithm (rank-then-take-then-sort) Remove unused code More TODOs
Alters ArraySortFunc to allow for error propagation and passing through the caller's ExecContext to use in additional kernels
Co-authored-by: Weston Pace <weston.pace@gmail.com>
7d1c591
to
b628024
Compare
I will note that the string benchmark numbers are lower with the latest benchmarking changes (which reduce the actual string array length), but that reflects the non-trivial fixed costs in the sorting routine:
|
Benchmark runs are scheduled for baseline = f3500f6 and contender = 6ceb12f. 6ceb12f is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
Rationale for this change
Sorting for
DictionaryArray
s is not currently supported.What changes are included in this PR?
array_sort_indices
kernelArraySortFunc
definition to return an error status and accept the caller'sExecContext
as an argumentAre these changes tested?
Yes
Are there any user-facing changes?
Yes
Notes
This picks up where #13334 left off. Those commits have been squashed and included in this PR.