-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-14314: [C++] Sorting dictionary array not implemented #13334
Conversation
c4b8128
to
69ef9cf
Compare
@pitrou This is a first draft. I got a couple of questions:
|
We should support all indices types, that is all integer types.
I'm not sure I understand the question. Which type would you cast the values to? (and/or can give you a concrete example?) |
Sure, for example, if we have a dictionary d, we'll have an indices array, a values array, and a dictionary type with indices and values types.
And, I want to know whether it's possible to cast indices into an array of indices_type or not. The main reason I want to cast the indices array is to use And another question (OOT), can we build and array from a dictionary array? Something like this:
|
Ah, yes, of course. Just use |
You can, for example by using the take function ( |
At the end, to avoid doing cast, I'm using stoull to get the index. Should I add test with other types of values on dictionary? |
You certainly don't want to go through |
Got it, so which is the best way of getting the index? |
You would probably have to cast the indices to the concrete array type :-) |
@pitrou I cast index array to concrete type :) |
auto indices_array = | ||
CallFunction("array_sort_indices", {values}, &options).ValueOrDie().make_array(); | ||
|
||
const auto& indices_values = checked_cast<const UInt64Array&>(*indices_array); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail if index_type
is not UInt64Type
. You should instead do the appropriate cast in SortInternal
above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should work because it is sorting values array, and as far as I know the output array is of UInt64 type. Probably I should change indices_array
to values_array
to avoid confusion.
@ArianaVillegas I'm sorry for the delay (was on a trip then got sick :-(). I posted a couple comments; I don't know if they make sense to you, feel free to ask for more guidance if needed! |
d805e07
to
b4b0bd4
Compare
@pitrou do you know why the output of
And, if NullPartitionResult is needed, why it is not assigned on ArraySortIndices? |
@westonpace I'll be great if you can review this PR too. |
I'll take a look next week (but feel free to merge in the meantime) |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not an expert in this code but here are a few thoughts.
18ced04
to
6cf1b28
Compare
These Ruby failures may be related but I can't really understand them. @kou do you mind taking a look? |
Sure. |
6e6b909
to
a452b76
Compare
I keep getting this error. It looks like the AMD64 MacOS tests are failing, any ideas? @pitrou @westonpace @kou
|
@ArianaVillegas Yes, this is an unrelated issue that we are hoping to fix. |
@ArianaVillegas Ok, I took a lot and then pushed an update. Things I added:
Here are the benchmark results here:
We can see that there is a speed increase on a dict of strings (compared to a plain strings array), but not on a dict of integers. This is already nice, I'll try to see if there's a way to be better still. |
Hmm, I think making it better requires a Rank variant that preserves nulls in the output instead of giving them a numeric rank. Shouldn't be very hard, but probably for a separate JIRA and PR still. Edit: this probably can be done internally instead, as it's really simple. |
Ok, so I implemented the algorithm which was originally envisioned in https://issues.apache.org/jira/browse/ARROW-14314?focusedCommentId=17445377&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17445377 The performance numbers are indeed quite better now:
We see that sorting a dictionary array is now much faster (potentially 10x faster) than sorting an equivalent dense array, if the dictionary size is small (which is a very common case). |
cc @cyb70289 |
I'm making this a draft because the FIXMEs will require a couple internal API changes to address. |
### Rationale for this change Sorting for `DictionaryArray`s is not currently supported. ### What changes are included in this PR? - Adds support for dictionaries in the `array_sort_indices` kernel - Adds tests and benchmarks - Alters the internal `ArraySortFunc` definition to return an error status and accept the caller's `ExecContext` as an argument ### Are these changes tested? Yes ### Are there any user-facing changes? Yes ### Notes This picks up where #13334 left off. Those commits have been squashed and included in this PR. * Closes: #29887 Lead-authored-by: benibus <bpharks@gmx.com> Co-authored-by: Ariana Villegas <ariana.villegas@utec.edu.pe> Co-authored-by: Ben Harkins <60872452+benibus@users.noreply.github.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
This has been superceded by #35280, closing. |
Sort dictionary array
Implementation: