New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Sorting dictionary array not implemented #29887
Comments
Antoine Pitrou / @pitrou:
|
Antoine Pitrou / @pitrou: |
Ariana Villegas / @ArianaVillegas:
|
Antoine Pitrou / @pitrou:
|
Antoine Pitrou / @pitrou: values: ['c', 'a', 'b', 'b']
indices: [0, 1, 3, 2, 3, 0] Should generate the following sort indices: sort_indices: [1, 2, 3, 4, 0, 5] (note "2, 3, 4", not "3, 2, 4) |
Ariana Villegas / @ArianaVillegas: |
Antoine Pitrou / @pitrou: |
Ariana Villegas / @ArianaVillegas: @pitrou In that case, I think we can do something like this:
|
Antoine Pitrou / @pitrou: The remaining problem is, what happens with the following dictionary: values: ['a', null, 'b', 'c']
indices: [0, null, 1, 0, 2, 3] It should also get |
Ariana Villegas / @ArianaVillegas:
But to avoid that, we can replace null values into indices, so the problem will look like this: values: ['a', null, 'b', 'c'] indices: [0, null, null, 0, 2, 3]
@pitrou, btw, why do we allow nulls in values? Shouldn't it be easier to have them only in indices? |
We can indeed. Another possibility is to partition nulls away first, then work on non-null values (partitioning is how the sorting implementation already deals with null values for other data types). That might be a bit faster as well.
Probably for compatibility with various data sources. |
Ariana Villegas / @ArianaVillegas:
|
Antoine Pitrou / @pitrou: |
Ariana Villegas / @ArianaVillegas: |
Christian Lorentzen: values/dict = ["c", "a", "b"] Will this be sorted in ascending order as My hope is for A) such that the order of the dict in the DictionaryArray has a meaning. |
### Rationale for this change Sorting for `DictionaryArray`s is not currently supported. ### What changes are included in this PR? - Adds support for dictionaries in the `array_sort_indices` kernel - Adds tests and benchmarks - Alters the internal `ArraySortFunc` definition to return an error status and accept the caller's `ExecContext` as an argument ### Are these changes tested? Yes ### Are there any user-facing changes? Yes ### Notes This picks up where #13334 left off. Those commits have been squashed and included in this PR. * Closes: #29887 Lead-authored-by: benibus <bpharks@gmx.com> Co-authored-by: Ariana Villegas <ariana.villegas@utec.edu.pe> Co-authored-by: Ben Harkins <60872452+benibus@users.noreply.github.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
This did not implement sorting on dictionary chunked arrays, or in tables. Is that planned, or should I open a feature request? |
Not necessarily "planned" but it probably should be for consistency's sake. Opening a feature request sounds good. |
With pyarrow version 14 import pyarrow as pa
dict_array = pa.DictionaryArray.from_arrays([2, 0, 2, 1, 0], ["c", "a", "b"])
dict_array.dictionary_decode() # ["b", "c", "b", "a", "c"]
dict_array.sort().dictionary_decode() # ["a", "b", "b", "c", "c"] So it is just the alpha numerical sort order, not the order as given by the dictionary. I find that a bit unfortunate and don't find any discussion of it. |
This is what most people would expect. What did you expect here? |
As the dictionary is specified as With pandas ordered categoricals: import pandas as pd
s = pd.Series(pd.Categorical(["b", "c", "b", "a", "c"], categories=["c", "a", "b"], ordered=True))
s # b c b a c
s.sort_values() # c c a b b Use case: Being able to specify the sort order can make plots or reported tables much easier to create. |
In this case, you could probably create a new column with just the dictionary indices and sort on that column. |
cc @jorisvandenbossche FYI ^^ |
That's what I'm actually doing in a package of mine (with polars as df). Being able to specify the order as a user would just be much more convenient and less code. That's all I'm saying. |
From R, taking the stock
mtcars
dataset and giving it a dictionary type column:Reporter: Neal Richardson / @nealrichardson
Assignee: Ariana Villegas / @ArianaVillegas
PRs and other links:
Note: This issue was originally created as ARROW-14314. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: