Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Sorting dictionary array not implemented #29887

Closed
asfimport opened this issue Oct 13, 2021 · 23 comments · Fixed by #35280
Closed

[C++] Sorting dictionary array not implemented #29887

asfimport opened this issue Oct 13, 2021 · 23 comments · Fixed by #35280

Comments

@asfimport
Copy link

From R, taking the stock mtcars dataset and giving it a dictionary type column:

mtcars %>% 
  mutate(cyl = as.factor(cyl)) %>% 
  Table$create() %>% 
  arrange(cyl) %>% 
  collect()

Error: Type error: Sorting not supported for type dictionary<values=string, indices=int8, ordered=0>
../src/arrow/compute/kernels/vector_array_sort.cc:427  VisitTypeInline(type, this)
../src/arrow/compute/kernels/vector_sort.cc:148  GetArraySorter(*physical_type_)
../src/arrow/compute/kernels/vector_sort.cc:1206  sorter.Sort()
../src/arrow/compute/api_vector.cc:259  CallFunction("sort_indices", {datum}, &options, ctx)
../src/arrow/compute/exec/order_by_impl.cc:53  SortIndices(table, options_, ctx_)
../src/arrow/compute/exec/sink_node.cc:292  impl_->DoFinish()
../src/arrow/compute/exec/exec_plan.cc:297  iterator_.Next()
../src/arrow/record_batch.cc:318  ReadNext(&batch)
../src/arrow/record_batch.cc:329  ReadAll(&batches)

Reporter: Neal Richardson / @nealrichardson
Assignee: Ariana Villegas / @ArianaVillegas

PRs and other links:

Note: This issue was originally created as ARROW-14314. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
It should be possible to do this efficiently:

  • generate an array of ordinals from the dictionary values: for example ["c", "a", "b"] would generate [2, 0, 1], while ["c", "a", "b", "b"] would generate [3, 0, 1, 1] (notice that equal values should give the same ordinal, for stability)

  • generate a dense array by indexing the computed ordinals with the generated indices

  • sort the dense array

    One complication is to deal with nulls inside the dictionary values, so that stability is enforced for them as well.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
cc @ArianaVillegas if you like playing with algorithms.

@asfimport
Copy link
Author

Ariana Villegas / @ArianaVillegas:
@pitrou  I have a couple of questions about this issue:

  • Currently, how are nulls handled in a dictionary array?

  • Can we do something like this?

    • Given the following dictionary:
      original_array: ['a', 'c', 'b', 'b', 'a']
      
      values: ['a', 'c', 'b']
      indices: [0, 1, 2, 2, 0]
    • Get sort_indices of values, replace indices and values
      values_sorted_idx: [0, 2, 1]
      
      indices: [0, 2, 1, 1, 0]
      values: ['a', 'b', 'c']
    • And finally, sort the indices

      indices: [0, 0, 1, 1, 2]

      And, if I remember correctly, current sort_indices handles nulls and send them to the end of the array. Please let me know if I'm misunderstanding something.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
@ArianaVillegas Nulls in a dictionary array can be represented in two different ways:

  • nulls in the dictionary values, e.g.:

    values: ['a', null, 'b', 'c']
    indices: [0, 1, 1, 0, 2, 3]
  • nulls in the dictionary indices, e.g.:

    values: ['a', 'b', 'c']
    indices: [0, null, null, 0, 1, 2]

    Also, it can be a mixture of both, such as:

    values: ['a', null, 'b', 'c']
    indices: [0, 1, null, 0, 2, 3]

    (all three examples have the same value once decoded)

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
@ArianaVillegas Sorting the dictionary values is not exactly enough, because you must care about stability as well.
For example this input:

values: ['c', 'a', 'b', 'b']
indices: [0, 1, 3, 2, 3, 0]

Should generate the following sort indices:

sort_indices: [1, 2, 3, 4, 0, 5]

(note "2, 3, 4", not "3, 2, 4)

@asfimport
Copy link
Author

Ariana Villegas / @ArianaVillegas:
@pitrou, maybe this is a naive question, but why sort_indices: [1, 3, 2, 4, 0, 5] is wrong and [1, 2, 3, 4, 0, 5] is right? 

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Because the sort must be stable. The values at indices [2, 3, 4] all have the value "b", so they must stay in the same order.

@asfimport
Copy link
Author

Ariana Villegas / @ArianaVillegas:
Ok, I got it.

@pitrou In that case, I think we can do something like this:

  • Given the following dictionary:

    values: ['c', 'a', 'b', 'b']
    indices: [0, 1, 3, 2, 3, 0]
  • Get sort_idx from values and transform it to give the same idx to same values

    values_sort_idx = [1, 2, 3, 0]
    transformed_sort_idx = [3, 0, 1, 1]
  • Get sort_idx from transformed indices

    transformed_indices = [3, 0, 1, 1, 1, 3]
    sort_indices = [1, 2, 3, 4, 0, 5]

    With nulls, it will work similarly:

  • Given the following dictionary:

    values: ['a', null, 'b', 'c']
    indices: [0, 1, null, 0, 2, 3]
  • Get sort_idx from values and transform it to give the same idx to same values

    values_sort_idx = [0, 2, 3, 1]
    transformed_sort_idx = [0, 3, 1, 2]
  • Get sort_idx from transformed indices

    transformed_indices = [0, 3, null, 0, 1, 2]
    sort_indices = [0, 3, 4, 5, 1, 2]

     

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
IIUC, transformed_indices = take(transformed_sort_idx, indices) and sort_indices = sort_indices(transformed_indices) ?

The remaining problem is, what happens with the following dictionary:

values: ['a', null, 'b', 'c']
indices: [0, null, 1, 0, 2, 3]

It should also get sort_indices = [0, 3, 4, 5, 1, 2] as an answer. Does it?

@asfimport
Copy link
Author

Ariana Villegas / @ArianaVillegas:

transformed_indices = take(transformed_sort_idx, indices) and sort_indices = sort_indices(transformed_indices) ?
Yes.
It should also get sort_indices = [0, 3, 4, 5, 1, 2] as an answer. Does it?
Right now, it doesn't, it result is: [0, 3, 4, 5, 2, 1]

But to avoid that, we can replace null values into indices, so the problem will look like this:

values: ['a', null, 'b', 'c']

indices: [0, null, null, 0, 2, 3]

 

@pitrou, btw, why do we allow nulls in values? Shouldn't it be easier to have them only in indices?

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:

But to avoid that, we can replace null values into indices, so the problem will look like this:

We can indeed. Another possibility is to partition nulls away first, then work on non-null values (partitioning is how the sorting implementation already deals with null values for other data types). That might be a bit faster as well.

btw, why do we allow nulls in values? Shouldn't it be easier to have them only in indices?

Probably for compatibility with various data sources.

@asfimport
Copy link
Author

Ariana Villegas / @ArianaVillegas:
Where is the sorting implementation? 

We can indeed. Another possibility is to partition nulls away first, then work on non-null values (partitioning is how the sorting implementation already deals with null values for other data types). That might be a bit faster as well.
Nice, so we'll first partition nulls away and just care about non-nulls values, and at the end just merge both.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
You can find the sorting implementations in arrow/compute/kernels/vector_sort* as well as vector_array_sort.cc.
Specifically, the current null partitioning implementations are in vector_sort_internal.h, but none of them deals with dictionary arrays, so you'll have to add your own.

@asfimport
Copy link
Author

Ariana Villegas / @ArianaVillegas:
Ok, I'll try to implement our idea and let you know as soon as I have something you can review.

@asfimport
Copy link
Author

Christian Lorentzen:
Hi
I recently stumbled over this missing feature.
It is not 100% clear to me if the intent is to sort by values or by the indices. For example

values/dict = ["c", "a", "b"]
indices = [2, 0, 2, 1, 0]
decoded = ["b", "c", "b", "a", "c"]

Will this be sorted in ascending order as
A) orderd by dictionary
indices = [0, 0, 1, 2, 2]
decoded = ["c", "c", "a", "b", "b"]
B) orderd by decoded values
indices = [1, 2, 2, 0, 0]
decoded = ["a", b", "b", "c", "c"]

My hope is for A) such that the order of the dict in the DictionaryArray has a meaning.

@asfimport asfimport added this to the 11.0.0 milestone Jan 11, 2023
@raulcd raulcd removed this from the 11.0.0 milestone Jan 11, 2023
pitrou pushed a commit that referenced this issue May 22, 2023
### Rationale for this change

Sorting for `DictionaryArray`s is not currently supported.

### What changes are included in this PR?

- Adds support for dictionaries in the `array_sort_indices` kernel
- Adds tests and benchmarks
- Alters the internal `ArraySortFunc` definition to return an error status and accept the caller's `ExecContext` as an argument

### Are these changes tested?

Yes

### Are there any user-facing changes?

Yes

### Notes

This picks up where #13334 left off. Those commits have been squashed and included in this PR.
* Closes: #29887

Lead-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: Ariana Villegas <ariana.villegas@utec.edu.pe>
Co-authored-by: Ben Harkins <60872452+benibus@users.noreply.github.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 13.0.0 milestone May 22, 2023
@coady
Copy link

coady commented Jul 19, 2023

This did not implement sorting on dictionary chunked arrays, or in tables. Is that planned, or should I open a feature request?

@benibus
Copy link
Collaborator

benibus commented Jul 19, 2023

Not necessarily "planned" but it probably should be for consistency's sake. Opening a feature request sounds good.

@lorentzenchr
Copy link
Contributor

With pyarrow version 14

import pyarrow as pa

dict_array = pa.DictionaryArray.from_arrays([2, 0, 2, 1, 0], ["c", "a", "b"])
dict_array.dictionary_decode()  # ["b", "c", "b", "a", "c"]

dict_array.sort().dictionary_decode()  # ["a", "b", "b", "c", "c"]

So it is just the alpha numerical sort order, not the order as given by the dictionary. I find that a bit unfortunate and don't find any discussion of it.

@pitrou
Copy link
Member

pitrou commented Nov 3, 2023

This is what most people would expect. What did you expect here?

@lorentzenchr
Copy link
Contributor

As the dictionary is specified as ["c", "a", "b"], I would have expected to preserve that ordering resulting in a decoded array ["c", "c", "a", "b", "b"], or to have the option to do so.

With pandas ordered categoricals:

import pandas as pd

s = pd.Series(pd.Categorical(["b", "c", "b", "a", "c"], categories=["c", "a", "b"], ordered=True))
s  # b c b a c

s.sort_values()  # c c a b b

Use case: Being able to specify the sort order can make plots or reported tables much easier to create.

@pitrou
Copy link
Member

pitrou commented Nov 3, 2023

In this case, you could probably create a new column with just the dictionary indices and sort on that column.

@pitrou
Copy link
Member

pitrou commented Nov 3, 2023

cc @jorisvandenbossche FYI ^^

@lorentzenchr
Copy link
Contributor

In this case, you could probably create a new column with just the dictionary indices and sort on that column.

That's what I'm actually doing in a package of mine (with polars as df). Being able to specify the order as a user would just be much more convenient and less code. That's all I'm saying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants