Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-14314: [C++] Sorting dictionary array not implemented #13334

Closed
wants to merge 11 commits into from

Conversation

ArianaVillegas
Copy link
Contributor

@ArianaVillegas ArianaVillegas commented Jun 7, 2022

Sort dictionary array

Implementation:

  • Split nulls and non nulls values
  • Get sorted indices array of dictionary values
  • Assign new indices to each different dictionary value
  • Sort non nulls values based on previously obtained indices

@github-actions
Copy link

github-actions bot commented Jun 7, 2022

@ArianaVillegas
Copy link
Contributor Author

@pitrou This is a first draft. I got a couple of questions:

  • Should we be able to deduce the type of indices array? Or is it ok work with UInt64Array?
  • Is there a way to cast values dictionary based on dictionary type?

@pitrou
Copy link
Member

pitrou commented Jun 9, 2022

@pitrou This is a first draft. I got a couple of questions:

* Should we be able to deduce the type of indices array? Or is it ok work with UInt64Array?

We should support all indices types, that is all integer types.

* Is there a way to cast values dictionary based on dictionary type?

I'm not sure I understand the question. Which type would you cast the values to? (and/or can give you a concrete example?)

@ArianaVillegas
Copy link
Contributor Author

I'm not sure I understand the question. Which type would you cast the values to? (and/or can give you a concrete example?)

Sure, for example, if we have a dictionary d, we'll have an indices array, a values array, and a dictionary type with indices and values types.

indices = dict_array.indices();
values = dict_array.dictionary();
dict_type = dict_array.dict_type();
indices_type = dict_type->index_type();
values_type = dict_type->value_type();

And, I want to know whether it's possible to cast indices into an array of indices_type or not. The main reason I want to cast the indices array is to use GetView. Or can I get an uint64_t from a Scalar?

And another question (OOT), can we build and array from a dictionary array? Something like this:

values = [ b, null, c, a]
indices = [0, null, 1, 0, 2, 3]

final_array = [b, null, null, b, c, a]

@pitrou
Copy link
Member

pitrou commented Jun 9, 2022

And, I want to know whether it's possible to cast indices into an array of indices_type or not

Ah, yes, of course. Just use checked_cast with the right concrete array type.

@pitrou
Copy link
Member

pitrou commented Jun 9, 2022

And another question (OOT), can we build and array from a dictionary array?

You can, for example by using the take function (take(values, indices)).

@ArianaVillegas
Copy link
Contributor Author

At the end, to avoid doing cast, I'm using stoull to get the index. Should I add test with other types of values on dictionary?

@pitrou
Copy link
Member

pitrou commented Jun 9, 2022

You certainly don't want to go through GetScalar or string conversion as that will exhibit very bad performance.

@ArianaVillegas
Copy link
Contributor Author

Got it, so which is the best way of getting the index?

@pitrou
Copy link
Member

pitrou commented Jun 9, 2022

You would probably have to cast the indices to the concrete array type :-)

@ArianaVillegas
Copy link
Contributor Author

@pitrou I cast index array to concrete type :)

@ArianaVillegas ArianaVillegas marked this pull request as ready for review June 21, 2022 15:18
auto indices_array =
CallFunction("array_sort_indices", {values}, &options).ValueOrDie().make_array();

const auto& indices_values = checked_cast<const UInt64Array&>(*indices_array);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail if index_type is not UInt64Type. You should instead do the appropriate cast in SortInternal above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should work because it is sorting values array, and as far as I know the output array is of UInt64 type. Probably I should change indices_array to values_array to avoid confusion.

@pitrou
Copy link
Member

pitrou commented Jun 30, 2022

@ArianaVillegas I'm sorry for the delay (was on a trip then got sick :-(). I posted a couple comments; I don't know if they make sense to you, feel free to ask for more guidance if needed!

@ArianaVillegas
Copy link
Contributor Author

ArianaVillegas commented Jul 4, 2022

@pitrou do you know why the output of operation() on sorting is NullPartitionResult when the indices sorted array references are indices_begin and indices_begin? Why don't just return and Status instead?

Result<NullPartitionResult> operator()(uint64_t* indices_begin, uint64_t* indices_end,
                                         const Array& array, int64_t offset,
                                         const ArraySortOptions& options)

And, if NullPartitionResult is needed, why it is not assigned on ArraySortIndices?

@ArianaVillegas
Copy link
Contributor Author

@westonpace I'll be great if you can review this PR too.

@westonpace westonpace self-requested a review July 8, 2022 23:07
@westonpace
Copy link
Member

@westonpace I'll be great if you can review this PR too.

I'll take a look next week (but feel free to merge in the meantime)

@westonpace
Copy link
Member

@pitrou do you know why the output of operation() on sorting is NullPartitionResult when the indices sorted array references are indices_begin and indices_begin? Why don't just return and Status instead?

Result operator()(uint64_t* indices_begin, uint64_t* indices_end,
const Array& array, int64_t offset,
const ArraySortOptions& options)

And, if NullPartitionResult is needed, why it is not assigned on ArraySortIndices?

@ArianaVillegas

GetArraySorter is also used in RankInternal

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not an expert in this code but here are a few thoughts.

cpp/src/arrow/compute/kernels/vector_array_sort.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/vector_array_sort.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/vector_array_sort.cc Outdated Show resolved Hide resolved
@westonpace
Copy link
Member

These Ruby failures may be related but I can't really understand them. @kou do you mind taking a look?

@kou
Copy link
Member

kou commented Jul 15, 2022

Sure.

@ArianaVillegas ArianaVillegas force-pushed the ARROW-14314 branch 2 times, most recently from 6e6b909 to a452b76 Compare July 18, 2022 23:03
@ArianaVillegas
Copy link
Contributor Author

I keep getting this error. It looks like the AMD64 MacOS tests are failing, any ideas? @pitrou @westonpace @kou

Undefined symbols for architecture x86_64:
  "google::protobuf::internal::InternalMetadata::~InternalMetadata()"

@pitrou
Copy link
Member

pitrou commented Jul 19, 2022

@ArianaVillegas Yes, this is an unrelated issue that we are hoping to fix.

@pitrou
Copy link
Member

pitrou commented Jul 21, 2022

@ArianaVillegas Ok, I took a lot and then pushed an update. Things I added:

  • a random-generated test that results are identical when sorting a dict array and sorting the decoded dict array
  • some benchmarks

Here are the benchmark results here:

  • on integers:
ArraySortIndicesInt64Narrow/32768/10000      11851 ns        11849 ns        57862 bytes_per_second=2.57561G/s items_per_second=345.692M/s null_percent=0.01 size=32.768k
ArraySortIndicesInt64Narrow/32768/100        11985 ns        11982 ns        58758 bytes_per_second=2.54692G/s items_per_second=341.842M/s null_percent=1 size=32.768k
ArraySortIndicesInt64Narrow/32768/10         13707 ns        13704 ns        50653 bytes_per_second=2.22694G/s items_per_second=298.894M/s null_percent=10 size=32.768k
ArraySortIndicesInt64Narrow/32768/2          23804 ns        23800 ns        28934 bytes_per_second=1.28226G/s items_per_second=172.102M/s null_percent=50 size=32.768k
ArraySortIndicesInt64Narrow/32768/1           6120 ns         6119 ns       114497 bytes_per_second=4.98747G/s items_per_second=669.407M/s null_percent=100 size=32.768k
ArraySortIndicesInt64Narrow/32768/0          11855 ns        11853 ns        59640 bytes_per_second=2.5747G/s items_per_second=345.57M/s null_percent=0 size=32.768k
ArraySortIndicesInt64Narrow/1048576/100     398029 ns       397942 ns         1771 bytes_per_second=2.45404G/s items_per_second=329.375M/s null_percent=1 size=1048.58k
ArraySortIndicesInt64Narrow/8388608/100    5079866 ns      5075697 ns          137 bytes_per_second=1.5392G/s items_per_second=206.588M/s null_percent=1 size=8.38861M
ArraySortIndicesInt64Wide/32768/10000       182747 ns       182712 ns         3817 bytes_per_second=171.034M/s items_per_second=22.4178M/s null_percent=0.01 size=32.768k
ArraySortIndicesInt64Wide/32768/100         186532 ns       186503 ns         3749 bytes_per_second=167.557M/s items_per_second=21.9621M/s null_percent=1 size=32.768k
ArraySortIndicesInt64Wide/32768/10          177480 ns       177452 ns         3917 bytes_per_second=176.104M/s items_per_second=23.0824M/s null_percent=10 size=32.768k
ArraySortIndicesInt64Wide/32768/2           109780 ns       109753 ns         6371 bytes_per_second=284.731M/s items_per_second=37.3202M/s null_percent=50 size=32.768k
ArraySortIndicesInt64Wide/32768/1             6090 ns         6089 ns       115057 bytes_per_second=5.01168G/s items_per_second=672.656M/s null_percent=100 size=32.768k
ArraySortIndicesInt64Wide/32768/0           182501 ns       182462 ns         3823 bytes_per_second=171.269M/s items_per_second=22.4486M/s null_percent=0 size=32.768k
ArraySortIndicesInt64Wide/1048576/100      9275293 ns      9273955 ns           75 bytes_per_second=107.829M/s items_per_second=14.1333M/s null_percent=1 size=1048.58k
ArraySortIndicesInt64Wide/8388608/100     97555560 ns     97519573 ns            7 bytes_per_second=82.0348M/s items_per_second=10.7525M/s null_percent=1 size=8.38861M
ArraySortIndicesInt64Dict/32768/10000       147982 ns       147952 ns         4726 bytes_per_second=211.218M/s items_per_second=27.6847M/s null_percent=0.01 size=32.768k
ArraySortIndicesInt64Dict/32768/100         187246 ns       187208 ns         3699 bytes_per_second=166.927M/s items_per_second=21.8795M/s null_percent=1 size=32.768k
ArraySortIndicesInt64Dict/32768/10          164064 ns       164029 ns         4218 bytes_per_second=190.515M/s items_per_second=24.9711M/s null_percent=10 size=32.768k
ArraySortIndicesInt64Dict/32768/2           111406 ns       111383 ns         6205 bytes_per_second=280.564M/s items_per_second=36.774M/s null_percent=50 size=32.768k
ArraySortIndicesInt64Dict/32768/1            68672 ns        68657 ns         9926 bytes_per_second=455.164M/s items_per_second=59.6593M/s null_percent=100 size=32.768k
ArraySortIndicesInt64Dict/32768/0           145594 ns       145571 ns         4766 bytes_per_second=214.671M/s items_per_second=28.1374M/s null_percent=0 size=32.768k
ArraySortIndicesInt64Dict/1048576/100      6942861 ns      6941856 ns          101 bytes_per_second=144.054M/s items_per_second=18.8814M/s null_percent=1 size=1048.58k
ArraySortIndicesInt64Dict/8388608/100     64016223 ns     63989906 ns           11 bytes_per_second=125.02M/s items_per_second=16.3866M/s null_percent=1 size=8.38861M
  • on strings:
ArraySortIndicesStrings/32768/10000         368405 ns       368350 ns         1898 items_per_second=88.9589M/s null_percent=0.01 size=32.768k
ArraySortIndicesStrings/32768/100           366529 ns       366463 ns         1906 items_per_second=89.417M/s null_percent=1 size=32.768k
ArraySortIndicesStrings/32768/10            340724 ns       340671 ns         2051 items_per_second=96.1866M/s null_percent=10 size=32.768k
ArraySortIndicesStrings/32768/2             193334 ns       193303 ns         3606 items_per_second=169.517M/s null_percent=50 size=32.768k
ArraySortIndicesStrings/32768/1               6175 ns         6174 ns       113451 items_per_second=5.30736G/s null_percent=100 size=32.768k
ArraySortIndicesStrings/32768/0             363407 ns       363341 ns         1926 items_per_second=90.1852M/s null_percent=0 size=32.768k
ArraySortIndicesStrings/1048576/100       16882256 ns     16879894 ns           41 items_per_second=62.1198M/s null_percent=1 size=1048.58k
ArraySortIndicesStrings/8388608/100      211687626 ns    211590267 ns            3 items_per_second=39.6455M/s null_percent=1 size=8.38861M
ArraySortIndicesStringsDict/32768/10000     191436 ns       191343 ns         3691 items_per_second=171.253M/s null_percent=0.01 size=32.768k
ArraySortIndicesStringsDict/32768/100       188732 ns       188644 ns         3619 items_per_second=173.703M/s null_percent=1 size=32.768k
ArraySortIndicesStringsDict/32768/10        162775 ns       162744 ns         4286 items_per_second=201.346M/s null_percent=10 size=32.768k
ArraySortIndicesStringsDict/32768/2         111711 ns       111677 ns         6137 items_per_second=293.417M/s null_percent=50 size=32.768k
ArraySortIndicesStringsDict/32768/1          71330 ns        71316 ns         9638 items_per_second=459.479M/s null_percent=100 size=32.768k
ArraySortIndicesStringsDict/32768/0         146030 ns       146000 ns         4770 items_per_second=224.438M/s null_percent=0 size=32.768k
ArraySortIndicesStringsDict/1048576/100    6975856 ns      6974757 ns          100 items_per_second=150.339M/s null_percent=1 size=1048.58k
ArraySortIndicesStringsDict/8388608/100   64342819 ns     64317553 ns           11 items_per_second=130.425M/s null_percent=1 size=8.38861M

We can see that there is a speed increase on a dict of strings (compared to a plain strings array), but not on a dict of integers. This is already nice, I'll try to see if there's a way to be better still.

@pitrou
Copy link
Member

pitrou commented Jul 21, 2022

We can see that there is a speed increase on a dict of strings (compared to a plain strings array), but not on a dict of integers. This is already nice, I'll try to see if there's a way to be better still.

Hmm, I think making it better requires a Rank variant that preserves nulls in the output instead of giving them a numeric rank. Shouldn't be very hard, but probably for a separate JIRA and PR still.

Edit: this probably can be done internally instead, as it's really simple.

@pitrou
Copy link
Member

pitrou commented Jul 21, 2022

Ok, so I implemented the algorithm which was originally envisioned in https://issues.apache.org/jira/browse/ARROW-14314?focusedCommentId=17445377&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17445377

The performance numbers are indeed quite better now:

  • on integers:
ArraySortIndicesInt64Narrow/32768/10000      12096 ns        12093 ns        58332 bytes_per_second=2.52347G/s items_per_second=338.694M/s null_percent=0.01 size=32.768k
ArraySortIndicesInt64Narrow/32768/100        12100 ns        12098 ns        57680 bytes_per_second=2.5226G/s items_per_second=338.578M/s null_percent=1 size=32.768k
ArraySortIndicesInt64Narrow/32768/10         13387 ns        13384 ns        50920 bytes_per_second=2.28012G/s items_per_second=306.032M/s null_percent=10 size=32.768k
ArraySortIndicesInt64Narrow/32768/2          23478 ns        23473 ns        28378 bytes_per_second=1.3001G/s items_per_second=174.496M/s null_percent=50 size=32.768k
ArraySortIndicesInt64Narrow/32768/1           6422 ns         6420 ns       109333 bytes_per_second=4.75339G/s items_per_second=637.989M/s null_percent=100 size=32.768k
ArraySortIndicesInt64Narrow/32768/0          12042 ns        12040 ns        57830 bytes_per_second=2.53471G/s items_per_second=340.203M/s null_percent=0 size=32.768k
ArraySortIndicesInt64Narrow/1048576/100     400335 ns       400263 ns         1721 bytes_per_second=2.4398G/s items_per_second=327.465M/s null_percent=1 size=1048.58k
ArraySortIndicesInt64Narrow/8388608/100    5230098 ns      5226130 ns          134 bytes_per_second=1.49489G/s items_per_second=200.641M/s null_percent=1 size=8.38861M

ArraySortIndicesInt64Wide/32768/10000       196031 ns       195992 ns         3553 bytes_per_second=159.446M/s items_per_second=20.8989M/s null_percent=0.01 size=32.768k
ArraySortIndicesInt64Wide/32768/100         199523 ns       199488 ns         3493 bytes_per_second=156.651M/s items_per_second=20.5325M/s null_percent=1 size=32.768k
ArraySortIndicesInt64Wide/32768/10          189211 ns       189180 ns         3692 bytes_per_second=165.187M/s items_per_second=21.6513M/s null_percent=10 size=32.768k
ArraySortIndicesInt64Wide/32768/2           117105 ns       117085 ns         5907 bytes_per_second=266.899M/s items_per_second=34.983M/s null_percent=50 size=32.768k
ArraySortIndicesInt64Wide/32768/1             6399 ns         6398 ns       109441 bytes_per_second=4.76962G/s items_per_second=640.168M/s null_percent=100 size=32.768k
ArraySortIndicesInt64Wide/32768/0           195661 ns       195629 ns         3571 bytes_per_second=159.741M/s items_per_second=20.9376M/s null_percent=0 size=32.768k
ArraySortIndicesInt64Wide/1048576/100      9867426 ns      9865950 ns           71 bytes_per_second=101.359M/s items_per_second=13.2853M/s null_percent=1 size=1048.58k
ArraySortIndicesInt64Wide/8388608/100    104777802 ns    104736080 ns            7 bytes_per_second=76.3825M/s items_per_second=10.0116M/s null_percent=1 size=8.38861M

ArraySortIndicesInt64Dict/32768/10000        21406 ns        21401 ns        32998 bytes_per_second=1.42602G/s items_per_second=191.397M/s null_percent=0.01 size=32.768k
ArraySortIndicesInt64Dict/32768/100          23328 ns        23320 ns        29746 bytes_per_second=1.30864G/s items_per_second=175.643M/s null_percent=1 size=32.768k
ArraySortIndicesInt64Dict/32768/10           32835 ns        32828 ns        20994 bytes_per_second=951.943M/s items_per_second=124.773M/s null_percent=10 size=32.768k
ArraySortIndicesInt64Dict/32768/2            63866 ns        63851 ns        10790 bytes_per_second=489.417M/s items_per_second=64.1489M/s null_percent=50 size=32.768k
ArraySortIndicesInt64Dict/32768/1            51498 ns        51488 ns        13377 bytes_per_second=606.943M/s items_per_second=79.5532M/s null_percent=100 size=32.768k
ArraySortIndicesInt64Dict/32768/0            20983 ns        20978 ns        33428 bytes_per_second=1.45471G/s items_per_second=195.248M/s null_percent=0 size=32.768k
ArraySortIndicesInt64Dict/1048576/100       618425 ns       618258 ns         1150 bytes_per_second=1.57954G/s items_per_second=212.002M/s null_percent=1 size=1048.58k
ArraySortIndicesInt64Dict/8388608/100      9134358 ns      9127689 ns           77 bytes_per_second=876.454M/s items_per_second=114.879M/s null_percent=1 size=8.38861M
  • on strings:
ArraySortIndicesStrings/32768/10000         380932 ns       380866 ns         1833 items_per_second=10.7544M/s null_percent=0.01 size=4.096k
ArraySortIndicesStrings/32768/100           381913 ns       381849 ns         1831 items_per_second=10.7267M/s null_percent=1 size=4.096k
ArraySortIndicesStrings/32768/10            353224 ns       353159 ns         1978 items_per_second=11.5982M/s null_percent=10 size=4.096k
ArraySortIndicesStrings/32768/2             198191 ns       198155 ns         3485 items_per_second=20.6707M/s null_percent=50 size=4.096k
ArraySortIndicesStrings/32768/1               6516 ns         6514 ns       107935 items_per_second=628.767M/s null_percent=100 size=4.096k
ArraySortIndicesStrings/32768/0             376185 ns       376122 ns         1866 items_per_second=10.8901M/s null_percent=0 size=4.096k
ArraySortIndicesStrings/1048576/100       17554424 ns     17551747 ns           40 items_per_second=7.46775M/s null_percent=1 size=131.072k
ArraySortIndicesStrings/8388608/100      218960374 ns    218848644 ns            3 items_per_second=4.79133M/s null_percent=1 size=1048.58k

ArraySortIndicesStringsDict/32768/10000      21847 ns        21842 ns        32054 items_per_second=187.525M/s null_percent=0.01 size=4.096k
ArraySortIndicesStringsDict/32768/100        23700 ns        23695 ns        29360 items_per_second=172.865M/s null_percent=1 size=4.096k
ArraySortIndicesStringsDict/32768/10         33740 ns        33732 ns        20298 items_per_second=121.428M/s null_percent=10 size=4.096k
ArraySortIndicesStringsDict/32768/2          64616 ns        64602 ns        10581 items_per_second=63.4031M/s null_percent=50 size=4.096k
ArraySortIndicesStringsDict/32768/1          53657 ns        53645 ns        12715 items_per_second=76.3539M/s null_percent=100 size=4.096k
ArraySortIndicesStringsDict/32768/0          21505 ns        21500 ns        32619 items_per_second=190.512M/s null_percent=0 size=4.096k
ArraySortIndicesStringsDict/1048576/100     629995 ns       629868 ns         1113 items_per_second=208.094M/s null_percent=1 size=131.072k
ArraySortIndicesStringsDict/8388608/100    9055816 ns      9045824 ns           77 items_per_second=115.918M/s null_percent=1 size=1048.58k

We see that sorting a dictionary array is now much faster (potentially 10x faster) than sorting an equivalent dense array, if the dictionary size is small (which is a very common case).

@pitrou
Copy link
Member

pitrou commented Jul 21, 2022

cc @cyb70289

@pitrou pitrou marked this pull request as draft July 21, 2022 16:47
@pitrou
Copy link
Member

pitrou commented Jul 21, 2022

I'm making this a draft because the FIXMEs will require a couple internal API changes to address.

pitrou pushed a commit that referenced this pull request May 22, 2023
### Rationale for this change

Sorting for `DictionaryArray`s is not currently supported.

### What changes are included in this PR?

- Adds support for dictionaries in the `array_sort_indices` kernel
- Adds tests and benchmarks
- Alters the internal `ArraySortFunc` definition to return an error status and accept the caller's `ExecContext` as an argument

### Are these changes tested?

Yes

### Are there any user-facing changes?

Yes

### Notes

This picks up where #13334 left off. Those commits have been squashed and included in this PR.
* Closes: #29887

Lead-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: Ariana Villegas <ariana.villegas@utec.edu.pe>
Co-authored-by: Ben Harkins <60872452+benibus@users.noreply.github.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou
Copy link
Member

pitrou commented Jun 1, 2023

This has been superceded by #35280, closing.

@pitrou pitrou closed this Jun 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants