Conversation
|
FYI @Dandandan -- thanks for the suggestion about avoiding the per-row sort kernel, seems quite effective. |
| } | ||
|
|
||
| fn array_sort_inner(args: &[ArrayRef]) -> Result<ArrayRef> { | ||
| pub fn array_sort_inner(args: &[ArrayRef]) -> Result<ArrayRef> { |
There was a problem hiding this comment.
Claude flagged this:
4. array_sort_inner made pub just for benchmarks
Solution which could I think could work
- [...] restructuring the benchmark to use the public ArraySort UDF instead
There was a problem hiding this comment.
pub(crate) should work, no?
| let values_start = offsets[0].as_usize(); | ||
| let total_values = offsets[row_count].as_usize() - values_start; | ||
|
|
||
| let converter = RowConverter::new(vec![SortField::new_with_options( |
There was a problem hiding this comment.
Why using a RowConverter for this one? I think the path in arrow-rs is something like:
- partition on nulls
- create the indices (begin...end)
- sort by the strings (using the indices)
- Add the nulls if it starts with nulls
- Add the values (using take)
- Add the nulls if it ends with nulls
There was a problem hiding this comment.
I briefly considered something like that, but I figured that all the pointer chasing would be pretty expensive. You're right that it's worth comparing though.
Here's a quick Claude-generated version -- lmk if you had something else in mind.
Benchmarking it against the RowComparator approach, RowComparator wins for medium-sized arrays (20 elements) and larger, and loses to the index-based comparison approach for small arrays:
┌─────────────┬──────────┬─────────────────┬─────────────────┐
│ Benchmark │ main │ RowConverter │ make_comparator │
├─────────────┼──────────┼─────────────────┼─────────────────┤
│ string/5 │ 2.12 ms │ 727 µs (-66%) │ 608 µs (-71%) │
├─────────────┼──────────┼─────────────────┼─────────────────┤
│ string/20 │ 5.94 ms │ 4.42 ms (-26%) │ 4.76 ms (-20%) │
├─────────────┼──────────┼─────────────────┼─────────────────┤
│ string/100 │ 26.8 ms │ 22.6 ms (-16%) │ 25.1 ms (-6%) │
├─────────────┼──────────┼─────────────────┼─────────────────┤
│ string/1000 │ 404.9 ms │ 293.1 ms (-28%) │ 403.9 ms (~0%) │
└─────────────┴──────────┴─────────────────┴─────────────────┘
Not sure offhand which typical real-world workloads look like; lmk if you have a view.
There was a problem hiding this comment.
The Claude code looks close to what I had in mind except for the make_comparator and some non-null specializilation / Vec::push which can not be inlined and generate slow / branchy code.
I guess for strings you could also do the same tricks as used in arrow kernels, create: create some inlined key / small string as well for fast comparisons.
RowFilter is fine of course, but there is some higher fixed overhead upfront and has some higher space usage as well, so for single columns I think type specialization always wins.
|
Woah, great! |
Which issue does this PR close?
array_sort#21005.Rationale for this change
The previous
array_sortimplementation called the Arrow sort kernel for every row, and then usedconcatto produce the final results. This was quite inefficient. Instead, we employee three different techniques depending on the input:(1) For arrays of primitives types without null elements, we copy all values into a single
Vec, sort each row's slice of theVecin-place, and then wrap theVecin aGenericListArray.(2) For arrays of primitives types with null elements, we use a similar approach but we need to incur some more bookkeeping to place null elements in the right place and construct the null buffer.
(3) For arrays of non-primitive types, we use
RowConverterto convert the entire input into the row format in one call, sort row indices by comparing the encoded row values, and then use a singletake()to construct the result of the sort.Benchmarks (8192 rows, vs main):
What changes are included in this PR?
array_sortbenchmarkAre these changes tested?
No.
Are there any user-facing changes?
No.