Skip to content

perf: Optimize array_sort()#21083

Open
neilconway wants to merge 3 commits intoapache:mainfrom
neilconway:neilc/array-sort-custom-kernel
Open

perf: Optimize array_sort()#21083
neilconway wants to merge 3 commits intoapache:mainfrom
neilconway:neilc/array-sort-custom-kernel

Conversation

@neilconway
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

The previous array_sort implementation called the Arrow sort kernel for every row, and then used concat to produce the final results. This was quite inefficient. Instead, we employee three different techniques depending on the input:

(1) For arrays of primitives types without null elements, we copy all values into a single Vec, sort each row's slice of the Vec in-place, and then wrap the Vec in a GenericListArray.

(2) For arrays of primitives types with null elements, we use a similar approach but we need to incur some more bookkeeping to place null elements in the right place and construct the null buffer.

(3) For arrays of non-primitive types, we use RowConverter to convert the entire input into the row format in one call, sort row indices by comparing the encoded row values, and then use a single take() to construct the result of the sort.

Benchmarks (8192 rows, vs main):

int32/5 elements:          886 µs →  57 µs  (-94%)
int32/20 elements:        1.64 ms → 846 µs  (-48%)
int32/100 elements:       4.03 ms → 3.22 ms (-20%)
int32_null_elements/5:    1.17 ms → 168 µs  (-86%)
int32_null_elements/1000: 47.2 ms → 44.1 ms  (-7%)
string/5 elements:        2.12 ms → 727 µs  (-66%)
string/1000 elements:      405 ms → 293 ms  (-28%)

What changes are included in this PR?

  • New array_sort benchmark
  • Extended unit test coverage
  • Improve docs
  • Implement optimizations as described above

Are these changes tested?

No.

Are there any user-facing changes?

No.

@github-actions github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Mar 20, 2026
@neilconway
Copy link
Contributor Author

FYI @Dandandan -- thanks for the suggestion about avoiding the per-row sort kernel, seems quite effective.

}

fn array_sort_inner(args: &[ArrayRef]) -> Result<ArrayRef> {
pub fn array_sort_inner(args: &[ArrayRef]) -> Result<ArrayRef> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude flagged this:
4. array_sort_inner made pub just for benchmarks

Solution which could I think could work

  • [...] restructuring the benchmark to use the public ArraySort UDF instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pub(crate) should work, no?

let values_start = offsets[0].as_usize();
let total_values = offsets[row_count].as_usize() - values_start;

let converter = RowConverter::new(vec![SortField::new_with_options(
Copy link
Contributor

@Dandandan Dandandan Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using a RowConverter for this one? I think the path in arrow-rs is something like:

  • partition on nulls
  • create the indices (begin...end)
  • sort by the strings (using the indices)
  • Add the nulls if it starts with nulls
  • Add the values (using take)
  • Add the nulls if it ends with nulls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I briefly considered something like that, but I figured that all the pointer chasing would be pretty expensive. You're right that it's worth comparing though.

Here's a quick Claude-generated version -- lmk if you had something else in mind.

Benchmarking it against the RowComparator approach, RowComparator wins for medium-sized arrays (20 elements) and larger, and loses to the index-based comparison approach for small arrays:

  ┌─────────────┬──────────┬─────────────────┬─────────────────┐
  │  Benchmark  │   main   │  RowConverter   │ make_comparator │
  ├─────────────┼──────────┼─────────────────┼─────────────────┤
  │ string/5    │ 2.12 ms  │ 727 µs (-66%)   │ 608 µs (-71%)   │
  ├─────────────┼──────────┼─────────────────┼─────────────────┤
  │ string/20   │ 5.94 ms  │ 4.42 ms (-26%)  │ 4.76 ms (-20%)  │
  ├─────────────┼──────────┼─────────────────┼─────────────────┤
  │ string/100  │ 26.8 ms  │ 22.6 ms (-16%)  │ 25.1 ms (-6%)   │
  ├─────────────┼──────────┼─────────────────┼─────────────────┤
  │ string/1000 │ 404.9 ms │ 293.1 ms (-28%) │ 403.9 ms (~0%)  │
  └─────────────┴──────────┴─────────────────┴─────────────────┘

Not sure offhand which typical real-world workloads look like; lmk if you have a view.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Claude code looks close to what I had in mind except for the make_comparator and some non-null specializilation / Vec::push which can not be inlined and generate slow / branchy code.

I guess for strings you could also do the same tricks as used in arrow kernels, create: create some inlined key / small string as well for fast comparisons.

RowFilter is fine of course, but there is some higher fixed overhead upfront and has some higher space usage as well, so for single columns I think type specialization always wins.

@Dandandan
Copy link
Contributor

Woah, great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize array_sort with nested-list-aware sort kernel Optimize array_sort

2 participants