use sort_unstable_by in primitive sorting #552

jimexist · 2021-07-14T16:12:17Z

Which issue does this PR close?

use sort_unstable_by in primitive sorting

Closes #553

Rationale for this change

less memory usage
likely faster
given a present limit the k-select is already unstable, we need to be consistent

What changes are included in this PR?

Are there any user-facing changes?

codecov-commenter · 2021-07-14T16:28:14Z

Codecov Report

Merging #552 (8418f1c) into master (fc78af6) will not change coverage.
The diff coverage is 90.90%.

@@           Coverage Diff           @@
##           master     #552   +/-   ##
=======================================
  Coverage   82.47%   82.47%           
=======================================
  Files         167      167           
  Lines       46142    46142           
=======================================
  Hits        38056    38056           
  Misses       8086     8086

Impacted Files	Coverage Δ
arrow/src/compute/kernels/sort.rs	`94.15% <90.90%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc78af6...8418f1c. Read the comment docs.

jimexist · 2021-07-14T16:30:34Z

sort 2^10               time:   [110.68 us 111.64 us 112.55 us]
                        change: [-14.710% -13.112% -11.406%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

sort 2^12               time:   [516.32 us 519.97 us 523.58 us]
                        change: [-10.932% -9.5913% -8.2390%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

sort nulls 2^10         time:   [88.568 us 89.197 us 89.907 us]
                        change: [-20.378% -18.966% -17.484%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

sort nulls 2^12         time:   [372.34 us 375.56 us 378.70 us]
                        change: [-25.293% -24.019% -22.641%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe

bool sort 2^12          time:   [184.53 us 185.82 us 187.17 us]
                        change: [-60.022% -59.238% -58.451%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

bool sort nulls 2^12    time:   [210.24 us 213.40 us 216.47 us]
                        change: [-54.063% -53.122% -52.165%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

sort 2^12 limit 10      time:   [61.881 us 62.244 us 62.614 us]
                        change: [-8.6258% -6.5015% -4.3522%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

sort 2^12 limit 100     time:   [67.314 us 67.729 us 68.205 us]
                        change: [-5.5755% -3.5509% -1.4872%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

sort 2^12 limit 1000    time:   [178.07 us 179.37 us 180.82 us]
                        change: [-2.1034% +0.0621% +2.3026%] (p = 0.96 > 0.05)
                        No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) high mild
  10 (10.00%) high severe

sort 2^12 limit 2^12    time:   [459.51 us 462.14 us 464.90 us]
                        change: [-25.238% -23.036% -21.081%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

sort nulls 2^12 limit 10
                        time:   [117.28 us 118.29 us 119.29 us]
                        change: [+7.7135% +9.3966% +11.186%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe

sort nulls 2^12 limit 100
                        time:   [106.53 us 107.89 us 109.50 us]
                        change: [-6.4134% -4.0486% -1.6589%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

sort nulls 2^12 limit 1000
                        time:   [113.71 us 114.30 us 114.92 us]
                        change: [-5.9266% -4.5141% -3.0736%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

sort nulls 2^12 limit 2^12
                        time:   [336.84 us 340.39 us 344.08 us]
                        change: [-31.546% -30.224% -28.822%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

jorgecarleitao · 2021-07-19T06:42:35Z

I think that this is backward incompatible since the sorted indices are now different between implementations, It is a semantic, not an API, backward incompatibility. See https://stackoverflow.com/questions/1517793/what-is-stability-in-sorting-algorithms-and-why-is-it-important if you are interested in the why it is important.

jimexist · 2021-07-19T07:11:59Z

I think that this is backward incompatible since the sorted indices are now different between implementations, It is a semantic, not an API, backward incompatibility. See https://stackoverflow.com/questions/1517793/what-is-stability-in-sorting-algorithms-and-why-is-it-important if you are interested in the why it is important.

agree that it's API change. and also it's an improvement of consistency because somehow the sort with limit has been unstable. if needed we can add a stable sort alternative later.

alamb

The change looks good to me. I think we should also update the docstring so it no longer says the sort is stable:

/// Sort the `ArrayRef` using `SortOptions`.
///
/// Performs a stable sort on values and indices. Nulls are ordered according to the `nulls_first` flag in `options`.
/// Floats are sorted using IEEE 754 totalOrder
///

alamb · 2021-07-19T19:33:16Z

Thanks @jimexist

alamb · 2021-07-20T18:04:09Z

I will update the docstring in a follow on PR -- I am trying to keep the merge queue down

alamb · 2021-07-20T18:04:30Z

Thanks again @jimexist

alamb · 2021-07-20T18:07:36Z

Doc update proposed in #572

use sort_unstable_by in primitive sorting

Verified

This commit was signed with the committer’s verified signature.

mcculls Stuart McCulloch

GPG key ID: 29641ACA805A0ABB

Verified
Learn about vigilant mode

8418f1c

github-actions bot added the arrow label Jul 14, 2021

jorgecarleitao approved these changes Jul 17, 2021

View reviewed changes

Dandandan approved these changes Jul 18, 2021

View reviewed changes

nevi-me approved these changes Jul 19, 2021

View reviewed changes

jorgecarleitao added the api-change label Jul 19, 2021

alamb reviewed Jul 19, 2021

View reviewed changes

alamb merged commit 99ae88c into apache:master Jul 20, 2021

alamb mentioned this pull request Jul 20, 2021

Update sort kernel docs to note it is unstable #572

Merged

alamb added the cherry-picked label Jul 20, 2021

alamb mentioned this pull request Jul 20, 2021

Cherry pick use sort_unstable_by in primitive sorting to active_release #573

Closed

jimexist deleted the use-unstable-sort branch July 21, 2021 01:33

alamb removed the cherry-picked label Jul 21, 2021

waynexia mentioned this pull request Jul 27, 2021

Sort binary #569

Merged

This was referenced Sep 9, 2021

Rework the python bindings using conversion traits from arrow-rs apache/datafusion#873

Merged

Update DataFusion to arrow 6.0 apache/datafusion#984

Merged

alamb mentioned this pull request Apr 10, 2022

fix: Sort with a lot of repetition values apache/datafusion#2182

Merged

alamb mentioned this pull request Jul 22, 2023

Unstable (lex)sort_to_indices #4545

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use sort_unstable_by in primitive sorting #552

use sort_unstable_by in primitive sorting #552

jimexist commented Jul 14, 2021 •

edited

Loading

codecov-commenter commented Jul 14, 2021

jimexist commented Jul 14, 2021

jorgecarleitao commented Jul 19, 2021

jimexist commented Jul 19, 2021

alamb left a comment

alamb commented Jul 19, 2021

alamb commented Jul 20, 2021 •

edited

Loading

alamb commented Jul 20, 2021

alamb commented Jul 20, 2021

use sort_unstable_by in primitive sorting #552

use sort_unstable_by in primitive sorting #552

Conversation

jimexist commented Jul 14, 2021 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

codecov-commenter commented Jul 14, 2021

Codecov Report

jimexist commented Jul 14, 2021

jorgecarleitao commented Jul 19, 2021

jimexist commented Jul 19, 2021

alamb left a comment

Choose a reason for hiding this comment

alamb commented Jul 19, 2021

alamb commented Jul 20, 2021 • edited Loading

alamb commented Jul 20, 2021

alamb commented Jul 20, 2021

jimexist commented Jul 14, 2021 •

edited

Loading

alamb commented Jul 20, 2021 •

edited

Loading