ARROW-7924: [Rust] Add sort for float types #7193

nevi-me · 2020-05-15T16:36:17Z

This relaxes the trait bound of std::cmp::Ord to std::cmp::PartialOrd to enable sorting by floats

github-actions · 2020-05-15T16:46:36Z

https://issues.apache.org/jira/browse/ARROW-7924

andygrove · 2020-05-15T18:36:54Z

rust/arrow/src/compute/kernels/sort.rs

@@ -149,9 +151,9 @@ where
        .collect::<Vec<(u32, T::Native)>>();
    let mut nulls = null_indices;
    if !options.descending {
-        valids.sort_by_key(|a| a.1);
+        valids.sort_by(|a, b| a.1.partial_cmp(&b.1).unwrap());


Won't this panic if the vector contains any NaN values? partial_cmp would return None in that case.

Hmm, do we ever store NaN, or do we represent it as a null? I've never tried creating an array with NaN

andygrove · 2020-05-15T18:39:58Z

rust/arrow/src/compute/kernels/sort.rs

+                Some(2.225),
+                Some(-1.01),
+                Some(-0.05),
+                None,


Could you add f64::NAN to the test values.

I can look at this in more detail over the weekend but I think having special impl for f32/f64 is probably the way to go, then you can add specific checks for f32::NAN and f64::NAN before calling partial_cmp and we'll need to decide if NaN comes before or after valid numbers.

PTAL at the solution that I've just pushed. Given that the unwrap() will only fail if there's a NAN, I've replaced it by a default std::cmp::Ordering::Greater to treat NAN as the highest value. In the descending sort path, it ends up being inverted to the lowest value.

I suppose a better approach might be to let the nulls_first sort option drive the behaviour, with the below sort options:

ascending, nulls last: Ordering::Greater placing NaNs before the first null

ascending, nulls first: Ordering::Less placing NaNs after the last null

descending, nulls last: Ordering::Greater

descending, nulls first: Ordering::Less

... so the ordering being determined by the null behaviour (it helped to write it out 😄)

The problem with this approach is that the NaN could be a, or b, or both so this comparison is now non-deterministic and inconsistent. I think implementing this specifically for Float32Array and Float64Array and checking for NaN on both values is the only way we can handle this correctly.

PTAL at my changes, I check for NaN values when I perform the range partition, so that there are no nulls when values are sorted.

houqp · 2020-05-31T23:26:06Z

rust/arrow/src/compute/kernels/sort.rs

-    let (v, n): (Vec<usize>, Vec<usize>) =
-        range.partition(|index| values.is_valid(*index));
+    // perform a custom range partition for floats, to account for NaN
+    let (v, n): (Vec<usize>, Vec<usize>) = if values.data_type() == &DataType::Float32 {


looks like using match would be cleaner here?

houqp · 2020-06-01T00:33:03Z

rust/arrow/src/compute/kernels/sort.rs

@@ -149,9 +167,13 @@ where
        .collect::<Vec<(u32, T::Native)>>();
    let mut nulls = null_indices;
    if !options.descending {
-        valids.sort_by_key(|a| a.1);
+        valids.sort_by(|a, b| a.1.partial_cmp(&b.1).unwrap_or_else(|| Ordering::Greater));


i am kind of curious, have you done any benchmark to see if changing to partial_cmp with unwrap_or_else results in any performance difference?

I haven't, I can try it out when I get a chance, as I'm also curious of what the difference would be

This relaxes the trait bound of `std::cmp::Ord` to `std::cmp::PartialOrd` to enable sorting by floats

NaN gets partitioned with nulls before valid values are sorted

andygrove

LGTM overall although I am concerned there could be some edge cases around NaNs still. I think we can address that later if it comes up though.

wesm · 2020-07-09T21:10:25Z

Can this be merged or are there more changes to make?

nevi-me requested review from andygrove and paddyhoran May 15, 2020 16:36

nevi-me added the Component: Rust label May 15, 2020

nevi-me force-pushed the ARROW-7924 branch from 9641f6a to e44cbaf Compare May 15, 2020 16:52

nevi-me requested review from paddyhoran and removed request for paddyhoran May 15, 2020 16:52

andygrove reviewed May 15, 2020

View reviewed changes

houqp mentioned this pull request May 25, 2020

ARROW-8931: [Rust] add lexical sort support to arrow compute kernel #7265

Closed

nevi-me force-pushed the ARROW-7924 branch from 1be70a4 to d552b35 Compare May 31, 2020 01:29

houqp reviewed May 31, 2020

View reviewed changes

houqp reviewed Jun 1, 2020

View reviewed changes

nevi-me added 4 commits July 1, 2020 22:11

ARROW-7924: [Rust] Add sort for float types

962317c

This relaxes the trait bound of `std::cmp::Ord` to `std::cmp::PartialOrd` to enable sorting by floats

add missing Union datatype match

09d8d29

handle float NAN in sort

3e69902

address NaN floats in sort

db7e5fc

NaN gets partitioned with nulls before valid values are sorted

nevi-me force-pushed the ARROW-7924 branch from d552b35 to db7e5fc Compare July 1, 2020 20:14

fix test failurs due to changing sort order

41bdb8c

andygrove approved these changes Jul 6, 2020

View reviewed changes

nevi-me closed this in 0304d20 Jul 10, 2020

asfimport mentioned this pull request Jul 10, 2020

[Rust] Add sort for float types #24144

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-7924: [Rust] Add sort for float types #7193

ARROW-7924: [Rust] Add sort for float types #7193

nevi-me commented May 15, 2020

github-actions bot commented May 15, 2020

andygrove May 15, 2020

nevi-me May 15, 2020

andygrove May 15, 2020

andygrove May 15, 2020

nevi-me May 15, 2020

andygrove May 15, 2020

nevi-me May 31, 2020

houqp May 31, 2020

houqp Jun 1, 2020

nevi-me Jul 2, 2020

andygrove left a comment

wesm commented Jul 9, 2020

ARROW-7924: [Rust] Add sort for float types #7193

ARROW-7924: [Rust] Add sort for float types #7193

Conversation

nevi-me commented May 15, 2020

github-actions bot commented May 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove left a comment

Choose a reason for hiding this comment

wesm commented Jul 9, 2020