Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-7924: [Rust] Add sort for float types #7193

Closed
wants to merge 5 commits into from

Conversation

nevi-me
Copy link
Contributor

@nevi-me nevi-me commented May 15, 2020

This relaxes the trait bound of std::cmp::Ord to std::cmp::PartialOrd to enable sorting by floats

@github-actions
Copy link

@nevi-me nevi-me requested review from paddyhoran and removed request for paddyhoran May 15, 2020 16:52
@@ -149,9 +151,9 @@ where
.collect::<Vec<(u32, T::Native)>>();
let mut nulls = null_indices;
if !options.descending {
valids.sort_by_key(|a| a.1);
valids.sort_by(|a, b| a.1.partial_cmp(&b.1).unwrap());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this panic if the vector contains any NaN values? partial_cmp would return None in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, do we ever store NaN, or do we represent it as a null? I've never tried creating an array with NaN

Some(2.225),
Some(-1.01),
Some(-0.05),
None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add f64::NAN to the test values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can look at this in more detail over the weekend but I think having special impl for f32/f64 is probably the way to go, then you can add specific checks for f32::NAN and f64::NAN before calling partial_cmp and we'll need to decide if NaN comes before or after valid numbers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL at the solution that I've just pushed. Given that the unwrap() will only fail if there's a NAN, I've replaced it by a default std::cmp::Ordering::Greater to treat NAN as the highest value. In the descending sort path, it ends up being inverted to the lowest value.

I suppose a better approach might be to let the nulls_first sort option drive the behaviour, with the below sort options:

  • ascending, nulls last: Ordering::Greater placing NaNs before the first null
  • ascending, nulls first: Ordering::Less placing NaNs after the last null
  • descending, nulls last: Ordering::Greater
  • descending, nulls first: Ordering::Less

... so the ordering being determined by the null behaviour (it helped to write it out 😄)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this approach is that the NaN could be a, or b, or both so this comparison is now non-deterministic and inconsistent. I think implementing this specifically for Float32Array and Float64Array and checking for NaN on both values is the only way we can handle this correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL at my changes, I check for NaN values when I perform the range partition, so that there are no nulls when values are sorted.

let (v, n): (Vec<usize>, Vec<usize>) =
range.partition(|index| values.is_valid(*index));
// perform a custom range partition for floats, to account for NaN
let (v, n): (Vec<usize>, Vec<usize>) = if values.data_type() == &DataType::Float32 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like using match would be cleaner here?

@@ -149,9 +167,13 @@ where
.collect::<Vec<(u32, T::Native)>>();
let mut nulls = null_indices;
if !options.descending {
valids.sort_by_key(|a| a.1);
valids.sort_by(|a, b| a.1.partial_cmp(&b.1).unwrap_or_else(|| Ordering::Greater));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am kind of curious, have you done any benchmark to see if changing to partial_cmp with unwrap_or_else results in any performance difference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't, I can try it out when I get a chance, as I'm also curious of what the difference would be

This relaxes the trait bound of `std::cmp::Ord` to `std::cmp::PartialOrd` to enable sorting by floats
NaN gets partitioned with nulls before valid values are sorted
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall although I am concerned there could be some edge cases around NaNs still. I think we can address that later if it comes up though.

@wesm
Copy link
Member

wesm commented Jul 9, 2020

Can this be merged or are there more changes to make?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants