-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-7924: [Rust] Add sort for float types #7193
Conversation
@@ -149,9 +151,9 @@ where | |||
.collect::<Vec<(u32, T::Native)>>(); | |||
let mut nulls = null_indices; | |||
if !options.descending { | |||
valids.sort_by_key(|a| a.1); | |||
valids.sort_by(|a, b| a.1.partial_cmp(&b.1).unwrap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this panic if the vector contains any NaN
values? partial_cmp would return None
in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, do we ever store NaN, or do we represent it as a null? I've never tried creating an array with NaN
Some(2.225), | ||
Some(-1.01), | ||
Some(-0.05), | ||
None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add f64::NAN
to the test values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can look at this in more detail over the weekend but I think having special impl for f32/f64 is probably the way to go, then you can add specific checks for f32::NAN
and f64::NAN
before calling partial_cmp
and we'll need to decide if NaN
comes before or after valid numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL at the solution that I've just pushed. Given that the unwrap()
will only fail if there's a NAN
, I've replaced it by a default std::cmp::Ordering::Greater
to treat NAN
as the highest value. In the descending sort path, it ends up being inverted to the lowest value.
I suppose a better approach might be to let the nulls_first
sort option drive the behaviour, with the below sort options:
- ascending, nulls last:
Ordering::Greater
placing NaNs before the first null - ascending, nulls first:
Ordering::Less
placing NaNs after the last null - descending, nulls last:
Ordering::Greater
- descending, nulls first:
Ordering::Less
... so the ordering being determined by the null behaviour (it helped to write it out 😄)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with this approach is that the NaN could be a, or b, or both so this comparison is now non-deterministic and inconsistent. I think implementing this specifically for Float32Array and Float64Array and checking for NaN on both values is the only way we can handle this correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL at my changes, I check for NaN values when I perform the range partition, so that there are no nulls when values are sorted.
let (v, n): (Vec<usize>, Vec<usize>) = | ||
range.partition(|index| values.is_valid(*index)); | ||
// perform a custom range partition for floats, to account for NaN | ||
let (v, n): (Vec<usize>, Vec<usize>) = if values.data_type() == &DataType::Float32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like using match would be cleaner here?
@@ -149,9 +167,13 @@ where | |||
.collect::<Vec<(u32, T::Native)>>(); | |||
let mut nulls = null_indices; | |||
if !options.descending { | |||
valids.sort_by_key(|a| a.1); | |||
valids.sort_by(|a, b| a.1.partial_cmp(&b.1).unwrap_or_else(|| Ordering::Greater)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am kind of curious, have you done any benchmark to see if changing to partial_cmp with unwrap_or_else results in any performance difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't, I can try it out when I get a chance, as I'm also curious of what the difference would be
This relaxes the trait bound of `std::cmp::Ord` to `std::cmp::PartialOrd` to enable sorting by floats
NaN gets partitioned with nulls before valid values are sorted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall although I am concerned there could be some edge cases around NaNs still. I think we can address that later if it comes up though.
Can this be merged or are there more changes to make? |
This relaxes the trait bound of
std::cmp::Ord
tostd::cmp::PartialOrd
to enable sorting by floats