-
Notifications
You must be signed in to change notification settings - Fork 662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add distinct kernels (#960) (#4438) #4716
Conversation
|
||
let r_v = r.as_any_dictionary_opt(); | ||
let r = r_v.map(|x| x.values().as_ref()).unwrap_or(r); | ||
let r_t = r.data_type(); | ||
|
||
if l_t != r_t || l_t.is_nested() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is technically stricter than the previous logic, as this hasn't been released yet I think this is fine
2c13ff3
to
a2c1cfe
Compare
I've integrated this into apache/datafusion#7282 for additional test verification |
@@ -311,7 +411,7 @@ fn apply_op<T: ArrayOrd>( | |||
(Some(l_s), Some(r_s)) => { | |||
let a = l.value(l_s); | |||
let b = r.value(r_s); | |||
std::iter::once(op(a, b)).collect() | |||
std::iter::once(op(a, b) ^ neg).collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a pre-existing bug, that resulted in incorrect results when comparing scalars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also please add a documentation comment explaining how to interpret the neg
argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thank you @tustvold
I went over the logic and tests carefully. There are a few comments I suggested that would add clarity to what IS DISTINCT
means but I don't think they are required
@@ -437,6 +437,15 @@ impl<Ptr: std::borrow::Borrow<Option<bool>>> FromIterator<Ptr> for BooleanArray | |||
} | |||
} | |||
|
|||
impl From<BooleanBuffer> for BooleanArray { | |||
fn from(values: BooleanBuffer) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a nice UX improvement -- I had been looking for that when working on group by in DataFusion
@@ -129,7 +134,35 @@ pub fn gt_eq(lhs: &dyn Datum, rhs: &dyn Datum) -> Result<BooleanArray, ArrowErro | |||
compare_op(Op::GreaterEqual, lhs, rhs) | |||
} | |||
|
|||
/// Perform `left IS DISTINCT FROM right` operation on two [`Datum`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Perform `left IS DISTINCT FROM right` operation on two [`Datum`] | |
/// Perform `left IS DISTINCT FROM right` operation on two [`Datum`]. `IS DISTINCT` | |
/// similar to `NotEq`, differing in null handling. Two operands are considered DISTINCT | |
/// if they have a different value or if one of them is NULL and the other isn't. | |
/// The result of `IS DISTINCT FROM` is never NULL. |
compare_op(Op::Distinct, lhs, rhs) | ||
} | ||
|
||
/// Perform `left IS NOT DISTINCT FROM right` operation on two [`Datum`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Perform `left IS NOT DISTINCT FROM right` operation on two [`Datum`] | |
/// Perform `left IS NOT DISTINCT FROM right` operation on two [`Datum`]. `IS NOT DISTINCT` | |
/// similar to `Eq`, differing in null handling. Two operands are considered NOT DISTINCT | |
/// if they have the same value or if both of them are NULL. | |
/// The result of `IS NOT DISTINCT FROM` is never NULL. |
|
||
let c = |((l, r), n)| ((l ^ r) | (l & r & n)); | ||
let buffer = l.zip(r).zip(ne).map(c).collect(); | ||
BooleanBuffer::new(buffer, 0, len).into() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 for no null buffer -- not sure if that is worth calling out in a comment or not
let l = nulls.inner().bit_chunks().iter_padded(); | ||
let ne = values.bit_chunks().iter_padded(); | ||
let c = |(l, n)| u64::not(l) | n; | ||
let buffer = l.zip(ne).map(c).collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a while to understand why this clause didn't mirror the Op::NotDistinct
clause
NotDistinct
can simply use https://docs.rs/arrow/latest/arrow/buffer/struct.BooleanBuffer.html#impl-BitOr%3C%26BooleanBuffer%3E-for-%26BooleanBuffer
but It seems it is because there is no equivalent of https://doc.rust-lang.org/nightly/core/ops/trait.Neg.html for BooleanBuffer and even if there were that would result in allocating a temporary buffer that might wasteful
Though now that I write this I wonder if performance could be improved here by deferring / reusing the buffer allocated by values()
. Maybe as some future optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BooleanBuffer does implement https://doc.rust-lang.org/std/ops/trait.Not.html
However, as you surmised, computing the mask as is done here is marginally better for performance.
Deferring the buffer allocated by values would result in non-trivial additional codegen, and would likely confuse LLVMs already temperamental vectorisation,
Reusing the buffer is potentially worth exploring. I suspect that in most cases the performance gain would be marginal at best, but I haven't profiled this
@@ -311,7 +411,7 @@ fn apply_op<T: ArrayOrd>( | |||
(Some(l_s), Some(r_s)) => { | |||
let a = l.value(l_s); | |||
let b = r.value(r_s); | |||
std::iter::once(op(a, b)).collect() | |||
std::iter::once(op(a, b) ^ neg).collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also please add a documentation comment explaining how to interpret the neg
argument?
} | ||
None => values_ne, | ||
}) | ||
Ok(distinct(&v1, &v2)?.values().clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
Which issue does this PR close?
Closes #960
Closes #4438
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?