Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add distinct kernels (#960) (#4438) #4716

Merged
merged 7 commits into from
Aug 21, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Aug 18, 2023

Which issue does this PR close?

Closes #960
Closes #4438

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 18, 2023

let r_v = r.as_any_dictionary_opt();
let r = r_v.map(|x| x.values().as_ref()).unwrap_or(r);
let r_t = r.data_type();

if l_t != r_t || l_t.is_nested() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically stricter than the previous logic, as this hasn't been released yet I think this is fine

@tustvold tustvold marked this pull request as ready for review August 18, 2023 19:18
@tustvold
Copy link
Contributor Author

I've integrated this into apache/datafusion#7282 for additional test verification

@@ -311,7 +411,7 @@ fn apply_op<T: ArrayOrd>(
(Some(l_s), Some(r_s)) => {
let a = l.value(l_s);
let b = r.value(r_s);
std::iter::once(op(a, b)).collect()
std::iter::once(op(a, b) ^ neg).collect()
Copy link
Contributor Author

@tustvold tustvold Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a pre-existing bug, that resulted in incorrect results when comparing scalars

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also please add a documentation comment explaining how to interpret the neg argument?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @tustvold

I went over the logic and tests carefully. There are a few comments I suggested that would add clarity to what IS DISTINCT means but I don't think they are required

@@ -437,6 +437,15 @@ impl<Ptr: std::borrow::Borrow<Option<bool>>> FromIterator<Ptr> for BooleanArray
}
}

impl From<BooleanBuffer> for BooleanArray {
fn from(values: BooleanBuffer) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a nice UX improvement -- I had been looking for that when working on group by in DataFusion

@@ -129,7 +134,35 @@ pub fn gt_eq(lhs: &dyn Datum, rhs: &dyn Datum) -> Result<BooleanArray, ArrowErro
compare_op(Op::GreaterEqual, lhs, rhs)
}

/// Perform `left IS DISTINCT FROM right` operation on two [`Datum`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Perform `left IS DISTINCT FROM right` operation on two [`Datum`]
/// Perform `left IS DISTINCT FROM right` operation on two [`Datum`]. `IS DISTINCT`
/// similar to `NotEq`, differing in null handling. Two operands are considered DISTINCT
/// if they have a different value or if one of them is NULL and the other isn't.
/// The result of `IS DISTINCT FROM` is never NULL.

compare_op(Op::Distinct, lhs, rhs)
}

/// Perform `left IS NOT DISTINCT FROM right` operation on two [`Datum`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Perform `left IS NOT DISTINCT FROM right` operation on two [`Datum`]
/// Perform `left IS NOT DISTINCT FROM right` operation on two [`Datum`]. `IS NOT DISTINCT`
/// similar to `Eq`, differing in null handling. Two operands are considered NOT DISTINCT
/// if they have the same value or if both of them are NULL.
/// The result of `IS NOT DISTINCT FROM` is never NULL.


let c = |((l, r), n)| ((l ^ r) | (l & r & n));
let buffer = l.zip(r).zip(ne).map(c).collect();
BooleanBuffer::new(buffer, 0, len).into()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 for no null buffer -- not sure if that is worth calling out in a comment or not

Comment on lines +271 to +274
let l = nulls.inner().bit_chunks().iter_padded();
let ne = values.bit_chunks().iter_padded();
let c = |(l, n)| u64::not(l) | n;
let buffer = l.zip(ne).map(c).collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to understand why this clause didn't mirror the Op::NotDistinct clause

NotDistinct can simply use https://docs.rs/arrow/latest/arrow/buffer/struct.BooleanBuffer.html#impl-BitOr%3C%26BooleanBuffer%3E-for-%26BooleanBuffer

but It seems it is because there is no equivalent of https://doc.rust-lang.org/nightly/core/ops/trait.Neg.html for BooleanBuffer and even if there were that would result in allocating a temporary buffer that might wasteful

Though now that I write this I wonder if performance could be improved here by deferring / reusing the buffer allocated by values(). Maybe as some future optimization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BooleanBuffer does implement https://doc.rust-lang.org/std/ops/trait.Not.html

However, as you surmised, computing the mask as is done here is marginally better for performance.

Deferring the buffer allocated by values would result in non-trivial additional codegen, and would likely confuse LLVMs already temperamental vectorisation,

Reusing the buffer is potentially worth exploring. I suspect that in most cases the performance gain would be marginal at best, but I haven't profiled this

@@ -311,7 +411,7 @@ fn apply_op<T: ArrayOrd>(
(Some(l_s), Some(r_s)) => {
let a = l.value(l_s);
let b = r.value(r_s);
std::iter::once(op(a, b)).collect()
std::iter::once(op(a, b) ^ neg).collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also please add a documentation comment explaining how to interpret the neg argument?

}
None => values_ne,
})
Ok(distinct(&v1, &v2)?.values().clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@tustvold tustvold merged commit bce0b41 into apache:master Aug 21, 2023
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Equality kernel where null==null gives true Consider adding is_distinct_from kernels
2 participants