Add distinct kernels (#960) (#4438) #4716

tustvold · 2023-08-18T17:33:40Z

Which issue does this PR close?

Closes #960
Closes #4438

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold · 2023-08-18T17:34:27Z

arrow-ord/src/cmp.rs


    let r_v = r.as_any_dictionary_opt();
    let r = r_v.map(|x| x.values().as_ref()).unwrap_or(r);
+    let r_t = r.data_type();
+
+    if l_t != r_t || l_t.is_nested() {


This is technically stricter than the previous logic, as this hasn't been released yet I think this is fine

tustvold · 2023-08-18T19:19:57Z

I've integrated this into apache/datafusion#7282 for additional test verification

tustvold · 2023-08-18T20:30:49Z

arrow-ord/src/cmp.rs

@@ -311,7 +411,7 @@ fn apply_op<T: ArrayOrd>(
        (Some(l_s), Some(r_s)) => {
            let a = l.value(l_s);
            let b = r.value(r_s);
-            std::iter::once(op(a, b)).collect()
+            std::iter::once(op(a, b) ^ neg).collect()


This was a pre-existing bug, that resulted in incorrect results when comparing scalars

Can we also please add a documentation comment explaining how to interpret the neg argument?

alamb

Looks good to me -- thank you @tustvold

I went over the logic and tests carefully. There are a few comments I suggested that would add clarity to what IS DISTINCT means but I don't think they are required

alamb · 2023-08-21T12:15:40Z

arrow-array/src/array/boolean_array.rs

@@ -437,6 +437,15 @@ impl<Ptr: std::borrow::Borrow<Option<bool>>> FromIterator<Ptr> for BooleanArray
    }
 }

+impl From<BooleanBuffer> for BooleanArray {
+    fn from(values: BooleanBuffer) -> Self {


this is a nice UX improvement -- I had been looking for that when working on group by in DataFusion

alamb · 2023-08-21T12:20:58Z

arrow-ord/src/cmp.rs

@@ -129,7 +134,35 @@ pub fn gt_eq(lhs: &dyn Datum, rhs: &dyn Datum) -> Result<BooleanArray, ArrowErro
    compare_op(Op::GreaterEqual, lhs, rhs)
 }

+/// Perform `left IS DISTINCT FROM right` operation on two [`Datum`]


Suggested change

/// Perform `left IS DISTINCT FROM right` operation on two [`Datum`]

/// Perform `left IS DISTINCT FROM right` operation on two [`Datum`]. `IS DISTINCT`

/// similar to `NotEq`, differing in null handling. Two operands are considered DISTINCT

/// if they have a different value or if one of them is NULL and the other isn't.

/// The result of `IS DISTINCT FROM` is never NULL.

alamb · 2023-08-21T12:21:55Z

arrow-ord/src/cmp.rs

+    compare_op(Op::Distinct, lhs, rhs)
+}
+
+/// Perform `left IS NOT DISTINCT FROM right` operation on two [`Datum`]


Suggested change

/// Perform `left IS NOT DISTINCT FROM right` operation on two [`Datum`]

/// Perform `left IS NOT DISTINCT FROM right` operation on two [`Datum`]. `IS NOT DISTINCT`

/// similar to `Eq`, differing in null handling. Two operands are considered NOT DISTINCT

/// if they have the same value or if both of them are NULL.

/// The result of `IS NOT DISTINCT FROM` is never NULL.

alamb · 2023-08-21T12:26:13Z

arrow-ord/src/cmp.rs

+
+                    let c = |((l, r), n)| ((l ^ r) | (l & r & n));
+                    let buffer = l.zip(r).zip(ne).map(c).collect();
+                    BooleanBuffer::new(buffer, 0, len).into()


💯 for no null buffer -- not sure if that is worth calling out in a comment or not

alamb · 2023-08-21T12:33:50Z

arrow-ord/src/cmp.rs

+                        let l = nulls.inner().bit_chunks().iter_padded();
+                        let ne = values.bit_chunks().iter_padded();
+                        let c = |(l, n)| u64::not(l) | n;
+                        let buffer = l.zip(ne).map(c).collect();


It took me a while to understand why this clause didn't mirror the Op::NotDistinct clause

NotDistinct can simply use https://docs.rs/arrow/latest/arrow/buffer/struct.BooleanBuffer.html#impl-BitOr%3C%26BooleanBuffer%3E-for-%26BooleanBuffer

but It seems it is because there is no equivalent of https://doc.rust-lang.org/nightly/core/ops/trait.Neg.html for BooleanBuffer and even if there were that would result in allocating a temporary buffer that might wasteful

Though now that I write this I wonder if performance could be improved here by deferring / reusing the buffer allocated by values(). Maybe as some future optimization.

BooleanBuffer does implement https://doc.rust-lang.org/std/ops/trait.Not.html

However, as you surmised, computing the mask as is done here is marginally better for performance.

Deferring the buffer allocated by values would result in non-trivial additional codegen, and would likely confuse LLVMs already temperamental vectorisation,

Reusing the buffer is potentially worth exploring. I suspect that in most cases the performance gain would be marginal at best, but I haven't profiled this

alamb · 2023-08-21T12:35:05Z

arrow-ord/src/cmp.rs

@@ -311,7 +411,7 @@ fn apply_op<T: ArrayOrd>(
        (Some(l_s), Some(r_s)) => {
            let a = l.value(l_s);
            let b = r.value(r_s);
-            std::iter::once(op(a, b)).collect()
+            std::iter::once(op(a, b) ^ neg).collect()


Can we also please add a documentation comment explaining how to interpret the neg argument?

alamb · 2023-08-21T12:35:55Z

arrow-ord/src/partition.rs

-        }
-        None => values_ne,
-    })
+    Ok(distinct(&v1, &v2)?.values().clone())


github-actions bot added the arrow Changes to the arrow crate label Aug 18, 2023

tustvold commented Aug 18, 2023

View reviewed changes

Add distinct kernels (apache#960) (apache#4438)

a2c1cfe

tustvold force-pushed the distinct-experiments branch from 2c13ff3 to a2c1cfe Compare August 18, 2023 17:35

tustvold added 2 commits August 18, 2023 19:08

Fixes

d55bf0c

Add tests

52fa67c

tustvold marked this pull request as ready for review August 18, 2023 19:18

tustvold added 3 commits August 18, 2023 20:55

Handle NullArray

55f5839

Fix comparisons between scalar and empty array

2aff33d

Clippy

e5f0784

tustvold commented Aug 18, 2023

View reviewed changes

tustvold mentioned this pull request Aug 21, 2023

Is Distinct From Incorrectly Handles Masked Nulls apache/datafusion#7332

Closed

alamb approved these changes Aug 21, 2023

View reviewed changes

Review feedback

055124b

tustvold merged commit bce0b41 into apache:master Aug 21, 2023
25 checks passed

tustvold mentioned this pull request Aug 21, 2023

Equality kernel where null==null gives true #4438

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add distinct kernels (#960) (#4438) #4716

Add distinct kernels (#960) (#4438) #4716

tustvold commented Aug 18, 2023 •

edited

tustvold Aug 18, 2023

tustvold commented Aug 18, 2023

tustvold Aug 18, 2023 •

edited

alamb Aug 21, 2023

alamb left a comment

alamb Aug 21, 2023

alamb Aug 21, 2023

alamb Aug 21, 2023

alamb Aug 21, 2023

alamb Aug 21, 2023

tustvold Aug 21, 2023

alamb Aug 21, 2023

alamb Aug 21, 2023

-/// Perform `left IS DISTINCT FROM right` operation on two [`Datum`]
+/// Perform `left IS DISTINCT FROM right` operation on two [`Datum`]. `IS DISTINCT`
+/// similar to `NotEq`, differing in null handling.  Two operands are considered DISTINCT
+/// if they have a different value or if one of them is NULL and the other isn't.
+/// The result of `IS DISTINCT FROM` is never NULL.

-/// Perform `left IS NOT DISTINCT FROM right` operation on two [`Datum`]
+/// Perform `left IS NOT DISTINCT FROM right` operation on two [`Datum`]. `IS NOT DISTINCT`
+/// similar to `Eq`, differing in null handling.  Two operands are considered NOT DISTINCT
+/// if they have the same value or if both of them are NULL.
+/// The result of `IS NOT DISTINCT FROM` is never NULL.

Add distinct kernels (#960) (#4438) #4716

Add distinct kernels (#960) (#4438) #4716

Conversation

tustvold commented Aug 18, 2023 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

tustvold commented Aug 18, 2023

tustvold Aug 18, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Aug 18, 2023 •

edited

tustvold Aug 18, 2023 •

edited