Add scalar comparison kernels for DictionaryArray #984

matthewmturner · 2021-11-27T10:01:27Z

Which issue does this PR close?

Closes #869

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

matthewmturner · 2021-11-27T10:02:27Z

@alamb I started the work on this. Would you be able to give it a quick look when you get the chance to make sure its going in the right direction?

alamb · 2021-11-28T11:50:56Z

arrow/src/compute/kernels/comparison.rs

+        let values = $left
+            .values()
+            .as_any()
+            .downcast_ref::<StringArray>()


The values array can be anything (not always a StringArray) -- perhaps this would be a good place to use the dyn_XX kernels -- to compare the values array with $right)

from my understanding of the dyn kernels those cant be used when comparing to constant right?

🤔 yes you are correct -- we would need to add dyn_XX_lit type kernels, but that seems a bit overkill for this PR

if the primary use case for this PR was comparing dict array to constant then maybe it makes sense for me to do a separate PR for that first and then come back to this?

if the primary use case for this PR was comparing dict array to constant then maybe it makes sense for me to do a separate PR for that first and then come back to this?

I think focusing on the usecase of comparing dict array to constant is the best choice for now

Ok! Will start with that.

@alamb ive been reviewing this but i think i might be missing something. my understanding is that my code above is for getting the dictionary values, which can be of any type (of course above im only handling StringArray).

let values = $left .values() .as_any() .downcast_ref::<StringArray>() .unwrap()

But then you mention using the new dyn_xx kernels / creating dyn_xx_lit kernels. Since theres no actual compute being done here, what would the dyn kernels be used for? Or were you referring to using the kernels to replace more than just that section of code?

to me it looks like i need a macro to downcast DictionaryArray.values() into whatever type the values are, and then i could use something like dyn_xx_lit on that in order to get the comparison results. Is this roughly what you had in mind?

I am very sorry for confusing this conversation with mentioning dyn_xx_lit.

What I was (inarticulately) trying to say was that once you have eq_dict_scalar (and you will likely also need eq_dict_scalar_utf8) we will end up with several different ways to compare an array to a scalar, depending on the array type

So I was thinking ahead to adding functions like

fn eq_scalar_utf8_dyn(array: dyn &Array, right: &str) -> Boolean { // do dispatch to the right kernel based on type of array }

But definitely not for this PR

alamb · 2021-11-28T11:58:41Z

arrow/src/compute/kernels/comparison.rs

+        let comparison = (0..$left.len()).map(|i| unsafe {
+            let key = $left.keys().value_unchecked(i).to_usize().unwrap();
+            $op(values.value_unchecked(key), $right)
+        });


I think one of the main points of this ticket is to avoid the call here to vaues.value_unchecked

I like to think about the goal in by thinking "what would happen with DictionaryArray with 1000000 entries but a dictionary of size 1?" -- the way you have this PR, I think we would call $op 1000000 times. The idea is to call $op 1 time.

So the pattern I think we are looking, at least for the constant kernels is:

In pseudo code:

let values = dict_array.values(); let comparison_result_on_values = apply_op_to_values(); let result = dict_array.keys().iter().map(|index| comparison_result_on_values[index]).collect()

Makes sense, thanks for explanation. I am looking into this.

@alamb im struggling with the second step in your pseudocode given that my understanding is that the values could be of any ArrowPrimativeType. Would you be able to provide guidance on how to handle that? I've been playing with different macros and iteration options on the underlying buffers, but i feel like im missing some fundamental understanding about how to work with dynamic data type like this or how to use ArrayData.

🤔

Yes this is definitely tricky. Maybe taking a step back, and think about the usecase: comparing DictionaryArrays to literals.

For example, if you look at the comparison kernels (for eq) , https://docs.rs/arrow/6.3.0/arrow/compute/kernels/comparison/index.html we find;

eq_scalar eq_bool_scalar eq_utf8_scalar

With each being typed based on the type of scalar (because the arrays are typed)

The issue with a DictionaryArray is that it could have numbers, bool, strings, etc. so we can't have a single entrypoint as we do with other types of arrays

So i am thinking we would need something like

eq_dict_scalar // numeric eq_dict_bool_scalar // boolean eq_dict_utf8_scalar // strings

where each of those kernels would be able to downcast the array appropriately.

However, having three functions for each dict kernel seems somewhat crazy.

That is where my dyn idea was coming from. If we are going to add three new kernels for each operator (eq, lt, etc) we could perhaps add

eq_dyn_scalar // numeric eq_dyn_bool_scalar // boolean eq_dyn_utf8_scalar // strings

etc

Which handle DictionaryArray as well as dispatching to the other eq_scalar, eq_bool_scalar, eq_utf8_scalar as appropriate.

Does that make sense? I can try and sketch out the interface this weekend sometime

thanks for explanation! yes, it does make sense. i think i was trying to do too much in my macros / functions which was causing my confusion. i think if i can get one of the below to work that should give me my baseline to do the rest.

eq_dict_scalar // numeric eq_dict_bool_scalar // boolean eq_dict_utf8_scalar // strings

matthewmturner · 2021-12-07T16:39:56Z

@alamb this took me much longer than it should have, and im still not even sure if its idiomatic rust / arrow but i think i might be close to getting the eq_dict_scalar kernel. this process did definitely help a lot with learning about rust traits / macros and how arrow uses them.

one thing i noticed is that it doesnt seem there is way to create a DictionaryArray from a vec of scalars like you can with a vec of str slices. Is that expected? I also dont see a builder method i can use to construct the DictionaryArray. I guess the main use case for DictionaryArray is with strings and not scalars?

Either way, would you be able to see if this update is going in the right direction? If so, then I can extend to utf8 kernels etc.

UPDATE:
Actually I found PrimitiveDictionaryBuilder which i think will work.

alamb · 2021-12-08T18:44:21Z

one thing i noticed is that it doesnt seem there is way to create a DictionaryArray from a vec of scalars like you can with a vec of str slices.

You can create DictionaryArrays from &str using FromIter as shown in https://docs.rs/arrow/6.3.0/arrow/array/struct.DictionaryArray.html

use arrow::array::{DictionaryArray, Int8Array};
use arrow::datatypes::Int8Type;
let test = vec!["a", "a", "b", "c"];
let array : DictionaryArray<Int8Type> = test.iter().map(|&x| if x == "b" {None} else {Some(x)}).collect();
assert_eq!(array.keys(), &Int8Array::from(vec![Some(0), Some(0), None, Some(1)]));

There isn't an equivalent syntax for other scalar types (e.g. u32, u64, etc) though we could add them

alamb · 2021-12-08T18:52:52Z

The type system is very tricky -- let me see if I can sketch something out to help

matthewmturner · 2021-12-08T18:56:32Z

@alamb i made some updates and the scalar test is now passing! i think i did it right...maybe some stylistic changes to be made though. If you're okay with this implementation then i can add the other ops for scalar.

i had started on the utf8_scalar macro but im still trying to figure out if i can reuse compare_dict_op_scalar macro or if i need to make another one that downcasts to GenericStringArray.

codecov-commenter · 2021-12-08T19:11:08Z

Codecov Report

Merging #984 (b5f04c5) into master (9703f98) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #984      +/-   ##
==========================================
- Coverage   82.32%   82.31%   -0.01%     
==========================================
  Files         168      168              
  Lines       48717    49051     +334     
==========================================
+ Hits        40107    40378     +271     
- Misses       8610     8673      +63

Impacted Files	Coverage Δ
arrow/src/compute/kernels/comparison.rs	`93.39% <100.00%> (+0.45%)`	⬆️
arrow/src/datatypes/native.rs	`72.91% <0.00%> (-1.56%)`	⬇️
arrow/src/array/array_string.rs	`97.08% <0.00%> (-0.83%)`	⬇️
parquet/src/record/reader.rs	`89.83% <0.00%> (-0.63%)`	⬇️
parquet/src/schema/printer.rs	`72.47% <0.00%> (-0.55%)`	⬇️
arrow/src/util/display.rs	`20.00% <0.00%> (-0.39%)`	⬇️
arrow/src/array/transform/mod.rs	`85.10% <0.00%> (-0.38%)`	⬇️
arrow/src/datatypes/schema.rs	`66.66% <0.00%> (-0.27%)`	⬇️
arrow/src/array/array.rs	`85.45% <0.00%> (-0.26%)`	⬇️
parquet_derive/src/parquet_field.rs	`65.98% <0.00%> (-0.23%)`	⬇️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9703f98...b5f04c5. Read the comment docs.

alamb · 2021-12-08T19:12:20Z

arrow/src/compute/kernels/comparison.rs

+
+        // Safety:
+        // `i < $left.len()`
+        let comparison: Vec<bool> = (0..array.len())


I didn't think the values() (the dictionary size) has to the same as the size of the overall array 🤔

if youre referring to the safety comment i just hadnt removed that yet

alamb · 2021-12-08T19:14:39Z

arrow/src/compute/kernels/comparison.rs

+pub fn eq_dict_scalar<T>(
+    left: &DictionaryArray<T>,
+    right: T::Native,
+) -> Result<BooleanArray>
+where
+    T: ArrowNumericType,
+{
+    #[cfg(not(feature = "simd"))]
+    println!("{}", std::any::type_name::<T>());
+    return compare_dict_op_scalar!(left, T, right, |a, b| a == b);
+}


@matthewmturner this is what I was trying to say.

I think the way you have this function with a single T generic parameter means one could not compare a DictionaryArray<Int8> (aka that has keys / indexes of Int8) that had values of type DataType::Unt16

Here is a sketch of how this might work:

/// Perform `left == right` operation on a [`DictionaryArray`] and a numeric scalar value. pub fn eq_dict_scalar<T, K>( left: &DictionaryArray<K>, right: T::Native, ) -> Result<BooleanArray> where T: ArrowNumericType, K: ArrowNumericType, { // compare to the dictionary values (e.g if the dictionary is {A, // B} and the keys are {1,0,1,1} that represents the values B, A, // B, B. // // So we compare just the dictionary {A, B} values to `right` and // // TODO macro-ize this let dictionary_comparison = match left.values().data_type() { DataType::Int8 => { eq_scalar(as_primitive_array::<T>(left.values()), right) } // TODO fill in Int16, Int32, etc _ => unimplemented!("Should error: dictionary did not store values of type T") }?; // Required for safety below assert_eq!(dictionary_comparison.len(), left.values().len()); // Now, look up the dictionary for each output let result: BooleanArray = left.keys() .iter() .map(|key| { // figure out how the dictionary element at this index // compared to the scalar key.map(|key| { // safety: the original array's indices were valid // `(0 .. left.values().len()` and dictionary_comparisoon // is the same size, checked above unsafe { // it would be nice to avoid checking the conversion each time let key = key.to_usize().expect("Dictionary index not usize"); dictionary_comparison.value_unchecked(key) } }) }) .collect(); Ok(result) }

thx much for putting this together and the explanation. I'll work on implementing it!

yordan-pavlov · 2021-12-08T19:15:17Z

I wonder if a SIMD implementation could be done as well

alamb · 2021-12-08T19:23:24Z

I wonder if a SIMD implementation could be done as well

If we take the approach I tried to sketch in #984 (comment) i think the comparisons themselves could be vectorized (ideally by falling back to the existing vectorized kernel)

matthewmturner · 2021-12-10T20:07:33Z

@alamb ive started updating to your proposed approach but now when testing i get an error that a type annotation is needed for type parameter T. Im still playing around with it but ran out of time for today. Does that mean that the scalar needs to be casted to an ArrowPrimitiveType that is then used here:

 eq_scalar(as_primitive_array::<T>(left.values()), right)

my naive view is that we are casting to a PrimitiveArray using an ArrowPrimitiveType that comes from the scalar value we want to compare. But what if the values in our PrimitiveArray were a larger type than the scalar (i.e. values were i64 vs scalar of i32) - shouldnt we cast the scalar to a larger type in that case?

or said differently, if we're using eq_scalar under the hood, which requires the PrimitiveArray and scalar to be of the same type, why arent we enforcing the same with DictionaryArrays if its ultimately being passed to a PrimitiveArray and using eq_scalar?

yordan-pavlov · 2021-12-12T17:02:22Z

i think the comparisons themselves could be vectorized (ideally by falling back to the existing vectorized kernel)

@alamb I thought the benefit of dictionary comparison would be that it would enable vectorization of key comparison (especially useful when values are strings), since the number of keys would usually be much larger than the number of values (although vectorization of value comparison is a nice fallback in case the dictionary isn't sorted and so a binary search wouldn't work). In your proposed approach vectorization of key comparison could still happen, but is left to the compiler (instead of using explicit vectorization from existing comparison kernels). Or have I misunderstood?

alamb · 2021-12-12T22:01:24Z

@alamb I thought the benefit of dictionary comparison would be that it would enable vectorization of key comparison (especially useful when values are strings), since the number of keys would usually be much larger than the number of values

Yes that is my understanding @yordan-pavlov

In your proposed approach vectorization of key comparison could still happen, but is left to the compiler (instead of using explicit vectorization from existing comparison kernels).

I guess I was imagining that the key comparisons would be exactly as vectorized as the existing comparison kernels; Specifically, in my sketch above there is a call to eq_scalar (aka an existing comparison kernel)

            eq_scalar(as_primitive_array::<T>(left.values()), right)

alamb · 2021-12-12T22:02:05Z

@matthewmturner I plan to answer your question in more detail tomorrow.

alamb · 2021-12-13T15:15:13Z

@alamb ive started updating to your proposed approach but now when testing i get an error that a type annotation is needed for type parameter T. Im still playing around with it but ran out of time for today. Does that mean that the scalar needs to be casted to an ArrowPrimitiveType that is then used here:

@matthewmturner -- I was able to make your branch compile using this change, but that seems quite non-ideal

diff --git a/arrow/src/compute/kernels/comparison.rs b/arrow/src/compute/kernels/comparison.rs
index f98e15d549..59d960220d 100644
--- a/arrow/src/compute/kernels/comparison.rs
+++ b/arrow/src/compute/kernels/comparison.rs
@@ -2135,7 +2135,7 @@ mod tests {
         builder.append_null().unwrap();
         builder.append(223).unwrap();
         let array = builder.finish();
-        let a_eq = eq_dict_scalar(&array, 123).unwrap();
+        let a_eq = eq_dict_scalar::<UInt8Type, UInt8Type>(&array, 123).unwrap();
         assert_eq!(
             a_eq,
             BooleanArray::from(vec![Some(true), None, Some(false)])

I had been hoping something like this would work

        let a_eq = eq_dict_scalar(&array, 123u8).unwrap();

(as in ensure that the 123 was typed as a u8)

But the compiler still says it needs type annotations.

So while the direct usability of eq_dict_scalar might be pretty low, I think it could then be used to implement eq_dyn_scalar (which I know I keep going on about)

@tustvold or @carols10cents do you have any ideas how we might avoid having to add type annotations to the call of eq_dict_scalar?

alamb · 2021-12-13T15:30:08Z

I am still thinking about this one

alamb · 2021-12-13T17:13:44Z

So I messed around with this some more. The key thing I kept hitting was that the comparisons to the dictionary values effectively needed to be "dynamic"

Here is one approach that we might be able to use (and skip the specialized eq_dict_scalar entirely).

/// Perform `left == right` operation on an array and a numeric scalar
/// value. Supports PrimtiveArrays, and DictionaryArrays that have primitive values
pub fn eq_dyn_scalar<T>(
    left: &dyn Array,
    right: T::Native,
) -> Result<BooleanArray>
where
    T: ArrowNumericType,
{
    #[cfg(not(feature = "simd"))]
    println!("{}", std::any::type_name::<T>());
    match left.data_type() {
        DataType::UInt8 => {
            // horrible (?) way to get a u8
            let right: u8 = right.to_usize()
                .and_then(|right| right.try_into().ok())
                .ok_or_else(|| ArrowError::ComputeError(format!("Can not convert {:?} to u8 for comparison with UInt8Array", right)))?;
            eq_scalar::<UInt8Type>(as_primitive::<UInt8Type>(left), right)
        }
        DataType::UInt16 => {
            // horrible (?) way to get a u16
            let right: u16 = right.to_usize()
                .and_then(|right| right.try_into().ok())
                .ok_or_else(|| ArrowError::ComputeError(format!("Can not convert {:?} to u16 for comparison with UInt16Array", right)))?;
            eq_scalar::<UInt16Type>(as_primitive::<UInt16Type>(left), right)
        }
        // TODO other primitive array types
        DataType::Dictionary(key_type, value_type) => {
            match key_type.as_ref() {
                DataType::UInt8 => {
                    let left = as_dictionary::<UInt8Type>(left);
                    unpack_dict_comparison(left, eq_dyn_scalar::<T>(left.values().as_ref(), right)?)
                }
                // TODO fill out the rest of the key types here
                _ => todo!()
            }
        }
        // TODO macroize / fill out rest of primitive dispatch
        _ => todo!()
    }
}

The downside is you still need type annotations at the callsite:

    #[test]
    fn test_dict_eq_scalar() {
...
        let array = builder.finish();
        // still need the UInt8Type annotations
        let a_eq = eq_dyn_scalar::<UInt8Type>(&array, 123u8).unwrap();
        assert_eq!(
            a_eq,
            BooleanArray::from(vec![Some(true), None, Some(false)])
        );
    }

alamb · 2021-12-13T17:22:55Z

The other thing I can think of, that seems a bit of a hack, but might be ok would be to take something like impl TryInto<i128> which covers all current native types

/// Perform `left == right` operation on an array and a numeric scalar
/// value. Supports PrimtiveArrays, and DictionaryArrays that have primitive values
pub fn eq_dyn_scalar<T>(
    left: &dyn Array,
    right: T
) -> Result<BooleanArray>
where
    T : TryInto<i128> + Copy + std::fmt::Debug
{
    let right: i128 = right
        .try_into()
        .map_err(|_| ArrowError::ComputeError(format!("Can not convert scalar {:?} to i128", right)))?;
    match left.data_type() {
        DataType::UInt8 => {
            let right: u8 = right
                .try_into()
                .map_err(|_| ArrowError::ComputeError(format!("Can not convert {:?} to u8 for comparison with UInt8Array", right)))?;
            eq_scalar::<UInt8Type>(as_primitive::<UInt8Type>(left), right)
        }
        DataType::UInt16 => {
            let right: u16 = right
                .try_into()
                .map_err(|_| ArrowError::ComputeError(format!("Can not convert {:?} to u16 for comparison with UInt16Array", right)))?;
            eq_scalar::<UInt16Type>(as_primitive::<UInt16Type>(left), right)
        }
        // TODO macroize + fill out the other primitive array types here

        DataType::Dictionary(key_type, value_type) => {
            match key_type.as_ref() {
                DataType::UInt8 => {
                    let left = as_dictionary::<UInt8Type>(left);
                    unpack_dict_comparison(left, eq_dyn_scalar(left.values().as_ref(), right)?)
                }
                // TODO fill out the rest of the key types here
                _ => todo!()
            }
        }
        _ => todo!()
    }
}

/// unpacks the results of comparing left.values (as a boolean)
///
/// TODO add example
///
fn unpack_dict_comparison<K>(
    left: &DictionaryArray<K>,
    dict_comparison: BooleanArray,
) -> Result<BooleanArray>
where
    K: ArrowNumericType,
{
    assert_eq!(dict_comparison.len(), left.values().len());

    let result: BooleanArray = left
        .keys()
        .iter()
        .map(|key| {
            key.map(|key| unsafe {
                // safety lengths were verified above
                let key = key.to_usize().expect("Dictionary index not usize");
                dict_comparison.value_unchecked(key)
            })
        })
        .collect();

    Ok(result)
}

Which then finally allows a call to eq_dyn_scalar without type annotations:

    #[test]
    fn test_dict_eq_scalar() {
        ...
        let array = builder.finish();
        // YAY! No type annotations!
        let a_eq = eq_dyn_scalar(&array, 123).unwrap();
        assert_eq!(
            a_eq,
            BooleanArray::from(vec![Some(true), None, Some(false)])
        );
    }

jorgecarleitao · 2021-12-13T17:50:23Z

IMO we need a dyn Scalar and a cmp_scalar(array: &dyn Array, scalar: &dyn Scalar) -> BooleanArray that dispatches to specific implementations based on the array's datatype. We then use cmp_scalar(dict.values().as_ref(), scalar) and clone the indices. Inspiration for trait Scalar here: https://github.com/jorgecarleitao/arrow2/tree/main/src/scalar . I hope it helps.

alamb · 2021-12-13T19:48:12Z

IMO we need a dyn Scalar and a cmp_scalar(array: &dyn Array, scalar: &dyn Scalar) -> BooleanArray that dispatches to specific implementations based on the array's datatype. We then use cmp_scalar(dict.values().as_ref(), scalar) and clone the indices. Inspiration for trait Scalar here: https://github.com/jorgecarleitao/arrow2/tree/main/src/scalar . I hope it helps.

@jorgecarleitao -- Indeed, this is an excellent idea. 🤔

Following the existing pattern in the comparison kernels, we will likely end up with several kernels such as

eq_dyn_lit(left: &dyn Array, right: &dyn Array)
eq_utf8_dyn_lit(left: &dyn Array, right: impl AsRef<str>)
eq_bool_dyn_lit(left: &dyn Array, right: bool)

Which might be nice as the user's rust code could leave the scalar right strongly typed, but it would still have to call the correct kernel. It also has the upside that it wouldn't require an owned String for an Utf8 constant, but we can probably make a dyn Scalar work like that too with some finagling

@jorgecarleitao when you say "inspiration" do you mean it would be ok to incorporate https://github.com/jorgecarleitao/arrow2/tree/main/src/scalar into arrow-rs? I have somewhat lost track of what was needed IP clearance wise. Maybe since arrow2 is apache licensed, it is fine?

alamb · 2021-12-13T19:49:42Z

Perhaps one possibility would be to have a trait that maps in the reverse direction? Something like playground

@shepmaster this is a very cool idea -- in fact I think that is much better than taking an impl Into<i128> as I suggested in #984 (comment)

jorgecarleitao · 2021-12-13T19:54:55Z

For reference (for anyone not following the gist), a DictionaryArray is essentially

struct DictionaryArray<K> {
    indices: PrimitiveArray<K>
    values: Arc<dyn Array>
}

that is itemwise represented as

indices
     .into_iter()
     .map(|optional_index| optional_index.map(|index| "values[index]"))

where "values[index]" represents the index of values.

@alamb 's insight is that we can compare a dictionary with a scalar of the same (dynamic) type as values by comparing values with that scalar (ignoring indices).

jorgecarleitao · 2021-12-13T20:01:13Z

"inspiration" because

there is Datafusion's scalar API based on enums, which we could consider instead
in arrow2 there is a smaller number of physical types than arrow (one per physical type as opposed to one per logical type), so it can't be copied as is
the scalar API it is not IP-cleared it was added to arrow2 post IP clearance :/

alamb · 2021-12-13T20:52:28Z

the scalar API it is not IP-cleared it was added to arrow2 post IP clearance :/

Right -- I don't understand the implications of this statement (like if code is apache licensed, that means it can be incorporated into other projects, right?). So I don't really understand the IP clearance need but I remember it was complicated 🤷

On the topic of scalars -- there is a recent related mailing list discussion of adding "Constant Array" support to arrow which perhaps could serve the same / similar purpose: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq

alippai · 2021-12-13T23:13:00Z

@alamb my understanding is that any meaningful chunk of code with any license needs to be cleared if it's not original creation of the dev written for the Apache Arrow project. I believe @jorgecarleitao referred to this: that trait is not IP cleared yet, thus it needs the legal process :/

matthewmturner · 2021-12-14T06:19:34Z

@alamb @jorgecarleitao im not familiar enough with the rest of the code base, but would adding a dyn Scalar trait mean having to update everywhere rust native scalars are currently used with that trait? of course this is assuming legal issues are resolved. on that point, independent from the context of this PR, is IP clearance expected to refresh anytime soon?

The approach from @shepmaster seemed quite nice as well however would still require users setting the type of the scalar - is that acceptable?

alamb · 2021-12-14T12:08:38Z

but would adding a dyn Scalar trait mean having to update everywhere rust native scalars are currently used with that trait?

One approach could be to remove all the rust native scalars and replace with dyn Scalar, but another approach is to have both (keep the existing kernels with different signatures, and add a new entirely dynamic kernel eq_scalar(&dyn Array, &dyn Scalar).

I think keeping with the existing style of separate kernels for separate rust native types for this PR is probable,

of course this is assuming legal issues are resolved. on that point, independent from the context of this PR, is IP clearance expected to refresh anytime soon?

I defer to @jorgecarleitao on his plans, as he is the primary author of most of arrow2

The approach from @shepmaster seemed quite nice as well however would still require users setting the type of the scalar - is that acceptable?

I think it is acceptable (and a good idea) personally. It would result in perhaps something like:

/// Perform `left == right` operation on an array and a numeric scalar
/// value. Supports PrimtiveArrays, and DictionaryArrays that have primitive numeric values
pub fn eq_dyn_scalar<T>(
    left: &dyn Array,
    right: T
) -> Result<BooleanArray>
where
    T : IntoArrowNumericType
{
...
}

?

matthewmturner · 2021-12-14T15:07:54Z

@alamb ok! I can open a new issue for adding the IntoArrowNumericType trait which i think we would want to do separate from this PR. That would also give others an opportunity to provide an opinion and give a little time to see if there is path forward on the ip clearance.

alamb · 2021-12-14T20:04:55Z

@alamb ok! I can open a new issue for adding the IntoArrowNumericType trait which i think we would want to do separate from this PR. That would also give others an opportunity to provide an opinion and give a little time to see if there is path forward on the ip clearance.

Sounds good @matthewmturner .

In terms of Scalar / ScalarValue and IP clearance, here is what I would suggest:

We implement the kernels that dynamically dispatch on the array type eq_dyn_scalar, eq_utf8_dyn_lit and eq_bool_dyn_lit), -- I can definitely find time to help here)
We the file a follow on ticket / issue for implementing a fully dynamic kernels that dispatch both on array type and on scalar type

What can I do to be most helpful for this project?

alamb · 2021-12-14T20:07:14Z

BTW for some context, the comparison kernels for dictionaries are very important (though not critically urgent) for our usecase in IOx. as we have large amounts of dictionary string data.

Thus given you are working on this I can reallocate some non trivial amount of my time to help. Thank you for all the effort so far @matthewmturner

cc @pauldix @jacobmarble

jorgecarleitao · 2021-12-14T20:11:10Z

Yes, what @alippai wrote. Given the number of differences that we would need to apply, re-writing it shouldn't be very difficult. My comment was really just to point out that dyn Scalar is a an option here. I think we discussed this some time ago with @nevi-me (the idea of a Scalar API to help us write generic code).

matthewmturner · 2021-12-14T21:57:05Z

@alamb thank you for the additional context and all the guidance youve provided so far.

just to expand on and summarize your plan to make sure im not missing any steps, can you confirm these are the steps?

New issue / PR to implement IntoArrowNumericType
Use 'IntoArrowNumericType' to create the different underlying xx_dict_scalar functions / macros.
We create the dynamic array kernels (i.e. eq_dyn_scalar) for scalar values, using the the new dict kernels create in step 2 for DictionaryArray (this PR)
We create issue for implementing the Scalar api
We create issue for creating dynamic array and scalar kernels

Im happy to, and have enjoyed, working on this. but I dont want to slow down anything on the IOx side as a result of me getting up to speed on the different pieces here. If you're okay (and aligned on the steps above) with it we could continue as we have been on this, and ill start on the first step and ping you as I have questions / for guidance. If this isnt moving at a quick enough pace then we can work on plan for how you can assist on the implementation to speed it up?

nevi-me · 2021-12-14T22:13:24Z

It's been a while, but if I recall correctly, the idea was for compute kernels in arrow to follow a similar approach to C++ where we have a Datum (or whatever we call it), which could likely be an enum of the same datatype. Within that Datum, we'd have Scalar, Array, ...

The C++ impl has a ChunkedArray which is effectively Vec<Array>, but we never went with that.

It would have been desirable to move Scalar to arrow at the time, but that's fortunately been done in arrow2.

alamb · 2021-12-15T22:09:13Z

Hi @matthewmturner -- I will respond tomorrow - I ran out of time today

alamb · 2021-12-16T13:46:44Z

just to expand on and summarize your plan to make sure im not missing any steps, can you confirm these are the steps?

Yes I think those are the steps I was proposing. Thank you @matthewmturner

If you're okay (and aligned on the steps above) with it we could continue as we have been on this, and ill start on the first step and ping you as I have questions / for guidance. If this isn't moving at a quick enough pace then we can work on plan for how you can assist on the implementation to speed it up?

Sounds like a good plan! Thank you!

alamb · 2021-12-16T14:06:02Z

It's been a while, but if I recall correctly, the idea was for compute kernels in arrow to follow a similar approach to C++ where we have a Datum (or whatever we call it), which could likely be an enum of the same datatype. Within that Datum, we'd have Scalar, Array, ...

That is a great idea @nevi-me -- I couldn't find an existing ticket so I wrote a new one here: #1047

Maybe we can work towards that vision, starting with the equality kernels (we'll need all the pieces and tests, even when we have a fully dynamic dispatch, I suspect)

alamb · 2021-12-16T14:06:49Z

@matthewmturner let me know if it would help to file a ticket for IntoArrowNumericType

matthewmturner · 2021-12-16T14:11:14Z

@matthewmturner let me know if it would help to file a ticket for IntoArrowNumericType

@alamb Sure that would be great!

matthewmturner · 2021-12-16T15:16:24Z

@shepmaster one quick question on your proposal.

What would be the difference between using that approach compared to implementing the From trait (https://doc.rust-lang.org/std/convert/trait.From.html) for ArrowNumericType on each of the rust native types. I guess that would change the signatures / how we call the functions to something like the following:

fn eq_dict_scalar<K, T>(left: &DictionaryArray<K>, right: T)
where
    K: ArrowNumericType,
    T: ArrowNumericType,
{
    todo!()
}

// ----

fn usage<K>(d: &DictionaryArray<K>)
where
    K: ArrowNumericType,
{
    eq_dict_scalar(d, 8.into());
    eq_dict_scalar(d, -64.into());
}

@alamb FYI

shepmaster · 2021-12-16T17:04:30Z

If it makes sense to use From, then there should be no material difference. You can still take a generic T: Into<ArrowNumericType> to avoid forcing the caller to do extra typing.

However, I don’t think that path is good as you’d still need to provide explicit types: what is the concrete type you are converting into?

matthewmturner · 2021-12-16T18:19:41Z

However, I don’t think that path is good as you’d still need to provide explicit types: what is the concrete type you are converting into?

Actually, i think im mistaken and mixing up types and traits. I dont think we need to do a conversion. Sry about that.

alamb · 2021-12-20T16:10:34Z

@matthewmturner - I filed #1068 to track the IntoArrayNumericType trait

Before we spend time creating polished PRs for IntoArrowNumericType I think we should try and spike out a PR showing how the API would work with eq_dyn_scalar

Once we had that basic framework in place then we could fill out the implementation in individual PRs.

If it would help, I could try and take a crack at creating such a PR (likely based on this one)

bkmgit · 2021-12-28T15:12:24Z

One area that still needs to be determined in C++ is interfaces for custom comparisons:

Maybe worth planning for this as well.

alamb · 2021-12-29T12:57:05Z

Maybe worth planning for this as well.

Thanks @bkmgit -- perhaps you can file a ticket for the item ?

alamb · 2021-12-29T12:59:06Z

I think we have had all the discussion on this ticket and have the follow up items tracked in other tickets / issues. Thus closing this PR down

matthewmturner added 2 commits November 27, 2021 00:55

Start adding dict array comparison functions

c5c7f35

Update test

155d60f

github-actions bot added the arrow Changes to the arrow crate label Nov 27, 2021

alamb reviewed Nov 28, 2021

View reviewed changes

Checkpoint

4896762

matthewmturner mentioned this pull request Nov 29, 2021

Add dyn_xx_lit type kernels #986

Closed

Updated dict scalar op

0d5d1b1

Update comparision macro and scalar test to use builder

c147869

Fix kernel and test eq_dict_scalar is passing

b5f04c5

yordan-pavlov mentioned this pull request Dec 8, 2021

Implement returning dictionary arrays from parquet reader #171

Closed

alamb reviewed Dec 8, 2021

View reviewed changes

WIP Updating eq_dict_scalar implementation

ee7997c

alamb changed the title ~~Add comparison kernels for DictionaryArray~~ Add scalar comparison kernels for DictionaryArray Dec 13, 2021

alamb mentioned this pull request Dec 16, 2021

Add Scalar / Datum support to compute kernels #1047

Closed

alamb mentioned this pull request Dec 20, 2021

Implement into IntoArrowNumericType trait #1068

Closed

matthewmturner mentioned this pull request Dec 21, 2021

Define eq_dyn_scalar API #1074

Merged

alamb closed this Dec 29, 2021

Add scalar comparison kernels for DictionaryArray #984

Add scalar comparison kernels for DictionaryArray #984

Conversation

matthewmturner commented Nov 27, 2021

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

matthewmturner commented Nov 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Nov 29, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Dec 2, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewmturner commented Dec 7, 2021 • edited

alamb commented Dec 8, 2021

alamb commented Dec 8, 2021

matthewmturner commented Dec 8, 2021 • edited

codecov-commenter commented Dec 8, 2021

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yordan-pavlov commented Dec 8, 2021

alamb commented Dec 8, 2021

matthewmturner commented Dec 10, 2021 • edited

yordan-pavlov commented Dec 12, 2021

alamb commented Dec 12, 2021

alamb commented Dec 12, 2021

alamb commented Dec 13, 2021

alamb commented Dec 13, 2021

alamb commented Dec 13, 2021

alamb commented Dec 13, 2021

jorgecarleitao commented Dec 13, 2021

alamb commented Dec 13, 2021

alamb commented Dec 13, 2021

jorgecarleitao commented Dec 13, 2021 • edited

jorgecarleitao commented Dec 13, 2021

alamb commented Dec 13, 2021 • edited

alippai commented Dec 13, 2021

matthewmturner commented Dec 14, 2021

alamb commented Dec 14, 2021

matthewmturner commented Dec 14, 2021

alamb commented Dec 14, 2021

alamb commented Dec 14, 2021 • edited

jorgecarleitao commented Dec 14, 2021

matthewmturner commented Dec 14, 2021

nevi-me commented Dec 14, 2021

alamb commented Dec 15, 2021

alamb commented Dec 16, 2021

alamb commented Dec 16, 2021

alamb commented Dec 16, 2021

matthewmturner commented Dec 16, 2021

matthewmturner commented Dec 16, 2021

shepmaster commented Dec 16, 2021

matthewmturner commented Dec 16, 2021

alamb commented Dec 20, 2021

bkmgit commented Dec 28, 2021

alamb commented Dec 29, 2021

alamb commented Dec 29, 2021

alamb Nov 29, 2021 •

edited

alamb Dec 2, 2021 •

edited

matthewmturner commented Dec 7, 2021 •

edited

matthewmturner commented Dec 8, 2021 •

edited

matthewmturner commented Dec 10, 2021 •

edited

jorgecarleitao commented Dec 13, 2021 •

edited

alamb commented Dec 13, 2021 •

edited

alamb commented Dec 14, 2021 •

edited