Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scalar comparison kernels for DictionaryArray #984

Closed

Conversation

matthewmturner
Copy link
Contributor

Which issue does this PR close?

Closes #869

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 27, 2021
@matthewmturner
Copy link
Contributor Author

@alamb I started the work on this. Would you be able to give it a quick look when you get the chance to make sure its going in the right direction?

let values = $left
.values()
.as_any()
.downcast_ref::<StringArray>()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values array can be anything (not always a StringArray) -- perhaps this would be a good place to use the dyn_XX kernels -- to compare the values array with $right)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from my understanding of the dyn kernels those cant be used when comparing to constant right?

Copy link
Contributor

@alamb alamb Nov 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 yes you are correct -- we would need to add dyn_XX_lit type kernels, but that seems a bit overkill for this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the primary use case for this PR was comparing dict array to constant then maybe it makes sense for me to do a separate PR for that first and then come back to this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the primary use case for this PR was comparing dict array to constant then maybe it makes sense for me to do a separate PR for that first and then come back to this?

I think focusing on the usecase of comparing dict array to constant is the best choice for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! Will start with that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb ive been reviewing this but i think i might be missing something. my understanding is that my code above is for getting the dictionary values, which can be of any type (of course above im only handling StringArray).

        let values = $left
            .values()
            .as_any()
            .downcast_ref::<StringArray>()
            .unwrap()

But then you mention using the new dyn_xx kernels / creating dyn_xx_lit kernels. Since theres no actual compute being done here, what would the dyn kernels be used for? Or were you referring to using the kernels to replace more than just that section of code?

to me it looks like i need a macro to downcast DictionaryArray.values() into whatever type the values are, and then i could use something like dyn_xx_lit on that in order to get the comparison results. Is this roughly what you had in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very sorry for confusing this conversation with mentioning dyn_xx_lit.

What I was (inarticulately) trying to say was that once you have eq_dict_scalar (and you will likely also need eq_dict_scalar_utf8) we will end up with several different ways to compare an array to a scalar, depending on the array type

So I was thinking ahead to adding functions like

fn eq_scalar_utf8_dyn(array: dyn &Array, right: &str) -> Boolean {
  // do dispatch to the right kernel based on type of array
}

But definitely not for this PR

Comment on lines 219 to 222
let comparison = (0..$left.len()).map(|i| unsafe {
let key = $left.keys().value_unchecked(i).to_usize().unwrap();
$op(values.value_unchecked(key), $right)
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one of the main points of this ticket is to avoid the call here to vaues.value_unchecked

I like to think about the goal in by thinking "what would happen with DictionaryArray with 1000000 entries but a dictionary of size 1?" -- the way you have this PR, I think we would call $op 1000000 times. The idea is to call $op 1 time.

So the pattern I think we are looking, at least for the constant kernels is:

In pseudo code:

let values = dict_array.values();
let comparison_result_on_values = apply_op_to_values();
let result = dict_array.keys().iter().map(|index| comparison_result_on_values[index]).collect()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for explanation. I am looking into this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb im struggling with the second step in your pseudocode given that my understanding is that the values could be of any ArrowPrimativeType. Would you be able to provide guidance on how to handle that? I've been playing with different macros and iteration options on the underlying buffers, but i feel like im missing some fundamental understanding about how to work with dynamic data type like this or how to use ArrayData.

Copy link
Contributor

@alamb alamb Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

Yes this is definitely tricky. Maybe taking a step back, and think about the usecase: comparing DictionaryArrays to literals.

For example, if you look at the comparison kernels (for eq) , https://docs.rs/arrow/6.3.0/arrow/compute/kernels/comparison/index.html we find;

eq_scalar
eq_bool_scalar
eq_utf8_scalar

With each being typed based on the type of scalar (because the arrays are typed)

The issue with a DictionaryArray is that it could have numbers, bool, strings, etc. so we can't have a single entrypoint as we do with other types of arrays

So i am thinking we would need something like

eq_dict_scalar // numeric 
eq_dict_bool_scalar // boolean
eq_dict_utf8_scalar // strings

where each of those kernels would be able to downcast the array appropriately.

However, having three functions for each dict kernel seems somewhat crazy.

That is where my dyn idea was coming from. If we are going to add three new kernels for each operator (eq, lt, etc) we could perhaps add

eq_dyn_scalar // numeric 
eq_dyn_bool_scalar // boolean
eq_dyn_utf8_scalar // strings

etc

Which handle DictionaryArray as well as dispatching to the other eq_scalar, eq_bool_scalar, eq_utf8_scalar as appropriate.

Does that make sense? I can try and sketch out the interface this weekend sometime

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for explanation! yes, it does make sense. i think i was trying to do too much in my macros / functions which was causing my confusion. i think if i can get one of the below to work that should give me my baseline to do the rest.

eq_dict_scalar // numeric 
eq_dict_bool_scalar // boolean
eq_dict_utf8_scalar // strings

@matthewmturner
Copy link
Contributor Author

matthewmturner commented Dec 7, 2021

@alamb this took me much longer than it should have, and im still not even sure if its idiomatic rust / arrow but i think i might be close to getting the eq_dict_scalar kernel. this process did definitely help a lot with learning about rust traits / macros and how arrow uses them.

one thing i noticed is that it doesnt seem there is way to create a DictionaryArray from a vec of scalars like you can with a vec of str slices. Is that expected? I also dont see a builder method i can use to construct the DictionaryArray. I guess the main use case for DictionaryArray is with strings and not scalars?

Either way, would you be able to see if this update is going in the right direction? If so, then I can extend to utf8 kernels etc.

UPDATE:
Actually I found PrimitiveDictionaryBuilder which i think will work.

@alamb
Copy link
Contributor

alamb commented Dec 8, 2021

one thing i noticed is that it doesnt seem there is way to create a DictionaryArray from a vec of scalars like you can with a vec of str slices.

You can create DictionaryArrays from &str using FromIter as shown in https://docs.rs/arrow/6.3.0/arrow/array/struct.DictionaryArray.html

use arrow::array::{DictionaryArray, Int8Array};
use arrow::datatypes::Int8Type;
let test = vec!["a", "a", "b", "c"];
let array : DictionaryArray<Int8Type> = test.iter().map(|&x| if x == "b" {None} else {Some(x)}).collect();
assert_eq!(array.keys(), &Int8Array::from(vec![Some(0), Some(0), None, Some(1)]));

There isn't an equivalent syntax for other scalar types (e.g. u32, u64, etc) though we could add them

@alamb
Copy link
Contributor

alamb commented Dec 8, 2021

The type system is very tricky -- let me see if I can sketch something out to help

@matthewmturner
Copy link
Contributor Author

matthewmturner commented Dec 8, 2021

@alamb i made some updates and the scalar test is now passing! i think i did it right...maybe some stylistic changes to be made though. If you're okay with this implementation then i can add the other ops for scalar.

i had started on the utf8_scalar macro but im still trying to figure out if i can reuse compare_dict_op_scalar macro or if i need to make another one that downcasts to GenericStringArray.

@codecov-commenter
Copy link

Codecov Report

Merging #984 (b5f04c5) into master (9703f98) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #984      +/-   ##
==========================================
- Coverage   82.32%   82.31%   -0.01%     
==========================================
  Files         168      168              
  Lines       48717    49051     +334     
==========================================
+ Hits        40107    40378     +271     
- Misses       8610     8673      +63     
Impacted Files Coverage Δ
arrow/src/compute/kernels/comparison.rs 93.39% <100.00%> (+0.45%) ⬆️
arrow/src/datatypes/native.rs 72.91% <0.00%> (-1.56%) ⬇️
arrow/src/array/array_string.rs 97.08% <0.00%> (-0.83%) ⬇️
parquet/src/record/reader.rs 89.83% <0.00%> (-0.63%) ⬇️
parquet/src/schema/printer.rs 72.47% <0.00%> (-0.55%) ⬇️
arrow/src/util/display.rs 20.00% <0.00%> (-0.39%) ⬇️
arrow/src/array/transform/mod.rs 85.10% <0.00%> (-0.38%) ⬇️
arrow/src/datatypes/schema.rs 66.66% <0.00%> (-0.27%) ⬇️
arrow/src/array/array.rs 85.45% <0.00%> (-0.26%) ⬇️
parquet_derive/src/parquet_field.rs 65.98% <0.00%> (-0.23%) ⬇️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9703f98...b5f04c5. Read the comment docs.


// Safety:
// `i < $left.len()`
let comparison: Vec<bool> = (0..array.len())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think the values() (the dictionary size) has to the same as the size of the overall array 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if youre referring to the safety comment i just hadnt removed that yet

Comment on lines 1261 to 1271
pub fn eq_dict_scalar<T>(
left: &DictionaryArray<T>,
right: T::Native,
) -> Result<BooleanArray>
where
T: ArrowNumericType,
{
#[cfg(not(feature = "simd"))]
println!("{}", std::any::type_name::<T>());
return compare_dict_op_scalar!(left, T, right, |a, b| a == b);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthewmturner this is what I was trying to say.

I think the way you have this function with a single T generic parameter means one could not compare a DictionaryArray<Int8> (aka that has keys / indexes of Int8) that had values of type DataType::Unt16

Here is a sketch of how this might work:

/// Perform `left == right` operation on a [`DictionaryArray`] and a numeric scalar value.
pub fn eq_dict_scalar<T, K>(
    left: &DictionaryArray<K>,
    right: T::Native,
) -> Result<BooleanArray>
where
    T: ArrowNumericType,
    K: ArrowNumericType,
{
    // compare to the dictionary values (e.g if the dictionary is {A,
    // B} and the keys are {1,0,1,1} that represents the values B, A,
    // B, B.
    //
    // So we compare just the dictionary {A, B} values to `right` and
    //
    // TODO macro-ize this

    let dictionary_comparison = match left.values().data_type() {
        DataType::Int8 => {
            eq_scalar(as_primitive_array::<T>(left.values()), right)
        }
        // TODO fill in Int16, Int32, etc
        _ => unimplemented!("Should error: dictionary did not store values of type T")
    }?;

    // Required for safety below
    assert_eq!(dictionary_comparison.len(), left.values().len());

    // Now, look up the dictionary for each output
    let result: BooleanArray = left.keys()
        .iter()
        .map(|key| {
            // figure out how the dictionary element at this index
            // compared to the scalar
            key.map(|key| {
                // safety: the original array's indices were valid
                // `(0 .. left.values().len()` and dictionary_comparisoon
                // is the same size, checked above
                unsafe {
                    // it would be nice to avoid checking the conversion each time
                    let key = key.to_usize().expect("Dictionary index not usize");
                    dictionary_comparison.value_unchecked(key)
                }
            })
        })
        .collect();

    Ok(result)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx much for putting this together and the explanation. I'll work on implementing it!

@yordan-pavlov
Copy link
Contributor

I wonder if a SIMD implementation could be done as well

@alamb
Copy link
Contributor

alamb commented Dec 8, 2021

I wonder if a SIMD implementation could be done as well

If we take the approach I tried to sketch in #984 (comment) i think the comparisons themselves could be vectorized (ideally by falling back to the existing vectorized kernel)

@matthewmturner
Copy link
Contributor Author

matthewmturner commented Dec 10, 2021

@alamb ive started updating to your proposed approach but now when testing i get an error that a type annotation is needed for type parameter T. Im still playing around with it but ran out of time for today. Does that mean that the scalar needs to be casted to an ArrowPrimitiveType that is then used here:

 eq_scalar(as_primitive_array::<T>(left.values()), right)

my naive view is that we are casting to a PrimitiveArray using an ArrowPrimitiveType that comes from the scalar value we want to compare. But what if the values in our PrimitiveArray were a larger type than the scalar (i.e. values were i64 vs scalar of i32) - shouldnt we cast the scalar to a larger type in that case?

or said differently, if we're using eq_scalar under the hood, which requires the PrimitiveArray and scalar to be of the same type, why arent we enforcing the same with DictionaryArrays if its ultimately being passed to a PrimitiveArray and using eq_scalar?

@yordan-pavlov
Copy link
Contributor

i think the comparisons themselves could be vectorized (ideally by falling back to the existing vectorized kernel)

@alamb I thought the benefit of dictionary comparison would be that it would enable vectorization of key comparison (especially useful when values are strings), since the number of keys would usually be much larger than the number of values (although vectorization of value comparison is a nice fallback in case the dictionary isn't sorted and so a binary search wouldn't work). In your proposed approach vectorization of key comparison could still happen, but is left to the compiler (instead of using explicit vectorization from existing comparison kernels). Or have I misunderstood?

@alamb
Copy link
Contributor

alamb commented Dec 12, 2021

@alamb I thought the benefit of dictionary comparison would be that it would enable vectorization of key comparison (especially useful when values are strings), since the number of keys would usually be much larger than the number of values

Yes that is my understanding @yordan-pavlov

In your proposed approach vectorization of key comparison could still happen, but is left to the compiler (instead of using explicit vectorization from existing comparison kernels).

I guess I was imagining that the key comparisons would be exactly as vectorized as the existing comparison kernels; Specifically, in my sketch above there is a call to eq_scalar (aka an existing comparison kernel)

            eq_scalar(as_primitive_array::<T>(left.values()), right)

@alamb
Copy link
Contributor

alamb commented Dec 12, 2021

@matthewmturner I plan to answer your question in more detail tomorrow.

@alamb
Copy link
Contributor

alamb commented Dec 13, 2021

@alamb ive started updating to your proposed approach but now when testing i get an error that a type annotation is needed for type parameter T. Im still playing around with it but ran out of time for today. Does that mean that the scalar needs to be casted to an ArrowPrimitiveType that is then used here:

@matthewmturner -- I was able to make your branch compile using this change, but that seems quite non-ideal

diff --git a/arrow/src/compute/kernels/comparison.rs b/arrow/src/compute/kernels/comparison.rs
index f98e15d549..59d960220d 100644
--- a/arrow/src/compute/kernels/comparison.rs
+++ b/arrow/src/compute/kernels/comparison.rs
@@ -2135,7 +2135,7 @@ mod tests {
         builder.append_null().unwrap();
         builder.append(223).unwrap();
         let array = builder.finish();
-        let a_eq = eq_dict_scalar(&array, 123).unwrap();
+        let a_eq = eq_dict_scalar::<UInt8Type, UInt8Type>(&array, 123).unwrap();
         assert_eq!(
             a_eq,
             BooleanArray::from(vec![Some(true), None, Some(false)])

I had been hoping something like this would work

        let a_eq = eq_dict_scalar(&array, 123u8).unwrap();

(as in ensure that the 123 was typed as a u8)

But the compiler still says it needs type annotations.

So while the direct usability of eq_dict_scalar might be pretty low, I think it could then be used to implement eq_dyn_scalar (which I know I keep going on about)

@tustvold or @carols10cents do you have any ideas how we might avoid having to add type annotations to the call of eq_dict_scalar?

@alamb
Copy link
Contributor

alamb commented Dec 13, 2021

I am still thinking about this one

@alamb
Copy link
Contributor

alamb commented Dec 13, 2021

So I messed around with this some more. The key thing I kept hitting was that the comparisons to the dictionary values effectively needed to be "dynamic"

Here is one approach that we might be able to use (and skip the specialized eq_dict_scalar entirely).

/// Perform `left == right` operation on an array and a numeric scalar
/// value. Supports PrimtiveArrays, and DictionaryArrays that have primitive values
pub fn eq_dyn_scalar<T>(
    left: &dyn Array,
    right: T::Native,
) -> Result<BooleanArray>
where
    T: ArrowNumericType,
{
    #[cfg(not(feature = "simd"))]
    println!("{}", std::any::type_name::<T>());
    match left.data_type() {
        DataType::UInt8 => {
            // horrible (?) way to get a u8
            let right: u8 = right.to_usize()
                .and_then(|right| right.try_into().ok())
                .ok_or_else(|| ArrowError::ComputeError(format!("Can not convert {:?} to u8 for comparison with UInt8Array", right)))?;
            eq_scalar::<UInt8Type>(as_primitive::<UInt8Type>(left), right)
        }
        DataType::UInt16 => {
            // horrible (?) way to get a u16
            let right: u16 = right.to_usize()
                .and_then(|right| right.try_into().ok())
                .ok_or_else(|| ArrowError::ComputeError(format!("Can not convert {:?} to u16 for comparison with UInt16Array", right)))?;
            eq_scalar::<UInt16Type>(as_primitive::<UInt16Type>(left), right)
        }
        // TODO other primitive array types
        DataType::Dictionary(key_type, value_type) => {
            match key_type.as_ref() {
                DataType::UInt8 => {
                    let left = as_dictionary::<UInt8Type>(left);
                    unpack_dict_comparison(left, eq_dyn_scalar::<T>(left.values().as_ref(), right)?)
                }
                // TODO fill out the rest of the key types here
                _ => todo!()
            }
        }
        // TODO macroize / fill out rest of primitive dispatch
        _ => todo!()
    }
}

The downside is you still need type annotations at the callsite:

    #[test]
    fn test_dict_eq_scalar() {
...
        let array = builder.finish();
        // still need the UInt8Type annotations
        let a_eq = eq_dyn_scalar::<UInt8Type>(&array, 123u8).unwrap();
        assert_eq!(
            a_eq,
            BooleanArray::from(vec![Some(true), None, Some(false)])
        );
    }

@alamb
Copy link
Contributor

alamb commented Dec 13, 2021

The other thing I can think of, that seems a bit of a hack, but might be ok would be to take something like impl TryInto<i128> which covers all current native types

/// Perform `left == right` operation on an array and a numeric scalar
/// value. Supports PrimtiveArrays, and DictionaryArrays that have primitive values
pub fn eq_dyn_scalar<T>(
    left: &dyn Array,
    right: T
) -> Result<BooleanArray>
where
    T : TryInto<i128> + Copy + std::fmt::Debug
{
    let right: i128 = right
        .try_into()
        .map_err(|_| ArrowError::ComputeError(format!("Can not convert scalar {:?} to i128", right)))?;
    match left.data_type() {
        DataType::UInt8 => {
            let right: u8 = right
                .try_into()
                .map_err(|_| ArrowError::ComputeError(format!("Can not convert {:?} to u8 for comparison with UInt8Array", right)))?;
            eq_scalar::<UInt8Type>(as_primitive::<UInt8Type>(left), right)
        }
        DataType::UInt16 => {
            let right: u16 = right
                .try_into()
                .map_err(|_| ArrowError::ComputeError(format!("Can not convert {:?} to u16 for comparison with UInt16Array", right)))?;
            eq_scalar::<UInt16Type>(as_primitive::<UInt16Type>(left), right)
        }
        // TODO macroize + fill out the other primitive array types here

        DataType::Dictionary(key_type, value_type) => {
            match key_type.as_ref() {
                DataType::UInt8 => {
                    let left = as_dictionary::<UInt8Type>(left);
                    unpack_dict_comparison(left, eq_dyn_scalar(left.values().as_ref(), right)?)
                }
                // TODO fill out the rest of the key types here
                _ => todo!()
            }
        }
        _ => todo!()
    }
}

/// unpacks the results of comparing left.values (as a boolean)
///
/// TODO add example
///
fn unpack_dict_comparison<K>(
    left: &DictionaryArray<K>,
    dict_comparison: BooleanArray,
) -> Result<BooleanArray>
where
    K: ArrowNumericType,
{
    assert_eq!(dict_comparison.len(), left.values().len());

    let result: BooleanArray = left
        .keys()
        .iter()
        .map(|key| {
            key.map(|key| unsafe {
                // safety lengths were verified above
                let key = key.to_usize().expect("Dictionary index not usize");
                dict_comparison.value_unchecked(key)
            })
        })
        .collect();

    Ok(result)
}

Which then finally allows a call to eq_dyn_scalar without type annotations:

    #[test]
    fn test_dict_eq_scalar() {
        ...
        let array = builder.finish();
        // YAY! No type annotations!
        let a_eq = eq_dyn_scalar(&array, 123).unwrap();
        assert_eq!(
            a_eq,
            BooleanArray::from(vec![Some(true), None, Some(false)])
        );
    }

@alamb alamb changed the title Add comparison kernels for DictionaryArray Add scalar comparison kernels for DictionaryArray Dec 13, 2021
@jorgecarleitao
Copy link
Member

IMO we need a dyn Scalar and a cmp_scalar(array: &dyn Array, scalar: &dyn Scalar) -> BooleanArray that dispatches to specific implementations based on the array's datatype. We then use cmp_scalar(dict.values().as_ref(), scalar) and clone the indices. Inspiration for trait Scalar here: https://github.com/jorgecarleitao/arrow2/tree/main/src/scalar . I hope it helps.

@alamb
Copy link
Contributor

alamb commented Dec 13, 2021

IMO we need a dyn Scalar and a cmp_scalar(array: &dyn Array, scalar: &dyn Scalar) -> BooleanArray that dispatches to specific implementations based on the array's datatype. We then use cmp_scalar(dict.values().as_ref(), scalar) and clone the indices. Inspiration for trait Scalar here: https://github.com/jorgecarleitao/arrow2/tree/main/src/scalar . I hope it helps.

@jorgecarleitao -- Indeed, this is an excellent idea. 🤔

Following the existing pattern in the comparison kernels, we will likely end up with several kernels such as

eq_dyn_lit(left: &dyn Array, right: &dyn Array)
eq_utf8_dyn_lit(left: &dyn Array, right: impl AsRef<str>)
eq_bool_dyn_lit(left: &dyn Array, right: bool)

Which might be nice as the user's rust code could leave the scalar right strongly typed, but it would still have to call the correct kernel. It also has the upside that it wouldn't require an owned String for an Utf8 constant, but we can probably make a dyn Scalar work like that too with some finagling

@jorgecarleitao when you say "inspiration" do you mean it would be ok to incorporate https://github.com/jorgecarleitao/arrow2/tree/main/src/scalar into arrow-rs? I have somewhat lost track of what was needed IP clearance wise. Maybe since arrow2 is apache licensed, it is fine?

@alamb
Copy link
Contributor

alamb commented Dec 13, 2021

Perhaps one possibility would be to have a trait that maps in the reverse direction? Something like playground

@shepmaster this is a very cool idea -- in fact I think that is much better than taking an impl Into<i128> as I suggested in #984 (comment)

@jorgecarleitao
Copy link
Member

jorgecarleitao commented Dec 13, 2021

For reference (for anyone not following the gist), a DictionaryArray is essentially

struct DictionaryArray<K> {
    indices: PrimitiveArray<K>
    values: Arc<dyn Array>
}

that is itemwise represented as

indices
     .into_iter()
     .map(|optional_index| optional_index.map(|index| "values[index]"))

where "values[index]" represents the index of values.

@alamb 's insight is that we can compare a dictionary with a scalar of the same (dynamic) type as values by comparing values with that scalar (ignoring indices).

@jorgecarleitao
Copy link
Member

"inspiration" because

  • there is Datafusion's scalar API based on enums, which we could consider instead
  • in arrow2 there is a smaller number of physical types than arrow (one per physical type as opposed to one per logical type), so it can't be copied as is
  • the scalar API it is not IP-cleared it was added to arrow2 post IP clearance :/

@alamb
Copy link
Contributor

alamb commented Dec 13, 2021

the scalar API it is not IP-cleared it was added to arrow2 post IP clearance :/

Right -- I don't understand the implications of this statement (like if code is apache licensed, that means it can be incorporated into other projects, right?). So I don't really understand the IP clearance need but I remember it was complicated 🤷

On the topic of scalars -- there is a recent related mailing list discussion of adding "Constant Array" support to arrow which perhaps could serve the same / similar purpose: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq

@alippai
Copy link
Contributor

alippai commented Dec 13, 2021

@alamb my understanding is that any meaningful chunk of code with any license needs to be cleared if it's not original creation of the dev written for the Apache Arrow project. I believe @jorgecarleitao referred to this: that trait is not IP cleared yet, thus it needs the legal process :/

@matthewmturner
Copy link
Contributor Author

@alamb @jorgecarleitao im not familiar enough with the rest of the code base, but would adding a dyn Scalar trait mean having to update everywhere rust native scalars are currently used with that trait? of course this is assuming legal issues are resolved. on that point, independent from the context of this PR, is IP clearance expected to refresh anytime soon?

The approach from @shepmaster seemed quite nice as well however would still require users setting the type of the scalar - is that acceptable?

@alamb
Copy link
Contributor

alamb commented Dec 14, 2021

but would adding a dyn Scalar trait mean having to update everywhere rust native scalars are currently used with that trait?

One approach could be to remove all the rust native scalars and replace with dyn Scalar, but another approach is to have both (keep the existing kernels with different signatures, and add a new entirely dynamic kernel eq_scalar(&dyn Array, &dyn Scalar).

I think keeping with the existing style of separate kernels for separate rust native types for this PR is probable,

of course this is assuming legal issues are resolved. on that point, independent from the context of this PR, is IP clearance expected to refresh anytime soon?

I defer to @jorgecarleitao on his plans, as he is the primary author of most of arrow2

The approach from @shepmaster seemed quite nice as well however would still require users setting the type of the scalar - is that acceptable?

I think it is acceptable (and a good idea) personally. It would result in perhaps something like:

/// Perform `left == right` operation on an array and a numeric scalar
/// value. Supports PrimtiveArrays, and DictionaryArrays that have primitive numeric values
pub fn eq_dyn_scalar<T>(
    left: &dyn Array,
    right: T
) -> Result<BooleanArray>
where
    T : IntoArrowNumericType
{
...
}

?

@matthewmturner
Copy link
Contributor Author

@alamb ok! I can open a new issue for adding the IntoArrowNumericType trait which i think we would want to do separate from this PR. That would also give others an opportunity to provide an opinion and give a little time to see if there is path forward on the ip clearance.

@alamb
Copy link
Contributor

alamb commented Dec 14, 2021

@alamb ok! I can open a new issue for adding the IntoArrowNumericType trait which i think we would want to do separate from this PR. That would also give others an opportunity to provide an opinion and give a little time to see if there is path forward on the ip clearance.

Sounds good @matthewmturner .

In terms of Scalar / ScalarValue and IP clearance, here is what I would suggest:

  1. We implement the kernels that dynamically dispatch on the array type eq_dyn_scalar, eq_utf8_dyn_lit and eq_bool_dyn_lit), -- I can definitely find time to help here)
  2. We the file a follow on ticket / issue for implementing a fully dynamic kernels that dispatch both on array type and on scalar type

What can I do to be most helpful for this project?

@alamb
Copy link
Contributor

alamb commented Dec 14, 2021

BTW for some context, the comparison kernels for dictionaries are very important (though not critically urgent) for our usecase in IOx. as we have large amounts of dictionary string data.

Thus given you are working on this I can reallocate some non trivial amount of my time to help. Thank you for all the effort so far @matthewmturner

cc @pauldix @jacobmarble

@jorgecarleitao
Copy link
Member

Yes, what @alippai wrote. Given the number of differences that we would need to apply, re-writing it shouldn't be very difficult. My comment was really just to point out that dyn Scalar is a an option here. I think we discussed this some time ago with @nevi-me (the idea of a Scalar API to help us write generic code).

@matthewmturner
Copy link
Contributor Author

@alamb thank you for the additional context and all the guidance youve provided so far.

just to expand on and summarize your plan to make sure im not missing any steps, can you confirm these are the steps?

  1. New issue / PR to implement IntoArrowNumericType
  2. Use 'IntoArrowNumericType' to create the different underlying xx_dict_scalar functions / macros.
  3. We create the dynamic array kernels (i.e. eq_dyn_scalar) for scalar values, using the the new dict kernels create in step 2 for DictionaryArray (this PR)
  4. We create issue for implementing the Scalar api
  5. We create issue for creating dynamic array and scalar kernels

Im happy to, and have enjoyed, working on this. but I dont want to slow down anything on the IOx side as a result of me getting up to speed on the different pieces here. If you're okay (and aligned on the steps above) with it we could continue as we have been on this, and ill start on the first step and ping you as I have questions / for guidance. If this isnt moving at a quick enough pace then we can work on plan for how you can assist on the implementation to speed it up?

@nevi-me
Copy link
Contributor

nevi-me commented Dec 14, 2021

It's been a while, but if I recall correctly, the idea was for compute kernels in arrow to follow a similar approach to C++ where we have a Datum (or whatever we call it), which could likely be an enum of the same datatype. Within that Datum, we'd have Scalar, Array, ...

The C++ impl has a ChunkedArray which is effectively Vec<Array>, but we never went with that.

It would have been desirable to move Scalar to arrow at the time, but that's fortunately been done in arrow2.

@alamb
Copy link
Contributor

alamb commented Dec 15, 2021

Hi @matthewmturner -- I will respond tomorrow - I ran out of time today

@alamb
Copy link
Contributor

alamb commented Dec 16, 2021

just to expand on and summarize your plan to make sure im not missing any steps, can you confirm these are the steps?

Yes I think those are the steps I was proposing. Thank you @matthewmturner

If you're okay (and aligned on the steps above) with it we could continue as we have been on this, and ill start on the first step and ping you as I have questions / for guidance. If this isn't moving at a quick enough pace then we can work on plan for how you can assist on the implementation to speed it up?

Sounds like a good plan! Thank you!

@alamb
Copy link
Contributor

alamb commented Dec 16, 2021

It's been a while, but if I recall correctly, the idea was for compute kernels in arrow to follow a similar approach to C++ where we have a Datum (or whatever we call it), which could likely be an enum of the same datatype. Within that Datum, we'd have Scalar, Array, ...

That is a great idea @nevi-me -- I couldn't find an existing ticket so I wrote a new one here: #1047

Maybe we can work towards that vision, starting with the equality kernels (we'll need all the pieces and tests, even when we have a fully dynamic dispatch, I suspect)

@alamb
Copy link
Contributor

alamb commented Dec 16, 2021

@matthewmturner let me know if it would help to file a ticket for IntoArrowNumericType

@matthewmturner
Copy link
Contributor Author

@matthewmturner let me know if it would help to file a ticket for IntoArrowNumericType

@alamb Sure that would be great!

@matthewmturner
Copy link
Contributor Author

@shepmaster one quick question on your proposal.

What would be the difference between using that approach compared to implementing the From trait (https://doc.rust-lang.org/std/convert/trait.From.html) for ArrowNumericType on each of the rust native types. I guess that would change the signatures / how we call the functions to something like the following:

fn eq_dict_scalar<K, T>(left: &DictionaryArray<K>, right: T)
where
    K: ArrowNumericType,
    T: ArrowNumericType,
{
    todo!()
}

// ----

fn usage<K>(d: &DictionaryArray<K>)
where
    K: ArrowNumericType,
{
    eq_dict_scalar(d, 8.into());
    eq_dict_scalar(d, -64.into());
}

@alamb FYI

@shepmaster
Copy link
Contributor

If it makes sense to use From, then there should be no material difference. You can still take a generic T: Into<ArrowNumericType> to avoid forcing the caller to do extra typing.

However, I don’t think that path is good as you’d still need to provide explicit types: what is the concrete type you are converting into?

@matthewmturner
Copy link
Contributor Author

However, I don’t think that path is good as you’d still need to provide explicit types: what is the concrete type you are converting into?

Actually, i think im mistaken and mixing up types and traits. I dont think we need to do a conversion. Sry about that.

@alamb
Copy link
Contributor

alamb commented Dec 20, 2021

@matthewmturner - I filed #1068 to track the IntoArrayNumericType trait

Before we spend time creating polished PRs for IntoArrowNumericType I think we should try and spike out a PR showing how the API would work with eq_dyn_scalar

Once we had that basic framework in place then we could fill out the implementation in individual PRs.

If it would help, I could try and take a crack at creating such a PR (likely based on this one)

@bkmgit
Copy link
Contributor

bkmgit commented Dec 28, 2021

One area that still needs to be determined in C++ is interfaces for custom comparisons:

Maybe worth planning for this as well.

@alamb
Copy link
Contributor

alamb commented Dec 29, 2021

Maybe worth planning for this as well.

Thanks @bkmgit -- perhaps you can file a ticket for the item ?

@alamb
Copy link
Contributor

alamb commented Dec 29, 2021

I think we have had all the discussion on this ticket and have the follow up items tracked in other tickets / issues. Thus closing this PR down

@alamb alamb closed this Dec 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add native comparison kernel support for DictionaryArray
9 participants