Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AnyDictionary Abstraction and Take ArrayRef in DictionaryArray::with_values #4707

Merged
merged 3 commits into from
Aug 17, 2023

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

apache/datafusion#5471 (comment)

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 16, 2023
}

impl<K: ArrowDictionaryKeyType> AnyDictionaryArray for DictionaryArray<K> {
fn keys(&self) -> &dyn Array {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will tie in nicely with #4705

@@ -460,21 +461,18 @@ impl<K: ArrowDictionaryKeyType> DictionaryArray<K> {
/// }
/// ```
///
pub fn with_values(&self, values: &dyn Array) -> Self {
pub fn with_values(&self, values: ArrayRef) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change, and makes this API consistent with the other constructors and avoids needing to go via ArrayData

@tustvold tustvold added the api-change Changes to the arrow API label Aug 16, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tustvold -- I think this is getting close. The last missing part in my mind is applying a transform to the values to operate on the array (aka a version of unary that works on dyn Array)

///
/// This can be used to efficiently implement kernels for all possible dictionary
/// keys without needing to create specialized implementations for each key type
pub fn as_any_dictionary_array(array: &dyn Array) -> Option<&dyn AnyDictionaryArray> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can also add as_any_dictionary and as_any_dictionary_opt to AsArray?:

https://docs.rs/arrow/latest/arrow/array/trait.AsArray.html#method.as_dictionary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to AsArray

@@ -82,7 +82,7 @@ where
{
let dict_values = array.values().as_any().downcast_ref().unwrap();
let values = unary::<T, F, T>(dict_values, op);
Ok(Arc::new(array.with_values(&values)))
Ok(Arc::new(array.with_values(Arc::new(values))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about deprecating unary_dict and instead adding something that takes an AnyDictionaryArray ?

I feel like the common pattern is:

/// Apply op to the values of `array` and return a new array, with the same keys and type but transformed values 
fn unary_dict(array: &dyn DictionaryArrayAny, op: F) -> Result<ArrayRef, ArrowError>
where:
     F: Fn(T::Native) -> T::Native,
    F: Fn(T::Native) -> T::Native,

Copy link
Contributor Author

@tustvold tustvold Aug 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would entail generating code that is generic over both the dictionary and all the primitive values, which is precisely what we are trying to avoid doing. This categorically should not be the pattern.

Instead users should write a function with a kernel like

fn my_amazing_kernel(a: &dyn Array) -> Result<ArrayRef> {
    if let Some(a) = as_any_dictionary_array(a) {
        let values = my_amazing_kernel(a.values())?;
        Ok(a.with_values(values))
    }

    downcast_primitive_array! {
        a => ...
    }
}

What do you think about deprecating unary_dict

Yes, I eventually would hope that we can deprecate and remove unary_dict, it is a deeply problematic function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b95d762 deprecates unary_dict and adds an example of how to handle this correctly

@tustvold
Copy link
Contributor Author

Integration test failure seems unrelated, will see if sorts itself out tomorrow, if not I'll raise an upstream report

@tustvold tustvold merged commit f0200db into apache:master Aug 17, 2023
25 checks passed
}

/// Applies an infallible unary function to an array with primitive values.
#[deprecated(note = "Use arrow_array::AnyDictionaryArray")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold @alamb saw this when upgrading to the latest DF 31.0.0, and I think this deprecation note doesn't look very helpful: should users implement the functionalities of unary_dyn by themselves using things like as_any_dictionary_opt? shall we just change the implementation of unary_dyn using the new APIs instead?

Copy link
Contributor Author

@tustvold tustvold Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My major motivation for deprecating it was to move away from these quasi-generic APIs, that are the worst of both worlds. They have to parameterised on the value type, whilst also not being properly generic. Additionally in this particular case it results in codegen for each type of key type, and returns a PrimitiveArray when it could preserve the dictionary.

There is an example on AnyDictionaryArray of how to handle this better - https://docs.rs/arrow-array/latest/arrow_array/array/trait.AnyDictionaryArray.html. Swapping the with_values for the take kernel will hydrate the dictionary as a PrimitiveArray if that is the desired behaviour, this will be both faster and result in less codegen

TLDR this function has very peculiar semantics that I think encourage a problematic pattern

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally in this particular case it results in codegen for each type of key type, and returns a PrimitiveArray when it could preserve the dictionary.

I can understand it has to do case analysis for different key types, but why it would return a PrimitiveArray unnecessarily? it does call unary_dict which preserves the dictionary right?

Yea, when seeing the deprecation message, I was expecting that a equivalent function would be provided for users to migrate to, instead of having to find a way to replicate the logic by themselves. I'm thinking we can come up with a new implementation for unary_dyn without the key type enumeration problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does call unary_dict which preserves the dictionary right?

Aah yeah sorry, I'm confusing this with one of the other recently removed APIs

I'm thinking we can come up with a new implementation for unary_dyn without the key type enumeration problem

We could do, I would prefer to encourage users to not implement type dispatch at this level. Why duplicate the any-dictionary dispatch logic for every type of primitive value, and not just handle the dispatch logic once?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm sorry, not sure I understand this. I think it would still be valuable to have a function similar to unary but works for dictionary arrays? unary_dyn bridges the gap by taking an ArrayRef as input so users don't have to do the special handling of dictionary array. Otherwise, I guess they all have to repeat the steps of:

    if let Some(d) = a.as_any_dictionary_opt() {
        // Recursively handle dictionary input
        let r = my_kernel(d.values().as_ref())?;
        return Ok(d.with_values(r));
    }

by themselves?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops sorry, it should be something like:

pub fn my_function(array: ArrayRef, to_type: &DataType) -> ArrayRef {
    match to_type {
        DataType::Int64 => {
            unary_dyn::<_, Int64Type>(&array, |v| v.div_floor(MICROS_PER_SECOND)).unwrap()
        },
        ...
    }
}

basically we are doing some additional casting on top of the input array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't really follow, but this is sounding sufficiently use-case specific that perhaps it doesn't belong upstream??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea sure, we can keep this logic in our own repo for now. Just wanna raise this in case someone else run into the same issue :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, I guess they all have to repeat the steps of:

I agree the pattern of I have a function I want to apply to an array and I would like it applied either to a non-dictionary or dictionary encoded version of that type is very common.

It would be great to avoid clients having to write stuff like

    if let Some(d) = a.as_any_dictionary_opt() {
        // Recursively handle dictionary input
        let r = my_kernel(d.values().as_ref())?;
        return Ok(d.with_values(r));
    } else {
       // apply my kernel to dictionary input)
       return my_kernel(a.as_ref())
    }

Which forces users to:

  1. Remember to handle dictionaries
  2. won't automatically work for things like REE (or StringViews)

Copy link
Contributor Author

@tustvold tustvold Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBC I'm not objecting to what @alamb proposes, just observing that this function doesn't actually provide that interface.

If we can find a way to avoid people needing to duplicate logic, I'm all for it, I've just not been able to devise such an interface

The challenge is Rust doesn't support higher rank types, so you can't pass a generic function as an argument.

Perhaps macros might work... But they have their own issues...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants