Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UDAF: Add more fields to state fields #10391

Closed
wants to merge 7 commits into from

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented May 6, 2024

Which issue does this PR close?

Closes #.

I check the field and state_fields that are used for most of the aggregate functions. I think it is nice to split them out as a PR

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@github-actions github-actions bot added the logical-expr Logical plan and expressions label May 6, 2024
@@ -152,16 +157,21 @@ pub trait AggregateExpr: Send + Sync + Debug + PartialEq<dyn Any> {
pub struct AggregateFunctionExpr {
fun: AggregateUDF,
args: Vec<Arc<dyn PhysicalExpr>>,
/// input type
input_type: DataType,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found most of the builtin function name is input_data_type so I rename it.

///
/// [`format_state_name`]: crate::utils::format_state_name
fn state_fields(&self, _args: StateFieldsArgs) -> Result<Vec<Field>> {
not_impl_err!("state_fields hasn't been implemented for {self:?} yet")
Copy link
Contributor Author

@jayzhan211 jayzhan211 May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found most of the state_fields have their own version, so return err if not defined now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be an API change -- I think if we want to change it we should also update the documentation to reflect the change

@jayzhan211 jayzhan211 added the api change Changes the API exposed to users of the crate label May 6, 2024
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@github-actions github-actions bot added the core Core datafusion crate label May 6, 2024
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211 jayzhan211 marked this pull request as ready for review May 6, 2024 14:49
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jayzhan211 -- I almost feel like this PR is a solution in search of a problem. Specifically, is there a practical problem that the current API can't handle that this could ?

If this doesn't add any value to users but costs them (in terms of API churn) I am not sure it makes sense to make this change

@@ -287,18 +286,23 @@ pub trait AggregateUDFImpl: Debug + Send + Sync {
/// aggregate function was called.
fn accumulator(&self, acc_args: AccumulatorArgs) -> Result<Box<dyn Accumulator>>;

/// Return the fields for the function
fn field(&self, _args: FieldArgs) -> Result<Field> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same information as return_type? If so, perhaps we should deprecate the return_type function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return_type is part of the information in field and state_field not replaceable to field

///
/// [`format_state_name`]: crate::utils::format_state_name
fn state_fields(&self, _args: StateFieldsArgs) -> Result<Vec<Field>> {
not_impl_err!("state_fields hasn't been implemented for {self:?} yet")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be an API change -- I think if we want to change it we should also update the documentation to reflect the change

Field::new(
format_state_name(args.name, "mean1"),
DataType::Float64,
args.nullable,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't some of these fields nullable even if their input is not nullable (e.g. mean with no inputs is null)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this should be true

pub nullable: bool,
}

impl<'a> FieldArgs<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given all the fields are pub, I wonder how much benefit this API adds over simply creating FieldArgs directly 🤔

Copy link
Contributor Author

@jayzhan211 jayzhan211 May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But like name, it includes arguments, and not every time is the same, how do they get the correct field without args.name?

@jayzhan211
Copy link
Contributor Author

@alamb This change is based on what the built-in function does, and what we need for UDAF to have the same behavior. If the change is not clear if it is necessary, we can also change it when moving the function to UDAF gradually.

This PR exists because 1. breaking API changes at once, so we don't need to keep breaking API after that. 2. Split the change out can also reduce the review size of other functions moving (but less straightforward if the change is worth it)

if we found out there is a better way to rewrite what the current builtin function does, the change here might be useless, so I think both way has pros and cons

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented May 7, 2024

StateFieldsArgs
DistinctArrayAgg needs nullable
Sum needs return_type
Most of the functions needs input_type.

FieldArgs
Avg needs return_type
Most of the functions needs input_type.
OrderSensitiveArrayAgg needs nullable and order_by_data_types

Upd:
I was mislead by the naming, it seems we just need the value_type what we have now, which most of the functions are input_type, while functions like Sum or Count has a different type (return_type).

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211 jayzhan211 marked this pull request as draft May 8, 2024 05:22
@alamb
Copy link
Contributor

alamb commented May 8, 2024

@alamb This change is based on what the built-in function does, and what we need for UDAF to have the same behavior. If the change is not clear if it is necessary, we can also change it when moving the function to UDAF gradually.

Ah, interesting, I didn't realize this. Sorry

I think it would be useful to port UDFs over one by one, and as we find gaps in the APIs we can adjust them

Or if we want to discover all the potential issues we could do a WIP PR (as you have done already) to try to move the aggregates and when we hit a block fix the underlying issues in a different PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core datafusion crate logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants