Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce user-defined signature #10439

Merged
merged 16 commits into from
May 11, 2024
Merged

Introduce user-defined signature #10439

merged 16 commits into from
May 11, 2024

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented May 10, 2024

Which issue does this PR close?

Closes #10423 .

Rationale for this change

What changes are included in this PR?

Add VariadicCoercion, UniformCoercion
Remove VariadicEqual (coercion rule does not require the result type to be the same for all)

Add the datafusion error aside from the signature error message for better debugging usage that is why many test files are updated.

Are these changes tested?

UDAF code is also included but no test is covered since no UDAF requires coercion there yet (We will cover it after moving builtin to UDAF)

Are there any user-facing changes?

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules sqllogictest labels May 10, 2024
plan_datafusion_err!(
"{}",
"{} and {}",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep error for debugging

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@@ -833,12 +865,8 @@ mod test {
signature: Signature::uniform(1, vec![DataType::Float32], Volatility::Stable),
})
.call(vec![lit("Apple")]);
let plan_err = Projection::try_new(vec![udf], empty)
Projection::try_new(vec![udf], empty)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think matching err message is unnecessary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as the message is tested elsewhere I think this is fine

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211 jayzhan211 marked this pull request as ready for review May 10, 2024 12:35
@alamb alamb added the api change Changes the API exposed to users of the crate label May 10, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jayzhan211 -- this looks great to me

Prior to merge, I think it is worth considering (but not required):

  1. Using a single new variant for TypeSignature (rather than 2)
  2. Improving the doc strings (I left some suggestions)
  3. Moving the semantic changes to coalsce into their own PR

///
/// `make_array(i32, i64) -> make_array(i64, i64)`
VariadicEqual,
/// One or more arguments of an arbitrary type and coerced with user-defined coercion rules.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since both TypeSignature::VariadicCoercion and TypeCoercion::UniformCoercion call into ScalarUDFImpl::coerce_types I wonder if there is a reason to have both.

Maybe instead we could add a single variant for both cases. Perhaps something like:

pub enum TypeSignature {
...
  /// The acceptable signature and coercions rules to coerce arguments to this
  /// signature are special for this function. If this signature is specified,
  /// Datafusion will call [`ScalarUDFImpl::coerce_types`] to prepare argument types.
  FunctionDefined, 
...
}

}
TypeSignature::VariadicCoercion | TypeSignature::UniformCoercion(_) => {
return internal_err!(
"Coercion signature is handled in function-specific get_valid_types."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Coercion signature is handled in function-specific get_valid_types."
"Coercion signature should be handled by in function-specific coerce_types."

@@ -111,6 +111,25 @@ impl ScalarUDFImpl for MakeArray {
fn aliases(&self) -> &[String] {
&self.aliases
}

fn coerce_types(&self, arg_types: &[DataType]) -> Result<Vec<DataType>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love it!

Ok(arg_types[1].clone())
}

fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> {
nvl2_func(args)
}

fn coerce_types(&self, arg_types: &[DataType]) -> Result<Vec<DataType>> {
let new_type = arg_types.iter().skip(1).try_fold(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this pattern to coerce all arguments to the type of the first one is repeated several times -- maybe we could make some sort of utility function to do it

As a follow on PR perhaps

@@ -833,12 +865,8 @@ mod test {
signature: Signature::uniform(1, vec![DataType::Float32], Volatility::Stable),
})
.call(vec![lit("Apple")]);
let plan_err = Projection::try_new(vec![udf], empty)
Projection::try_new(vec![udf], empty)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as the message is tested elsewhere I think this is fine

@@ -914,7 +942,7 @@ mod test {
.err()
.unwrap();
assert_eq!(
"type_coercion\ncaused by\nError during planning: Coercion from [Utf8] to the signature Uniform(1, [Float64]) failed.",
"type_coercion\ncaused by\nError during planning: [data_types_with_aggregate_udf] Coercion from [Utf8] to the signature Uniform(1, [Float64]) failed.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could make this error say "User defined coercsion rule failed" so it was clearer what happened

@@ -23,7 +23,7 @@ select coalesce(1, 2, 3);
1

# test with first null
query IT
query ?T
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still prefer to make the change to the coalsce coercion rules as its own PR, but I am also fine if we include it in this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to keep the old signature VariadicEqual for coalesce to split the change for coalesce :(. I prefer to make the change at once.

@@ -420,6 +425,11 @@ pub trait ScalarUDFImpl: Debug + Send + Sync {
fn short_circuits(&self) -> bool {
false
}

/// Coerce the types of the arguments to the types that the function expects
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice -- I think it would be nice to expand this documentation to help people understand the API some more

Perhaps something like

Suggested change
/// Coerce the types of the arguments to the types that the function expects
/// Coerce arguments of a function call to types that the function can evaluate.
///
/// This function is only called if [`Self::signature`] returns [`TypeSignature::UserDefined`]. Most
/// UDFs should return one of the other variants of `TypeSignature` which handle common
/// cases
///
/// See the [type coercion module](datafusion_expr::type_coercion)
/// documentation for more details on type coercion
///
/// For example, if your function requires a floating point arguments, but the user calls
/// it like `my_func(1::int)` (aka with `1` as an integer), coerce_types could return `[DataType::Float64]`
/// to ensure the argument was cast to `1::double`
///
/// # Parameters
/// * `arg_types`: The argument types of the arguments this function with
///
/// # Return value
/// A Vec the same length as `arg_types`. DataFusion will `CAST` the function call
/// arguments to these specific types.

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me -- thanks @jayzhan211

I had one question on a comment in the docstrings but we could fix that as a follow on as well

/// Performs type coercion for function arguments.
///
/// Returns the data types to which each argument must be coerced to
/// match `signature`.
///
/// For more details on coercion in general, please see the
/// [`type_coercion`](crate::type_coercion) module.
///
/// This function will be replaced with [data_types_with_scalar_udf],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we want to replace this function over time -- I think having the basic simple Signatures that handle most common coercions makes sense to have in DataFusion core (even if it could be done purely in a udf) as it will make creating UDFs easier for users

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with you

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211 jayzhan211 changed the title Introduce coercion signature VariadicCoercion and UniformCoercion Introduce user-defined signature May 11, 2024
@jayzhan211 jayzhan211 merged commit 76f9e2e into apache:main May 11, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented May 11, 2024

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sqllogictest
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support "User defined coercion" rules
2 participants