Skip to content

Add function origin API to replace name-based function checks in the optimizer#20868

Open
alexandreyc wants to merge 6 commits intoapache:mainfrom
alexandreyc:fix-18643
Open

Add function origin API to replace name-based function checks in the optimizer#20868
alexandreyc wants to merge 6 commits intoapache:mainfrom
alexandreyc:fix-18643

Conversation

@alexandreyc
Copy link
Contributor

@alexandreyc alexandreyc commented Mar 11, 2026

Which issue does this PR close?

Rationale for this change

See the issue.

What changes are included in this PR?

I followed the proposal made by @2010YOUY01 in #18643.

  • Added a is_builtin() -> bool method to AggregateUDF/AggregateUDFImpl and to WindowUDF/WindowUDFImpl
  • Updated all implementations of those traits
  • Updated occurences of matching on function names to additionally check the origin of the function

Are these changes tested?

Not directly but I can't see a relevant way to test this. Suggestions are welcome.

Are there any user-facing changes?

Currently yes, but the change could be made non-breaking, see questions below.

Request for advices

I'm new to the codebase so feel free to challenge this PR. In particular, I'd like to have your opinion on the following items:

  1. Should we make is_builtin have a default implementation that returns false? That would make this change non-breaking for users and slightly simplify this PR. But in return it would be more error-prone when implementing built-in functions.
  2. Should we add the method is_builtin to ScalarUDF/ScalarUPDImpl? I didn't do it because it seems there doesn't exist any scalar UDF name matching in the codebase. It might be desirable to add it for the sake of consistency across all kinds of UDF.
  3. Should we replace is_builtin() -> bool by origin() -> UDFOrigin? UDFOrigin would be something like enum { BuiltIn, Spark, UserDefined }. Asking because it's not clear to me if functions in the datafusion_spark crates should be considered built-in or not.

@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate substrait Changes to the substrait crate proto Related to proto crate functions Changes to functions implementation ffi Changes to the ffi crate spark labels Mar 11, 2026
@2010YOUY01 2010YOUY01 changed the title Fix #18643 Add function origin API to replace name-based function checks in the optimizer Mar 13, 2026
@coderfender
Copy link

This seems like a major change in the DF change. Perhaps we could break this into smaller PRs (if that it is even possible? )

@2010YOUY01
Copy link
Contributor

Thank you for the help!

  1. Should we replace is_builtin() -> bool by origin() -> UDFOrigin? UDFOrigin would be something like enum { BuiltIn, Spark, UserDefined }. Asking because it's not clear to me if functions in the datafusion_spark crates should be considered built-in or not.

I think origin() -> UDFOrigin API is better, it allows more flexibility for other potential usages. For example in the same context, 2 functions with the same name are registered, they're in different dialect and we want to check origin at runtime.

I suggest we first wait several days to see if there are other opinions.

@alexandreyc
Copy link
Contributor Author

Thanks @2010YOUY01 for your reply.

I updated the PR replacing is_builtin() -> bool by origin() -> UDFOrigin where UDFOrigin is enum { BuiltIn, SparkCompat, UserDefined }.

Also, I added a default implementation for the new method that returns UDFOrigin::UserDefined so that we don't break existing users. This also allows to make the PR slightly shorter.

@github-actions github-actions bot removed the proto Related to proto crate label Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate ffi Changes to the ffi crate functions Changes to functions implementation logical-expr Logical plan and expressions optimizer Optimizer rules spark substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid check function type by matching names in the optimizer

3 participants