ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs #8032

jorgecarleitao · 2020-08-23T15:24:55Z

See associated issue and document for details.

The gist is that currently, users call UDFs through

df.select(scalar_functions(“my_sqrt”, vec![col(“a”)], DataType::Float64))

and this PR proposes a change to

let functions = df.registry()?;

df.select(functions.udf(“my_sqrt”, vec![col(“a”)])?)

so that they do not have to remember the UDFs return type when using it (and a whole lot other things for us internally).

github-actions · 2020-08-23T15:31:52Z

https://issues.apache.org/jira/browse/ARROW-9836

andygrove · 2020-08-24T19:41:41Z

For built-in functions like sqrt I would expect DataFusion to provide convenience functions to create an expression, like we do with col and the aggregate functions. I assume we could also do that with the design proposed here?

For example, I would like to be able to write:

df.select(vec![col("foo"), sqrt(col("bar"))])?

jorgecarleitao · 2020-08-25T04:56:03Z

For built-in functions like sqrt I would expect DataFusion to provide convenience functions to create an expression, like we do with col and the aggregate functions. I assume we could also do that with the design proposed here?

For example, I would like to be able to write:
df.select(vec![col("foo"), sqrt(col("bar"))])?

This PR does not support this, as it threats every function (built-in or not) equally. To include that case, IMO this PR needs to add a new enum in the logical Expr:

Expr::ScalarFunction { name: String/Enum, args: Vec<Expr> } that we logically know its return type based on name (e.g. "sqrt"), exactly like Expr::BinaryExpr. This is mapped to a physical expression during planning. These can be build without access to the registry, as we hard-code the return type on the logical plan to be consistent with the physical one, like we do for our aggregates, binary expressions, etc.
Expr:ScalarUDF { fun: ScalarFunction, args: Vec<Expr> }, whose return type is only known after going to the registry to check what the user set its return type to be (as this PR currently does).

sqrt would return Expr::ScalarFunction, that knows its own return type, and the planner converts it to ScalarFunction via a hard-coded map, while Expr:UDF's physical planning is just planning args and pass them to ScalarFunction like this PR already does.

I.e. at the physical level, built-in and UDFs are indistinguishable, but at the logical plan, one only knows its name (built-in), the other also knows its physical representation ScalarUDF.

alamb · 2020-08-25T13:50:59Z

For anyone else reading along, the associated document I think is https://docs.google.com/document/d/1Kzz642ScizeKXmVE1bBlbLvR663BKQaGqVIyy9cAscY/edit?usp=sharing

rust/datafusion/src/execution/context.rs

rust/datafusion/src/logicalplan.rs

rust/datafusion/src/sql/planner.rs

rust/datafusion/src/execution/dataframe_impl.rs

rust/datafusion/src/execution/physical_plan/functions.rs

jorgecarleitao · 2020-08-25T20:10:29Z

@alamb thank you very much for your comments, I will now work on addressing them now. I still learning the Arc/Box/Ref, so thank you a lot for also teaching me.

@andygrove , I agree with you that built-in functions should not require access to the registry. Unfortunately, doing so required some re-work, which is the reason I retracted #7967 back to draft to focus on this one first.

I pushed a new commit to this PR to address this point. Specifically, that commit adds:

a new enum with all built-in functions
functionally gluing the logical plan with the physical plan so that the function's return types are invariant.
made type coercion on built-in functions to be on the physical plane, to preserve schema invariance during planning.

I am pretty happy with this PR, as IMO has the flexibility we need to expand DataFusion's pool of built-in functions to multiple input and return types. The main features of this PR:

users no longer have to pass the return type of the UDF when calling them (the proposal)
planning built-in functions continue to not need access to the registry (@andygrove 's point)
built-in functions now support multiple input types (e.g. sqrt(f32), sqrt(f64))
built-in functions now support multiple return types (e.g. sqrt(f32) -> f32, sqrt(f64) -> f64)
coercion rules are no longer applied in the sql planning or physical planning to built-in functions, to avoid breaking schema invariance during planning

I have not completed the valid return types of built-in math functions as this PR was already too long.

Overall, I think that this has not been a pleasant experience for you @andygrove and @alamb, as I constantly open and close PRs around functions/UDFs, and for that I am really sorry. I've been hitting some design challenge after another, which requires me to go back and forth.

I am still in pursuit of my original quests:

built-in aggregate functions whose logical types are known from the physical expressions
type coercion on aggregate functions
built-in aggregate functions whose return types (e.g. min(f32) -> f32, min(f64) -> f64) are directly derived from the physical plan (there is an old fixme/todo in the code around that)
aggregate udfs
udfs with multiple incoming and return types, to bring them to the same level of functionality of built-ins
planning a udf without registering it (a-la spark) in the DF's API.

I have code for some of this, I... just... need... to... finish... the... scalar... stuff... first... 😃

…nctions. Deprecates "Field" as argument to the UDF declaration, since we are only using its type. This is a spin-off of #8032 with a much smaller scope, as the other one is getting to large to handle. Closes #8045 from jorgecarleitao/clean_args Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

alamb · 2020-08-27T10:35:27Z

Good luck -- at this stage of a project (when architecture is changing a bunch) know it is hard to make small / easy to review PRs. I hope the comments are helpful and I am sorry I don't have more time to devote to reviews.

…

On Tue, Aug 25, 2020 at 4:10 PM Jorge Leitao ***@***.***> wrote: @alamb <https://github.com/alamb> thank you very much for your comments, I will now work on addressing them now. I still learning the Arc/Box/Ref, so thank you a lot for also teaching me. @andygrove <https://github.com/andygrove> , I agree with you that built-in functions should not require access to the registry. Unfortunately, doing so required some re-work, which is the reason I retracted #7967 <#7967> back to draft to focus on this one first. I pushed a new commit to this PR to address this point. Specifically, that commit adds: - a new enum with all built-in functions - functionally gluing the logical plan with the physical plan so that the function's return types are invariant. - made type coercion on built-in functions to be on the physical plane, to preserve schema invariance during planning. I am pretty happy with this PR, as IMO has the flexibility we need to expand DataFusion's pool of built-in functions to multiple input and return types. The main features of this PR: - users no longer have to pass the return type of the UDF when calling them (the proposal) - planning built-in functions continue to not need access to the registry ***@***.*** <https://github.com/andygrove> 's point) - built-in functions now support multiple input types (e.g. sqrt(f32), sqrt(f64)) - built-in functions now support multiple return types (e.g. sqrt(f32) -> f32, sqrt(f64) -> f64) - coercion rules are no longer applied in the sql planning or physical planning to built-in functions, to avoid breaking schema invariance during planning I have not completed the valid return types of built-in math functions as this PR was already too long. Overall, I think that this has not been a pleasant experience for you @andygrove <https://github.com/andygrove> and @alamb <https://github.com/alamb>, as I constantly open and close PRs around functions/UDFs, and for that I am really sorry. I've been hitting some design challenge after another, which requires me to go back and forth. I am still in pursuit of my original quests: - built-in aggregate functions whose logical types are known from the physical expressions - type coercion on aggregate functions - built-in aggregate functions whose return types (e.g. min(f32) -> f32, min(f64) -> f64) are directly derived from the physical plan (there is an old fixme/todo in the code around that) - aggregate udfs - udfs with multiple incoming and return types, to bring them to the same level of functionality of built-ins - planning a udf without registering it (a-la spark) in the DF's API. I have code for some of this, I... just... need... to... finish... the... scalar... stuff... first... 😃 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8032 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADXZMLPUU3TGXRLTCKWWKLSCQLEFANCNFSM4QIW6KXQ> .

jorgecarleitao · 2020-08-29T08:17:56Z

FYI, this is what a PR to add support for f32 to mathematical expressions (keeping the return type f64) looks like:
https://github.com/jorgecarleitao/arrow/pull/1

I.e. IMO with this PR we can support almost any built-in function: fixed type, variable return type, multiple input types, etc on Expr::ScalarFunction, which gives a lot of flexibility to add a new function, as we do not need to fiddle with Expr, only with input types, return types, etc. on the physical plane.

I split built-ins from the UDFs because built-ins type is known without access to the registry, which is currently required to allow users to use them outside the registry.

jorgecarleitao · 2020-08-30T04:24:11Z

FYI, this is what a PR would look like for the concatenate function: https://github.com/jorgecarleitao/arrow/pull/2/files using this API.

I am not advocating that we follow this design (option 3 in this comment) would also be fine.

My point is that regardless of which option we pick, we will need to have the functionality in this PR:

function signatures
methods to check and coerce types based on those signatures
methods to return the return type of the function

jorgecarleitao · 2020-08-30T05:12:33Z

And finally, this is how we would add the array function, that receives an arbitrary but uniformly-typed number of arguments: https://github.com/jorgecarleitao/arrow/pull/3/files

To summarize:

math functions:
- accept 1 argument of a fixed type (f32 or f64)
- return a fixed type (f64)
concat:
- accepts an arbitrary number of arguments of fixed type (utf8 atm, but large is trivial to add)
- returns a fixed type (utf8)
array:
- accepts an arbitrary number of arguments of variable but uniform types (i.e. all arguments must be of equal type)
- returns FixedSizeList(input_types[0], input.len())

The API that I am proposing here addresses all these cases out of the box. The 3 PRs in my repo,

all math accepts f32: https://github.com/jorgecarleitao/arrow/pull/1/files
concatenate for utf8: https://github.com/jorgecarleitao/arrow/pull/2/files
array for utf8: https://github.com/jorgecarleitao/arrow/pull/3/files

add support to each of them at the physical and logical level.

This PR also includes all the required coercion rules for this to work. E.g. array(f32, i32, u32) would be coerced to array(f32, f32, f32) (first argument dominates).

@alamb

…s for built-in functions @alamb and @andygrove , I was able to split #8032 in two, so that they address different problems. This PR is specific to the problem that we have been discussing in #7967. It offers a solution that covers the three main cases: * single return type, such as `sqrt -> f64` * finite set of return types, such as `concat` (utf8 and largeUTF8) * potentially infinite set of return types, such as `array` (Array of any primitive or non-primitive type) I believe that this implementation is closer to option 1 that @alamb enumerated here. It is so because so far I was unable to offer an implementation for option 3, because functions such as `array` have an arbitrary return type (it can be any valid type, primitive or non-primitive), and thus we can't write them as `array_TYPE` as the number of cases is potentially large. --------------- This PR is exclusive to *built-in functions* of variable return type and it does not care about UDFs. It addresses a limitation of our current logical planning, that has been thoroughly discussed in #8032 and #7967, that logical planning needs to specify a specific return type when planning usage of UDFs and built-in functions (details below). Notation: `return type function`: a function mapping the functions' argument types to its return type. E.g. `(utf8) -> utf8; (LargeUtf8) -> LargeUtf8;` is an example of the signature of a typical one argument string function. The primary difference between built-ins and UDFs is that built-in's return type function is always known (hard-coded), while the return type function of a UDF is known by accessing the registry where it is registered on (it is a non-static closure). This PR is required to address an incompatibility of the following requirements that I gathered from discussions between @alamb, @andygrove and @jorgecarleitao: 1. we want to have typing information during logical planning (see [here](https://docs.google.com/document/d/1Kzz642ScizeKXmVE1bBlbLvR663BKQaGqVIyy9cAscY/edit?disco=AAAAJ4XOjHk)) 2. we want to have functions that require their return type to depend on their input. Examples include `array` (any type to any other type) and `concatenate` (`utf8 -> utf8`, `largeutf8 -> largeutf8`), and many others (see [here](#7967 (comment))) 3. we would like users to plan built-in functions without accessing the registry (see [here](#8032 (comment)) and mailing list) 4. a UDFs return type function needs to be retrieved from the registry (`ExecutionContextState`). 5. Currently, all our built-in functions are declared as UDFs and registered on the registry when the context is initialized. These points are incompatible because: * 1. and 2. requires access to built-in function's return type function during planning * 4. and 5. requires access the registry to know the built-in's return type * 3. forbids us from accessing the registry during planning This PR solves this incompatibility by leveraging the following: * builtin functions have a well declared return type during planning, since they are part of the source code * builtin functions do not need to be in our function's registry The first commit in this PR makes the existing logical node `Expr::ScalarFunction` to be exclusive for built-in functions, and moves our UDF planning logic to a new node named `Expr::ScalarUDF`. It also makes the planning of built-in functions to no longer require access the registry. The second commit in this PR introduces the necessary functionality for built-in functions to support all types of complex signatures. Examples of usage of this functionality are in the following PRs: 1. add support for math functions that accept f32: https://github.com/jorgecarleitao/arrow/pull/4/files 2. add `concat`, of an arbitrary number of arguments of type utf8: https://github.com/jorgecarleitao/arrow/pull/5/files 3. add `array` function, supporting an arbitrary number of arguments with uniform types: https://github.com/jorgecarleitao/arrow/pull/6/files Closes #8080 from jorgecarleitao/functions Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

jorgecarleitao · 2020-09-03T06:46:39Z

After a long digression through the realm of built-ins, this has now been simplified and rebased against master.

@andygrove and @alamb , ready for a re-review.

Again, the core goal here is to allow users to use UDFs without having to worry about their return type.

I've incorporated all points from @alamb and @andygrove so far:

Registry is now a trait
Arc is used consistently
built-ins continue to be available without access to the registry

alamb

I really like this code -- I think the UDF usage / definition is clearer now and the code looks better. Really nice work @jorgecarleitao

rust/datafusion/src/logical_plan/mod.rs

alamb · 2020-09-03T11:16:41Z

rust/datafusion/src/logical_plan/mod.rs

-        } else {
-            expr.clone()
-        };
+        let mut projected_expr = vec![];


I want to verify my understanding -- this is a code cleanup that is not directly required for UDFs, right?

Indirectly, it is related: expr.contains(&Expr::Wildcard) requires PartialEq, which was dropped in this line due to the addition of an Arc<ScalarFunction> in this line.

However, since if expr.contains(&Expr::Wildcard) thankfully was entirely optional, I dropped it ^_^

rust/datafusion/src/optimizer/type_coercion.rs

andygrove · 2020-09-03T14:01:29Z

rust/datafusion/src/dataframe.rs

+    /// # Ok(())
+    /// # }
+    /// ```
+    fn registry(&self) -> &dyn FunctionRegistry;


Is this registry specific to scalar functions or will it also be used for aggregate functions? Perhaps we should name the method either function_registry or scalar_function_registry?

andygrove

LGTM and this is a nice improvement! My only comment is that maybe some of the naming could be made more consistent, or documentation made clearer .. the things that weren't immediately clear to me (although I didn't spend much time on the review) were:

Is the registry just for user-defined scalar functions, or does it include the built-in scalar functions as well?
The registry has methods like udf but should we consider naming scalar_udf? I wasn't sure what the plan was for user-defined aggregate functions (UDAFs)

jorgecarleitao · 2020-09-03T14:32:31Z

I agree with you, @andygrove .

Built-in functions are not part of the registry as they are exposed directly from the prelude. So, maybe udf_registry could be more explicit about this.
I thought about FunctionRegistry being used for both udfs and UDAFs (user-defined aggregate function, à la spark), so that a user does not have to call df.registry() or df.aggregate_registry(), they just need to remember about one.
I thought about FunctionRegistry::udf and FunctionRegistry::udaf, just because short names make it easier to read statements. But, as always, it is a trade-off between short vs understanding.

I have no strong opinions about naming nor UX here: I will implement whatever you agree upon :-). My main concern was to fix the data type thing :P

andygrove · 2020-09-07T20:20:32Z

@jorgecarleitao @alamb I'm catching up on the PRs today. It looks like this one is ready to merge?

jorgecarleitao · 2020-09-07T20:22:58Z

I think so, @andygrove . There is probably some renaming once we have UDAFs. For now, I think it is fine.

…pes per argument This PR aligns UDFs registration and declaration to be consistent with our built-in functions, so that we can leverage coercion rules on their arguments. For ease of use, this PR introduces a function `create_udf` that simplifies the creation of UDFs with a fixed signature and fixed return type, so that users have a simple interface to declare them. However, underneath, the UDFs have the same capabilities as built-in functions, in that they can be as generic as built-in functions (arbitrary types, etc.). Specific achievements of this PR: * Added example (120 LOC) of how to declare and register a UDF * Deprecated the type coercer optimizer, since it was causing logical schemas to become misaligned and cause our end-to-end tests to faail when implicit casting was required, and replaced it by what we already do for built-ins * Made UDFs use the same interfaces as built-in functions Note that this PR is built on top of #8032. Closes #7967 from jorgecarleitao/clean Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

…nctions. Deprecates "Field" as argument to the UDF declaration, since we are only using its type. This is a spin-off of apache#8032 with a much smaller scope, as the other one is getting to large to handle. Closes apache#8045 from jorgecarleitao/clean_args Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

@alamb

…s for built-in functions @alamb and @andygrove , I was able to split apache#8032 in two, so that they address different problems. This PR is specific to the problem that we have been discussing in apache#7967. It offers a solution that covers the three main cases: * single return type, such as `sqrt -> f64` * finite set of return types, such as `concat` (utf8 and largeUTF8) * potentially infinite set of return types, such as `array` (Array of any primitive or non-primitive type) I believe that this implementation is closer to option 1 that @alamb enumerated here. It is so because so far I was unable to offer an implementation for option 3, because functions such as `array` have an arbitrary return type (it can be any valid type, primitive or non-primitive), and thus we can't write them as `array_TYPE` as the number of cases is potentially large. --------------- This PR is exclusive to *built-in functions* of variable return type and it does not care about UDFs. It addresses a limitation of our current logical planning, that has been thoroughly discussed in apache#8032 and apache#7967, that logical planning needs to specify a specific return type when planning usage of UDFs and built-in functions (details below). Notation: `return type function`: a function mapping the functions' argument types to its return type. E.g. `(utf8) -> utf8; (LargeUtf8) -> LargeUtf8;` is an example of the signature of a typical one argument string function. The primary difference between built-ins and UDFs is that built-in's return type function is always known (hard-coded), while the return type function of a UDF is known by accessing the registry where it is registered on (it is a non-static closure). This PR is required to address an incompatibility of the following requirements that I gathered from discussions between @alamb, @andygrove and @jorgecarleitao: 1. we want to have typing information during logical planning (see [here](https://docs.google.com/document/d/1Kzz642ScizeKXmVE1bBlbLvR663BKQaGqVIyy9cAscY/edit?disco=AAAAJ4XOjHk)) 2. we want to have functions that require their return type to depend on their input. Examples include `array` (any type to any other type) and `concatenate` (`utf8 -> utf8`, `largeutf8 -> largeutf8`), and many others (see [here](apache#7967 (comment))) 3. we would like users to plan built-in functions without accessing the registry (see [here](apache#8032 (comment)) and mailing list) 4. a UDFs return type function needs to be retrieved from the registry (`ExecutionContextState`). 5. Currently, all our built-in functions are declared as UDFs and registered on the registry when the context is initialized. These points are incompatible because: * 1. and 2. requires access to built-in function's return type function during planning * 4. and 5. requires access the registry to know the built-in's return type * 3. forbids us from accessing the registry during planning This PR solves this incompatibility by leveraging the following: * builtin functions have a well declared return type during planning, since they are part of the source code * builtin functions do not need to be in our function's registry The first commit in this PR makes the existing logical node `Expr::ScalarFunction` to be exclusive for built-in functions, and moves our UDF planning logic to a new node named `Expr::ScalarUDF`. It also makes the planning of built-in functions to no longer require access the registry. The second commit in this PR introduces the necessary functionality for built-in functions to support all types of complex signatures. Examples of usage of this functionality are in the following PRs: 1. add support for math functions that accept f32: https://github.com/jorgecarleitao/arrow/pull/4/files 2. add `concat`, of an arbitrary number of arguments of type utf8: https://github.com/jorgecarleitao/arrow/pull/5/files 3. add `array` function, supporting an arbitrary number of arguments with uniform types: https://github.com/jorgecarleitao/arrow/pull/6/files Closes apache#8080 from jorgecarleitao/functions Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

See associated issue and document for details. The gist is that currently, users call UDFs through ``` df.select(scalar_functions(“my_sqrt”, vec![col(“a”)], DataType::Float64)) ``` and this PR proposes a change to ``` let functions = df.registry()?; df.select(functions.udf(“my_sqrt”, vec![col(“a”)])?) ``` so that they do not have to remember the UDFs return type when using it (and a whole lot other things for us internally). Closes apache#8032 from jorgecarleitao/registry Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

…pes per argument This PR aligns UDFs registration and declaration to be consistent with our built-in functions, so that we can leverage coercion rules on their arguments. For ease of use, this PR introduces a function `create_udf` that simplifies the creation of UDFs with a fixed signature and fixed return type, so that users have a simple interface to declare them. However, underneath, the UDFs have the same capabilities as built-in functions, in that they can be as generic as built-in functions (arbitrary types, etc.). Specific achievements of this PR: * Added example (120 LOC) of how to declare and register a UDF * Deprecated the type coercer optimizer, since it was causing logical schemas to become misaligned and cause our end-to-end tests to faail when implicit casting was required, and replaced it by what we already do for built-ins * Made UDFs use the same interfaces as built-in functions Note that this PR is built on top of apache#8032. Closes apache#7967 from jorgecarleitao/clean Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

…nctions. Deprecates "Field" as argument to the UDF declaration, since we are only using its type. This is a spin-off of apache#8032 with a much smaller scope, as the other one is getting to large to handle. Closes apache#8045 from jorgecarleitao/clean_args Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

@alamb

…s for built-in functions @alamb and @andygrove , I was able to split apache#8032 in two, so that they address different problems. This PR is specific to the problem that we have been discussing in apache#7967. It offers a solution that covers the three main cases: * single return type, such as `sqrt -> f64` * finite set of return types, such as `concat` (utf8 and largeUTF8) * potentially infinite set of return types, such as `array` (Array of any primitive or non-primitive type) I believe that this implementation is closer to option 1 that @alamb enumerated here. It is so because so far I was unable to offer an implementation for option 3, because functions such as `array` have an arbitrary return type (it can be any valid type, primitive or non-primitive), and thus we can't write them as `array_TYPE` as the number of cases is potentially large. --------------- This PR is exclusive to *built-in functions* of variable return type and it does not care about UDFs. It addresses a limitation of our current logical planning, that has been thoroughly discussed in apache#8032 and apache#7967, that logical planning needs to specify a specific return type when planning usage of UDFs and built-in functions (details below). Notation: `return type function`: a function mapping the functions' argument types to its return type. E.g. `(utf8) -> utf8; (LargeUtf8) -> LargeUtf8;` is an example of the signature of a typical one argument string function. The primary difference between built-ins and UDFs is that built-in's return type function is always known (hard-coded), while the return type function of a UDF is known by accessing the registry where it is registered on (it is a non-static closure). This PR is required to address an incompatibility of the following requirements that I gathered from discussions between @alamb, @andygrove and @jorgecarleitao: 1. we want to have typing information during logical planning (see [here](https://docs.google.com/document/d/1Kzz642ScizeKXmVE1bBlbLvR663BKQaGqVIyy9cAscY/edit?disco=AAAAJ4XOjHk)) 2. we want to have functions that require their return type to depend on their input. Examples include `array` (any type to any other type) and `concatenate` (`utf8 -> utf8`, `largeutf8 -> largeutf8`), and many others (see [here](apache#7967 (comment))) 3. we would like users to plan built-in functions without accessing the registry (see [here](apache#8032 (comment)) and mailing list) 4. a UDFs return type function needs to be retrieved from the registry (`ExecutionContextState`). 5. Currently, all our built-in functions are declared as UDFs and registered on the registry when the context is initialized. These points are incompatible because: * 1. and 2. requires access to built-in function's return type function during planning * 4. and 5. requires access the registry to know the built-in's return type * 3. forbids us from accessing the registry during planning This PR solves this incompatibility by leveraging the following: * builtin functions have a well declared return type during planning, since they are part of the source code * builtin functions do not need to be in our function's registry The first commit in this PR makes the existing logical node `Expr::ScalarFunction` to be exclusive for built-in functions, and moves our UDF planning logic to a new node named `Expr::ScalarUDF`. It also makes the planning of built-in functions to no longer require access the registry. The second commit in this PR introduces the necessary functionality for built-in functions to support all types of complex signatures. Examples of usage of this functionality are in the following PRs: 1. add support for math functions that accept f32: https://github.com/jorgecarleitao/arrow/pull/4/files 2. add `concat`, of an arbitrary number of arguments of type utf8: https://github.com/jorgecarleitao/arrow/pull/5/files 3. add `array` function, supporting an arbitrary number of arguments with uniform types: https://github.com/jorgecarleitao/arrow/pull/6/files Closes apache#8080 from jorgecarleitao/functions Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

See associated issue and document for details. The gist is that currently, users call UDFs through ``` df.select(scalar_functions(“my_sqrt”, vec![col(“a”)], DataType::Float64)) ``` and this PR proposes a change to ``` let functions = df.registry()?; df.select(functions.udf(“my_sqrt”, vec![col(“a”)])?) ``` so that they do not have to remember the UDFs return type when using it (and a whole lot other things for us internally). Closes apache#8032 from jorgecarleitao/registry Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

…pes per argument This PR aligns UDFs registration and declaration to be consistent with our built-in functions, so that we can leverage coercion rules on their arguments. For ease of use, this PR introduces a function `create_udf` that simplifies the creation of UDFs with a fixed signature and fixed return type, so that users have a simple interface to declare them. However, underneath, the UDFs have the same capabilities as built-in functions, in that they can be as generic as built-in functions (arbitrary types, etc.). Specific achievements of this PR: * Added example (120 LOC) of how to declare and register a UDF * Deprecated the type coercer optimizer, since it was causing logical schemas to become misaligned and cause our end-to-end tests to faail when implicit casting was required, and replaced it by what we already do for built-ins * Made UDFs use the same interfaces as built-in functions Note that this PR is built on top of apache#8032. Closes apache#7967 from jorgecarleitao/clean Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

andygrove added Component: Rust Component: Rust - DataFusion labels Aug 23, 2020

This was referenced Aug 25, 2020

ARROW-9751: [Rust] [DataFusion] Allow UDFs to accept multiple data types per argument #7967

Closed

ARROW-9849: [Rust] [DataFusion] Simplified argument types of ScalarFunctions. #8045

Closed

alamb reviewed Aug 25, 2020

View reviewed changes

rust/datafusion/src/execution/context.rs Outdated Show resolved Hide resolved

rust/datafusion/src/logicalplan.rs Outdated Show resolved Hide resolved

rust/datafusion/src/logicalplan.rs Outdated Show resolved Hide resolved

rust/datafusion/src/sql/planner.rs Outdated Show resolved Hide resolved

jorgecarleitao commented Aug 25, 2020

View reviewed changes

rust/datafusion/src/execution/dataframe_impl.rs Show resolved Hide resolved

jorgecarleitao commented Aug 25, 2020

View reviewed changes

rust/datafusion/src/execution/physical_plan/functions.rs Outdated Show resolved Hide resolved

jorgecarleitao commented Aug 25, 2020

View reviewed changes

rust/datafusion/src/execution/physical_plan/functions.rs Outdated Show resolved Hide resolved

jorgecarleitao commented Aug 25, 2020

View reviewed changes

rust/datafusion/src/execution/physical_plan/functions.rs Outdated Show resolved Hide resolved

jorgecarleitao marked this pull request as ready for review August 29, 2020 08:12

jorgecarleitao mentioned this pull request Aug 30, 2020

ARROW-9887: [Rust] [DataFusion] Added support for complex return types for built-in functions #8080

Closed

alamb approved these changes Sep 3, 2020

View reviewed changes

andygrove reviewed Sep 3, 2020

View reviewed changes

andygrove approved these changes Sep 3, 2020

View reviewed changes

jorgecarleitao mentioned this pull request Sep 5, 2020

ARROW-9902: [Rust] [DataFusion] Add array() built-in function #8102

Closed

jorgecarleitao added 2 commits September 7, 2020 18:11

Made UDFs be used from a registry.

c665886

Renamed Registry -> FunctionRegistry

1e3dd09

andygrove closed this in 4186a66 Sep 7, 2020

jorgecarleitao deleted the registry branch September 30, 2020 15:18

asfimport mentioned this pull request Sep 27, 2020

[Rust] [DataFusion] Improve API for usage of UDFs #25875

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs #8032

ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs #8032

jorgecarleitao commented Aug 23, 2020 •

edited

github-actions bot commented Aug 23, 2020

andygrove commented Aug 24, 2020

jorgecarleitao commented Aug 25, 2020 •

edited

alamb commented Aug 25, 2020

jorgecarleitao commented Aug 25, 2020

alamb commented Aug 27, 2020 via email

jorgecarleitao commented Aug 29, 2020

jorgecarleitao commented Aug 30, 2020

jorgecarleitao commented Aug 30, 2020

jorgecarleitao commented Sep 3, 2020

alamb left a comment

alamb Sep 3, 2020

jorgecarleitao Sep 3, 2020

andygrove Sep 3, 2020

andygrove left a comment

jorgecarleitao commented Sep 3, 2020

andygrove commented Sep 7, 2020

jorgecarleitao commented Sep 7, 2020

ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs #8032

ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs #8032

Conversation

jorgecarleitao commented Aug 23, 2020 • edited

github-actions bot commented Aug 23, 2020

andygrove commented Aug 24, 2020

jorgecarleitao commented Aug 25, 2020 • edited

alamb commented Aug 25, 2020

jorgecarleitao commented Aug 25, 2020

alamb commented Aug 27, 2020 via email

jorgecarleitao commented Aug 29, 2020

jorgecarleitao commented Aug 30, 2020

jorgecarleitao commented Aug 30, 2020

jorgecarleitao commented Sep 3, 2020

alamb left a comment

Choose a reason for hiding this comment

alamb Sep 3, 2020

Choose a reason for hiding this comment

jorgecarleitao Sep 3, 2020

Choose a reason for hiding this comment

andygrove Sep 3, 2020

Choose a reason for hiding this comment

andygrove left a comment

Choose a reason for hiding this comment

jorgecarleitao commented Sep 3, 2020

andygrove commented Sep 7, 2020

jorgecarleitao commented Sep 7, 2020

jorgecarleitao commented Aug 23, 2020 •

edited

jorgecarleitao commented Aug 25, 2020 •

edited