ARROW-9887: [Rust] [DataFusion] Added support for complex return types for built-in functions #8080

jorgecarleitao · 2020-08-30T10:53:29Z

@alamb and @andygrove , I was able to split #8032 in two, so that they address different problems. This PR is specific to the problem that we have been discussing in #7967. It offers a solution that covers the three main cases:

single return type, such as sqrt -> f64
finite set of return types, such as concat (utf8 and largeUTF8)
potentially infinite set of return types, such as array (Array of any primitive or non-primitive type)

I believe that this implementation is closer to option 1 that @alamb enumerated here. It is so because so far I was unable to offer an implementation for option 3, because functions such as array have an arbitrary return type (it can be any valid type, primitive or non-primitive), and thus we can't write them as array_TYPE as the number of cases is potentially large.

This PR is exclusive to built-in functions of variable return type and it does not care about UDFs. It addresses a limitation of our current logical planning, that has been thoroughly discussed in #8032 and #7967, that logical planning needs to specify a specific return type when planning usage of UDFs and built-in functions (details below).

Notation: return type function: a function mapping the functions' argument types to its return type. E.g. (utf8) -> utf8; (LargeUtf8) -> LargeUtf8; is an example of the signature of a typical one argument string function.

The primary difference between built-ins and UDFs is that built-in's return type function is always known (hard-coded), while the return type function of a UDF is known by accessing the registry where it is registered on (it is a non-static closure).

This PR is required to address an incompatibility of the following requirements that I gathered from discussions between @alamb, @andygrove and @jorgecarleitao:

we want to have typing information during logical planning (see here)
we want to have functions that require their return type to depend on their input. Examples include array (any type to any other type) and concatenate (utf8 -> utf8, largeutf8 -> largeutf8), and many others (see here)
we would like users to plan built-in functions without accessing the registry (see here and mailing list)
a UDFs return type function needs to be retrieved from the registry (ExecutionContextState).
Currently, all our built-in functions are declared as UDFs and registered on the registry when the context is initialized.

These points are incompatible because:

1. and 2. requires access to built-in function's return type function during planning
1. and 5. requires access the registry to know the built-in's return type
1. forbids us from accessing the registry during planning

This PR solves this incompatibility by leveraging the following:

builtin functions have a well declared return type during planning, since they are part of the source code
builtin functions do not need to be in our function's registry

The first commit in this PR makes the existing logical node Expr::ScalarFunction to be exclusive for built-in functions, and moves our UDF planning logic to a new node named Expr::ScalarUDF. It also makes the planning of built-in functions to no longer require access the registry.

The second commit in this PR introduces the necessary functionality for built-in functions to support all types of complex signatures. Examples of usage of this functionality are in the following PRs:

add support for math functions that accept f32: https://github.com/jorgecarleitao/arrow/pull/4/files
add concat, of an arbitrary number of arguments of type utf8: https://github.com/jorgecarleitao/arrow/pull/5/files
add array function, supporting an arbitrary number of arguments with uniform types: https://github.com/jorgecarleitao/arrow/pull/6/files

This commit removes all built-in functions from our UDF API so that their return type does not need to be specified when planning them. This commit also fixes a bug on which sql planner was changing the logical plan's schema, causing the column name to change.

This allow us to handle built-in functions such as: * concat, with an arbitrary number of arguments of a fixed type * array, with an arbitrary number of arguments of a single type * any variation of uniform types (e.g. simple scalar functions)

github-actions · 2020-08-30T11:06:18Z

https://issues.apache.org/jira/browse/ARROW-9887

alamb

I went through the first commit in detail and left some comments. I didn't make it through the second commit and I am mostly out of time this morning (I need to go and respond to one more of your comments on another PR).

alamb · 2020-08-30T11:07:10Z

rust/datafusion/src/execution/physical_plan/functions.rs

+    coerce(fun, &arg_types[0])?;
+
+    // the return type after coercion.
+    // for now, this is type-independent, but there will be built-in functions whose return type


Suggested change

// for now, this is type-independent, but there will be built-in functions whose return type

// for now, this is type-independent, but there may be built-in functions whose return type

alamb · 2020-08-30T11:11:14Z

rust/datafusion/src/execution/physical_plan/functions.rs

+
+/// Create a physical (function) expression.
+/// This function errors when `args`' can't be coerced to a valid argument type of the function.
+pub fn function(


Suggested change

pub fn function(

pub fn to_physical_expr(

I personally find function to be a little confusing as the term is so generic and could apply to many things

alamb · 2020-08-30T11:12:09Z

rust/datafusion/src/execution/physical_plan/functions.rs

+}
+
+/// the type supported by the function `fun`.
+fn valid_type(fun: &ScalarFunction) -> DataType {


Suggested change

fn valid_type(fun: &ScalarFunction) -> DataType {

fn argument_type(fun: &ScalarFunction) -> DataType {

alamb · 2020-08-30T11:12:32Z

rust/datafusion/src/execution/physical_plan/functions.rs

+    )))
+}
+
+/// the type supported by the function `fun`.


Suggested change

/// the type supported by the function `fun`.

/// the type of argument required by the function `fun`.

alamb · 2020-08-30T11:16:43Z

rust/datafusion/src/execution/physical_plan/functions.rs

+/// coercion rules for all built-in functions.
+/// Returns a DataType coerced from `arg_type` that is accepted by `fun`.
+/// Errors when `arg_type` can't be coerced to a valid return type of `fun`.
+fn coerce(fun: &ScalarFunction, arg_type: &DataType) -> Result<DataType> {


Suggested change

fn coerce(fun: &ScalarFunction, arg_type: &DataType) -> Result<DataType> {

fn is_argument_compatible(fun: &ScalarFunction, arg_type: &DataType) -> Result<DataType> {

I don't think this function is "coercing" the input -- it is effectively encoding "what coercions would be possible" I think...

alamb · 2020-08-30T11:20:52Z

rust/datafusion/src/logicalplan.rs

@@ -190,7 +190,14 @@ fn create_name(e: &Expr, input_schema: &Schema) -> Result<String> {
            let expr = create_name(expr, input_schema)?;
            Ok(format!("CAST({} as {:?})", expr, data_type))
        }
-        Expr::ScalarFunction { name, args, .. } => {
+        Expr::ScalarFunction { fun, args, .. } => {
+            let mut names = Vec::with_capacity(args.len());


I don't know how much you care about using "idomatic" rust, but you can probably do the same kind of thing here with iterators:

let names = args.iter() .map(|e| create_name(e, input_schema) .collect::<Result<Vec_>>()?;

I agree. I kept it because the code around this change is using this idiom. I would prefer to change that whole code (ScalarUDF, AggregateFunction, ScalarFunction) in a separate commit to not want to complicate even further this PR

that makes sense

alamb · 2020-08-30T11:23:34Z

rust/datafusion/src/sql/planner.rs

@@ -592,7 +607,7 @@ mod tests {
    fn select_scalar_func_with_literal_no_relation() {
        quick_test(
            "SELECT sqrt(9)",
-            "Projection: sqrt(CAST(Int64(9) AS Float64))\
+            "Projection: sqrt(Int64(9))\


This change would imply to me that sqrt can take an int which is not actually what happens, right? At some point something adds a Cast to float...

It depends where want to perform the cast. This test was ensuring that the cast was happenning on the logical plane, due to the type coercer.

IMO sqrt(8) is a logical expression that everyone agrees should return 2.82842712475... and the reason we cast it to f64 is entirely due to a limitation of our physical expressions, that only implement sqrt for f64.

For this reason, IMO these type of casting operations should only be done on our physical plane. For example, a different physical planner could prefer to not have it casted because it has a specialized implementation that supports other types.

This PR moves casting operations on built-in functions to the physical plane, like we are performing for binary operators.

Note that the user continues to have full transparency: they can use EXPLAIN verbose to get the physical plan, that contains the cast.

For this reason, IMO these type of casting operations should only be done on our physical plane. For example, a different physical planner could prefer to not have it casted because it has a specialized implementation that supports other types.

I hadn't thought of the idea that different physical planners might want to do type coercion differently -- that is a good point and it makes sense to me.

alamb · 2020-08-30T11:27:40Z

rust/datafusion/src/execution/physical_plan/functions.rs

+    /// arbitrary number of arguments of an arbitrary but uniform type
+    // A function such as `array` is `ManyUniform`
+    // The first argument decides the type used for coercion
+    ManyUniform,


the idea of a ManyUniform number of arguments perhaps is what is making this so complicated. ManyUniform sounds a lot like the notion of "varargs" (variadic functions) from C (aka how you can do printf(char *msg, ...), though that does not restrict to a single type.

Yes, Variadic, that is more appropriate name.

the idea of a ManyUniform number of arguments perhaps is what is making this so complicated.

I am not sure: implementation-wise, it is a small variation of Uniform.

The complication comes from what we want to achieve: we are essentially building a dynamic typed system with type-checking during planning. AFAIK that is a subset of a compiler...

alamb

I think this code is good enough to merge -- while it is perhaps not exactly how I would have done it, it seems well thought out, has a coherent design and I think I could make this code do anything I might want -- namely I think it would work for the usecases in our project). Given the effort that @jorgecarleitao has put in, it might make sense to get this in and continue to iterate.

andygrove

Thanks for taking the lead on reviewing this @alamb. LGTM too.

@alamb

…s for built-in functions @alamb and @andygrove , I was able to split apache#8032 in two, so that they address different problems. This PR is specific to the problem that we have been discussing in apache#7967. It offers a solution that covers the three main cases: * single return type, such as `sqrt -> f64` * finite set of return types, such as `concat` (utf8 and largeUTF8) * potentially infinite set of return types, such as `array` (Array of any primitive or non-primitive type) I believe that this implementation is closer to option 1 that @alamb enumerated here. It is so because so far I was unable to offer an implementation for option 3, because functions such as `array` have an arbitrary return type (it can be any valid type, primitive or non-primitive), and thus we can't write them as `array_TYPE` as the number of cases is potentially large. --------------- This PR is exclusive to *built-in functions* of variable return type and it does not care about UDFs. It addresses a limitation of our current logical planning, that has been thoroughly discussed in apache#8032 and apache#7967, that logical planning needs to specify a specific return type when planning usage of UDFs and built-in functions (details below). Notation: `return type function`: a function mapping the functions' argument types to its return type. E.g. `(utf8) -> utf8; (LargeUtf8) -> LargeUtf8;` is an example of the signature of a typical one argument string function. The primary difference between built-ins and UDFs is that built-in's return type function is always known (hard-coded), while the return type function of a UDF is known by accessing the registry where it is registered on (it is a non-static closure). This PR is required to address an incompatibility of the following requirements that I gathered from discussions between @alamb, @andygrove and @jorgecarleitao: 1. we want to have typing information during logical planning (see [here](https://docs.google.com/document/d/1Kzz642ScizeKXmVE1bBlbLvR663BKQaGqVIyy9cAscY/edit?disco=AAAAJ4XOjHk)) 2. we want to have functions that require their return type to depend on their input. Examples include `array` (any type to any other type) and `concatenate` (`utf8 -> utf8`, `largeutf8 -> largeutf8`), and many others (see [here](apache#7967 (comment))) 3. we would like users to plan built-in functions without accessing the registry (see [here](apache#8032 (comment)) and mailing list) 4. a UDFs return type function needs to be retrieved from the registry (`ExecutionContextState`). 5. Currently, all our built-in functions are declared as UDFs and registered on the registry when the context is initialized. These points are incompatible because: * 1. and 2. requires access to built-in function's return type function during planning * 4. and 5. requires access the registry to know the built-in's return type * 3. forbids us from accessing the registry during planning This PR solves this incompatibility by leveraging the following: * builtin functions have a well declared return type during planning, since they are part of the source code * builtin functions do not need to be in our function's registry The first commit in this PR makes the existing logical node `Expr::ScalarFunction` to be exclusive for built-in functions, and moves our UDF planning logic to a new node named `Expr::ScalarUDF`. It also makes the planning of built-in functions to no longer require access the registry. The second commit in this PR introduces the necessary functionality for built-in functions to support all types of complex signatures. Examples of usage of this functionality are in the following PRs: 1. add support for math functions that accept f32: https://github.com/jorgecarleitao/arrow/pull/4/files 2. add `concat`, of an arbitrary number of arguments of type utf8: https://github.com/jorgecarleitao/arrow/pull/5/files 3. add `array` function, supporting an arbitrary number of arguments with uniform types: https://github.com/jorgecarleitao/arrow/pull/6/files Closes apache#8080 from jorgecarleitao/functions Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

@alamb

…s for built-in functions @alamb and @andygrove , I was able to split apache#8032 in two, so that they address different problems. This PR is specific to the problem that we have been discussing in apache#7967. It offers a solution that covers the three main cases: * single return type, such as `sqrt -> f64` * finite set of return types, such as `concat` (utf8 and largeUTF8) * potentially infinite set of return types, such as `array` (Array of any primitive or non-primitive type) I believe that this implementation is closer to option 1 that @alamb enumerated here. It is so because so far I was unable to offer an implementation for option 3, because functions such as `array` have an arbitrary return type (it can be any valid type, primitive or non-primitive), and thus we can't write them as `array_TYPE` as the number of cases is potentially large. --------------- This PR is exclusive to *built-in functions* of variable return type and it does not care about UDFs. It addresses a limitation of our current logical planning, that has been thoroughly discussed in apache#8032 and apache#7967, that logical planning needs to specify a specific return type when planning usage of UDFs and built-in functions (details below). Notation: `return type function`: a function mapping the functions' argument types to its return type. E.g. `(utf8) -> utf8; (LargeUtf8) -> LargeUtf8;` is an example of the signature of a typical one argument string function. The primary difference between built-ins and UDFs is that built-in's return type function is always known (hard-coded), while the return type function of a UDF is known by accessing the registry where it is registered on (it is a non-static closure). This PR is required to address an incompatibility of the following requirements that I gathered from discussions between @alamb, @andygrove and @jorgecarleitao: 1. we want to have typing information during logical planning (see [here](https://docs.google.com/document/d/1Kzz642ScizeKXmVE1bBlbLvR663BKQaGqVIyy9cAscY/edit?disco=AAAAJ4XOjHk)) 2. we want to have functions that require their return type to depend on their input. Examples include `array` (any type to any other type) and `concatenate` (`utf8 -> utf8`, `largeutf8 -> largeutf8`), and many others (see [here](apache#7967 (comment))) 3. we would like users to plan built-in functions without accessing the registry (see [here](apache#8032 (comment)) and mailing list) 4. a UDFs return type function needs to be retrieved from the registry (`ExecutionContextState`). 5. Currently, all our built-in functions are declared as UDFs and registered on the registry when the context is initialized. These points are incompatible because: * 1. and 2. requires access to built-in function's return type function during planning * 4. and 5. requires access the registry to know the built-in's return type * 3. forbids us from accessing the registry during planning This PR solves this incompatibility by leveraging the following: * builtin functions have a well declared return type during planning, since they are part of the source code * builtin functions do not need to be in our function's registry The first commit in this PR makes the existing logical node `Expr::ScalarFunction` to be exclusive for built-in functions, and moves our UDF planning logic to a new node named `Expr::ScalarUDF`. It also makes the planning of built-in functions to no longer require access the registry. The second commit in this PR introduces the necessary functionality for built-in functions to support all types of complex signatures. Examples of usage of this functionality are in the following PRs: 1. add support for math functions that accept f32: https://github.com/jorgecarleitao/arrow/pull/4/files 2. add `concat`, of an arbitrary number of arguments of type utf8: https://github.com/jorgecarleitao/arrow/pull/5/files 3. add `array` function, supporting an arbitrary number of arguments with uniform types: https://github.com/jorgecarleitao/arrow/pull/6/files Closes apache#8080 from jorgecarleitao/functions Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

jorgecarleitao added 2 commits August 30, 2020 11:39

alamb reviewed Aug 30, 2020

View reviewed changes

andygrove added Component: Rust Component: Rust - DataFusion labels Aug 30, 2020

jorgecarleitao mentioned this pull request Aug 30, 2020

ARROW-9751: [Rust] [DataFusion] Allow UDFs to accept multiple data types per argument #7967

Closed

alamb approved these changes Aug 31, 2020

View reviewed changes

jorgecarleitao added 2 commits August 31, 2020 19:21

Renamed function.

5b1317e

Renamed Signature names.

6f4241f

andygrove approved these changes Sep 1, 2020

View reviewed changes

andygrove closed this in a898ee1 Sep 1, 2020

jorgecarleitao deleted the functions branch September 30, 2020 15:17

asfimport mentioned this pull request Sep 29, 2020

[Rust] [DataFusion] Add support for complex return types of built-in functions #25921

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-9887: [Rust] [DataFusion] Added support for complex return types for built-in functions #8080

ARROW-9887: [Rust] [DataFusion] Added support for complex return types for built-in functions #8080

jorgecarleitao commented Aug 30, 2020

github-actions bot commented Aug 30, 2020

alamb left a comment

alamb Aug 30, 2020

alamb Aug 30, 2020

alamb Aug 30, 2020

alamb Aug 30, 2020

alamb Aug 30, 2020

alamb Aug 30, 2020

jorgecarleitao Aug 31, 2020

alamb Aug 31, 2020

alamb Aug 30, 2020

jorgecarleitao Aug 30, 2020

alamb Aug 31, 2020

alamb Aug 30, 2020

jorgecarleitao Aug 30, 2020

alamb left a comment

andygrove left a comment

	// for now, this is type-independent, but there will be built-in functions whose return type
	// for now, this is type-independent, but there may be built-in functions whose return type

	fn valid_type(fun: &ScalarFunction) -> DataType {
	fn argument_type(fun: &ScalarFunction) -> DataType {

	/// the type supported by the function `fun`.
	/// the type of argument required by the function `fun`.

	fn coerce(fun: &ScalarFunction, arg_type: &DataType) -> Result<DataType> {
	fn is_argument_compatible(fun: &ScalarFunction, arg_type: &DataType) -> Result<DataType> {

ARROW-9887: [Rust] [DataFusion] Added support for complex return types for built-in functions #8080

ARROW-9887: [Rust] [DataFusion] Added support for complex return types for built-in functions #8080

Conversation

jorgecarleitao commented Aug 30, 2020

github-actions bot commented Aug 30, 2020

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

andygrove left a comment

Choose a reason for hiding this comment