New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-9887: [Rust] [DataFusion] Added support for complex return types for built-in functions #8080
Conversation
This commit removes all built-in functions from our UDF API so that their return type does not need to be specified when planning them. This commit also fixes a bug on which sql planner was changing the logical plan's schema, causing the column name to change.
This allow us to handle built-in functions such as: * concat, with an arbitrary number of arguments of a fixed type * array, with an arbitrary number of arguments of a single type * any variation of uniform types (e.g. simple scalar functions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the first commit in detail and left some comments. I didn't make it through the second commit and I am mostly out of time this morning (I need to go and respond to one more of your comments on another PR).
coerce(fun, &arg_types[0])?; | ||
|
||
// the return type after coercion. | ||
// for now, this is type-independent, but there will be built-in functions whose return type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// for now, this is type-independent, but there will be built-in functions whose return type | |
// for now, this is type-independent, but there may be built-in functions whose return type |
|
||
/// Create a physical (function) expression. | ||
/// This function errors when `args`' can't be coerced to a valid argument type of the function. | ||
pub fn function( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn function( | |
pub fn to_physical_expr( |
I personally find function
to be a little confusing as the term is so generic and could apply to many things
} | ||
|
||
/// the type supported by the function `fun`. | ||
fn valid_type(fun: &ScalarFunction) -> DataType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn valid_type(fun: &ScalarFunction) -> DataType { | |
fn argument_type(fun: &ScalarFunction) -> DataType { |
))) | ||
} | ||
|
||
/// the type supported by the function `fun`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// the type supported by the function `fun`. | |
/// the type of argument required by the function `fun`. |
/// coercion rules for all built-in functions. | ||
/// Returns a DataType coerced from `arg_type` that is accepted by `fun`. | ||
/// Errors when `arg_type` can't be coerced to a valid return type of `fun`. | ||
fn coerce(fun: &ScalarFunction, arg_type: &DataType) -> Result<DataType> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn coerce(fun: &ScalarFunction, arg_type: &DataType) -> Result<DataType> { | |
fn is_argument_compatible(fun: &ScalarFunction, arg_type: &DataType) -> Result<DataType> { |
I don't think this function is "coercing" the input -- it is effectively encoding "what coercions would be possible" I think...
@@ -190,7 +190,14 @@ fn create_name(e: &Expr, input_schema: &Schema) -> Result<String> { | |||
let expr = create_name(expr, input_schema)?; | |||
Ok(format!("CAST({} as {:?})", expr, data_type)) | |||
} | |||
Expr::ScalarFunction { name, args, .. } => { | |||
Expr::ScalarFunction { fun, args, .. } => { | |||
let mut names = Vec::with_capacity(args.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how much you care about using "idomatic" rust, but you can probably do the same kind of thing here with iterators:
let names = args.iter()
.map(|e| create_name(e, input_schema)
.collect::<Result<Vec_>>()?;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I kept it because the code around this change is using this idiom. I would prefer to change that whole code (ScalarUDF
, AggregateFunction
, ScalarFunction
) in a separate commit to not want to complicate even further this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that makes sense
@@ -592,7 +607,7 @@ mod tests { | |||
fn select_scalar_func_with_literal_no_relation() { | |||
quick_test( | |||
"SELECT sqrt(9)", | |||
"Projection: sqrt(CAST(Int64(9) AS Float64))\ | |||
"Projection: sqrt(Int64(9))\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change would imply to me that sqrt
can take an int
which is not actually what happens, right? At some point something adds a Cast
to float...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends where want to perform the cast. This test was ensuring that the cast was happenning on the logical plane, due to the type coercer.
IMO sqrt(8)
is a logical expression that everyone agrees should return 2.82842712475...
and the reason we cast it to f64 is entirely due to a limitation of our physical expressions, that only implement sqrt
for f64.
For this reason, IMO these type of casting operations should only be done on our physical plane. For example, a different physical planner could prefer to not have it cast
ed because it has a specialized implementation that supports other types.
This PR moves casting operations on built-in functions to the physical plane, like we are performing for binary operators.
Note that the user continues to have full transparency: they can use EXPLAIN verbose
to get the physical plan, that contains the cast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this reason, IMO these type of casting operations should only be done on our physical plane. For example, a different physical planner could prefer to not have it casted because it has a specialized implementation that supports other types.
I hadn't thought of the idea that different physical planners might want to do type coercion differently -- that is a good point and it makes sense to me.
/// arbitrary number of arguments of an arbitrary but uniform type | ||
// A function such as `array` is `ManyUniform` | ||
// The first argument decides the type used for coercion | ||
ManyUniform, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the idea of a ManyUniform
number of arguments perhaps is what is making this so complicated. ManyUniform
sounds a lot like the notion of "varargs" (variadic functions) from C (aka how you can do printf(char *msg, ...)
, though that does not restrict to a single type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Variadic
, that is more appropriate name.
the idea of a ManyUniform number of arguments perhaps is what is making this so complicated.
I am not sure: implementation-wise, it is a small variation of Uniform
.
The complication comes from what we want to achieve: we are essentially building a dynamic typed system with type-checking during planning. AFAIK that is a subset of a compiler...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code is good enough to merge -- while it is perhaps not exactly how I would have done it, it seems well thought out, has a coherent design and I think I could make this code do anything I might want -- namely I think it would work for the usecases in our project). Given the effort that @jorgecarleitao has put in, it might make sense to get this in and continue to iterate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the lead on reviewing this @alamb. LGTM too.
…s for built-in functions @alamb and @andygrove , I was able to split apache#8032 in two, so that they address different problems. This PR is specific to the problem that we have been discussing in apache#7967. It offers a solution that covers the three main cases: * single return type, such as `sqrt -> f64` * finite set of return types, such as `concat` (utf8 and largeUTF8) * potentially infinite set of return types, such as `array` (Array of any primitive or non-primitive type) I believe that this implementation is closer to option 1 that @alamb enumerated here. It is so because so far I was unable to offer an implementation for option 3, because functions such as `array` have an arbitrary return type (it can be any valid type, primitive or non-primitive), and thus we can't write them as `array_TYPE` as the number of cases is potentially large. --------------- This PR is exclusive to *built-in functions* of variable return type and it does not care about UDFs. It addresses a limitation of our current logical planning, that has been thoroughly discussed in apache#8032 and apache#7967, that logical planning needs to specify a specific return type when planning usage of UDFs and built-in functions (details below). Notation: `return type function`: a function mapping the functions' argument types to its return type. E.g. `(utf8) -> utf8; (LargeUtf8) -> LargeUtf8;` is an example of the signature of a typical one argument string function. The primary difference between built-ins and UDFs is that built-in's return type function is always known (hard-coded), while the return type function of a UDF is known by accessing the registry where it is registered on (it is a non-static closure). This PR is required to address an incompatibility of the following requirements that I gathered from discussions between @alamb, @andygrove and @jorgecarleitao: 1. we want to have typing information during logical planning (see [here](https://docs.google.com/document/d/1Kzz642ScizeKXmVE1bBlbLvR663BKQaGqVIyy9cAscY/edit?disco=AAAAJ4XOjHk)) 2. we want to have functions that require their return type to depend on their input. Examples include `array` (any type to any other type) and `concatenate` (`utf8 -> utf8`, `largeutf8 -> largeutf8`), and many others (see [here](apache#7967 (comment))) 3. we would like users to plan built-in functions without accessing the registry (see [here](apache#8032 (comment)) and mailing list) 4. a UDFs return type function needs to be retrieved from the registry (`ExecutionContextState`). 5. Currently, all our built-in functions are declared as UDFs and registered on the registry when the context is initialized. These points are incompatible because: * 1. and 2. requires access to built-in function's return type function during planning * 4. and 5. requires access the registry to know the built-in's return type * 3. forbids us from accessing the registry during planning This PR solves this incompatibility by leveraging the following: * builtin functions have a well declared return type during planning, since they are part of the source code * builtin functions do not need to be in our function's registry The first commit in this PR makes the existing logical node `Expr::ScalarFunction` to be exclusive for built-in functions, and moves our UDF planning logic to a new node named `Expr::ScalarUDF`. It also makes the planning of built-in functions to no longer require access the registry. The second commit in this PR introduces the necessary functionality for built-in functions to support all types of complex signatures. Examples of usage of this functionality are in the following PRs: 1. add support for math functions that accept f32: https://github.com/jorgecarleitao/arrow/pull/4/files 2. add `concat`, of an arbitrary number of arguments of type utf8: https://github.com/jorgecarleitao/arrow/pull/5/files 3. add `array` function, supporting an arbitrary number of arguments with uniform types: https://github.com/jorgecarleitao/arrow/pull/6/files Closes apache#8080 from jorgecarleitao/functions Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>
…s for built-in functions @alamb and @andygrove , I was able to split apache#8032 in two, so that they address different problems. This PR is specific to the problem that we have been discussing in apache#7967. It offers a solution that covers the three main cases: * single return type, such as `sqrt -> f64` * finite set of return types, such as `concat` (utf8 and largeUTF8) * potentially infinite set of return types, such as `array` (Array of any primitive or non-primitive type) I believe that this implementation is closer to option 1 that @alamb enumerated here. It is so because so far I was unable to offer an implementation for option 3, because functions such as `array` have an arbitrary return type (it can be any valid type, primitive or non-primitive), and thus we can't write them as `array_TYPE` as the number of cases is potentially large. --------------- This PR is exclusive to *built-in functions* of variable return type and it does not care about UDFs. It addresses a limitation of our current logical planning, that has been thoroughly discussed in apache#8032 and apache#7967, that logical planning needs to specify a specific return type when planning usage of UDFs and built-in functions (details below). Notation: `return type function`: a function mapping the functions' argument types to its return type. E.g. `(utf8) -> utf8; (LargeUtf8) -> LargeUtf8;` is an example of the signature of a typical one argument string function. The primary difference between built-ins and UDFs is that built-in's return type function is always known (hard-coded), while the return type function of a UDF is known by accessing the registry where it is registered on (it is a non-static closure). This PR is required to address an incompatibility of the following requirements that I gathered from discussions between @alamb, @andygrove and @jorgecarleitao: 1. we want to have typing information during logical planning (see [here](https://docs.google.com/document/d/1Kzz642ScizeKXmVE1bBlbLvR663BKQaGqVIyy9cAscY/edit?disco=AAAAJ4XOjHk)) 2. we want to have functions that require their return type to depend on their input. Examples include `array` (any type to any other type) and `concatenate` (`utf8 -> utf8`, `largeutf8 -> largeutf8`), and many others (see [here](apache#7967 (comment))) 3. we would like users to plan built-in functions without accessing the registry (see [here](apache#8032 (comment)) and mailing list) 4. a UDFs return type function needs to be retrieved from the registry (`ExecutionContextState`). 5. Currently, all our built-in functions are declared as UDFs and registered on the registry when the context is initialized. These points are incompatible because: * 1. and 2. requires access to built-in function's return type function during planning * 4. and 5. requires access the registry to know the built-in's return type * 3. forbids us from accessing the registry during planning This PR solves this incompatibility by leveraging the following: * builtin functions have a well declared return type during planning, since they are part of the source code * builtin functions do not need to be in our function's registry The first commit in this PR makes the existing logical node `Expr::ScalarFunction` to be exclusive for built-in functions, and moves our UDF planning logic to a new node named `Expr::ScalarUDF`. It also makes the planning of built-in functions to no longer require access the registry. The second commit in this PR introduces the necessary functionality for built-in functions to support all types of complex signatures. Examples of usage of this functionality are in the following PRs: 1. add support for math functions that accept f32: https://github.com/jorgecarleitao/arrow/pull/4/files 2. add `concat`, of an arbitrary number of arguments of type utf8: https://github.com/jorgecarleitao/arrow/pull/5/files 3. add `array` function, supporting an arbitrary number of arguments with uniform types: https://github.com/jorgecarleitao/arrow/pull/6/files Closes apache#8080 from jorgecarleitao/functions Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>
@alamb and @andygrove , I was able to split #8032 in two, so that they address different problems. This PR is specific to the problem that we have been discussing in #7967. It offers a solution that covers the three main cases:
sqrt -> f64
concat
(utf8 and largeUTF8)array
(Array of any primitive or non-primitive type)I believe that this implementation is closer to option 1 that @alamb enumerated here. It is so because so far I was unable to offer an implementation for option 3, because functions such as
array
have an arbitrary return type (it can be any valid type, primitive or non-primitive), and thus we can't write them asarray_TYPE
as the number of cases is potentially large.This PR is exclusive to built-in functions of variable return type and it does not care about UDFs. It addresses a limitation of our current logical planning, that has been thoroughly discussed in #8032 and #7967, that logical planning needs to specify a specific return type when planning usage of UDFs and built-in functions (details below).
Notation:
return type function
: a function mapping the functions' argument types to its return type. E.g.(utf8) -> utf8; (LargeUtf8) -> LargeUtf8;
is an example of the signature of a typical one argument string function.The primary difference between built-ins and UDFs is that built-in's return type function is always known (hard-coded), while the return type function of a UDF is known by accessing the registry where it is registered on (it is a non-static closure).
This PR is required to address an incompatibility of the following requirements that I gathered from discussions between @alamb, @andygrove and @jorgecarleitao:
array
(any type to any other type) andconcatenate
(utf8 -> utf8
,largeutf8 -> largeutf8
), and many others (see here)ExecutionContextState
).These points are incompatible because:
This PR solves this incompatibility by leveraging the following:
The first commit in this PR makes the existing logical node
Expr::ScalarFunction
to be exclusive for built-in functions, and moves our UDF planning logic to a new node namedExpr::ScalarUDF
. It also makes the planning of built-in functions to no longer require access the registry.The second commit in this PR introduces the necessary functionality for built-in functions to support all types of complex signatures. Examples of usage of this functionality are in the following PRs:
concat
, of an arbitrary number of arguments of type utf8: https://github.com/jorgecarleitao/arrow/pull/5/filesarray
function, supporting an arbitrary number of arguments with uniform types: https://github.com/jorgecarleitao/arrow/pull/6/files