Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-9887: [Rust] [DataFusion] Added support for complex return types for built-in functions #8080

Closed
wants to merge 4 commits into from

Conversation

jorgecarleitao
Copy link
Member

@alamb and @andygrove , I was able to split #8032 in two, so that they address different problems. This PR is specific to the problem that we have been discussing in #7967. It offers a solution that covers the three main cases:

  • single return type, such as sqrt -> f64
  • finite set of return types, such as concat (utf8 and largeUTF8)
  • potentially infinite set of return types, such as array (Array of any primitive or non-primitive type)

I believe that this implementation is closer to option 1 that @alamb enumerated here. It is so because so far I was unable to offer an implementation for option 3, because functions such as array have an arbitrary return type (it can be any valid type, primitive or non-primitive), and thus we can't write them as array_TYPE as the number of cases is potentially large.


This PR is exclusive to built-in functions of variable return type and it does not care about UDFs. It addresses a limitation of our current logical planning, that has been thoroughly discussed in #8032 and #7967, that logical planning needs to specify a specific return type when planning usage of UDFs and built-in functions (details below).

Notation: return type function: a function mapping the functions' argument types to its return type. E.g. (utf8) -> utf8; (LargeUtf8) -> LargeUtf8; is an example of the signature of a typical one argument string function.

The primary difference between built-ins and UDFs is that built-in's return type function is always known (hard-coded), while the return type function of a UDF is known by accessing the registry where it is registered on (it is a non-static closure).

This PR is required to address an incompatibility of the following requirements that I gathered from discussions between @alamb, @andygrove and @jorgecarleitao:

  1. we want to have typing information during logical planning (see here)
  2. we want to have functions that require their return type to depend on their input. Examples include array (any type to any other type) and concatenate (utf8 -> utf8, largeutf8 -> largeutf8), and many others (see here)
  3. we would like users to plan built-in functions without accessing the registry (see here and mailing list)
  4. a UDFs return type function needs to be retrieved from the registry (ExecutionContextState).
  5. Currently, all our built-in functions are declared as UDFs and registered on the registry when the context is initialized.

These points are incompatible because:

    1. and 2. requires access to built-in function's return type function during planning
    1. and 5. requires access the registry to know the built-in's return type
    1. forbids us from accessing the registry during planning

This PR solves this incompatibility by leveraging the following:

  • builtin functions have a well declared return type during planning, since they are part of the source code
  • builtin functions do not need to be in our function's registry

The first commit in this PR makes the existing logical node Expr::ScalarFunction to be exclusive for built-in functions, and moves our UDF planning logic to a new node named Expr::ScalarUDF. It also makes the planning of built-in functions to no longer require access the registry.

The second commit in this PR introduces the necessary functionality for built-in functions to support all types of complex signatures. Examples of usage of this functionality are in the following PRs:

  1. add support for math functions that accept f32: https://github.com/jorgecarleitao/arrow/pull/4/files
  2. add concat, of an arbitrary number of arguments of type utf8: https://github.com/jorgecarleitao/arrow/pull/5/files
  3. add array function, supporting an arbitrary number of arguments with uniform types: https://github.com/jorgecarleitao/arrow/pull/6/files

This commit removes all built-in functions from our UDF API
so that their return type does not need to be specified when
planning them.

This commit also fixes a bug on which sql planner was changing
the logical plan's schema, causing the column name to change.
This allow us to handle built-in functions such as:

* concat, with an arbitrary number of arguments of a fixed type
* array, with an arbitrary number of arguments of a single type
* any variation of uniform types (e.g. simple scalar functions)
@github-actions
Copy link

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the first commit in detail and left some comments. I didn't make it through the second commit and I am mostly out of time this morning (I need to go and respond to one more of your comments on another PR).

coerce(fun, &arg_types[0])?;

// the return type after coercion.
// for now, this is type-independent, but there will be built-in functions whose return type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// for now, this is type-independent, but there will be built-in functions whose return type
// for now, this is type-independent, but there may be built-in functions whose return type


/// Create a physical (function) expression.
/// This function errors when `args`' can't be coerced to a valid argument type of the function.
pub fn function(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn function(
pub fn to_physical_expr(

I personally find function to be a little confusing as the term is so generic and could apply to many things

}

/// the type supported by the function `fun`.
fn valid_type(fun: &ScalarFunction) -> DataType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn valid_type(fun: &ScalarFunction) -> DataType {
fn argument_type(fun: &ScalarFunction) -> DataType {

)))
}

/// the type supported by the function `fun`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// the type supported by the function `fun`.
/// the type of argument required by the function `fun`.

/// coercion rules for all built-in functions.
/// Returns a DataType coerced from `arg_type` that is accepted by `fun`.
/// Errors when `arg_type` can't be coerced to a valid return type of `fun`.
fn coerce(fun: &ScalarFunction, arg_type: &DataType) -> Result<DataType> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn coerce(fun: &ScalarFunction, arg_type: &DataType) -> Result<DataType> {
fn is_argument_compatible(fun: &ScalarFunction, arg_type: &DataType) -> Result<DataType> {

I don't think this function is "coercing" the input -- it is effectively encoding "what coercions would be possible" I think...

@@ -190,7 +190,14 @@ fn create_name(e: &Expr, input_schema: &Schema) -> Result<String> {
let expr = create_name(expr, input_schema)?;
Ok(format!("CAST({} as {:?})", expr, data_type))
}
Expr::ScalarFunction { name, args, .. } => {
Expr::ScalarFunction { fun, args, .. } => {
let mut names = Vec::with_capacity(args.len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how much you care about using "idomatic" rust, but you can probably do the same kind of thing here with iterators:

let names = args.iter()
  .map(|e| create_name(e, input_schema)
  .collect::<Result<Vec_>>()?;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I kept it because the code around this change is using this idiom. I would prefer to change that whole code (ScalarUDF, AggregateFunction, ScalarFunction) in a separate commit to not want to complicate even further this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense

@@ -592,7 +607,7 @@ mod tests {
fn select_scalar_func_with_literal_no_relation() {
quick_test(
"SELECT sqrt(9)",
"Projection: sqrt(CAST(Int64(9) AS Float64))\
"Projection: sqrt(Int64(9))\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change would imply to me that sqrt can take an int which is not actually what happens, right? At some point something adds a Cast to float...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends where want to perform the cast. This test was ensuring that the cast was happenning on the logical plane, due to the type coercer.

IMO sqrt(8) is a logical expression that everyone agrees should return 2.82842712475... and the reason we cast it to f64 is entirely due to a limitation of our physical expressions, that only implement sqrt for f64.

For this reason, IMO these type of casting operations should only be done on our physical plane. For example, a different physical planner could prefer to not have it casted because it has a specialized implementation that supports other types.

This PR moves casting operations on built-in functions to the physical plane, like we are performing for binary operators.

Note that the user continues to have full transparency: they can use EXPLAIN verbose to get the physical plan, that contains the cast.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this reason, IMO these type of casting operations should only be done on our physical plane. For example, a different physical planner could prefer to not have it casted because it has a specialized implementation that supports other types.

I hadn't thought of the idea that different physical planners might want to do type coercion differently -- that is a good point and it makes sense to me.

/// arbitrary number of arguments of an arbitrary but uniform type
// A function such as `array` is `ManyUniform`
// The first argument decides the type used for coercion
ManyUniform,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea of a ManyUniform number of arguments perhaps is what is making this so complicated. ManyUniform sounds a lot like the notion of "varargs" (variadic functions) from C (aka how you can do printf(char *msg, ...), though that does not restrict to a single type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Variadic, that is more appropriate name.

the idea of a ManyUniform number of arguments perhaps is what is making this so complicated.

I am not sure: implementation-wise, it is a small variation of Uniform.

The complication comes from what we want to achieve: we are essentially building a dynamic typed system with type-checking during planning. AFAIK that is a subset of a compiler...

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code is good enough to merge -- while it is perhaps not exactly how I would have done it, it seems well thought out, has a coherent design and I think I could make this code do anything I might want -- namely I think it would work for the usecases in our project). Given the effort that @jorgecarleitao has put in, it might make sense to get this in and continue to iterate.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the lead on reviewing this @alamb. LGTM too.

@andygrove andygrove closed this in a898ee1 Sep 1, 2020
@jorgecarleitao jorgecarleitao deleted the functions branch September 30, 2020 15:17
emkornfield pushed a commit to emkornfield/arrow that referenced this pull request Oct 16, 2020
…s for built-in functions

@alamb and @andygrove , I was able to split apache#8032 in two, so that they address different problems. This PR is specific to the problem that we have been discussing in apache#7967. It offers a solution that covers the three main cases:

* single return type, such as `sqrt -> f64`
* finite set of return types, such as `concat` (utf8 and largeUTF8)
* potentially infinite set of return types, such as `array` (Array of any primitive or non-primitive type)

I believe that this implementation is closer to option 1 that @alamb enumerated here. It is so because so far I was unable to offer an implementation for option 3, because functions such as `array` have an arbitrary return type (it can be any valid type, primitive or non-primitive), and thus we can't write them as `array_TYPE` as the number of cases is potentially large.

---------------

This PR is exclusive to *built-in functions* of variable return type and it does not care about UDFs. It addresses a limitation of our current logical planning, that has been thoroughly discussed in apache#8032 and apache#7967, that logical planning needs to specify a specific return type when planning usage of UDFs and built-in functions (details below).

Notation: `return type function`: a function mapping the functions' argument types to its return type. E.g. `(utf8) -> utf8; (LargeUtf8) -> LargeUtf8;` is an example of the signature of a typical one argument string function.

The primary difference between built-ins and UDFs is that built-in's return type function is always known (hard-coded), while the return type function of a UDF is known by accessing the registry where it is registered on (it is a non-static closure).

This PR is required to address an incompatibility of the following requirements that I gathered from discussions between @alamb, @andygrove and @jorgecarleitao:

1. we want to have typing information during logical planning (see [here](https://docs.google.com/document/d/1Kzz642ScizeKXmVE1bBlbLvR663BKQaGqVIyy9cAscY/edit?disco=AAAAJ4XOjHk))
2. we want to have functions that require their return type to depend on their input. Examples include `array` (any type to any other type) and `concatenate` (`utf8 -> utf8`, `largeutf8 -> largeutf8`), and many others (see [here](apache#7967 (comment)))
3. we would like users to plan built-in functions without accessing the registry (see [here](apache#8032 (comment)) and mailing list)
4. a UDFs return type function needs to be retrieved from the registry (`ExecutionContextState`).
5. Currently, all our built-in functions are declared as UDFs and registered on the registry when the context is initialized.

These points are incompatible because:

* 1. and 2. requires access to built-in function's return type function during planning
* 4. and 5. requires access the registry to know the built-in's return type
* 3. forbids us from accessing the registry during planning

This PR solves this incompatibility by leveraging the following:

* builtin functions have a well declared return type during planning, since they are part of the source code
* builtin functions do not need to be in our function's registry

The first commit in this PR makes the existing logical node `Expr::ScalarFunction` to be exclusive for built-in functions, and moves our UDF planning logic to a new node named `Expr::ScalarUDF`. It also makes the planning of built-in functions to no longer require access the registry.

The second commit in this PR introduces the necessary functionality for built-in functions to support all types of complex signatures. Examples of usage of this functionality are in the following PRs:

1. add support for math functions that accept f32: https://github.com/jorgecarleitao/arrow/pull/4/files
2. add `concat`, of an arbitrary number of arguments of type utf8: https://github.com/jorgecarleitao/arrow/pull/5/files
3. add `array` function, supporting an arbitrary number of arguments with uniform types: https://github.com/jorgecarleitao/arrow/pull/6/files

Closes apache#8080 from jorgecarleitao/functions

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Andy Grove <andygrove73@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…s for built-in functions

@alamb and @andygrove , I was able to split apache#8032 in two, so that they address different problems. This PR is specific to the problem that we have been discussing in apache#7967. It offers a solution that covers the three main cases:

* single return type, such as `sqrt -> f64`
* finite set of return types, such as `concat` (utf8 and largeUTF8)
* potentially infinite set of return types, such as `array` (Array of any primitive or non-primitive type)

I believe that this implementation is closer to option 1 that @alamb enumerated here. It is so because so far I was unable to offer an implementation for option 3, because functions such as `array` have an arbitrary return type (it can be any valid type, primitive or non-primitive), and thus we can't write them as `array_TYPE` as the number of cases is potentially large.

---------------

This PR is exclusive to *built-in functions* of variable return type and it does not care about UDFs. It addresses a limitation of our current logical planning, that has been thoroughly discussed in apache#8032 and apache#7967, that logical planning needs to specify a specific return type when planning usage of UDFs and built-in functions (details below).

Notation: `return type function`: a function mapping the functions' argument types to its return type. E.g. `(utf8) -> utf8; (LargeUtf8) -> LargeUtf8;` is an example of the signature of a typical one argument string function.

The primary difference between built-ins and UDFs is that built-in's return type function is always known (hard-coded), while the return type function of a UDF is known by accessing the registry where it is registered on (it is a non-static closure).

This PR is required to address an incompatibility of the following requirements that I gathered from discussions between @alamb, @andygrove and @jorgecarleitao:

1. we want to have typing information during logical planning (see [here](https://docs.google.com/document/d/1Kzz642ScizeKXmVE1bBlbLvR663BKQaGqVIyy9cAscY/edit?disco=AAAAJ4XOjHk))
2. we want to have functions that require their return type to depend on their input. Examples include `array` (any type to any other type) and `concatenate` (`utf8 -> utf8`, `largeutf8 -> largeutf8`), and many others (see [here](apache#7967 (comment)))
3. we would like users to plan built-in functions without accessing the registry (see [here](apache#8032 (comment)) and mailing list)
4. a UDFs return type function needs to be retrieved from the registry (`ExecutionContextState`).
5. Currently, all our built-in functions are declared as UDFs and registered on the registry when the context is initialized.

These points are incompatible because:

* 1. and 2. requires access to built-in function's return type function during planning
* 4. and 5. requires access the registry to know the built-in's return type
* 3. forbids us from accessing the registry during planning

This PR solves this incompatibility by leveraging the following:

* builtin functions have a well declared return type during planning, since they are part of the source code
* builtin functions do not need to be in our function's registry

The first commit in this PR makes the existing logical node `Expr::ScalarFunction` to be exclusive for built-in functions, and moves our UDF planning logic to a new node named `Expr::ScalarUDF`. It also makes the planning of built-in functions to no longer require access the registry.

The second commit in this PR introduces the necessary functionality for built-in functions to support all types of complex signatures. Examples of usage of this functionality are in the following PRs:

1. add support for math functions that accept f32: https://github.com/jorgecarleitao/arrow/pull/4/files
2. add `concat`, of an arbitrary number of arguments of type utf8: https://github.com/jorgecarleitao/arrow/pull/5/files
3. add `array` function, supporting an arbitrary number of arguments with uniform types: https://github.com/jorgecarleitao/arrow/pull/6/files

Closes apache#8080 from jorgecarleitao/functions

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Andy Grove <andygrove73@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants