Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs #8032

Closed
wants to merge 2 commits into from
Closed

ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs #8032

wants to merge 2 commits into from

Conversation

jorgecarleitao
Copy link
Member

@jorgecarleitao jorgecarleitao commented Aug 23, 2020

See associated issue and document for details.

The gist is that currently, users call UDFs through

df.select(scalar_functions(“my_sqrt”, vec![col(“a”)], DataType::Float64))

and this PR proposes a change to

let functions = df.registry()?;

df.select(functions.udf(“my_sqrt”, vec![col(“a”)])?)

so that they do not have to remember the UDFs return type when using it (and a whole lot other things for us internally).

@github-actions
Copy link

@andygrove
Copy link
Member

For built-in functions like sqrt I would expect DataFusion to provide convenience functions to create an expression, like we do with col and the aggregate functions. I assume we could also do that with the design proposed here?

For example, I would like to be able to write:

df.select(vec![col("foo"), sqrt(col("bar"))])?

@jorgecarleitao
Copy link
Member Author

jorgecarleitao commented Aug 25, 2020

For built-in functions like sqrt I would expect DataFusion to provide convenience functions to create an expression, like we do with col and the aggregate functions. I assume we could also do that with the design proposed here?

For example, I would like to be able to write:

df.select(vec![col("foo"), sqrt(col("bar"))])?

This PR does not support this, as it threats every function (built-in or not) equally. To include that case, IMO this PR needs to add a new enum in the logical Expr:

  • Expr::ScalarFunction { name: String/Enum, args: Vec<Expr> } that we logically know its return type based on name (e.g. "sqrt"), exactly like Expr::BinaryExpr. This is mapped to a physical expression during planning. These can be build without access to the registry, as we hard-code the return type on the logical plan to be consistent with the physical one, like we do for our aggregates, binary expressions, etc.
  • Expr:ScalarUDF { fun: ScalarFunction, args: Vec<Expr> }, whose return type is only known after going to the registry to check what the user set its return type to be (as this PR currently does).

sqrt would return Expr::ScalarFunction, that knows its own return type, and the planner converts it to ScalarFunction via a hard-coded map, while Expr:UDF's physical planning is just planning args and pass them to ScalarFunction like this PR already does.

I.e. at the physical level, built-in and UDFs are indistinguishable, but at the logical plan, one only knows its name (built-in), the other also knows its physical representation ScalarUDF.

@alamb
Copy link
Contributor

alamb commented Aug 25, 2020

For anyone else reading along, the associated document I think is https://docs.google.com/document/d/1Kzz642ScizeKXmVE1bBlbLvR663BKQaGqVIyy9cAscY/edit?usp=sharing

rust/datafusion/src/execution/context.rs Outdated Show resolved Hide resolved
rust/datafusion/src/logicalplan.rs Outdated Show resolved Hide resolved
rust/datafusion/src/logicalplan.rs Outdated Show resolved Hide resolved
rust/datafusion/src/sql/planner.rs Outdated Show resolved Hide resolved
@jorgecarleitao
Copy link
Member Author

@alamb thank you very much for your comments, I will now work on addressing them now. I still learning the Arc/Box/Ref, so thank you a lot for also teaching me.

@andygrove , I agree with you that built-in functions should not require access to the registry. Unfortunately, doing so required some re-work, which is the reason I retracted #7967 back to draft to focus on this one first.

I pushed a new commit to this PR to address this point. Specifically, that commit adds:

  • a new enum with all built-in functions
  • functionally gluing the logical plan with the physical plan so that the function's return types are invariant.
  • made type coercion on built-in functions to be on the physical plane, to preserve schema invariance during planning.

I am pretty happy with this PR, as IMO has the flexibility we need to expand DataFusion's pool of built-in functions to multiple input and return types. The main features of this PR:

  • users no longer have to pass the return type of the UDF when calling them (the proposal)
  • planning built-in functions continue to not need access to the registry (@andygrove 's point)
  • built-in functions now support multiple input types (e.g. sqrt(f32), sqrt(f64))
  • built-in functions now support multiple return types (e.g. sqrt(f32) -> f32, sqrt(f64) -> f64)
  • coercion rules are no longer applied in the sql planning or physical planning to built-in functions, to avoid breaking schema invariance during planning

I have not completed the valid return types of built-in math functions as this PR was already too long.

Overall, I think that this has not been a pleasant experience for you @andygrove and @alamb, as I constantly open and close PRs around functions/UDFs, and for that I am really sorry. I've been hitting some design challenge after another, which requires me to go back and forth.

I am still in pursuit of my original quests:

  • built-in aggregate functions whose logical types are known from the physical expressions
  • type coercion on aggregate functions
  • built-in aggregate functions whose return types (e.g. min(f32) -> f32, min(f64) -> f64) are directly derived from the physical plan (there is an old fixme/todo in the code around that)
  • aggregate udfs
  • udfs with multiple incoming and return types, to bring them to the same level of functionality of built-ins
  • planning a udf without registering it (a-la spark) in the DF's API.

I have code for some of this, I... just... need... to... finish... the... scalar... stuff... first... 😃

andygrove pushed a commit that referenced this pull request Aug 26, 2020
…nctions.

Deprecates "Field" as argument to the UDF declaration, since we are only using its type.

This is a spin-off of #8032 with a much smaller scope, as the other one is getting to large to handle.

Closes #8045 from jorgecarleitao/clean_args

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Andy Grove <andygrove73@gmail.com>
@alamb
Copy link
Contributor

alamb commented Aug 27, 2020 via email

@jorgecarleitao jorgecarleitao marked this pull request as ready for review August 29, 2020 08:12
@jorgecarleitao
Copy link
Member Author

FYI, this is what a PR to add support for f32 to mathematical expressions (keeping the return type f64) looks like:
https://github.com/jorgecarleitao/arrow/pull/1

I.e. IMO with this PR we can support almost any built-in function: fixed type, variable return type, multiple input types, etc on Expr::ScalarFunction, which gives a lot of flexibility to add a new function, as we do not need to fiddle with Expr, only with input types, return types, etc. on the physical plane.

I split built-ins from the UDFs because built-ins type is known without access to the registry, which is currently required to allow users to use them outside the registry.

@jorgecarleitao
Copy link
Member Author

FYI, this is what a PR would look like for the concatenate function: https://github.com/jorgecarleitao/arrow/pull/2/files using this API.

I am not advocating that we follow this design (option 3 in this comment) would also be fine.

My point is that regardless of which option we pick, we will need to have the functionality in this PR:

  1. function signatures
  2. methods to check and coerce types based on those signatures
  3. methods to return the return type of the function

@jorgecarleitao
Copy link
Member Author

And finally, this is how we would add the array function, that receives an arbitrary but uniformly-typed number of arguments: https://github.com/jorgecarleitao/arrow/pull/3/files

To summarize:

  • math functions:
    • accept 1 argument of a fixed type (f32 or f64)
    • return a fixed type (f64)
  • concat:
    • accepts an arbitrary number of arguments of fixed type (utf8 atm, but large is trivial to add)
    • returns a fixed type (utf8)
  • array:
    • accepts an arbitrary number of arguments of variable but uniform types (i.e. all arguments must be of equal type)
    • returns FixedSizeList(input_types[0], input.len())

The API that I am proposing here addresses all these cases out of the box. The 3 PRs in my repo,

add support to each of them at the physical and logical level.

This PR also includes all the required coercion rules for this to work. E.g. array(f32, i32, u32) would be coerced to array(f32, f32, f32) (first argument dominates).

andygrove pushed a commit that referenced this pull request Sep 1, 2020
…s for built-in functions

@alamb and @andygrove , I was able to split #8032 in two, so that they address different problems. This PR is specific to the problem that we have been discussing in #7967. It offers a solution that covers the three main cases:

* single return type, such as `sqrt -> f64`
* finite set of return types, such as `concat` (utf8 and largeUTF8)
* potentially infinite set of return types, such as `array` (Array of any primitive or non-primitive type)

I believe that this implementation is closer to option 1 that @alamb enumerated here. It is so because so far I was unable to offer an implementation for option 3, because functions such as `array` have an arbitrary return type (it can be any valid type, primitive or non-primitive), and thus we can't write them as `array_TYPE` as the number of cases is potentially large.

---------------

This PR is exclusive to *built-in functions* of variable return type and it does not care about UDFs. It addresses a limitation of our current logical planning, that has been thoroughly discussed in #8032 and #7967, that logical planning needs to specify a specific return type when planning usage of UDFs and built-in functions (details below).

Notation: `return type function`: a function mapping the functions' argument types to its return type. E.g. `(utf8) -> utf8; (LargeUtf8) -> LargeUtf8;` is an example of the signature of a typical one argument string function.

The primary difference between built-ins and UDFs is that built-in's return type function is always known (hard-coded), while the return type function of a UDF is known by accessing the registry where it is registered on (it is a non-static closure).

This PR is required to address an incompatibility of the following requirements that I gathered from discussions between @alamb, @andygrove and @jorgecarleitao:

1. we want to have typing information during logical planning (see [here](https://docs.google.com/document/d/1Kzz642ScizeKXmVE1bBlbLvR663BKQaGqVIyy9cAscY/edit?disco=AAAAJ4XOjHk))
2. we want to have functions that require their return type to depend on their input. Examples include `array` (any type to any other type) and `concatenate` (`utf8 -> utf8`, `largeutf8 -> largeutf8`), and many others (see [here](#7967 (comment)))
3. we would like users to plan built-in functions without accessing the registry (see [here](#8032 (comment)) and mailing list)
4. a UDFs return type function needs to be retrieved from the registry (`ExecutionContextState`).
5. Currently, all our built-in functions are declared as UDFs and registered on the registry when the context is initialized.

These points are incompatible because:

* 1. and 2. requires access to built-in function's return type function during planning
* 4. and 5. requires access the registry to know the built-in's return type
* 3. forbids us from accessing the registry during planning

This PR solves this incompatibility by leveraging the following:

* builtin functions have a well declared return type during planning, since they are part of the source code
* builtin functions do not need to be in our function's registry

The first commit in this PR makes the existing logical node `Expr::ScalarFunction` to be exclusive for built-in functions, and moves our UDF planning logic to a new node named `Expr::ScalarUDF`. It also makes the planning of built-in functions to no longer require access the registry.

The second commit in this PR introduces the necessary functionality for built-in functions to support all types of complex signatures. Examples of usage of this functionality are in the following PRs:

1. add support for math functions that accept f32: https://github.com/jorgecarleitao/arrow/pull/4/files
2. add `concat`, of an arbitrary number of arguments of type utf8: https://github.com/jorgecarleitao/arrow/pull/5/files
3. add `array` function, supporting an arbitrary number of arguments with uniform types: https://github.com/jorgecarleitao/arrow/pull/6/files

Closes #8080 from jorgecarleitao/functions

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Andy Grove <andygrove73@gmail.com>
@jorgecarleitao
Copy link
Member Author

After a long digression through the realm of built-ins, this has now been simplified and rebased against master.

@andygrove and @alamb , ready for a re-review.

Again, the core goal here is to allow users to use UDFs without having to worry about their return type.

I've incorporated all points from @alamb and @andygrove so far:

  • Registry is now a trait
  • Arc is used consistently
  • built-ins continue to be available without access to the registry

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this code -- I think the UDF usage / definition is clearer now and the code looks better. Really nice work @jorgecarleitao

rust/datafusion/src/logical_plan/mod.rs Outdated Show resolved Hide resolved
} else {
expr.clone()
};
let mut projected_expr = vec![];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to verify my understanding -- this is a code cleanup that is not directly required for UDFs, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indirectly, it is related: expr.contains(&Expr::Wildcard) requires PartialEq, which was dropped in this line due to the addition of an Arc<ScalarFunction> in this line.

However, since if expr.contains(&Expr::Wildcard) thankfully was entirely optional, I dropped it ^_^

/// # Ok(())
/// # }
/// ```
fn registry(&self) -> &dyn FunctionRegistry;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this registry specific to scalar functions or will it also be used for aggregate functions? Perhaps we should name the method either function_registry or scalar_function_registry?

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and this is a nice improvement! My only comment is that maybe some of the naming could be made more consistent, or documentation made clearer .. the things that weren't immediately clear to me (although I didn't spend much time on the review) were:

  • Is the registry just for user-defined scalar functions, or does it include the built-in scalar functions as well?
  • The registry has methods like udf but should we consider naming scalar_udf? I wasn't sure what the plan was for user-defined aggregate functions (UDAFs)

@jorgecarleitao
Copy link
Member Author

I agree with you, @andygrove .

  • Built-in functions are not part of the registry as they are exposed directly from the prelude. So, maybe udf_registry could be more explicit about this.

  • I thought about FunctionRegistry being used for both udfs and UDAFs (user-defined aggregate function, à la spark), so that a user does not have to call df.registry() or df.aggregate_registry(), they just need to remember about one.

  • I thought about FunctionRegistry::udf and FunctionRegistry::udaf, just because short names make it easier to read statements. But, as always, it is a trade-off between short vs understanding.

I have no strong opinions about naming nor UX here: I will implement whatever you agree upon :-). My main concern was to fix the data type thing :P

@andygrove
Copy link
Member

@jorgecarleitao @alamb I'm catching up on the PRs today. It looks like this one is ready to merge?

@jorgecarleitao
Copy link
Member Author

I think so, @andygrove . There is probably some renaming once we have UDAFs. For now, I think it is fine.

@andygrove andygrove closed this in 4186a66 Sep 7, 2020
andygrove pushed a commit that referenced this pull request Sep 9, 2020
…pes per argument

This PR aligns UDFs registration and declaration to be consistent with our built-in functions, so that we can leverage coercion rules on their arguments.

For ease of use, this PR introduces a function `create_udf` that simplifies the creation of UDFs with a fixed signature and fixed return type, so that users have a simple interface to declare them.

However, underneath, the UDFs have the same capabilities as built-in functions, in that they can be as generic as built-in functions (arbitrary types, etc.).

Specific achievements of this PR:

* Added example (120 LOC) of how to declare and register a UDF
* Deprecated the type coercer optimizer, since it was causing logical schemas to become misaligned and cause our end-to-end tests to faail when implicit casting was required, and replaced it by what we already do for built-ins
* Made UDFs use the same interfaces as built-in functions

Note that this PR is built on top of #8032.

Closes #7967 from jorgecarleitao/clean

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Andy Grove <andygrove73@gmail.com>
@jorgecarleitao jorgecarleitao deleted the registry branch September 30, 2020 15:18
emkornfield pushed a commit to emkornfield/arrow that referenced this pull request Oct 16, 2020
…nctions.

Deprecates "Field" as argument to the UDF declaration, since we are only using its type.

This is a spin-off of apache#8032 with a much smaller scope, as the other one is getting to large to handle.

Closes apache#8045 from jorgecarleitao/clean_args

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Andy Grove <andygrove73@gmail.com>
emkornfield pushed a commit to emkornfield/arrow that referenced this pull request Oct 16, 2020
…s for built-in functions

@alamb and @andygrove , I was able to split apache#8032 in two, so that they address different problems. This PR is specific to the problem that we have been discussing in apache#7967. It offers a solution that covers the three main cases:

* single return type, such as `sqrt -> f64`
* finite set of return types, such as `concat` (utf8 and largeUTF8)
* potentially infinite set of return types, such as `array` (Array of any primitive or non-primitive type)

I believe that this implementation is closer to option 1 that @alamb enumerated here. It is so because so far I was unable to offer an implementation for option 3, because functions such as `array` have an arbitrary return type (it can be any valid type, primitive or non-primitive), and thus we can't write them as `array_TYPE` as the number of cases is potentially large.

---------------

This PR is exclusive to *built-in functions* of variable return type and it does not care about UDFs. It addresses a limitation of our current logical planning, that has been thoroughly discussed in apache#8032 and apache#7967, that logical planning needs to specify a specific return type when planning usage of UDFs and built-in functions (details below).

Notation: `return type function`: a function mapping the functions' argument types to its return type. E.g. `(utf8) -> utf8; (LargeUtf8) -> LargeUtf8;` is an example of the signature of a typical one argument string function.

The primary difference between built-ins and UDFs is that built-in's return type function is always known (hard-coded), while the return type function of a UDF is known by accessing the registry where it is registered on (it is a non-static closure).

This PR is required to address an incompatibility of the following requirements that I gathered from discussions between @alamb, @andygrove and @jorgecarleitao:

1. we want to have typing information during logical planning (see [here](https://docs.google.com/document/d/1Kzz642ScizeKXmVE1bBlbLvR663BKQaGqVIyy9cAscY/edit?disco=AAAAJ4XOjHk))
2. we want to have functions that require their return type to depend on their input. Examples include `array` (any type to any other type) and `concatenate` (`utf8 -> utf8`, `largeutf8 -> largeutf8`), and many others (see [here](apache#7967 (comment)))
3. we would like users to plan built-in functions without accessing the registry (see [here](apache#8032 (comment)) and mailing list)
4. a UDFs return type function needs to be retrieved from the registry (`ExecutionContextState`).
5. Currently, all our built-in functions are declared as UDFs and registered on the registry when the context is initialized.

These points are incompatible because:

* 1. and 2. requires access to built-in function's return type function during planning
* 4. and 5. requires access the registry to know the built-in's return type
* 3. forbids us from accessing the registry during planning

This PR solves this incompatibility by leveraging the following:

* builtin functions have a well declared return type during planning, since they are part of the source code
* builtin functions do not need to be in our function's registry

The first commit in this PR makes the existing logical node `Expr::ScalarFunction` to be exclusive for built-in functions, and moves our UDF planning logic to a new node named `Expr::ScalarUDF`. It also makes the planning of built-in functions to no longer require access the registry.

The second commit in this PR introduces the necessary functionality for built-in functions to support all types of complex signatures. Examples of usage of this functionality are in the following PRs:

1. add support for math functions that accept f32: https://github.com/jorgecarleitao/arrow/pull/4/files
2. add `concat`, of an arbitrary number of arguments of type utf8: https://github.com/jorgecarleitao/arrow/pull/5/files
3. add `array` function, supporting an arbitrary number of arguments with uniform types: https://github.com/jorgecarleitao/arrow/pull/6/files

Closes apache#8080 from jorgecarleitao/functions

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Andy Grove <andygrove73@gmail.com>
emkornfield pushed a commit to emkornfield/arrow that referenced this pull request Oct 16, 2020
See associated issue and document for details.

The gist is that currently, users call UDFs through

```
df.select(scalar_functions(“my_sqrt”, vec![col(“a”)], DataType::Float64))
```

and this PR proposes a change to

```
let functions = df.registry()?;

df.select(functions.udf(“my_sqrt”, vec![col(“a”)])?)
```

so that they do not have to remember the UDFs return type when using it (and a whole lot other things for us internally).

Closes apache#8032 from jorgecarleitao/registry

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Andy Grove <andygrove73@gmail.com>
emkornfield pushed a commit to emkornfield/arrow that referenced this pull request Oct 16, 2020
…pes per argument

This PR aligns UDFs registration and declaration to be consistent with our built-in functions, so that we can leverage coercion rules on their arguments.

For ease of use, this PR introduces a function `create_udf` that simplifies the creation of UDFs with a fixed signature and fixed return type, so that users have a simple interface to declare them.

However, underneath, the UDFs have the same capabilities as built-in functions, in that they can be as generic as built-in functions (arbitrary types, etc.).

Specific achievements of this PR:

* Added example (120 LOC) of how to declare and register a UDF
* Deprecated the type coercer optimizer, since it was causing logical schemas to become misaligned and cause our end-to-end tests to faail when implicit casting was required, and replaced it by what we already do for built-ins
* Made UDFs use the same interfaces as built-in functions

Note that this PR is built on top of apache#8032.

Closes apache#7967 from jorgecarleitao/clean

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Andy Grove <andygrove73@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…nctions.

Deprecates "Field" as argument to the UDF declaration, since we are only using its type.

This is a spin-off of apache#8032 with a much smaller scope, as the other one is getting to large to handle.

Closes apache#8045 from jorgecarleitao/clean_args

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Andy Grove <andygrove73@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…s for built-in functions

@alamb and @andygrove , I was able to split apache#8032 in two, so that they address different problems. This PR is specific to the problem that we have been discussing in apache#7967. It offers a solution that covers the three main cases:

* single return type, such as `sqrt -> f64`
* finite set of return types, such as `concat` (utf8 and largeUTF8)
* potentially infinite set of return types, such as `array` (Array of any primitive or non-primitive type)

I believe that this implementation is closer to option 1 that @alamb enumerated here. It is so because so far I was unable to offer an implementation for option 3, because functions such as `array` have an arbitrary return type (it can be any valid type, primitive or non-primitive), and thus we can't write them as `array_TYPE` as the number of cases is potentially large.

---------------

This PR is exclusive to *built-in functions* of variable return type and it does not care about UDFs. It addresses a limitation of our current logical planning, that has been thoroughly discussed in apache#8032 and apache#7967, that logical planning needs to specify a specific return type when planning usage of UDFs and built-in functions (details below).

Notation: `return type function`: a function mapping the functions' argument types to its return type. E.g. `(utf8) -> utf8; (LargeUtf8) -> LargeUtf8;` is an example of the signature of a typical one argument string function.

The primary difference between built-ins and UDFs is that built-in's return type function is always known (hard-coded), while the return type function of a UDF is known by accessing the registry where it is registered on (it is a non-static closure).

This PR is required to address an incompatibility of the following requirements that I gathered from discussions between @alamb, @andygrove and @jorgecarleitao:

1. we want to have typing information during logical planning (see [here](https://docs.google.com/document/d/1Kzz642ScizeKXmVE1bBlbLvR663BKQaGqVIyy9cAscY/edit?disco=AAAAJ4XOjHk))
2. we want to have functions that require their return type to depend on their input. Examples include `array` (any type to any other type) and `concatenate` (`utf8 -> utf8`, `largeutf8 -> largeutf8`), and many others (see [here](apache#7967 (comment)))
3. we would like users to plan built-in functions without accessing the registry (see [here](apache#8032 (comment)) and mailing list)
4. a UDFs return type function needs to be retrieved from the registry (`ExecutionContextState`).
5. Currently, all our built-in functions are declared as UDFs and registered on the registry when the context is initialized.

These points are incompatible because:

* 1. and 2. requires access to built-in function's return type function during planning
* 4. and 5. requires access the registry to know the built-in's return type
* 3. forbids us from accessing the registry during planning

This PR solves this incompatibility by leveraging the following:

* builtin functions have a well declared return type during planning, since they are part of the source code
* builtin functions do not need to be in our function's registry

The first commit in this PR makes the existing logical node `Expr::ScalarFunction` to be exclusive for built-in functions, and moves our UDF planning logic to a new node named `Expr::ScalarUDF`. It also makes the planning of built-in functions to no longer require access the registry.

The second commit in this PR introduces the necessary functionality for built-in functions to support all types of complex signatures. Examples of usage of this functionality are in the following PRs:

1. add support for math functions that accept f32: https://github.com/jorgecarleitao/arrow/pull/4/files
2. add `concat`, of an arbitrary number of arguments of type utf8: https://github.com/jorgecarleitao/arrow/pull/5/files
3. add `array` function, supporting an arbitrary number of arguments with uniform types: https://github.com/jorgecarleitao/arrow/pull/6/files

Closes apache#8080 from jorgecarleitao/functions

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Andy Grove <andygrove73@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
See associated issue and document for details.

The gist is that currently, users call UDFs through

```
df.select(scalar_functions(“my_sqrt”, vec![col(“a”)], DataType::Float64))
```

and this PR proposes a change to

```
let functions = df.registry()?;

df.select(functions.udf(“my_sqrt”, vec![col(“a”)])?)
```

so that they do not have to remember the UDFs return type when using it (and a whole lot other things for us internally).

Closes apache#8032 from jorgecarleitao/registry

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Andy Grove <andygrove73@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…pes per argument

This PR aligns UDFs registration and declaration to be consistent with our built-in functions, so that we can leverage coercion rules on their arguments.

For ease of use, this PR introduces a function `create_udf` that simplifies the creation of UDFs with a fixed signature and fixed return type, so that users have a simple interface to declare them.

However, underneath, the UDFs have the same capabilities as built-in functions, in that they can be as generic as built-in functions (arbitrary types, etc.).

Specific achievements of this PR:

* Added example (120 LOC) of how to declare and register a UDF
* Deprecated the type coercer optimizer, since it was causing logical schemas to become misaligned and cause our end-to-end tests to faail when implicit casting was required, and replaced it by what we already do for built-ins
* Made UDFs use the same interfaces as built-in functions

Note that this PR is built on top of apache#8032.

Closes apache#7967 from jorgecarleitao/clean

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Andy Grove <andygrove73@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants