Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create datafusion-functions-array crate and move ArrayToString function into it #9113

Merged
merged 4 commits into from
Feb 12, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Feb 2, 2024

Which issue does this PR close?

Part of #8045
See #9100 for discussion

Rationale for this change

We are trying to make DataFusion more modular, and as more array functionality
gets added, it is important to be support:

  1. users choosing which functionality they want to include in their build and
  2. developers to be able to work on the array functions without having to recompile the entire project

What changes are included in this PR?

  1. Make a new datafusion-functions-array crate
  2. Move ArrayToString from BuiltInScalarFunction enum and datafusion-physical-expr to datafusion-functions-array,
  3. Add a new feature flag, array_expressions

If this pattern looks good, I think we can then move the other functions over in a similar way

Are these changes tested?

Yes, by existing and new tests

Are there any user-facing changes?

A new feature flag

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules core Core datafusion crate documentation Improvements or additions to documentation labels Feb 2, 2024
/// * `DOC`: documentation string for the function
/// * `SCALAR_UDF_FUNC`: name of the function to create (just) the `ScalarUDF`
/// * `GNAME`: name for the single static instance of the `ScalarUDF`
macro_rules! make_udf_function {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to be copied into every function crate or is this just here for this example migration? Seems perhaps something that could be put into a common/util/etc location

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have mentioned that -- there is some version of this macro in the datafusion-functions crate - https://github.com/apache/arrow-datafusion/blob/main/datafusion/functions/src/macros.rs

However it isn't quite the same (as the feature flag is on the entire datafusion-functions-array crate rather than on subsets of functions)

My thinking was that we would only have 2 function crates -- datatfusion_functions and datafusion_functions_arrays and thus the replication was ok

We could potentially move the macros into datafusion_expr or datafusion_common perhaps 🤔

@github-actions github-actions bot removed the optimizer Optimizer rules label Feb 7, 2024
// specific language governing permissions and limitations
// under the License.

//! implementation kernels for array functions
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code in this module was just moved from array_expressions.rs

@alamb alamb marked this pull request as ready for review February 7, 2024 16:35
use std::any::type_name;
use std::sync::Arc;

macro_rules! downcast_arg {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may want to move into utils,math_expressions.rs has very similar macros, I'll do it in following PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @comphead -- once this PR is approved and merged, that would be great

I suspect as we go about porting over the old to new code, we will find other examples that can be cleaned up (some of the functions were implemented very early in DataFusion's life)

@@ -641,7 +641,7 @@ enum ScalarFunction {
ArrayPrepend = 94;
ArrayRemove = 95;
ArrayReplace = 96;
ArrayToString = 97;
// 97 was ArrayToString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it was removed?

Copy link
Contributor

@jayzhan211 jayzhan211 Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be removed because it is proto for ScalarFunction.
But it seems we need to implement another version for UDF?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the nice things about making these functions UDF -- they can be serialized / deserialized without any special code.

Here is a test showing it working for encode/decode: https://github.com/apache/arrow-datafusion/blob/8413da8803665c07c2250a0d015f6b67fb7ef743/datafusion/proto/tests/cases/roundtrip_logical_plan.rs#L573

I will add a test case for round tripping array functions too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in 97a789a

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb
Very needed initiative
@Weijun-H @jayzhan211 cc

@comphead
Copy link
Contributor

comphead commented Feb 7, 2024

Should we include the package/feature in datafusion-cli/Cargo.toml?
Wondering what will happen if user built the cli without array functions feature and tries to invoke array function in SQL

// specific language governing permissions and limitations
// under the License.

//! [`ScalarUDFImpl`] definitions for array functions.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the API defined for the function definitions

@@ -1870,27 +1830,6 @@ pub fn array_replace_all(args: &[ArrayRef]) -> Result<ArrayRef> {
}
}

macro_rules! to_string {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to the datafusion-functions-array crate

@@ -1462,10 +1460,6 @@ pub fn parse_expr(
parse_expr(&args[2], registry)?,
parse_expr(&args[3], registry)?,
)),
ScalarFunction::ArrayToString => Ok(array_to_string(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this PR is to remove this enum (and set the pattern on how to do so for the rest of the array functionality

@@ -641,7 +641,7 @@ enum ScalarFunction {
ArrayPrepend = 94;
ArrayRemove = 95;
ArrayReplace = 96;
ArrayToString = 97;
// 97 was ArrayToString
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the nice things about making these functions UDF -- they can be serialized / deserialized without any special code.

Here is a test showing it working for encode/decode: https://github.com/apache/arrow-datafusion/blob/8413da8803665c07c2250a0d015f6b67fb7ef743/datafusion/proto/tests/cases/roundtrip_logical_plan.rs#L573

I will add a test case for round tripping array functions too

@alamb
Copy link
Contributor Author

alamb commented Feb 8, 2024

Should we include the package/feature in datafusion-cli/Cargo.toml?

Since array_expressions is a default feature, these functions will (still) be in datafusion-cli. I double checked this:

❯ select array_to_string([1,2,3], '---');
+---------------------------------------------------------------------+
| array_to_string(make_array(Int64(1),Int64(2),Int64(3)),Utf8("---")) |
+---------------------------------------------------------------------+
| 1---2---3                                                           |
+---------------------------------------------------------------------+
1 row in set. Query took 0.006 seconds.

Wondering what will happen if user built the cli without array functions feature and tries to invoke array function in SQL

They will get a "no such function" error. It may be worth considering a way to hint / add stub functions that could be registered that say something like "no such function, but there is a function named array_to_string in feature package array_functions" 🤔

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @alamb
Epic work

@comphead comphead merged commit 292865e into main Feb 12, 2024
44 checks passed
@alamb alamb deleted the alamb/array_crate branch February 12, 2024 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core datafusion crate documentation Improvements or additions to documentation logical-expr Logical plan and expressions physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants