Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[built-in function] add greatest(T,...) and least(T,...) SQL functions #6527

Closed
wants to merge 1 commit into from

Conversation

jimexist
Copy link
Member

@jimexist jimexist commented Jun 2, 2023

Which issue does this PR close?

Rationale for this change

add greatest(T,..) and least(T,..) variadic functions as per SQL 2023.

What changes are included in this PR?

  • add greatest(T,..) and least(T,..) variadic functions as per SQL 2023.
    • when no args are supplied the functions err
    • when both scalar and arrays are supplied then scalars are converted to arrays before max/min
    • only boolean, string, and numeric types are supported for now, no dates, etc.
    • arrow zip and ord kernels are used
  • change variadic equal function signature to take a vec of allowed types, only types within that types can be used

Are these changes tested?

  • unit tests
  • integration tests using tokio and based on sql

Are there any user-facing changes?

@github-actions github-actions bot added core Core datafusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions labels Jun 2, 2023
)))
} else {
Err(DataFusionError::NotImplemented(format!(
"{:?} expressions are not implemented for arrays yet as we need to update arrow kernels",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb is there already a max/min kernel in arrow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is min like https://docs.rs/arrow/latest/arrow/compute/fn.min.html but I think it works only for primitive arrays

Perhaps you can use the existing MinAccumulator / Max Accumulators?

https://docs.rs/datafusion/latest/datafusion/physical_plan/expressions/struct.MinAccumulator.html

@jimexist jimexist changed the title [built-in function] add greatest and least [built-in function] add greatest() and least() SQL functions Jun 2, 2023
Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Learned a lot from that 👍🏼

But there seem some problems with the current signature definitions
https://github.com/apache/arrow-datafusion/blob/815413c4a4b473f996ccaa7deb650653430a5aba/datafusion/expr/src/signature.rs#L41-L50
This seems to be the first function that implemented variadic_equal signature, the current implementation said it's
arbitrary number of arguments of an arbitrary but equal type
but the arguments should also be from some common type list for this case (greatest()/least()).

So it's better to change TypeSignature to:

pub enum TypeSignature {
    /// arbitrary number of arguments of an common type out of a list of valid types
    // A function such as `concat` is `Variadic(vec![DataType::Utf8, DataType::LargeUtf8])`
    Variadic(Vec<DataType>),
    /// arbitrary number of arguments of an common type out of a list of valid types, all arguments have the same type
    VariadicEqual(Vec<DataType>),
    /// arbitrary number of arguments of an arbitrary but equal type
    // A function such as `array` is `VariadicEqual`
    // The first argument decides the type used for coercion
    VariadicEqualAny,
    /// arbitrary number of arguments with arbitrary types
    VariadicAny,
    ...

And then the planner can better catch invalid function calls like this one

DataFusion CLI v25.0.0
❯ select least(interval '1 day', interval '2 day', interval '3 day');
thread 'main' panicked at 'Unsupported data type for comparison: Interval(MonthDayNano)', /Users/yongting/Desktop/code/arrow-datafusion/datafusion/physical-expr/src/comparison_expressions.rs:74:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@@ -376,6 +381,9 @@ pub fn signature(fun: &BuiltinScalarFunction) -> Signature {
BuiltinScalarFunction::Chr | BuiltinScalarFunction::ToHex => {
Signature::uniform(1, vec![DataType::Int64], fun.volatility())
}
BuiltinScalarFunction::Greatest | BuiltinScalarFunction::Least => {
Signature::variadic_equal(fun.volatility())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be Signature::variadic_equal(SUPPORTED_COMPARISON_TYPES, fun.volatility())

@jimexist jimexist force-pushed the add-greatest-least branch 2 times, most recently from 1b4e6c8 to 74fb1fa Compare June 4, 2023 02:36
@jimexist jimexist changed the title [built-in function] add greatest() and least() SQL functions [built-in function] add greatest(T,...) and least(T,...) SQL functions Jun 4, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome back 👋 @jimexist ! It is great to see you again and thank you for your contribution.

I am sorry I wrote a comment about using Min and Max accumulators previously but I forgot to click submit.

cc @izveigor as I think you have been working to add additional support for array computations

)))
} else {
Err(DataFusionError::NotImplemented(format!(
"{:?} expressions are not implemented for arrays yet as we need to update arrow kernels",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is min like https://docs.rs/arrow/latest/arrow/compute/fn.min.html but I think it works only for primitive arrays

Perhaps you can use the existing MinAccumulator / Max Accumulators?

https://docs.rs/datafusion/latest/datafusion/physical_plan/expressions/struct.MinAccumulator.html

// A function such as `array` is `VariadicEqual`
// The first argument decides the type used for coercion
VariadicEqual,
VariadicEqual(Vec<DataType>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the types are all equal, what is the purpose of storing a Vec of them? or maybe the Vec is the list of valid types?

It also looks like the code special cases when there are no types specified to mean "any type is allowed" -- can we please explicitly mention that in the docs as well? Or maybe add a new variant (VariadicEqualSpecific?)

Comment on lines +274 to +279
let is_error = get_valid_types(
&signature,
&[DataType::Int32, DataType::Boolean, DataType::Int32],
)
.is_err();
assert!(is_error);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to test for error is doing:

Suggested change
let is_error = get_valid_types(
&signature,
&[DataType::Int32, DataType::Boolean, DataType::Int32],
)
.is_err();
assert!(is_error);
let is_error = get_valid_types(
&signature,
&[DataType::Int32, DataType::Boolean, DataType::Int32],
)
.uwrap_err();

Which is both more concise, but I think also prints out what the OK value is when the result is not an Error

}

/// Evaluate a greatest or least function
fn compare(op: ComparisonOperator, args: &[ColumnarValue]) -> Result<ColumnarValue> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this functionality is basically the same as the Min/Max accumulators (e.g. https://docs.rs/datafusion/latest/datafusion/physical_plan/expressions/struct.MinAccumulator.html)

}

#[tokio::test]
async fn test_comparison_func_array_scalar_expression() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please write this test using sqllogictest instead of a new rs test? I think you'll find it quite nice for sql tests and it is easier to maintain and extend.

https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/tests/sqllogictests#cookbook-adding-tests

Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked the implementation of Variadic, it's actually the VariadicEqual we are trying to implement:
If a function is given a signature least(Variadic[Int32/Int64], it will try if can coerce to any of least(Int32, Int32, ...) or least(Int64, Int64, ...)
https://github.com/apache/arrow-datafusion/blob/a7970ebf6ef5181fe78de5db63dcf05760db5ace/datafusion/expr/src/type_coercion/functions.rs#L66-L69

But its name and doc are a bit misleading, which makes me ignore that previously and tried to use a new variant, maybe we can also make it more obvious.

@@ -376,6 +381,12 @@ pub fn signature(fun: &BuiltinScalarFunction) -> Signature {
BuiltinScalarFunction::Chr | BuiltinScalarFunction::ToHex => {
Signature::uniform(1, vec![DataType::Int64], fun.volatility())
}
BuiltinScalarFunction::Greatest | BuiltinScalarFunction::Least => {
Signature::variadic_equal(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried Signature::variadic( and it just work as expected

❯ select least(1,2,3);
+-----------------------------------+
| least(Int64(1),Int64(2),Int64(3)) |
+-----------------------------------+
| 1                                 |
+-----------------------------------+
1 row in set. Query took 0.077 seconds.
❯ select least(1,2.0,3);
+-------------------------------------+
| least(Int64(1),Float64(2),Int64(3)) |
+-------------------------------------+
| 1.0                                 |
+-------------------------------------+
1 row in set. Query took 0.009 seconds.
❯ select least(interval '1 day', interval '2 day', interval '3 day');
+-----------------------------------------------------------------------------------------------------------------------------------------------+
| least(IntervalMonthDayNano("18446744073709551616"),IntervalMonthDayNano("36893488147419103232"),IntervalMonthDayNano("55340232221128654848")) |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
| 0 years 0 mons 1 days 0 hours 0 mins 0.000000000 secs                                                                                         |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set. Query took 0.008 seconds.

@alamb
Copy link
Contributor

alamb commented Jun 9, 2023

Marking as draft as this PR has feedback and is awaiting a response (I am trying to keep the list of PRs needing review clear) -- please mark as ready for review when ready

@alamb alamb marked this pull request as draft June 9, 2023 13:37
@jimexist
Copy link
Member Author

Welcome back 👋 @jimexist ! It is great to see you again and thank you for your contribution.

I am sorry I wrote a comment about using Min and Max accumulators previously but I forgot to click submit.

cc @izveigor as I think you have been working to add additional support for array computations

thank you @alamb my bandwidth recently is very limited, so I am going to pause this PR for a bit, if anyone wants to take over feel free to. I can also do a merge as is (after rebase fix) and create issues as follow ups?

@alamb
Copy link
Contributor

alamb commented Jun 12, 2023

thank you @alamb my bandwidth recently is very limited, so I am going to pause this PR for a bit, if anyone wants to take over feel free to. I can also do a merge as is (after rebase fix) and create issues as follow ups?

Whatever you prefer -- I defer to your judgement

Copy link

github-actions bot commented May 7, 2024

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label May 7, 2024
@github-actions github-actions bot closed this May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core datafusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add greatest(T,...) and least(T,...) SQL functions
4 participants