Proposed changes for more flexible user defined Aggregate and window functions #12

mustafasrepo · 2023-06-13T07:05:05Z

Which issue does this PR close?

Closes #.

Rationale for this change

In #6617. We have discussed how to make user defined aggregate and window functions more flexible.

First of all current state is a bit complex for end user to handle. After examining the PartitionEvaluator trait we have decided that evaluate_stateful and evaluate_inside_range can be combined. Its new name is evaluate with the following API fn evaluate(&mut self,_values: &[ArrayRef],_range: &Range<usize>,) -> Result<ScalarValue> (Existing evaluate is renamed with evaluate_all to reflect better what function does).
It returns a single ScalarValue for the given input which is the result of window function (If function uses_window_frame result calculated according to given range).

Existing fn evaluate(&self, _values: &[ArrayRef], _num_rows: usize) -> Result<ArrayRef> is replaced by fn evaluate_all(&mut self, values: &[ArrayRef], num_rows: usize) -> Result<ArrayRef>.
This function receives whole of the data as single batch and produces all of the window result as bulk, which maybe more optimal for some use cases.

With the new API we have following options for the end user?

uses_window_frame	supports_bounded_execution	function_to_implement
false	false	`evaluate_all` (if we were to implement `PERCENT_RANK` it would end up in this quadrant, we cannot produce any result without seeing whole data)
false	true	`evaluate` (optionally can also implement `evaluate_all` for more optimized implementation. However, there will be default implementation that is suboptimal) . If we were to implement `ROW_NUMBER` it will end up in this quadrant. Example `OddRowNumber` showcases this use case
true	false	`evaluate` (I think as long as `uses_window_frame` is `true`. There is no way for `supports_bounded_execution` to be false). I couldn't come up with any example for this quadrant
true	true	`evaluate`. If we were to implement `FIRST_VALUE`, it would end up in this quadrant

To support end user to set flag uses_window_frame and supports_bounded_execution. I have moved these methods from BuiltInWindowFunctionExpr to PartitionEvaluator. However, in the following commit @alamb could find another way to add this support (I think his version is better. However, since this is showcase PR for new API, I didn't bother with retracting changes.).

In short, I think with the current approach in #6617. We are in very good shape (I will simplify evaluate logic with another PR). Hopefully, after these changes end user, by setting uses_window_frame and supports_bounded_execution properly. Then implementing corresponding evaluator (evaluate or evaluate_all) will be able to accomplish desired behavior for most of the use cases

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb

Thank you @mustafasrepo, I really like this PR and its direction 👍 I have one open question but otherwise I think this is awesome

Open question

How does include_rank (link) fit into this model?

Suggested next step:

Make a PR to apache/arrow-datafusion with the changes to PartitionEvaluator that are in this PR (I believe you plan to do that I will open the PR that unify evaluate and evaluate_stateful fields on the main repo once it is ready.)
I will work on various tests / examples for WindowUDF on RFC: User Defined Window Functions apache/datafusion#6617 (which I will port to use the new API when it is ready)

Comments / Responses

First of all current state is a bit complex for end user to handle. After examining the PartitionEvaluator trait we have decided that evaluate_stateful and evaluate_inside_range can be combined. Its new name is evaluate with the following API ..

I also had this observation and I think your solution is very elegant 👍

With the new API we have following options for the end user?

I really like the tabular format of this analysis and it makes sense to me. Adding that table to the comments of PartitionEvaluator would really help people understand it.

In short, I think with the current approach in apache#6617. We are in very good shape (I will simplify evaluate logic with another PR).

I agree.

Thank you so much for all your help!

alamb · 2023-06-14T10:55:50Z

datafusion/expr/src/partition_evaluator.rs

+    /// If this function returns true, [`Self::create_evaluator`] must
+    /// implement [`PartitionEvaluator::evaluate`]
+    fn supports_bounded_execution(&self) -> bool {
+        false
    }


Is the idea that the special case forinclude_rank would also be an option here?

https://github.com/apache/arrow-datafusion/blob/6194d588d5c3e9f202a31a0c524f63e6fb08d040/datafusion/physical-expr/src/window/built_in_window_function_expr.rs#L106-L111

When include_rank flag is true, evaluate_with_rank_all will be called. This method is basically same with evaluate_all in the spirit. (It takes all the data and produces all the output in single pass). However, since evaluate_with_rank_all requires additional arguments (such as rank boundaries). We do not unify their API, to not recalculate rank boundaries each time (even if we do not use them).
Certainly, we can move this trait to PartitionEvaluator also. However, I thought this would be confusing. Hence didn't move it. (I will think about how to combine evaluate_with_rank_all and evaluate_all without calculating rank boundaries unnecessarily).

Maybe we can present to the user just a subset of the PartitionEvaluator methods. They wouldn't see evaluate_with_rank_all either (Just like your suggestion in option 2).

For anyone following along, I think we went with the evaluate_with_rank_all approach: https://github.com/apache/arrow-datafusion/blob/c3d5d77e447e51c2cca814a67706e5ab3e050ced/datafusion/physical-expr/src/window/partition_evaluator.rs#L209-L217

alamb · 2023-06-14T10:57:09Z

datafusion/physical-expr/src/window/aggregate.rs

@@ -155,8 +155,9 @@ impl WindowExpr for PlainAggregateWindowExpr {
    }

    fn uses_bounded_memory(&self) -> bool {
-        self.aggregate.supports_bounded_execution()
-            && !self.window_frame.end_bound.is_unbounded()
+        let supports_bounded_execution =


I think we can make this check explicit by adding the API described by @stuartcarnie here: apache#6611 (so that the accumulator can report on its capabilities)

alamb · 2023-06-14T10:59:05Z

datafusion-examples/examples/simple_udwf.rs

    }
 }

 // TODO show how to use other evaluate methods
+/// These different evaluation methods are called depending on the various settings of WindowUDF


mustafasrepo · 2023-06-14T12:31:56Z

Make a PR to apache/arrow-datafusion with the changes to PartitionEvaluator that are in this PR (I believe you plan to do that I will open the PR that unify evaluate and evaluate_stateful fields on the main repo once it is ready.)

I have opened the PR for stage 1. It can be found in the #6655

* Add dialect param to use CHAR instead of TEXT for Utf8 unparsing for MySQL (#12) * Configurable data type instead of flag for Utf8 unparsing * Fix type in comment

minor changes

66e0e78

mustafasrepo marked this pull request as draft June 13, 2023 07:05

github-actions bot added core logical-expr physical-expr labels Jun 13, 2023

mustafasrepo added 6 commits June 13, 2023 10:12

Use case for first_value (uses window frame)

6de0cc1

Bounded implementation example

c613434

remove leftovers

000fc3c

Remove leftovers

dc7edf9

Change get_range implementation

84bdf28

Combine evaluate_stateful and evaluate_inside_ranges

828de45

mustafasrepo mentioned this pull request Jun 14, 2023

RFC: User Defined Window Functions apache/datafusion#6617

Closed

alamb approved these changes Jun 14, 2023

View reviewed changes

mustafasrepo mentioned this pull request Jun 14, 2023

Combine evaluate_stateful and evaluate_inside_range apache/datafusion#6665

Merged

mustafasrepo closed this Jun 22, 2023

mustafasrepo deleted the feature/6617_exp branch July 14, 2023 07:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed changes for more flexible user defined Aggregate and window functions #12

Proposed changes for more flexible user defined Aggregate and window functions #12

mustafasrepo commented Jun 13, 2023 •

edited

Loading

alamb left a comment

alamb Jun 14, 2023

mustafasrepo Jun 14, 2023

alamb Jun 14, 2023

alamb Jun 14, 2023

alamb Jun 14, 2023

mustafasrepo commented Jun 14, 2023

Proposed changes for more flexible user defined Aggregate and window functions #12

Proposed changes for more flexible user defined Aggregate and window functions #12

Conversation

mustafasrepo commented Jun 13, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

Open question

Suggested next step:

Comments / Responses

alamb Jun 14, 2023

Choose a reason for hiding this comment

mustafasrepo Jun 14, 2023

Choose a reason for hiding this comment

alamb Jun 14, 2023

Choose a reason for hiding this comment

alamb Jun 14, 2023

Choose a reason for hiding this comment

alamb Jun 14, 2023

Choose a reason for hiding this comment

mustafasrepo commented Jun 14, 2023

mustafasrepo commented Jun 13, 2023 •

edited

Loading