Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed changes for more flexible user defined Aggregate and window functions #12

Closed
wants to merge 7 commits into from

Conversation

mustafasrepo
Copy link

@mustafasrepo mustafasrepo commented Jun 13, 2023

Which issue does this PR close?

Closes #.

Rationale for this change

In #6617. We have discussed how to make user defined aggregate and window functions more flexible.

First of all current state is a bit complex for end user to handle. After examining the PartitionEvaluator trait we have decided that evaluate_stateful and evaluate_inside_range can be combined. Its new name is evaluate with the following API fn evaluate(&mut self,_values: &[ArrayRef],_range: &Range<usize>,) -> Result<ScalarValue> (Existing evaluate is renamed with evaluate_all to reflect better what function does).
It returns a single ScalarValue for the given input which is the result of window function (If function uses_window_frame result calculated according to given range).

Existing fn evaluate(&self, _values: &[ArrayRef], _num_rows: usize) -> Result<ArrayRef> is replaced by fn evaluate_all(&mut self, values: &[ArrayRef], num_rows: usize) -> Result<ArrayRef>.
This function receives whole of the data as single batch and produces all of the window result as bulk, which maybe more optimal for some use cases.

With the new API we have following options for the end user?

uses_window_frame supports_bounded_execution function_to_implement
false false evaluate_all (if we were to implement PERCENT_RANK it would end up in this quadrant, we cannot produce any result without seeing whole data)
false true evaluate (optionally can also implement evaluate_all for more optimized implementation. However, there will be default implementation that is suboptimal) . If we were to implement ROW_NUMBER it will end up in this quadrant. Example OddRowNumber showcases this use case
true false evaluate (I think as long as uses_window_frame is true. There is no way for supports_bounded_execution to be false). I couldn't come up with any example for this quadrant
true true evaluate. If we were to implement FIRST_VALUE, it would end up in this quadrant

To support end user to set flag uses_window_frame and supports_bounded_execution. I have moved these methods from BuiltInWindowFunctionExpr to PartitionEvaluator. However, in the following commit @alamb could find another way to add this support (I think his version is better. However, since this is showcase PR for new API, I didn't bother with retracting changes.).

In short, I think with the current approach in #6617. We are in very good shape (I will simplify evaluate logic with another PR). Hopefully, after these changes end user, by setting uses_window_frame and supports_bounded_execution properly. Then implementing corresponding evaluator (evaluate or evaluate_all) will be able to accomplish desired behavior for most of the use cases

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Copy link
Owner

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mustafasrepo, I really like this PR and its direction 👍 I have one open question but otherwise I think this is awesome

Open question

How does include_rank (link) fit into this model?

Suggested next step:

  1. Make a PR to apache/arrow-datafusion with the changes to PartitionEvaluator that are in this PR (I believe you plan to do that I will open the PR that unify evaluate and evaluate_stateful fields on the main repo once it is ready.)
  2. I will work on various tests / examples for WindowUDF on RFC: User Defined Window Functions apache/datafusion#6617 (which I will port to use the new API when it is ready)

Comments / Responses

First of all current state is a bit complex for end user to handle. After examining the PartitionEvaluator trait we have decided that evaluate_stateful and evaluate_inside_range can be combined. Its new name is evaluate with the following API ..

I also had this observation and I think your solution is very elegant 👍

With the new API we have following options for the end user?

I really like the tabular format of this analysis and it makes sense to me. Adding that table to the comments of PartitionEvaluator would really help people understand it.

In short, I think with the current approach in apache#6617. We are in very good shape (I will simplify evaluate logic with another PR).

I agree.

Thank you so much for all your help!

/// If this function returns true, [`Self::create_evaluator`] must
/// implement [`PartitionEvaluator::evaluate`]
fn supports_bounded_execution(&self) -> bool {
false
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When include_rank flag is true, evaluate_with_rank_all will be called. This method is basically same with evaluate_all in the spirit. (It takes all the data and produces all the output in single pass). However, since evaluate_with_rank_all requires additional arguments (such as rank boundaries). We do not unify their API, to not recalculate rank boundaries each time (even if we do not use them).
Certainly, we can move this trait to PartitionEvaluator also. However, I thought this would be confusing. Hence didn't move it. (I will think about how to combine evaluate_with_rank_all and evaluate_all without calculating rank boundaries unnecessarily).

Maybe we can present to the user just a subset of the PartitionEvaluator methods. They wouldn't see evaluate_with_rank_all either (Just like your suggestion in option 2).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -155,8 +155,9 @@ impl WindowExpr for PlainAggregateWindowExpr {
}

fn uses_bounded_memory(&self) -> bool {
self.aggregate.supports_bounded_execution()
&& !self.window_frame.end_bound.is_unbounded()
let supports_bounded_execution =
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make this check explicit by adding the API described by @stuartcarnie here: apache#6611 (so that the accumulator can report on its capabilities)

}
}

// TODO show how to use other evaluate methods
/// These different evaluation methods are called depending on the various settings of WindowUDF
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@mustafasrepo
Copy link
Author

Make a PR to apache/arrow-datafusion with the changes to PartitionEvaluator that are in this PR (I believe you plan to do that I will open the PR that unify evaluate and evaluate_stateful fields on the main repo once it is ready.)

I have opened the PR for stage 1. It can be found in the #6655

@mustafasrepo mustafasrepo deleted the feature/6617_exp branch July 14, 2023 07:01
alamb pushed a commit that referenced this pull request Jul 17, 2024
* Add dialect param to use CHAR instead of TEXT for Utf8 unparsing for MySQL (#12)

* Configurable data type instead of flag for Utf8 unparsing

* Fix type in comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants