-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Support min max aggregates in window functions with sliding windows #4675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The functions are run with only float64 columns now as in the test case. Support for all types will be implemented.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution @berkaycpp -- this looks like a good start
I have some performance concerns, which I described in detail.
I also think the code needs significantly more test coverage - the sql level MIN/MAX tests are a good start but I don't think they necessarily hit all the corner cases.
The sliding_window is a fascinating proposal
cc @Ted-Jiang
|
Thanks @alamb for your feedbacks. We will address them as soon as possible. @Ted-Jiang if you have time, we would like to have your feedback also regarding this design. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the changes in datafusion/core/src/datasource/file_format/parquet.rs and datafusion/core/src/datasource/mod.rs I think this PR looks ready to go.
Thank you for your work on it
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @berkaycpp and @mustafasrepo
|
Benchmark runs are scheduled for baseline = 77991a3 and contender = afb1ae2. afb1ae2 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
|
||
| // The implementation is taken from https://github.com/spebern/moving_min_max/blob/master/src/lib.rs. | ||
|
|
||
| //! Keep track of the minimum or maximum value in a sliding window. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an update here, I subsequently discovered a possible better algorithm for this work -- see #4904 for details
Which issue does this PR close?
Closes #4603 and closes #4402.
Rationale for this change
This PR adds support for running MIN-MAX accumulators using custom window frames. Since adding support for sliding window for min-max introduces performance penalty, we have added
ForwardAggregateWindowExpr, which is a special implementation for expressions that don't require retract. By using this struct, if not needed, we do not introduce performance penalty. As an example window expression,MIN(c12) OVER (ORDER BY C12 RANGE BETWEEN 0.3 PRECEDING AND 0.2 FOLLOWING)will useMinAccumulatorto calculate its result (where implementation is sliding and introduces additional overhead). However, window expressionMIN(c12) OVER (ORDER BY C12 RANGE BETWEEN UNBOUNDED PRECEDING AND 0.2 FOLLOWING)will useMinRowAccumulatorwhich is a better implementation when retract is not a concern.What changes are included in this PR?
update_batchandretract_batchmethods of MIN, MAX accumulators are implemented. The algorithm and data structure used are described here. If an accumulator has a support for better implementation when retract is unnecessary, we now use optimized implementation during execution.Are these changes tested?
Yes.
aggregate_min_max_w_custom_window_framestest function is added.Are there any user-facing changes?
No.