-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move min and max to user defined aggregate function #11013
base: main
Are you sure you want to change the base?
Conversation
I do have something that's starting to look reasonable, but some tests on the optimizer now are failing for some reasons I can't understand
|
I guess you skip the aggregate statistic optimization for min/max datafusion/datafusion/core/src/physical_optimizer/aggregate_statistics.rs Lines 177 to 224 in 18042fd
You might need to check if the |
I fixed this but now I have a test that doesn't pass on the optimizer (there are two actually)
That suggests that the optimizer cannot use the existing aliases / doesn't understand the existing aliases that provide DISTINCT test.b . Looking, any tip would be highly appreciated |
I think we should add distinct for MIN/MAX so we can get the But I think there is no difference between MIN and Distinct Min, maybe we could remove distinct for MIN/MAX beforehand? Introduce EliminateDistinct optimize rule for MIN/MAX. |
Is this a part of the optimizer i.e. https://github.com/edmondop/arrow-datafusion/blob/main/datafusion/optimizer/src/replace_distinct_aggregate.rs ? Thank your for your help btw |
I don't think so, Distinct/Distinct On is different from distinct in the function. |
@jayzhan211 I have started experimenting with an optimizer rule, but removing the distinct result in such an error:
Do I need to change also the equivalence rules? |
You can take |
Thanks. I guess I wasn't clear in my comment here #11013 (comment) . How should that test failure be addressed? It seems that min/max udaf uses other aliases and is not reusing the intermediate results already available |
If we eliminate distinct of min/max prior to |
Wouldn't eliminating it require the optimizer rule? Or do you suggest I update the test case? Or the expected value? |
Yes, I suggest we update the test like #[test]
fn one_distinct_and_two_common() -> Result<()> {
let table_scan = test_table_scan()?;
let plan = LogicalPlanBuilder::from(table_scan)
.aggregate(
vec![col("a")],
vec![sum(col("c")), count_distinct(col("b")), max(col("b"))],
)?
.build()?;
// Should work
let expected = "Projection: test.a, sum(alias2) AS sum(test.c), COUNT(alias1) AS COUNT(DISTINCT test.b), MAX(alias3) AS MAX(test.b) [a:UInt32, sum(test.c):UInt64;N, COUNT(DISTINCT test.b):Int64;N, MAX(test.b):UInt32;N]\n Aggregate: groupBy=[[test.a]], aggr=[[sum(alias2), COUNT(alias1), MAX(alias3)]] [a:UInt32, sum(alias2):UInt64;N, COUNT(alias1):Int64;N, MAX(alias3):UInt32;N]\n Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2, MAX(test.b) AS alias3]] [a:UInt32, alias1:UInt32, alias2:UInt64;N, alias3:UInt32;N]\n TableScan: test [a:UInt32, b:UInt32, c:UInt32]";
assert_optimized_plan_equal(plan, expected)
} |
There seems to be a column added to the Aggregate node in the logical plan, can that affect performance and/or memory footprint? This was the reason why I didn't update the test in the first place This is a subset of the new plan
while this is the subset from the previous plan
there is an alias3:UInt64 that gets added |
Remove the Min/Max matching in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @edmondop -- I took a look at this PR and I think in general it is quite close.
It needs:
- to remove the old min/max implementation in https://github.com/apache/datafusion/blob/5bb6b356277ea1c6f1d7af64e2d66f005d7e1ed4/datafusion/physical-expr/src/aggregate/min_max.rs
- resolve some merge conflicts
There is also a follow on issue / PR I would like to make regarding the optimizer check
Given this PR has hung out for a while and has some merge conflicts now I am going to try and help polish it up
I think as long as you can explain me how to resolve the current test failure I should be fine. Agree using names for min and max unwrapping is not very robust |
49f7a7f
to
4440a8d
Compare
I think this is now blocked by #11595 |
pub(crate) mod prim_op { | ||
pub use datafusion_physical_expr_common::aggregate::groups_accumulator::prim_op::PrimitiveGroupsAccumulator; | ||
} | ||
pub(crate) mod prim_op {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can delete it
As I understand it this PR is still a work in progress, so marking it as a draft (I am trying to make sure it is clear what PRs are waiting for review and what are not) |
356a8e1
to
ab26352
Compare
@jayzhan211 the change is dropping the limit in the physical plan node, I wasn't able to find out the source of it. Do you have any hint ? |
@edmondop datafusion/datafusion/core/src/physical_optimizer/topk_aggregation.rs Lines 48 to 85 in 01dc3f9
|
@jayzhan211 I am a little confused about the test case here
I have noticed that in the signature for min/max as an AggregateFunction, TimeInterval is not added
If I don't add it, the SQL test fails because
|
Which issue does this PR close?
Closes #10943 .
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?