-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move min and max to user defined aggregate function #11013
base: main
Are you sure you want to change the base?
Conversation
@@ -232,54 +222,6 @@ mod tests { | |||
Ok(()) | |||
} | |||
|
|||
#[test] | |||
fn test_min_max_expr() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move this test to slt
I do have something that's starting to look reasonable, but some tests on the optimizer now are failing for some reasons I can't understand
|
I guess you skip the aggregate statistic optimization for min/max datafusion/datafusion/core/src/physical_optimizer/aggregate_statistics.rs Lines 177 to 224 in 18042fd
You might need to check if the |
I fixed this but now I have a test that doesn't pass on the optimizer (there are two actually)
That suggests that the optimizer cannot use the existing aliases / doesn't understand the existing aliases that provide DISTINCT test.b . Looking, any tip would be highly appreciated |
I think we should add distinct for MIN/MAX so we can get the But I think there is no difference between MIN and Distinct Min, maybe we could remove distinct for MIN/MAX beforehand? Introduce EliminateDistinct optimize rule for MIN/MAX. |
Is this a part of the optimizer i.e. https://github.com/edmondop/arrow-datafusion/blob/main/datafusion/optimizer/src/replace_distinct_aggregate.rs ? Thank your for your help btw |
I don't think so, Distinct/Distinct On is different from distinct in the function. |
@jayzhan211 I have started experimenting with an optimizer rule, but removing the distinct result in such an error:
Do I need to change also the equivalence rules? |
You can take |
Thanks. I guess I wasn't clear in my comment here #11013 (comment) . How should that test failure be addressed? It seems that min/max udaf uses other aliases and is not reusing the intermediate results already available |
If we eliminate distinct of min/max prior to |
Wouldn't eliminating it require the optimizer rule? Or do you suggest I update the test case? Or the expected value? |
Yes, I suggest we update the test like #[test]
fn one_distinct_and_two_common() -> Result<()> {
let table_scan = test_table_scan()?;
let plan = LogicalPlanBuilder::from(table_scan)
.aggregate(
vec![col("a")],
vec![sum(col("c")), count_distinct(col("b")), max(col("b"))],
)?
.build()?;
// Should work
let expected = "Projection: test.a, sum(alias2) AS sum(test.c), COUNT(alias1) AS COUNT(DISTINCT test.b), MAX(alias3) AS MAX(test.b) [a:UInt32, sum(test.c):UInt64;N, COUNT(DISTINCT test.b):Int64;N, MAX(test.b):UInt32;N]\n Aggregate: groupBy=[[test.a]], aggr=[[sum(alias2), COUNT(alias1), MAX(alias3)]] [a:UInt32, sum(alias2):UInt64;N, COUNT(alias1):Int64;N, MAX(alias3):UInt32;N]\n Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2, MAX(test.b) AS alias3]] [a:UInt32, alias1:UInt32, alias2:UInt64;N, alias3:UInt32;N]\n TableScan: test [a:UInt32, b:UInt32, c:UInt32]";
assert_optimized_plan_equal(plan, expected)
} |
There seems to be a column added to the Aggregate node in the logical plan, can that affect performance and/or memory footprint? This was the reason why I didn't update the test in the first place This is a subset of the new plan
while this is the subset from the previous plan
there is an alias3:UInt64 that gets added |
Remove the Min/Max matching in |
match expr { | ||
Expr::AggregateFunction(ref fun) => { | ||
let fn_name = fun.func_def.name().to_lowercase(); | ||
if fun.distinct && WORKSPACE_ROOT_LOCK.get_or_init(|| vec!["min".to_string(), "max".to_string()]).contains(&fn_name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need oncelock here? I think or
is enough 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I move this anyways in a different PR so we can discuss that separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I suggest we move eliminate distinct in another PR, maybe there is better idea from other reviewer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @edmondop -- I took a look at this PR and I think in general it is quite close.
It needs:
- to remove the old min/max implementation in https://github.com/apache/datafusion/blob/5bb6b356277ea1c6f1d7af64e2d66f005d7e1ed4/datafusion/physical-expr/src/aggregate/min_max.rs
- resolve some merge conflicts
There is also a follow on issue / PR I would like to make regarding the optimizer check
Given this PR has hung out for a while and has some merge conflicts now I am going to try and help polish it up
impl Min { | ||
pub fn new() -> Self { | ||
Self { | ||
aliases: vec!["count".to_string()], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure min should have a "count" alias 🤔
impl Max { | ||
pub fn new() -> Self { | ||
Self { | ||
aliases: vec!["count".to_string()], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise here, "count" doesn't seem right
@@ -173,6 +173,23 @@ fn take_optimizable_column_and_table_count( | |||
None | |||
} | |||
|
|||
fn unwrap_min(agg_expr: &dyn AggregateExpr) -> Option<&AggregateFunctionExpr> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As part of a follow on PR, I think we should change this optimizer to use some method on AggregateFunctionExpr rather than the name
So something like
impl AggregateFunctionExpr {
/// Return the value of the aggregate function if there are no rows in
///
/// If the optimizer knows the input to an aggregation operation has
/// no rows then it will replace the aggregation with a constant
///
/// If the value for 0 input rows is not known, returns None (the default)
fn zero_row_value(&self) -> Option<ScalarValue> { None }
...
}
We can do this as a follow on PR
I think as long as you can explain me how to resolve the current test failure I should be fine. Agree using names for min and max unwrapping is not very robust |
Now that I have spent some more time working with this PR I see it still needs some additional work -- sorry for the noise
I started with merging up from main and resolving the conflicts: edmondop#1 Once that is merged / ready I think we could keep hacking at this PR together Alternately, we could potentially make some smaller PRs to remove the barriers / unblock this one -- for example we could remove the direct use of the Min/Max PhysicalExprs For example in #11013 (comment) As well as here: datafusion/datafusion/physical-plan/src/aggregates/mod.rs Lines 485 to 494 in f58df32
If you are interested, I can file tickets explaining how those smaller tasks |
Yes 🙏 |
Ok, I filed #11153 and then some starting tasks like this Task List
Hopefully that helps |
Which issue does this PR close?
Closes #10943 .
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?