Move min and max to user defined aggregate function #11013

edmondop · 2024-06-19T18:52:33Z

Which issue does this PR close?

Closes #10943 .

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

datafusion/physical-expr/src/aggregate/build_in.rs

edmondop · 2024-06-20T20:04:35Z

I do have something that's starting to look reasonable, but some tests on the optimizer now are failing for some reasons I can't understand

running 1 test
test custom_sources_cases::optimizers_catch_all_statistics ... FAILED

successes:

successes:

failures:

---- custom_sources_cases::optimizers_catch_all_statistics stdout ----
thread 'custom_sources_cases::optimizers_catch_all_statistics' panicked at datafusion/core/tests/custom_sources_cases/mod.rs:274:5:
Expected aggregate_statistics optimizations missing: AggregateExec { mode: Single, group_by: PhysicalGroupBy { expr: [], null_expr: [], groups: [[]] }, aggr_expr: [AggregateFunctionExpr { fun: AggregateUDF { inner: Count { name: "COUNT", signature: Signature { type_signature: VariadicAny, volatility: Immutable } } }, args: [Literal { value: Int64(1) }], logical_args: [Literal(Int64(1))], data_type: Int64, name: "COUNT(*)", schema: Schema { fields: [Field { name: "c1", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, sort_exprs: [], ordering_req: [], ignore_nulls: false, ordering_fields: [], is_distinct: false, input_type: Int64 }, AggregateFunctionExpr { fun: AggregateUDF { inner: Min { signature: Signature { type_signature: Numeric(1), volatility: Immutable }, aliases: ["min"] } }, args: [Column { name: "c1", index: 0 }], logical_args: [Column(Column { relation: Some(Bare { table: "test" }), name: "c1" })], data_type: Int32, name: "MIN(test.c1)", schema: Schema { fields: [Field { name: "c1", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, sort_exprs: [], ordering_req: [], ignore_nulls: false, ordering_fields: [], is_distinct: false, input_type: Int32 }, AggregateFunctionExpr { fun: AggregateUDF { inner: Max { signature: Signature { type_signature: Numeric(1), volatility: Immutable }, aliases: ["max"] } }, args: [Column { name: "c1", index: 0 }], logical_args: [Column(Column { relation: Some(Bare { table: "test" }), name: "c1" })], data_type: Int32, name: "MAX(test.c1)", schema: Schema { fields: [Field { name: "c1", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, sort_exprs: [], ordering_req: [], ignore_nulls: false, ordering_fields: [], is_distinct: false, input_type: Int32 }], filter_expr: [None, None, None], limit: None, input: CustomExecutionPlan { projection: Some([0]), cache: PlanProperties { eq_properties: EquivalenceProperties { eq_group: EquivalenceGroup { classes: [] }, oeq_class: OrderingEquivalenceClass { orderings: [] }, constants: [], schema: Schema { fields: [Field { name: "c1", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} } }, partitioning: UnknownPartitioning(1), execution_mode: Bounded, output_ordering: None } }, schema: Schema { fields: [Field { name: "COUNT(*)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "MIN(test.c1)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "MAX(test.c1)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, input_schema: Schema { fields: [Field { name: "c1", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, metrics: ExecutionPlanMetricsSet { inner: Mutex { data: MetricsSet { metrics: [] } } }, required_input_ordering: None, input_order_mode: Linear, cache: PlanProperties { eq_properties: EquivalenceProperties { eq_group: EquivalenceGroup { classes: [] }, oeq_class: OrderingEquivalenceClass { orderings: [] }, constants: [], schema: Schema { fields: [Field { name: "COUNT(*)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "MIN(test.c1)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "MAX(test.c1)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} } }, partitioning: UnknownPartitioning(1), execution_mode: Bounded, output_ordering: None } }

jayzhan211 · 2024-06-21T02:29:48Z

I do have something that's starting to look reasonable, but some tests on the optimizer now are failing for some reasons I can't understand

running 1 test
test custom_sources_cases::optimizers_catch_all_statistics ... FAILED

successes:

successes:

failures:

---- custom_sources_cases::optimizers_catch_all_statistics stdout ----
thread 'custom_sources_cases::optimizers_catch_all_statistics' panicked at datafusion/core/tests/custom_sources_cases/mod.rs:274:5:
Expected aggregate_statistics optimizations missing: AggregateExec { mode: Single, group_by: PhysicalGroupBy { expr: [], null_expr: [], groups: [[]] }, aggr_expr: [AggregateFunctionExpr { fun: AggregateUDF { inner: Count { name: "COUNT", signature: Signature { type_signature: VariadicAny, volatility: Immutable } } }, args: [Literal { value: Int64(1) }], logical_args: [Literal(Int64(1))], data_type: Int64, name: "COUNT(*)", schema: Schema { fields: [Field { name: "c1", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, sort_exprs: [], ordering_req: [], ignore_nulls: false, ordering_fields: [], is_distinct: false, input_type: Int64 }, AggregateFunctionExpr { fun: AggregateUDF { inner: Min { signature: Signature { type_signature: Numeric(1), volatility: Immutable }, aliases: ["min"] } }, args: [Column { name: "c1", index: 0 }], logical_args: [Column(Column { relation: Some(Bare { table: "test" }), name: "c1" })], data_type: Int32, name: "MIN(test.c1)", schema: Schema { fields: [Field { name: "c1", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, sort_exprs: [], ordering_req: [], ignore_nulls: false, ordering_fields: [], is_distinct: false, input_type: Int32 }, AggregateFunctionExpr { fun: AggregateUDF { inner: Max { signature: Signature { type_signature: Numeric(1), volatility: Immutable }, aliases: ["max"] } }, args: [Column { name: "c1", index: 0 }], logical_args: [Column(Column { relation: Some(Bare { table: "test" }), name: "c1" })], data_type: Int32, name: "MAX(test.c1)", schema: Schema { fields: [Field { name: "c1", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, sort_exprs: [], ordering_req: [], ignore_nulls: false, ordering_fields: [], is_distinct: false, input_type: Int32 }], filter_expr: [None, None, None], limit: None, input: CustomExecutionPlan { projection: Some([0]), cache: PlanProperties { eq_properties: EquivalenceProperties { eq_group: EquivalenceGroup { classes: [] }, oeq_class: OrderingEquivalenceClass { orderings: [] }, constants: [], schema: Schema { fields: [Field { name: "c1", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} } }, partitioning: UnknownPartitioning(1), execution_mode: Bounded, output_ordering: None } }, schema: Schema { fields: [Field { name: "COUNT(*)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "MIN(test.c1)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "MAX(test.c1)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, input_schema: Schema { fields: [Field { name: "c1", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, metrics: ExecutionPlanMetricsSet { inner: Mutex { data: MetricsSet { metrics: [] } } }, required_input_ordering: None, input_order_mode: Linear, cache: PlanProperties { eq_properties: EquivalenceProperties { eq_group: EquivalenceGroup { classes: [] }, oeq_class: OrderingEquivalenceClass { orderings: [] }, constants: [], schema: Schema { fields: [Field { name: "COUNT(*)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "MIN(test.c1)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "MAX(test.c1)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} } }, partitioning: UnknownPartitioning(1), execution_mode: Bounded, output_ordering: None } }

I guess you skip the aggregate statistic optimization for min/max

datafusion/datafusion/core/src/physical_optimizer/aggregate_statistics.rs

Lines 177 to 224 in 18042fd

    
           fn take_optimizable_min( 
        
               agg_expr: &dyn AggregateExpr, 
        
               stats: &Statistics, 
        
           ) -> Option<(ScalarValue, String)> { 
        
               if let Precision::Exact(num_rows) = &stats.num_rows { 
        
                   match *num_rows { 
        
                       0 => { 
        
                           // MIN/MAX with 0 rows is always null 
        
                           if let Some(casted_expr) = 
        
                               agg_expr.as_any().downcast_ref::<expressions::Min>() 
        
                           { 
        
                               if let Ok(min_data_type) = 
        
                                   ScalarValue::try_from(casted_expr.field().unwrap().data_type()) 
        
                               { 
        
                                   return Some((min_data_type, casted_expr.name().to_string())); 
        
                               } 
        
                           } 
        
                       } 
        
                       value if value > 0 => { 
        
                           let col_stats = &stats.column_statistics; 
        
                           if let Some(casted_expr) = 
        
                               agg_expr.as_any().downcast_ref::<expressions::Min>() 
        
                           { 
        
                               if casted_expr.expressions().len() == 1 { 
        
                                   // TODO optimize with exprs other than Column 
        
                                   if let Some(col_expr) = casted_expr.expressions()[0] 
        
                                       .as_any() 
        
                                       .downcast_ref::<expressions::Column>() 
        
                                   { 
        
                                       if let Precision::Exact(val) = 
        
                                           &col_stats[col_expr.index()].min_value 
        
                                       { 
        
                                           if !val.is_null() { 
        
                                               return Some(( 
        
                                                   val.clone(), 
        
                                                   casted_expr.name().to_string(), 
        
                                               )); 
        
                                           } 
        
                                       } 
        
                                   } 
        
                               } 
        
                           } 
        
                       } 
        
                       _ => {} 
        
                   } 
        
               } 
        
               None 
        
           }

You might need to check if the AggregateExpr is min/max in take_optimizable_min and take_optimizable_max

edmondop · 2024-06-21T16:02:24Z

I fixed this but now I have a test that doesn't pass on the optimizer (there are two actually)

---- single_distinct_to_groupby::tests::two_distinct_and_one_common stdout ----
thread 'single_distinct_to_groupby::tests::two_distinct_and_one_common' panicked at datafusion/optimizer/src/test/mod.rs:200:5:
assertion `left == right` failed
  left: "Projection: test.a, sum(alias2) AS sum(test.c), COUNT(alias1) AS COUNT(DISTINCT test.b), MAX(alias3) AS MAX(test.b) [a:UInt32, sum(test.c):UInt64;N, COUNT(DISTINCT test.b):Int64;N, MAX(test.b):UInt32;N]\n  Aggregate: groupBy=[[test.a]], aggr=[[sum(alias2), COUNT(alias1), MAX(alias3)]] [a:UInt32, sum(alias2):UInt64;N, COUNT(alias1):Int64;N, MAX(alias3):UInt32;N]\n    Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2, MAX(test.b) AS alias3]] [a:UInt32, alias1:UInt32, alias2:UInt64;N, alias3:UInt32;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]"
 right: "Projection: test.a, sum(alias2) AS sum(test.c), COUNT(alias1) AS COUNT(DISTINCT test.b), MAX(alias1) AS MAX(DISTINCT test.b) [a:UInt32, sum(test.c):UInt64;N, COUNT(DISTINCT test.b):Int64;N, MAX(DISTINCT test.b):UInt32;N]\n  Aggregate: groupBy=[[test.a]], aggr=[[sum(alias2), COUNT(alias1), MAX(alias1)]] [a:UInt32, sum(alias2):UInt64;N, COUNT(alias1):Int64;N, MAX(alias1):UInt32;N]\n    Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2]] [a:UInt32, alias1:UInt32, alias2:UInt64;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]"

That suggests that the optimizer cannot use the existing aliases / doesn't understand the existing aliases that provide DISTINCT test.b . Looking, any tip would be highly appreciated

jayzhan211 · 2024-06-22T00:11:02Z

I fixed this but now I have a test that doesn't pass on the optimizer (there are two actually)

---- single_distinct_to_groupby::tests::two_distinct_and_one_common stdout ----
thread 'single_distinct_to_groupby::tests::two_distinct_and_one_common' panicked at datafusion/optimizer/src/test/mod.rs:200:5:
assertion `left == right` failed
  left: "Projection: test.a, sum(alias2) AS sum(test.c), COUNT(alias1) AS COUNT(DISTINCT test.b), MAX(alias3) AS MAX(test.b) [a:UInt32, sum(test.c):UInt64;N, COUNT(DISTINCT test.b):Int64;N, MAX(test.b):UInt32;N]\n  Aggregate: groupBy=[[test.a]], aggr=[[sum(alias2), COUNT(alias1), MAX(alias3)]] [a:UInt32, sum(alias2):UInt64;N, COUNT(alias1):Int64;N, MAX(alias3):UInt32;N]\n    Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2, MAX(test.b) AS alias3]] [a:UInt32, alias1:UInt32, alias2:UInt64;N, alias3:UInt32;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]"
 right: "Projection: test.a, sum(alias2) AS sum(test.c), COUNT(alias1) AS COUNT(DISTINCT test.b), MAX(alias1) AS MAX(DISTINCT test.b) [a:UInt32, sum(test.c):UInt64;N, COUNT(DISTINCT test.b):Int64;N, MAX(DISTINCT test.b):UInt32;N]\n  Aggregate: groupBy=[[test.a]], aggr=[[sum(alias2), COUNT(alias1), MAX(alias1)]] [a:UInt32, sum(alias2):UInt64;N, COUNT(alias1):Int64;N, MAX(alias1):UInt32;N]\n    Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2]] [a:UInt32, alias1:UInt32, alias2:UInt64;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]"

That suggests that the optimizer cannot use the existing aliases / doesn't understand the existing aliases that provide DISTINCT test.b . Looking, any tip would be highly appreciated

I think we should add distinct for MIN/MAX so we can get the distinct after group by is converted to distinct function

But I think there is no difference between MIN and Distinct Min, maybe we could remove distinct for MIN/MAX beforehand? Introduce EliminateDistinct optimize rule for MIN/MAX.

edmondop · 2024-06-22T01:04:52Z

I fixed this but now I have a test that doesn't pass on the optimizer (there are two actually)

---- single_distinct_to_groupby::tests::two_distinct_and_one_common stdout ----
thread 'single_distinct_to_groupby::tests::two_distinct_and_one_common' panicked at datafusion/optimizer/src/test/mod.rs:200:5:
assertion `left == right` failed
  left: "Projection: test.a, sum(alias2) AS sum(test.c), COUNT(alias1) AS COUNT(DISTINCT test.b), MAX(alias3) AS MAX(test.b) [a:UInt32, sum(test.c):UInt64;N, COUNT(DISTINCT test.b):Int64;N, MAX(test.b):UInt32;N]\n  Aggregate: groupBy=[[test.a]], aggr=[[sum(alias2), COUNT(alias1), MAX(alias3)]] [a:UInt32, sum(alias2):UInt64;N, COUNT(alias1):Int64;N, MAX(alias3):UInt32;N]\n    Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2, MAX(test.b) AS alias3]] [a:UInt32, alias1:UInt32, alias2:UInt64;N, alias3:UInt32;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]"
 right: "Projection: test.a, sum(alias2) AS sum(test.c), COUNT(alias1) AS COUNT(DISTINCT test.b), MAX(alias1) AS MAX(DISTINCT test.b) [a:UInt32, sum(test.c):UInt64;N, COUNT(DISTINCT test.b):Int64;N, MAX(DISTINCT test.b):UInt32;N]\n  Aggregate: groupBy=[[test.a]], aggr=[[sum(alias2), COUNT(alias1), MAX(alias1)]] [a:UInt32, sum(alias2):UInt64;N, COUNT(alias1):Int64;N, MAX(alias1):UInt32;N]\n    Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2]] [a:UInt32, alias1:UInt32, alias2:UInt64;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]"

That suggests that the optimizer cannot use the existing aliases / doesn't understand the existing aliases that provide DISTINCT test.b . Looking, any tip would be highly appreciated

I think we should add distinct for MIN/MAX so we can get the distinct after group by is converted to distinct function

But I think there is no difference between MIN and Distinct Min, maybe we could remove distinct for MIN/MAX beforehand? Introduce EliminateDistinct optimize rule for MIN/MAX.

Is this a part of the optimizer i.e. https://github.com/edmondop/arrow-datafusion/blob/main/datafusion/optimizer/src/replace_distinct_aggregate.rs ? Thank your for your help btw

jayzhan211 · 2024-06-22T01:43:00Z

I fixed this but now I have a test that doesn't pass on the optimizer (there are two actually)
---- single_distinct_to_groupby::tests::two_distinct_and_one_common stdout ----
thread 'single_distinct_to_groupby::tests::two_distinct_and_one_common' panicked at datafusion/optimizer/src/test/mod.rs:200:5:
assertion `left == right` failed
  left: "Projection: test.a, sum(alias2) AS sum(test.c), COUNT(alias1) AS COUNT(DISTINCT test.b), MAX(alias3) AS MAX(test.b) [a:UInt32, sum(test.c):UInt64;N, COUNT(DISTINCT test.b):Int64;N, MAX(test.b):UInt32;N]\n  Aggregate: groupBy=[[test.a]], aggr=[[sum(alias2), COUNT(alias1), MAX(alias3)]] [a:UInt32, sum(alias2):UInt64;N, COUNT(alias1):Int64;N, MAX(alias3):UInt32;N]\n    Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2, MAX(test.b) AS alias3]] [a:UInt32, alias1:UInt32, alias2:UInt64;N, alias3:UInt32;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]"
 right: "Projection: test.a, sum(alias2) AS sum(test.c), COUNT(alias1) AS COUNT(DISTINCT test.b), MAX(alias1) AS MAX(DISTINCT test.b) [a:UInt32, sum(test.c):UInt64;N, COUNT(DISTINCT test.b):Int64;N, MAX(DISTINCT test.b):UInt32;N]\n  Aggregate: groupBy=[[test.a]], aggr=[[sum(alias2), COUNT(alias1), MAX(alias1)]] [a:UInt32, sum(alias2):UInt64;N, COUNT(alias1):Int64;N, MAX(alias1):UInt32;N]\n    Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2]] [a:UInt32, alias1:UInt32, alias2:UInt64;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]"
That suggests that the optimizer cannot use the existing aliases / doesn't understand the existing aliases that provide DISTINCT test.b . Looking, any tip would be highly appreciated
I think we should add distinct for MIN/MAX so we can get the distinct after group by is converted to distinct function
But I think there is no difference between MIN and Distinct Min, maybe we could remove distinct for MIN/MAX beforehand? Introduce EliminateDistinct optimize rule for MIN/MAX.
Is this a part of the optimizer i.e. https://github.com/edmondop/arrow-datafusion/blob/main/datafusion/optimizer/src/replace_distinct_aggregate.rs ? Thank your for your help btw

I don't think so, Distinct/Distinct On is different from distinct in the function.

edmondop · 2024-06-23T20:27:08Z

@jayzhan211 I have started experimenting with an optimizer rule, but removing the distinct result in such an error:

running 2 tests
test eliminate_distinct::tests::eliminate_distinct_from_min_expr ... FAILED
test eliminate_nested_union::tests::eliminate_distinct_nothing ... ok

failures:

---- eliminate_distinct::tests::eliminate_distinct_from_min_expr stdout ----
Transformed yes true
Error: Context("Optimizer rule 'eliminate_distinct' failed", Context("eliminate_distinct", Internal("Failed due to a difference in schemas, original schema: DFSchema { inner: Schema { fields: [Field { name: \"a\", data_type: UInt32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"MIN(DISTINCT test.b)\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [Some(Bare { table: \"test\" }), None], functional_dependencies: FunctionalDependencies { deps: [FunctionalDependence { source_indices: [0], target_indices: [0, 1], nullable: false, mode: Single }] } }, new schema: DFSchema { inner: Schema { fields: [Field { name: \"a\", data_type: UInt32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"MIN(test.b)\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [Some(Bare { table: \"test\" }), None], functional_dependencies: FunctionalDependencies { deps: [FunctionalDependence { source_indices: [0], target_indices: [0, 1], nullable: false, mode: Single }] } }")))

Do I need to change also the equivalence rules?

jayzhan211 · 2024-06-23T23:59:37Z

eliminate_distinct_from_min_expr

You can take single_distinct_groupby as reference, there is alias to remain schema equivalence.
Also, I suggest we introduce this rule in another PR, not mixing this with MIN/MAX UDAF.

edmondop · 2024-06-24T00:12:59Z

Thanks. I guess I wasn't clear in my comment here #11013 (comment) . How should that test failure be addressed? It seems that min/max udaf uses other aliases and is not reusing the intermediate results already available

jayzhan211 · 2024-06-24T06:56:15Z

Thanks. I guess I wasn't clear in my comment here #11013 (comment) . How should that test failure be addressed? It seems that min/max udaf uses other aliases and is not reusing the intermediate results already available

If we eliminate distinct of min/max prior to single_distinct_to_group_by, we don't expect to get distinct min/max at this point, we should rewrite the test to other function like sum.

edmondop · 2024-06-24T11:39:08Z

---- single_distinct_to_groupby::tests::two_distinct_and_one_common

Wouldn't eliminating it require the optimizer rule? Or do you suggest I update the test case? Or the expected value?

jayzhan211 · 2024-06-24T11:56:53Z

---- single_distinct_to_groupby::tests::two_distinct_and_one_common

Wouldn't eliminating it require the optimizer rule? Or do you suggest I update the test case? Or the expected value?

Yes, I suggest we update the test like

    #[test]
    fn one_distinct_and_two_common() -> Result<()> {
        let table_scan = test_table_scan()?;

        let plan = LogicalPlanBuilder::from(table_scan)
            .aggregate(
                vec![col("a")],
                vec![sum(col("c")), count_distinct(col("b")), max(col("b"))],
            )?
            .build()?;
        // Should work
        let expected = "Projection: test.a, sum(alias2) AS sum(test.c), COUNT(alias1) AS COUNT(DISTINCT test.b), MAX(alias3) AS MAX(test.b) [a:UInt32, sum(test.c):UInt64;N, COUNT(DISTINCT test.b):Int64;N, MAX(test.b):UInt32;N]\n  Aggregate: groupBy=[[test.a]], aggr=[[sum(alias2), COUNT(alias1), MAX(alias3)]] [a:UInt32, sum(alias2):UInt64;N, COUNT(alias1):Int64;N, MAX(alias3):UInt32;N]\n    Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2, MAX(test.b) AS alias3]] [a:UInt32, alias1:UInt32, alias2:UInt64;N, alias3:UInt32;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]";

        assert_optimized_plan_equal(plan, expected)
    }

edmondop · 2024-06-24T12:56:30Z

There seems to be a column added to the Aggregate node in the logical plan, can that affect performance and/or memory footprint? This was the reason why I didn't update the test in the first place

This is a subset of the new plan

aggr=[[sum(test.c) AS alias2, MAX(test.b) AS alias3]] [a:UInt32, alias1:UInt32, alias2:UInt64;N, alias3:UInt32;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]"

while this is the subset from the previous plan

Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2]] [a:UInt32, alias1:UInt32, alias2:UInt64;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]"

there is an alias3:UInt64 that gets added

jayzhan211 · 2024-06-24T14:14:41Z

There seems to be a column added to the Aggregate node in the logical plan, can that affect performance and/or memory footprint? This was the reason why I didn't update the test in the first place

This is a subset of the new plan
aggr=[[sum(test.c) AS alias2, MAX(test.b) AS alias3]] [a:UInt32, alias1:UInt32, alias2:UInt64;N, alias3:UInt32;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]"
while this is the subset from the previous plan
Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2]] [a:UInt32, alias1:UInt32, alias2:UInt64;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]"
there is an alias3:UInt64 that gets added

Remove the Min/Max matching in is_single_distinct_agg and the alias is removed

datafusion/optimizer/src/eliminate_distinct.rs

alamb

Thank you so much @edmondop -- I took a look at this PR and I think in general it is quite close.

It needs:

to remove the old min/max implementation in https://github.com/apache/datafusion/blob/5bb6b356277ea1c6f1d7af64e2d66f005d7e1ed4/datafusion/physical-expr/src/aggregate/min_max.rs
resolve some merge conflicts

There is also a follow on issue / PR I would like to make regarding the optimizer check

Given this PR has hung out for a while and has some merge conflicts now I am going to try and help polish it up

datafusion/expr/src/test/function_stub.rs

datafusion/core/src/physical_optimizer/aggregate_statistics.rs

edmondop · 2024-06-27T20:52:08Z

I think as long as you can explain me how to resolve the current test failure I should be fine. Agree using names for min and max unwrapping is not very robust

Working out the duffs

datafusion/optimizer/src/single_distinct_to_groupby.rs

edmondop · 2024-07-22T12:49:53Z

I think this is now blocked by #11595

alamb · 2024-07-22T17:44:03Z

I think this is now blocked by #11595

Thanks @edmondop -- good 🕵️ work

jayzhan211 · 2024-07-23T00:43:14Z

datafusion/physical-expr/src/aggregate/groups_accumulator/mod.rs

-pub(crate) mod prim_op {
-    pub use datafusion_physical_expr_common::aggregate::groups_accumulator::prim_op::PrimitiveGroupsAccumulator;
-}
+pub(crate) mod prim_op {}


we can delete it

datafusion/proto/proto/datafusion.proto

datafusion/proto/tests/cases/roundtrip_physical_plan.rs

alamb · 2024-07-24T18:38:53Z

As I understand it this PR is still a work in progress, so marking it as a draft (I am trying to make sure it is clear what PRs are waiting for review and what are not)

edmondop · 2024-07-26T21:37:53Z

@jayzhan211 the change is dropping the limit in the physical plan node, I wasn't able to find out the source of it. Do you have any hint ?

jayzhan211 · 2024-07-26T23:56:24Z

@edmondop
You need to add get_minmax_desc method to AggregateFunctionExpr

datafusion/datafusion/core/src/physical_optimizer/topk_aggregation.rs

Lines 48 to 85 in 01dc3f9

    
           fn transform_agg( 
        
               aggr: &AggregateExec, 
        
               order: &PhysicalSortExpr, 
        
               limit: usize, 
        
           ) -> Option<Arc<dyn ExecutionPlan>> { 
        
               // ensure the sort direction matches aggregate function 
        
               let (field, desc) = aggr.get_minmax_desc()?; 
        
               if desc != order.options.descending { 
        
                   return None; 
        
               } 
        
               let group_key = aggr.group_expr().expr().iter().exactly_one().ok()?; 
        
               let kt = group_key.0.data_type(&aggr.input().schema()).ok()?; 
        
               if !kt.is_primitive() && kt != DataType::Utf8 { 
        
                   return None; 
        
               } 
        
               if aggr.filter_expr().iter().any(|e| e.is_some()) { 
        
                   return None; 
        
               } 
        
               // ensure the sort is on the same field as the aggregate output 
        
               let col = order.expr.as_any().downcast_ref::<Column>()?; 
        
               if col.name() != field.name() { 
        
                   return None; 
        
               } 
        
               // We found what we want: clone, copy the limit down, and return modified node 
        
               let new_aggr = AggregateExec::try_new( 
        
                   *aggr.mode(), 
        
                   aggr.group_expr().clone(), 
        
                   aggr.aggr_expr().to_vec(), 
        
                   aggr.filter_expr().to_vec(), 
        
                   aggr.input().clone(), 
        
                   aggr.input_schema(), 
        
               ) 
        
               .expect("Unable to copy Aggregate!") 
        
               .with_limit(Some(limit)); 
        
               Some(Arc::new(new_aggr)) 
        
           }

edmondop · 2024-07-27T21:02:49Z

@jayzhan211 I am a little confused about the test case here

datafusion/datafusion/sqllogictest/test_files/aggregate.slt

Line 1779 in a721be1

    
           Interval(MonthDayNano) 0 years -2 mons 0 days 0 hours 0 mins 0.000000000 secs 0 years 2 mons 15 days 0 hours 0 mins 0.000000000 secs

I have noticed that in the signature for min/max as an AggregateFunction, TimeInterval is not added

datafusion/datafusion/expr/src/aggregate_function.rs

Line 119 in a721be1

match self {

.

If I don't add it, the SQL test fails because arrow_type(min(c1)) returns Utf8 for me, but if I add it then the query fails with error

External error: query failed: DataFusion error: Internal error: Min/Max accumulator not implemented for type Interval(MonthDayNano).

github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules core Core datafusion crate labels Jun 19, 2024

edmondop force-pushed the issue-10943 branch from d6cf206 to 552f52b Compare June 19, 2024 18:54

jayzhan211 reviewed Jun 19, 2024

View reviewed changes

datafusion/physical-expr/src/aggregate/build_in.rs Show resolved Hide resolved

github-actions bot removed the sql SQL Planner label Jun 20, 2024

edmondop requested a review from jayzhan211 June 23, 2024 20:29

edmondop marked this pull request as ready for review June 23, 2024 20:29

jayzhan211 reviewed Jun 24, 2024

View reviewed changes

datafusion/optimizer/src/eliminate_distinct.rs Outdated Show resolved Hide resolved

alamb previously approved these changes Jun 27, 2024

View reviewed changes

datafusion/expr/src/test/function_stub.rs Outdated Show resolved Hide resolved

datafusion/expr/src/test/function_stub.rs Outdated Show resolved Hide resolved

datafusion/core/src/physical_optimizer/aggregate_statistics.rs Outdated Show resolved Hide resolved

alamb changed the title ~~Moving min and max to new API and removing from protobuf~~ Moving min and max to user defined aggregate function Jun 27, 2024

alamb mentioned this pull request Jun 27, 2024

Alamb/merge resolve edmondop/arrow-datafusion#1

Closed

jayzhan211 reviewed Jul 22, 2024

View reviewed changes

datafusion/optimizer/src/single_distinct_to_groupby.rs Show resolved Hide resolved

jayzhan211 reviewed Jul 22, 2024

View reviewed changes

datafusion/optimizer/src/single_distinct_to_groupby.rs Show resolved Hide resolved

edmondop force-pushed the issue-10943 branch 3 times, most recently from 49f7a7f to 4440a8d Compare July 22, 2024 12:12

jayzhan211 reviewed Jul 23, 2024

View reviewed changes

datafusion/proto/proto/datafusion.proto Outdated Show resolved Hide resolved

jayzhan211 reviewed Jul 24, 2024

View reviewed changes

datafusion/proto/tests/cases/roundtrip_physical_plan.rs Outdated Show resolved Hide resolved

alamb marked this pull request as draft July 24, 2024 18:39

edmondop force-pushed the issue-10943 branch from 4440a8d to 079b993 Compare July 24, 2024 22:00

github-actions bot added sql SQL Planner substrait labels Jul 24, 2024

edmondop force-pushed the issue-10943 branch 3 times, most recently from 356a8e1 to ab26352 Compare July 25, 2024 12:32

edmondop force-pushed the issue-10943 branch from ab26352 to b60fc73 Compare July 27, 2024 15:46

edmondop added 4 commits July 27, 2024 20:59

Moving min and max to new API and removing from protobuf

2cbc0d4

Rebasing on main once more

28c1c34

Using input_type rather than data_type

87e945f

Adding type coercion

07910bb

edmondop force-pushed the issue-10943 branch from 09495d8 to 07910bb Compare July 27, 2024 20:59

github-actions bot added the sqllogictest label Jul 27, 2024

Restored test

0392c3d

edmondop requested a review from jayzhan211 July 27, 2024 21:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move min and max to user defined aggregate function #11013

Move min and max to user defined aggregate function #11013

edmondop commented Jun 19, 2024 •

edited

Loading

edmondop commented Jun 20, 2024

jayzhan211 commented Jun 21, 2024

edmondop commented Jun 21, 2024

jayzhan211 commented Jun 22, 2024 •

edited

Loading

edmondop commented Jun 22, 2024

jayzhan211 commented Jun 22, 2024

edmondop commented Jun 23, 2024

jayzhan211 commented Jun 23, 2024

edmondop commented Jun 24, 2024

jayzhan211 commented Jun 24, 2024

edmondop commented Jun 24, 2024

jayzhan211 commented Jun 24, 2024

edmondop commented Jun 24, 2024

jayzhan211 commented Jun 24, 2024

alamb left a comment

edmondop commented Jun 27, 2024

edmondop commented Jul 22, 2024

alamb commented Jul 22, 2024

jayzhan211 Jul 23, 2024

alamb commented Jul 24, 2024 •

edited

Loading

edmondop commented Jul 26, 2024

jayzhan211 commented Jul 26, 2024

edmondop commented Jul 27, 2024

Move min and max to user defined aggregate function #11013

Are you sure you want to change the base?

Move min and max to user defined aggregate function #11013

Conversation

edmondop commented Jun 19, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

edmondop commented Jun 20, 2024

jayzhan211 commented Jun 21, 2024

edmondop commented Jun 21, 2024

jayzhan211 commented Jun 22, 2024 • edited Loading

edmondop commented Jun 22, 2024

jayzhan211 commented Jun 22, 2024

edmondop commented Jun 23, 2024

jayzhan211 commented Jun 23, 2024

edmondop commented Jun 24, 2024

jayzhan211 commented Jun 24, 2024

edmondop commented Jun 24, 2024

jayzhan211 commented Jun 24, 2024

edmondop commented Jun 24, 2024

jayzhan211 commented Jun 24, 2024

alamb left a comment

Choose a reason for hiding this comment

edmondop commented Jun 27, 2024

edmondop commented Jul 22, 2024

alamb commented Jul 22, 2024

jayzhan211 Jul 23, 2024

Choose a reason for hiding this comment

alamb commented Jul 24, 2024 • edited Loading

edmondop commented Jul 26, 2024

jayzhan211 commented Jul 26, 2024

edmondop commented Jul 27, 2024

edmondop commented Jun 19, 2024 •

edited

Loading

jayzhan211 commented Jun 22, 2024 •

edited

Loading

alamb commented Jul 24, 2024 •

edited

Loading