Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add simplify method to aggregate function #10354

Merged
merged 2 commits into from
May 13, 2024

Conversation

milenkovicm
Copy link
Contributor

Which issue does this PR close?

Closes #9526.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

yes, added new test and example

Are there any user-facing changes?

no, changes are backward compatible

@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules labels May 2, 2024
@milenkovicm
Copy link
Contributor Author

the change is quite similar to #9906 there are quite a lot optional/unused parameters in the method call, but i personally find it easier to understand as it is equivalent to AggregateFunction signature.

I wonder if ExprSimplifyResult::<T>::Original(T) where T can be something else than args (tuples included) would be a backward compatible change. This way we could send and return all required parameters without copy ... but this is discussion for different time

@milenkovicm milenkovicm force-pushed the simplify_aggregations_9526 branch 2 times, most recently from 42000ac to f96a6d8 Compare May 2, 2024 19:37
@milenkovicm
Copy link
Contributor Author

milenkovicm commented May 3, 2024

jsut thinking aloud, maybe if we change:

pub enum ExprSimplifyResult {
    /// The function call was simplified to an entirely new Expr
    Simplified(Expr),
    /// the function call could not be simplified, and the arguments
    /// are return unmodified.
    Original(Vec<Expr>),
}

to:

pub enum ExprSimplifyResult<T> {
    /// The function call was simplified to an entirely new Expr
    Simplified(Expr),
    /// the function call could not be simplified, and the arguments
    /// are return unmodified.
    Original(T),
}

simplify method would change from:

   pub fn simplify(
        &self,
        args: Vec<Expr>,
        distinct: &bool,
        filter: &Option<Box<Expr>>,
        order_by: &Option<Vec<Expr>>,
        null_treatment: &Option<NullTreatment>,
        info: &dyn SimplifyInfo,
    ) -> Result<ExprSimplifyResult> {

to:

   pub fn simplify(
        &self,
        args: Vec<Expr>,
        distinct: bool,
        filter: Option<Box<Expr>>,
        order_by: Option<Vec<Expr>>,
        null_treatment: Option<NullTreatment>,
        info: &dyn SimplifyInfo,
    ) -> Result<ExprSimplifyResult<(Vec<Expr>, bool, Option<Box<Expr>> ...)>> {
}

so in both cases, when we create replacement function or we re-assemble original function we would not need to clone parameters.

maybe hand full of changes, from -> Result<ExprSimplifyResult> to -> Result<ExprSimplifyResult<Vec<Expr>>>

Downside is that we would have to re-create expression if no simplification, plus change would be backward incompatible

/// See [`AggregateUDFImpl::simplify`] for more details.
pub fn simplify(
&self,
args: Vec<Expr>,
Copy link
Contributor

@jayzhan211 jayzhan211 May 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should reuse args in AggregateFunction, maybe create a new struct with AggregateArgs?

/// Aggregate function
#[derive(Clone, PartialEq, Eq, Hash, Debug)]
pub struct AggregateFunction {
/// Name of the function
pub func_def: AggregateFunctionDefinition,
/// List of expressions to feed to the functions as arguments
pub args: Vec<Expr>,
/// Whether this is a DISTINCT aggregation or not
pub distinct: bool,
/// Optional filter
pub filter: Option<Box<Expr>>,
/// Optional ordering
pub order_by: Option<Vec<Expr>>,
pub null_treatment: Option<NullTreatment>,
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, mixed feelings about it. IMHO I see no benefits, would not simplify anything and another structure would be introduced.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One benefit is that we could potentially add new fields without causing an API change 🤔

Another benefit could be that we could add a default method that recreated the expression. For example

impl AggegateArgs { 
  // recreate the original aggregate fucntions
  fn into_expr(self) -> Expr { ... }
...
}

Another benefit is, as @jayzhan211 says, that it would be more consistent with the other APIs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we change signature of pub struct AggregateFunction moving args, distinct, filter ... to AggegateArgs or we keep AggegateArgs as auxiliary structure?

@jayzhan211
Copy link
Contributor

pub enum ExprSimplifyResult<T> {
    /// The function call was simplified to an entirely new Expr
    Simplified(Expr),
    /// the function call could not be simplified, and the arguments
    /// are return unmodified.
    Original(T),
}

I think it is a good idea.

fn simplify(
&self,
args: Vec<Expr>,
distinct: &bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also pass others args with owned value for rewrite like args: Vec<Expr>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, if we make ExprSimplifyResult generic, as I pointed out in the comment. If we change it without change of ExprSimplifyResult it would lead to cloning of arguments even if simplify does nothing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the work we have been doing on #9637 let's not add an API that forces cloning

Another potential thing is to use Transformed directlt

@alamb
Copy link
Contributor

alamb commented May 6, 2024

jsut thinking aloud, maybe if we change:

I think if we are going to change the signature, I think we should use Transformed directly (and it is already parameterized)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @milenkovicm and @jayzhan211 -- I agree this API is a little akward. It would be great to figure out how to avoid cloning when unecessary

/// See [`AggregateUDFImpl::simplify`] for more details.
pub fn simplify(
&self,
args: Vec<Expr>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One benefit is that we could potentially add new fields without causing an API change 🤔

Another benefit could be that we could add a default method that recreated the expression. For example

impl AggegateArgs { 
  // recreate the original aggregate fucntions
  fn into_expr(self) -> Expr { ... }
...
}

Another benefit is, as @jayzhan211 says, that it would be more consistent with the other APIs

@@ -1307,6 +1309,39 @@ impl<'a, S: SimplifyInfo> TreeNodeRewriter for Simplifier<'a, S> {
ExprSimplifyResult::Simplified(expr) => Transformed::yes(expr),
},

Expr::AggregateFunction(AggregateFunction {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like in order to be able to make this API avoid copying, it would need to pass the distinct, filter, etc fields in and then get them back

Another thing we could do is to use Transformed directly

What if we made something like

struct AggregateFunctionsArgs {
  args: Vec<Expr>,
  distinct: bool,
  filter: Option<Expr>,
  order_by: ...
}

And then the call to simplify to be

trait AggreagateUDFImpl { 
...
    fn simplify(
        &self,
        agg_args: AggregateFunctionsArgs,
    ) -> Result<Transformed<AggregateFunctionsArgs> {
        Ok(Transformed::no(agg_args))
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with AggregateFunctionsArgs, consistency part got me on board 😀

I'm not sure about Transformed part, reason being is that ExprSimplifyResult returns two types, Expr and currently Vec<Expr>

in that case simplify signature should be changed to:

trait AggreagateUDFImpl { 
...
    fn simplify(
        &self,
        agg_args: AggregateFunctionsArgs,
    ) -> Result<Transformed<Expr> {
        let original : Expr = agg_args.into();
        Ok(Transformed::no(original))
    }

Thus simplify function should return 're-composed' original function or new function ...

Anyway, I'll have a look what can be done, we can discuss

fn simplify(
&self,
args: Vec<Expr>,
distinct: &bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the work we have been doing on #9637 let's not add an API that forces cloning

Another potential thing is to use Transformed directlt

@milenkovicm
Copy link
Contributor Author

milenkovicm commented May 7, 2024

Another option could be to make a simplify return optional closure

 fn simplify(&self ) -> Option<Box<dyn Fn(AggregateFunction) -> datafusion_common::tree_node::Transformed<Expr>>> {
        None
 }

in this case expr_simplifier part would become:

Expr::AggregateFunction(AggregateFunction {func_def: AggregateFunctionDefinition::UDF(ref udaf), ..}) => {
  match (udaf.simplify(), expr) {
    (Some(simplify), Expr::AggregateFunction(a)) => simplify(a),
    (_, e) => Transformed::no(e),
  }
}

this solution:

  • avoids cloning,
  • avoids decomposition/composition of function if not needed
  • avoids adding new struct
  • but, it has dynamic dispatch simplify function call (in case simplify is defined)
  • and a bit odd interface

other issues:

  • SimplifyInfo is missing from signature, can be added
  • Transformed::no does not make sense in this context as original expression has not been propagated (should we use actual expression instead of AggregateFunction or remove Transformed from return type
  • closure should return result with or without Transformed

@jayzhan211
Copy link
Contributor

jayzhan211 commented May 8, 2024

Another option is to introduce has_simplify so we can simplify the code to

            Expr::AggregateFunction(AggregateFunction {
                func_def: AggregateFunctionDefinition::UDF(udaf),
                args: AggregateFunctionArgs,
            }) if udaf.has_simplify() => {
                Transformed::yes(udaf.simplify()?)
            }

            fn simplify() -> Result<Expr>

The downside is that the user needs to both define simplify and set has_simplify to true, but I think it is much simpler than the optional closure

@milenkovicm
Copy link
Contributor Author

The downside is that the user needs to both define simplify and set has_simplify to true, but I think it is much simpler than the optional closure

I personally try to avoid situations like this when creating API, this should be documented that both functions have to be implemented plus it introduces more corner cases.

Anyway, closure was a brain dump from yesterday, after a good sleep I don't think we need it, will try to elaborate in next comment

@milenkovicm
Copy link
Contributor Author

Option 1 - Original expression as a parameter

    fn simplify(
        &self,
        expr: Expr, 
    ) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
        // we know it'll always be AggregateFunction but we have to match on it 
        // which is not terribly hard but it makes api a bit odd 
        if let Expr::AggregateFunction(agg) = expr {
            ...
            Ok(datafusion_common::tree_node::Transformed::yes(...))
        } else {
            // no re-creation of expression if not used 
            Ok(datafusion_common::tree_node::Transformed::no(expr))
        }
    }

default implementation is just:

    fn simplify(
        &self,
        expr: Expr, 
    ) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
            // no re-creation of expression if not used 
            Ok(datafusion_common::tree_node::Transformed::no(expr))
    }

Option 2 - AggregationFunction as a parameter

    fn simplify(
        &self,
        agg: AggregateFunction, 
    ) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
        // we know its aggregate function, no need to do extra matching, but we need to re-create expression 
        // issue user can return `no` with new expression in it 
        Ok(datafusion_common::tree_node::Transformed::no(Expr::AggregateFunction(agg)))
    }

Option 3 - AggregationFunction as a parameter with parametrised ExprSimplifyResult

    fn simplify(
        &self,
        agg: AggregateFunction, 
    ) -> Result<ExprSimplifyResult<AggregateFunction>> {
       // user cant return `no` with new expression 
       // simplifier will re-crete expression in case original has been returned 
        Ok(ExprSimplifyResult::Original(agg))        
   }

IMHO, I prefer option number 3, which would involve parametrizing ExprSimplifyResult, second best is option number 2.
wdyt @jayzhan211, @alamb

@jayzhan211
Copy link
Contributor

jayzhan211 commented May 8, 2024

Option 1 - Original expression as a parameter

    fn simplify(
        &self,
        expr: Expr, 
    ) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
        // we know it'll always be AggregateFunction but we have to match on it 
        // which is not terribly hard but it makes api a bit odd 
        if let Expr::AggregateFunction(agg) = expr {
            ...
            Ok(datafusion_common::tree_node::Transformed::yes(...))
        } else {
            // no re-creation of expression if not used 
            Ok(datafusion_common::tree_node::Transformed::no(expr))
        }
    }

default implementation is just:

    fn simplify(
        &self,
        expr: Expr, 
    ) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
            // no re-creation of expression if not used 
            Ok(datafusion_common::tree_node::Transformed::no(expr))
    }

Option 2 - AggregationFunction as a parameter

    fn simplify(
        &self,
        agg: AggregateFunction, 
    ) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
        // we know its aggregate function, no need to do extra matching, but we need to re-create expression 
        // issue user can return `no` with new expression in it 
        Ok(datafusion_common::tree_node::Transformed::no(Expr::AggregateFunction(agg)))
    }

Option 3 - AggregationFunction as a parameter with parametrised ExprSimplifyResult

    fn simplify(
        &self,
        agg: AggregateFunction, 
    ) -> Result<ExprSimplifyResult<AggregateFunction>> {
       // user cant return `no` with new expression 
       // simplifier will re-crete expression in case original has been returned 
        Ok(ExprSimplifyResult::Original(agg))        
   }

IMHO, I prefer option number 3, which would involve parametrizing ExprSimplifyResult, second best is option number 2. wdyt @jayzhan211, @alamb

Given these 3, I prefer 2, because I think re-create expression is not a high cost that we should avoid. And we can do even further,

If user does not implement simplify, it is equivalent to Transformed::no or ExprSimplifyResult::Original, so we can just have Result<Expr>

fn simplify(
        &self,
        agg: AggregateFunction, 
    ) -> Result<Expr> {
        Ok(Expr::AggregateFunction(agg))
    }

@milenkovicm
Copy link
Contributor Author

milenkovicm commented May 8, 2024

I apologise for making noise, but it looks like all of the non closure options have issue with borrowing:

error[E0505]: cannot move out of `expr.0` because it is borrowed
    --> datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:1391:48
     |
1388 |                 func_def: AggregateFunctionDefinition::UDF(ref udaf),
     |                                                            -------- borrow of `expr.0.func_def.0` occurs here
...
1391 |                 if let Expr::AggregateFunction(aggregate_function) = expr {
     |                                                ^^^^^^^^^^^^^^^^^^ move out of `expr.0` occurs here
1392 |                     udaf.simplify(aggregate_function, info)?
     |                     ---- borrow later used here

It looks like AggregateArg might be back on the table, which would involve cloning udf

@milenkovicm milenkovicm marked this pull request as draft May 8, 2024 12:52
@milenkovicm milenkovicm force-pushed the simplify_aggregations_9526 branch 3 times, most recently from d0e6842 to de51434 Compare May 9, 2024 09:52
@milenkovicm milenkovicm marked this pull request as ready for review May 9, 2024 09:54
@milenkovicm
Copy link
Contributor Author

@jayzhan211 and @alamb whenever you get chance please have a look, IMHO I find proposal with closure to tick all the boxes, but this one is definitely simpler.

@jayzhan211
Copy link
Contributor

jayzhan211 commented May 9, 2024

@jayzhan211 and @alamb whenever you get chance please have a look, IMHO I find proposal with closure to tick all the boxes, but this one is definitely simpler.

After playing around, I feel like maybe optional closure is our answer 🤔

fn simplify(
        &self,
        _info: &dyn SimplifyInfo,
    ) -> Option<Box<dyn Fn(AggregateArgs) -> Expr>> {}

@milenkovicm
Copy link
Contributor Author

After playing around, I feel like maybe optional closure is our answer 🤔

Damn! I should have created a branch with that code 😀, if we have consensus on that direction I'll redo it

@jayzhan211
Copy link
Contributor

jayzhan211 commented May 9, 2024

After playing around, I feel like maybe optional closure is our answer 🤔

Damn! I should have created a branch with that code 😀, if we have consensus on that direction I'll redo it

My final answer for today is

// UDAF
fn simplify(
        &self,
        args: AggregateArgs,
        _info: &dyn SimplifyInfo,
    ) -> Option<Box<dyn Fn(AggregateArgs) -> Result<Expr>>> {
// UDF, not necessary but for consistency
fn simplify(
        &self,
        args: Vec<Expr>,
        _info: &dyn SimplifyInfo,
    ) -> Option<Box<dyn Fn(Vec<Expr>) -> Result<Expr>>> {

Go with optional closure!

@milenkovicm
Copy link
Contributor Author

My final answer for today is

fn simplify(
        &self,
        args: AggregateArgs,
        _info: &dyn SimplifyInfo,
    ) -> Option<Box<dyn Fn(AggregateArgs) -> Result<Expr>>> {

Go with optional closure!

you mean

// UDF
fn simplify(
        &self,
    ) -> Option<Box<dyn Fn(AggregateFunction, &dyn SimplifyInfo) -> Result<Expr>>> {

or

// UDF
fn simplify(
        &self,
        info: &dyn SimplifyInfo
    ) -> Option<Box<dyn Fn(AggregateFunction) -> Result<Expr>>> {

@jayzhan211
Copy link
Contributor

It should be this.

// UDF
fn simplify(
        &self,
    ) -> Option<Box<dyn Fn(AggregateFunction, &dyn SimplifyInfo) -> Result<Expr>>> {

The reason for optional closure is that I assume we need to return Expr for simplified cases. If we expect AggregateArgs, then we might not need closure.

@milenkovicm
Copy link
Contributor Author

We're on the same page @jayzhan211

@jayzhan211
Copy link
Contributor

jayzhan211 commented May 9, 2024

We're on the same page @jayzhan211

If we don't need Expr for simplified UDAF, than we can have

pub fn simplify(
        &self,
        args: AggregateArgs,
        info: &dyn SimplifyInfo,
    ) -> Result<Transformed<AggregateArgs>> {}

I think the next problem is whether we need Expr or just AggregateArgs. The former can let us return the UDAF with a different function. But for this kind of function rewrite, we can handle it in FuncitonWrite. You can take ArrayFunctionRewriter for example.

@milenkovicm

@milenkovicm
Copy link
Contributor Author

We're on the same page @jayzhan211

If we don't need Expr for simplified UDAF, than we can have

pub fn simplify(
        &self,
        args: AggregateArgs,
        info: &dyn SimplifyInfo,
    ) -> Result<Transformed<AggregateArgs>> {}

I think the next problem is whether we need Expr or just AggregateArgs. The former can let us return the UDAF with a different function. But for this kind of function rewrite, we can handle it in FuncitonWrite. You can take ArrayFunctionRewriter for example.

@milenkovicm

I think we need Expr, this way we can replace function with something else

@jayzhan211
Copy link
Contributor

We're on the same page @jayzhan211

If we don't need Expr for simplified UDAF, than we can have

pub fn simplify(
        &self,
        args: AggregateArgs,
        info: &dyn SimplifyInfo,
    ) -> Result<Transformed<AggregateArgs>> {}

I think the next problem is whether we need Expr or just AggregateArgs. The former can let us return the UDAF with a different function. But for this kind of function rewrite, we can handle it in FuncitonWrite. You can take ArrayFunctionRewriter for example.
@milenkovicm

I think we need Expr, this way we can replace function with something else

I agree.

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jayzhan211
Copy link
Contributor

You can define closure in datafusion/expr/src/function.rs

@milenkovicm milenkovicm force-pushed the simplify_aggregations_9526 branch 2 times, most recently from 47108f1 to a678e6d Compare May 10, 2024 14:28
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @milenkovicm and @jayzhan211 -- I think it looks good and the API is clean. Let's give it a try 🚀

Looks like there is a small conflict to resolve and then we'll be good to go

@alamb alamb changed the title Add simplify method to aggregate function Add `simplify1 method to aggregate function May 13, 2024
@alamb alamb changed the title Add `simplify1 method to aggregate function Add simplify method to aggregate function May 13, 2024
@alamb alamb merged commit 230c68c into apache:main May 13, 2024
24 of 25 checks passed
@alamb
Copy link
Contributor

alamb commented May 13, 2024

Thanks again @jayzhan211 and @milenkovicm 🙏

findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* add simplify method for aggregate function

* simplify returns closure
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a AggregateUDFImpl::simplfy() API
3 participants