Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] Unify AggregateFunction Interface (remove built in list of AggregateFunction s), improve the system #8708

Open
alamb opened this issue Jan 1, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jan 1, 2024

Is your feature request related to a problem or challenge?

For many of the same reasons as listed on #8045, having two types of aggregate functions ("built in" -- datafusion::physical_plan::aggregates::AggregateFunction) and AggregateUDF is problematic for two reasons:

  1. There are some features not available to User Defined Aggregate Functions (such as the faster GroupsAccumulator interface)
  2. Users can not easily choose which aggregate functions to include (for example, do they want to allow (and pay the code size / compile time) for the Statistical and Approximate functions

The second also ends up causing pushback on adding new aggregates like ARRAY_SUM in #8325 and geospatial support #7859.

Describe the solution you'd like

I propose moving DataFusion to only use AggregateUDFs and remove the built in list of AggregateFunctions for the same reasons as #8045

We will keep the existing AggregateUDF interface as much as possible, while also potentially providing an easier way to define them.

New AggregateUDF is in functions-aggregate crate
Old Aggregate functions are in datafusion/physical-expr/src/aggregate

Describe alternatives you've considered

Additional context

Proposed implementation steps:

Move rust test to sqllogictest if possible #10384

Good first issue list

Pending

  • nth_value
  • array_agg_distinct
  • array_agg_ordered
impl FromStr for AggregateFunction {
    type Err = DataFusionError;
    fn from_str(name: &str) -> Result<AggregateFunction> {
        Ok(match name {
            // general
            "avg" => AggregateFunction::Avg,
            "bit_and" => AggregateFunction::BitAnd,
            "bit_or" => AggregateFunction::BitOr,
            "bit_xor" => AggregateFunction::BitXor,
            "bool_and" => AggregateFunction::BoolAnd,
            "bool_or" => AggregateFunction::BoolOr,
            "max" => AggregateFunction::Max,
            "mean" => AggregateFunction::Avg,
            "min" => AggregateFunction::Min,
            "array_agg" => AggregateFunction::ArrayAgg,
            "nth_value" => AggregateFunction::NthValue,
            "string_agg" => AggregateFunction::StringAgg,
            // statistical
            "corr" => AggregateFunction::Correlation,
            // other
            "grouping" => AggregateFunction::Grouping,
            _ => {
                return plan_err!("There is no built-in function named {name}");
            }
        })
    }
}

Feel free to file an issue if you are interested in working on any of the above in the pending list.

@alamb alamb added the enhancement New feature or request label Jan 1, 2024
@alamb alamb changed the title [Epic] Unify AggregateFunction Interface (remove built in list of AggregateFunction s) [Epic] Unify AggregateFunction Interface (remove built in list of AggregateFunction s), improve the system Jan 8, 2024
@jayzhan211
Copy link
Contributor

I will work on FirstValue UDF together with #9249

@jayzhan211
Copy link
Contributor

Can we introduce state_fields and fields for AggregateUDFImpl. We can see that types in AggregateUDFImpl are for building fields, why not just return fields directly, we can not only define the types but also the field name and is_nullalbe.

https://github.com/apache/arrow-datafusion/blob/e337832946fc69f6ceccd14b96a071cc2cd4693d/datafusion/physical-plan/src/udaf.rs#L86-L106

@alamb
Copy link
Contributor Author

alamb commented Mar 27, 2024

Thank you @jayzhan211 -- I am traveling this week so I am very behind on reviews. I will try and respond later this week

@jayzhan211
Copy link
Contributor

Thank you @jayzhan211 -- I am traveling this week so I am very behind on reviews. I will try and respond later this week

I leave the comment so I won't forget.
Have a good trip 😊

@jayzhan211

This comment was marked as resolved.

eejbyfeldt added a commit to eejbyfeldt/datafusion that referenced this issue Jun 13, 2024
eejbyfeldt added a commit to eejbyfeldt/datafusion that referenced this issue Jun 13, 2024
jayzhan211 pushed a commit that referenced this issue Jun 13, 2024
* Move Regr_* functions to use UDAF

Closes #10883 and is part of #8708

* Format and regen

* tweak error check

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
findepi pushed a commit to findepi/datafusion that referenced this issue Jul 16, 2024
* Move Regr_* functions to use UDAF

Closes apache#10883 and is part of apache#8708

* Format and regen

* tweak error check

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants