Skip to content

Review need for explicit distinct aggregate function implementations #10159

@Jefffrey

Description

@Jefffrey

Is your feature request related to a problem or challenge?

When raising #10158 to close some old tickets, noticed in code base places where it states distinct aggregations are not supported when they are:

(AggregateFunction::Median, true) => {
return not_impl_err!("MEDIAN(DISTINCT) aggregations are not available");
}

I guess this is just saying there isn't an explicit function implementation for this, since the plan will apply distinct first then the aggregation, like so:

> explain select median(distinct "1") from '/home/jeffrey/Downloads/data.csv';
+---------------+------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                   |
+---------------+------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: MEDIAN(alias1) AS MEDIAN(DISTINCT /home/jeffrey/Downloads/data.csv.1)                                      |
|               |   Aggregate: groupBy=[[]], aggr=[[MEDIAN(alias1)]]                                                                     |
|               |     Aggregate: groupBy=[[/home/jeffrey/Downloads/data.csv.1 AS alias1]], aggr=[[]]                                     |
|               |       TableScan: /home/jeffrey/Downloads/data.csv projection=[1]                                                       |
| physical_plan | ProjectionExec: expr=[MEDIAN(alias1)@0 as MEDIAN(DISTINCT /home/jeffrey/Downloads/data.csv.1)]                         |
|               |   AggregateExec: mode=Final, gby=[], aggr=[MEDIAN(alias1)]                                                             |
|               |     CoalescePartitionsExec                                                                                             |
|               |       AggregateExec: mode=Partial, gby=[], aggr=[MEDIAN(alias1)]                                                       |
|               |         AggregateExec: mode=FinalPartitioned, gby=[alias1@0 as alias1], aggr=[]                                        |
|               |           CoalesceBatchesExec: target_batch_size=8192                                                                  |
|               |             RepartitionExec: partitioning=Hash([alias1@0], 12), input_partitions=12                                    |
|               |               AggregateExec: mode=Partial, gby=[1@0 as alias1], aggr=[]                                                |
|               |                 RepartitionExec: partitioning=RoundRobinBatch(12), input_partitions=1                                  |
|               |                   CsvExec: file_groups={1 group: [[home/jeffrey/Downloads/data.csv]]}, projection=[1], has_header=true |
|               |                                                                                                                        |
+---------------+------------------------------------------------------------------------------------------------------------------------+

Describe the solution you'd like

Review if there's a need to do an explicit implementation of a distinct aggregation function (e.g. distinct_median) instead of relying on separate distinct -> median steps in the plan. Is it possible to implement a more efficient distinct median by doing it this way?

Describe alternatives you've considered

If decide not to implement an explicit function for distinct aggregates, update the above code to indicate this isn't a NotImplemented error but should instead be a plan or internal error, for clarity, and indicate in the error message that planning should have split it up.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions