Skip to content

Implement GroupsAccumulator for first_value aggregate (speed up first_value and DISTINCT ON queries) #17899

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

As reported in #16620 by @debajyoti-truefoundry, evaluting DISTINCT ON results in a query plan that uses first_value aggregates

The current implementation of first_value appears to have only a basic Accumulator implementation, and not the faster GroupsAccumulator: https://github.com/apache/datafusion/blob/main/datafusion/functions-aggregate/src/nth_value.rs#L93

We can very likely improve the performance of such queries significantly by implementing a GroupsAccumulator (background in Aggregating Millions of Groups Fast in Apache Arrow DataFusion )

Describe the solution you'd like

  1. Add a benchmark (maybe add a query to the clickbench_extended suite)
  2. Implement a GroupsAccumulator for first (and maybe nth) value

Describe alternatives you've considered

I think the accumulator could be pretty straightforward and track whatever groups were new and just copy the first row seen into the output (likely by using the take filter)

Additional context

There is a similar issue for optimizing array_agg here:

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions