Skip to content

Fix TopK DISTINCT aggregation preserving NULLs#22571

Open
kumarUjjawal wants to merge 3 commits into
apache:mainfrom
kumarUjjawal:fix/topk-distinct-aggregation
Open

Fix TopK DISTINCT aggregation preserving NULLs#22571
kumarUjjawal wants to merge 3 commits into
apache:mainfrom
kumarUjjawal:fix/topk-distinct-aggregation

Conversation

@kumarUjjawal
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

TopK aggregation dropped NULL group keys for ordered DISTINCT queries.

For example, SELECT DISTINCT v FROM t ORDER BY v ASC NULLS FIRST LIMIT 1 could return an empty string instead of NULL when TopK aggregation was enabled.

What changes are included in this PR?

This PR preserves NULL group keys for DISTINCT TopK aggregation by tracking whether a NULL group key was seen separately from the heap.

The heap still only stores non-NULL values. This avoids making the TopK heap implementations handle NULL values directly.

The stream also now marks itself done after emitting, so NULL-only DISTINCT results are emitted once and do not repeat.

Are these changes tested?

Yes

Are there any user-facing changes?

No API Change

@github-actions github-actions Bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) physical-plan Changes to the physical-plan crate labels May 27, 2026
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall @kumarUjjawal -- i had a few questions

I see you say

The heap still only stores non-NULL values. This avoids making the TopK heap implementations handle NULL values directly.

I don't have any sense of how complicated this would be / not be -- did you try it?

Also, perhaps @avantgardnerio you can help review this PR as I think you are familiar with the original logic


impl GroupedTopKAggregateStream {
fn is_distinct(&self) -> bool {
self.aggregate_arguments.is_empty()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can get no aggregates for just a normal SELECT x,y,z GROUP BY x,y,z type query, Though maybe that has the same semantics

for row_idx in 0..len {
if has_nulls && vals.is_null(row_idx) {
if self.is_distinct() {
self.null_group_seen = true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is in the (hot) inner loop and I worry about the performance implications of this change

Is it enough to check outside the loop? something like

        let has_nulls = vals.null_count() > 0;
        if has_nulls && self.is_distinct() {
            self.null_group_seen = true;
        }
        for row_idx in 0..len {
...
}

// For DISTINCT case (no aggregate expressions), only use the group key column
// since the schema only has one field and key/value are the same
if self.aggregate_arguments.is_empty() {
if self.is_distinct() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we try and encapsulate some of this logic in a helper function perhaps to try and keep this code easy to read?

It would also perhaps help to add some comments here about why concat is needed

@kumarUjjawal
Copy link
Copy Markdown
Contributor Author

Looks good to me overall @kumarUjjawal -- i had a few questions

I see you say

The heap still only stores non-NULL values. This avoids making the TopK heap implementations handle NULL values directly.

I don't have any sense of how complicated this would be / not be -- did you try it?

Also, perhaps @avantgardnerio you can help review this PR as I think you are familiar with the original logic

I don't have any sense of how complicated this would be / not be -- did you try it?

I did try that as my first approach but I find it broad and since the only question I needed to answer was did we see a NULL group key and since SQL DISTINCT has only one NULL group, a boolean seemed good choice. Let me know what you think, I can go back to the other approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DISTINCT ORDER BY NULLS FIRST LIMIT returns the wrong top row

2 participants