Fix TopK DISTINCT aggregation preserving NULLs by kumarUjjawal · Pull Request #22571 · apache/datafusion

kumarUjjawal · 2026-05-27T15:25:20Z

Which issue does this PR close?

Closes DISTINCT ORDER BY NULLS FIRST LIMIT returns the wrong top row #22554.

Rationale for this change

TopK aggregation dropped NULL group keys for ordered DISTINCT queries.

For example, SELECT DISTINCT v FROM t ORDER BY v ASC NULLS FIRST LIMIT 1 could return an empty string instead of NULL when TopK aggregation was enabled.

What changes are included in this PR?

This PR preserves NULL group keys for DISTINCT TopK aggregation by tracking whether a NULL group key was seen separately from the heap.

The heap still only stores non-NULL values. This avoids making the TopK heap implementations handle NULL values directly.

The stream also now marks itself done after emitting, so NULL-only DISTINCT results are emitted once and do not repeat.

Are these changes tested?

Yes

Are there any user-facing changes?

No API Change

alamb

Looks good to me overall @kumarUjjawal -- i had a few questions

I see you say

The heap still only stores non-NULL values. This avoids making the TopK heap implementations handle NULL values directly.

I don't have any sense of how complicated this would be / not be -- did you try it?

Also, perhaps @avantgardnerio you can help review this PR as I think you are familiar with the original logic

alamb · 2026-05-27T18:12:36Z


 impl GroupedTopKAggregateStream {
+    fn is_distinct(&self) -> bool {
+        self.aggregate_arguments.is_empty()


I think you can get no aggregates for just a normal SELECT x,y,z GROUP BY x,y,z type query, Though maybe that has the same semantics

alamb · 2026-05-27T18:13:09Z

        for row_idx in 0..len {
            if has_nulls && vals.is_null(row_idx) {
+                if self.is_distinct() {
+                    self.null_group_seen = true;


this is in the (hot) inner loop and I worry about the performance implications of this change

Is it enough to check outside the loop? something like

let has_nulls = vals.null_count() > 0; if has_nulls && self.is_distinct() { self.null_group_seen = true; } for row_idx in 0..len { ... }

alamb · 2026-05-27T18:19:11Z

                        // For DISTINCT case (no aggregate expressions), only use the group key column
                        // since the schema only has one field and key/value are the same
-                        if self.aggregate_arguments.is_empty() {
+                        if self.is_distinct() {


can we try and encapsulate some of this logic in a helper function perhaps to try and keep this code easy to read?

It would also perhaps help to add some comments here about why concat is needed

kumarUjjawal · 2026-05-28T04:16:18Z

Looks good to me overall @kumarUjjawal -- i had a few questions

I see you say

The heap still only stores non-NULL values. This avoids making the TopK heap implementations handle NULL values directly.

I don't have any sense of how complicated this would be / not be -- did you try it?

Also, perhaps @avantgardnerio you can help review this PR as I think you are familiar with the original logic

I don't have any sense of how complicated this would be / not be -- did you try it?

I did try that as my first approach but I find it broad and since the only question I needed to answer was did we see a NULL group key and since SQL DISTINCT has only one NULL group, a boolean seemed good choice. Let me know what you think, I can go back to the other approach.

Fix TopK DISTINCT aggregation preserving NULLs

0d228d9

github-actions Bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) physical-plan Changes to the physical-plan crate labels May 27, 2026

added explain test

2a700cd

alamb reviewed May 27, 2026

View reviewed changes

refactor

4862963

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TopK DISTINCT aggregation preserving NULLs#22571

Fix TopK DISTINCT aggregation preserving NULLs#22571
kumarUjjawal wants to merge 3 commits into
apache:mainfrom
kumarUjjawal:fix/topk-distinct-aggregation

kumarUjjawal commented May 27, 2026

Uh oh!

alamb left a comment

Uh oh!

alamb May 27, 2026

Uh oh!

alamb May 27, 2026

Uh oh!

alamb May 27, 2026

Uh oh!

kumarUjjawal commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kumarUjjawal commented May 27, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb May 27, 2026

Choose a reason for hiding this comment

Uh oh!

alamb May 27, 2026

Choose a reason for hiding this comment

Uh oh!

alamb May 27, 2026

Choose a reason for hiding this comment

Uh oh!

kumarUjjawal commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants