Skip to content

Fix DataFusion drops grouped MIN/MAX rows with NULL sort keys under ORDER BY + LIMIT#22364

Open
xiedeyantu wants to merge 1 commit into
apache:mainfrom
xiedeyantu:topk
Open

Fix DataFusion drops grouped MIN/MAX rows with NULL sort keys under ORDER BY + LIMIT#22364
xiedeyantu wants to merge 1 commit into
apache:mainfrom
xiedeyantu:topk

Conversation

@xiedeyantu
Copy link
Copy Markdown
Member

Which issue does this PR close?

Rationale for this change

TopKAggregation could rewrite grouped MIN/MAX queries with ORDER BY ... LIMIT
into the grouped TopK execution path even when the aggregate sort key was NULL.

The grouped TopK stream previously assumed aggregate sort values reaching the heap
were non-null and skipped rows whose aggregate value was NULL. This was incorrect
for queries where a group is valid but its final MIN/MAX result is NULL
(for example, when all values in the group are NULL). In such cases, the row
should still be preserved and ordered according to NULLS FIRST / NULLS LAST
semantics rather than being dropped.

This change fixes that correctness issue in the execution path instead of disabling
the optimization for nullable aggregates.

What changes are included in this PR?

  • update primitive TopK heap handling to retain nullable aggregate sort values
  • emit primitive heap results with the correct Arrow null buffer
  • remove the row-level NULL skip in GroupedTopKAggregateStream
  • add a regression test covering grouped MIN/MAX queries with NULL aggregate
    results under ORDER BY ... LIMIT

Are these changes tested?

Yes.

Added a regression test in datafusion/core/tests/physical_optimizer/aggregate_statistics.rs
that verifies grouped MIN/MAX rows with NULL aggregate results are preserved when
the TopK aggregation optimization is enabled.

Focused validation run:

  • cargo test -p datafusion-physical-plan aggregates::topk::heap --lib
  • cargo test -p datafusion --test core_integration null_min_max_topk_preserves_group_rows

Are there any user-facing changes?

Yes.

This fixes an incorrect query result for grouped MIN/MAX queries using
ORDER BY ... LIMIT when the aggregate sort key is NULL.

@github-actions github-actions Bot added core Core DataFusion crate physical-plan Changes to the physical-plan crate labels May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DataFusion drops grouped MIN/MAX rows with NULL sort keys under ORDER BY + LIMIT

1 participant