Skip to content

DataFusion drops grouped MIN/MAX rows with NULL sort keys under ORDER BY + LIMIT #22190

@Fly-a-Kite

Description

@Fly-a-Kite

Describe the bug

Version

  • Python: 3.12.3
  • Python package: datafusion==53.0.0
  • Arrow input via pyarrow==24.0.0
  • Platform used for reproduction: Ubuntu 24.04 x86_64, Linux 6.17.0-23-generic

Minimal Reproducer

import pyarrow as pa
from datafusion import SessionContext

ctx = SessionContext()
batch = pa.RecordBatch.from_arrays(
    [
        pa.array(["a"], type=pa.string()),
        pa.array([None], type=pa.int64()),
    ],
    schema=pa.schema(
        [
            pa.field("g", pa.string(), nullable=True),
            pa.field("x", pa.int64(), nullable=True),
        ]
    ),
)
ctx.register_record_batches("t0", [[batch]])

base = "SELECT g, MIN(x) AS min_x FROM t0 GROUP BY g"
print(ctx.sql(f"SELECT min_x FROM ({base}) q LIMIT 20").to_pandas())
print(ctx.sql(f"SELECT min_x FROM ({base}) q ORDER BY min_x ASC NULLS LAST LIMIT 20").to_pandas())

Actual Result

Standalone reproducer output:

datafusion=53.0.0
pyarrow=24.0.0
control:
   min_x
0    NaN
top-k:
Empty DataFrame
Columns: [min_x]
Index: []
top-k record-batch rows=0
AssertionError: DataFusion dropped the group whose aggregate sort key is NULL

The control query returns the grouped row with a NULL aggregate result:

   min_x
0    NaN

The query with ORDER BY min_x ASC NULLS LAST LIMIT 20 returns no rows:

Empty DataFrame
Columns: [min_x]
Index: []

With two groups, one having MIN(x) = NULL and one having MIN(x) = 5, the ORDER BY ... LIMIT 20 query returns only the non-NULL sort-key group. The limit is larger than the number of groups, so no grouped row should be removed.

Expected Result

Both single-group queries should return one row. GROUP BY g forms a group for g = 'a'; MIN(x) is NULL because all values in that group are NULL. Ordering with NULLS LAST and a limit larger than the result cardinality should preserve the row.

Additional Context

I found this with a cross-backend differential test. In the generated case, pandas, Polars eager/lazy, DuckDB, SQLite, and an independent DSL reference all preserved the grouped row whose aggregate sort key is NULL. DataFusion was the only backend that returned an empty result.

I also ran targeted follow-up checks for the same pattern. The issue consistently reproduced for NULL MIN(x) ordered ascending and NULL MAX(x) ordered descending, while nearby controls such as plain NULL column top-k, NULL group key with non-NULL aggregate, and NULL SUM(x) / AVG(x) aggregate top-k preserved the row.

I can share the standalone reproducer, boundary-check script, and full generated artifacts if useful.

To Reproduce

No response

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions