Describe the bug
Version
- Python:
3.12.3
- Python package:
datafusion==53.0.0
- Arrow input via
pyarrow==24.0.0
- Platform used for reproduction: Ubuntu 24.04 x86_64, Linux
6.17.0-23-generic
Minimal Reproducer
import pyarrow as pa
from datafusion import SessionContext
ctx = SessionContext()
batch = pa.RecordBatch.from_arrays(
[
pa.array(["a"], type=pa.string()),
pa.array([None], type=pa.int64()),
],
schema=pa.schema(
[
pa.field("g", pa.string(), nullable=True),
pa.field("x", pa.int64(), nullable=True),
]
),
)
ctx.register_record_batches("t0", [[batch]])
base = "SELECT g, MIN(x) AS min_x FROM t0 GROUP BY g"
print(ctx.sql(f"SELECT min_x FROM ({base}) q LIMIT 20").to_pandas())
print(ctx.sql(f"SELECT min_x FROM ({base}) q ORDER BY min_x ASC NULLS LAST LIMIT 20").to_pandas())
Actual Result
Standalone reproducer output:
datafusion=53.0.0
pyarrow=24.0.0
control:
min_x
0 NaN
top-k:
Empty DataFrame
Columns: [min_x]
Index: []
top-k record-batch rows=0
AssertionError: DataFusion dropped the group whose aggregate sort key is NULL
The control query returns the grouped row with a NULL aggregate result:
The query with ORDER BY min_x ASC NULLS LAST LIMIT 20 returns no rows:
Empty DataFrame
Columns: [min_x]
Index: []
With two groups, one having MIN(x) = NULL and one having MIN(x) = 5, the ORDER BY ... LIMIT 20 query returns only the non-NULL sort-key group. The limit is larger than the number of groups, so no grouped row should be removed.
Expected Result
Both single-group queries should return one row. GROUP BY g forms a group for g = 'a'; MIN(x) is NULL because all values in that group are NULL. Ordering with NULLS LAST and a limit larger than the result cardinality should preserve the row.
Additional Context
I found this with a cross-backend differential test. In the generated case, pandas, Polars eager/lazy, DuckDB, SQLite, and an independent DSL reference all preserved the grouped row whose aggregate sort key is NULL. DataFusion was the only backend that returned an empty result.
I also ran targeted follow-up checks for the same pattern. The issue consistently reproduced for NULL MIN(x) ordered ascending and NULL MAX(x) ordered descending, while nearby controls such as plain NULL column top-k, NULL group key with non-NULL aggregate, and NULL SUM(x) / AVG(x) aggregate top-k preserved the row.
I can share the standalone reproducer, boundary-check script, and full generated artifacts if useful.
To Reproduce
No response
Expected behavior
No response
Additional context
No response
Describe the bug
Version
3.12.3datafusion==53.0.0pyarrow==24.0.06.17.0-23-genericMinimal Reproducer
Actual Result
Standalone reproducer output:
The control query returns the grouped row with a NULL aggregate result:
The query with
ORDER BY min_x ASC NULLS LAST LIMIT 20returns no rows:With two groups, one having
MIN(x) = NULLand one havingMIN(x) = 5, theORDER BY ... LIMIT 20query returns only the non-NULL sort-key group. The limit is larger than the number of groups, so no grouped row should be removed.Expected Result
Both single-group queries should return one row.
GROUP BY gforms a group forg = 'a';MIN(x)is NULL because all values in that group are NULL. Ordering withNULLS LASTand a limit larger than the result cardinality should preserve the row.Additional Context
I found this with a cross-backend differential test. In the generated case, pandas, Polars eager/lazy, DuckDB, SQLite, and an independent DSL reference all preserved the grouped row whose aggregate sort key is NULL. DataFusion was the only backend that returned an empty result.
I also ran targeted follow-up checks for the same pattern. The issue consistently reproduced for NULL
MIN(x)ordered ascending and NULLMAX(x)ordered descending, while nearby controls such as plain NULL column top-k, NULL group key with non-NULL aggregate, and NULLSUM(x)/AVG(x)aggregate top-k preserved the row.I can share the standalone reproducer, boundary-check script, and full generated artifacts if useful.
To Reproduce
No response
Expected behavior
No response
Additional context
No response