Describe the bug
Summary
DataFusion 53.0.0 returns an empty result for a two-row grouped aggregate when an inner ordered LIMIT is followed by an outer ORDER BY ... OFFSET 1.
This may be related to #22190, but the reproducer differs: it does not use a NULL aggregate sort key or MIN/MAX. The row loss appears with a non-NULL grouped result when an inner ORDER BY/LIMIT is followed by an outer ORDER BY/OFFSET.
Environment
datafusion==53.0.0
pyarrow==24.0.0
- Python
3.12.3
Reproduction
#!/usr/bin/env python3
from __future__ import annotations
import datafusion
import pyarrow as pa
from datafusion import SessionContext
def register(ctx: SessionContext, name: str, rows: list[dict], schema: pa.Schema) -> None:
batch = pa.RecordBatch.from_pylist(rows, schema=schema)
ctx.register_record_batches(name, [[batch]])
def main() -> None:
ctx = SessionContext()
register(
ctx,
"t0",
[{"id": 0}, {"id": 1}],
pa.schema([pa.field("id", pa.int64())]),
)
register(
ctx,
"t1",
[{"id": 1, "j": 1}],
pa.schema([
pa.field("id", pa.int64()),
pa.field("j", pa.int64(), nullable=True),
]),
)
base = (
"SELECT t0.id, COUNT(t0.id) AS count_id, COUNT(DISTINCT j) AS nunique_j "
"FROM t0 LEFT JOIN t1 ON t0.id = t1.id GROUP BY t0.id"
)
control_query = (
f"SELECT * FROM ({base}) q "
"ORDER BY id DESC NULLS LAST, count_id DESC NULLS LAST, nunique_j ASC NULLS LAST "
"OFFSET 1"
)
failing_query = (
f"SELECT * FROM ("
f" SELECT * FROM ({base}) q "
" ORDER BY id DESC NULLS LAST, count_id DESC NULLS LAST, nunique_j DESC NULLS LAST "
" LIMIT 8"
") q2 "
"ORDER BY id DESC NULLS LAST, count_id DESC NULLS LAST, nunique_j ASC NULLS LAST "
"OFFSET 1"
)
print(f"datafusion={getattr(datafusion, '__version__', 'unknown')}")
print(f"pyarrow={pa.__version__}")
control = ctx.sql(control_query).to_pandas()
failing = ctx.sql(failing_query).to_pandas()
print("control query result:")
print(control)
print("query with inner ORDER BY/LIMIT result:")
print(failing)
assert len(control) == 1 and int(control.iloc[0]["id"]) == 0
assert len(failing) == 1 and int(failing.iloc[0]["id"]) == 0
if __name__ == "__main__":
main()
Expected behavior
Both queries should return the second grouped row after the final ordering and OFFSET 1:
id=0, count_id=1, nunique_j=0
Actual behavior
The control query returns the expected row, but the query with the inner ORDER BY ... LIMIT 8 returns an empty result.
Observed output:
control query result:
id count_id nunique_j
0 0 1 0
query with inner ORDER BY/LIMIT result:
Empty DataFrame
Columns: [id, count_id, nunique_j]
Index: []
To Reproduce
No response
Expected behavior
No response
Additional context
No response
Describe the bug
Summary
DataFusion
53.0.0returns an empty result for a two-row grouped aggregate when an inner orderedLIMITis followed by an outerORDER BY ... OFFSET 1.This may be related to #22190, but the reproducer differs: it does not use a NULL aggregate sort key or
MIN/MAX. The row loss appears with a non-NULL grouped result when an innerORDER BY/LIMITis followed by an outerORDER BY/OFFSET.Environment
datafusion==53.0.0pyarrow==24.0.03.12.3Reproduction
Expected behavior
Both queries should return the second grouped row after the final ordering and
OFFSET 1:Actual behavior
The control query returns the expected row, but the query with the inner
ORDER BY ... LIMIT 8returns an empty result.Observed output:
To Reproduce
No response
Expected behavior
No response
Additional context
No response