Skip to content

DataFusion drops grouped row after inner ORDER BY/LIMIT and outer ORDER BY/OFFSET #22489

@Fly-a-Kite

Description

@Fly-a-Kite

Describe the bug

Summary

DataFusion 53.0.0 returns an empty result for a two-row grouped aggregate when an inner ordered LIMIT is followed by an outer ORDER BY ... OFFSET 1.

This may be related to #22190, but the reproducer differs: it does not use a NULL aggregate sort key or MIN/MAX. The row loss appears with a non-NULL grouped result when an inner ORDER BY/LIMIT is followed by an outer ORDER BY/OFFSET.

Environment

  • datafusion==53.0.0
  • pyarrow==24.0.0
  • Python 3.12.3

Reproduction

#!/usr/bin/env python3

from __future__ import annotations

import datafusion
import pyarrow as pa
from datafusion import SessionContext


def register(ctx: SessionContext, name: str, rows: list[dict], schema: pa.Schema) -> None:
    batch = pa.RecordBatch.from_pylist(rows, schema=schema)
    ctx.register_record_batches(name, [[batch]])


def main() -> None:
    ctx = SessionContext()

    register(
        ctx,
        "t0",
        [{"id": 0}, {"id": 1}],
        pa.schema([pa.field("id", pa.int64())]),
    )
    register(
        ctx,
        "t1",
        [{"id": 1, "j": 1}],
        pa.schema([
            pa.field("id", pa.int64()),
            pa.field("j", pa.int64(), nullable=True),
        ]),
    )

    base = (
        "SELECT t0.id, COUNT(t0.id) AS count_id, COUNT(DISTINCT j) AS nunique_j "
        "FROM t0 LEFT JOIN t1 ON t0.id = t1.id GROUP BY t0.id"
    )

    control_query = (
        f"SELECT * FROM ({base}) q "
        "ORDER BY id DESC NULLS LAST, count_id DESC NULLS LAST, nunique_j ASC NULLS LAST "
        "OFFSET 1"
    )

    failing_query = (
        f"SELECT * FROM ("
        f"  SELECT * FROM ({base}) q "
        "   ORDER BY id DESC NULLS LAST, count_id DESC NULLS LAST, nunique_j DESC NULLS LAST "
        "   LIMIT 8"
        ") q2 "
        "ORDER BY id DESC NULLS LAST, count_id DESC NULLS LAST, nunique_j ASC NULLS LAST "
        "OFFSET 1"
    )

    print(f"datafusion={getattr(datafusion, '__version__', 'unknown')}")
    print(f"pyarrow={pa.__version__}")

    control = ctx.sql(control_query).to_pandas()
    failing = ctx.sql(failing_query).to_pandas()

    print("control query result:")
    print(control)

    print("query with inner ORDER BY/LIMIT result:")
    print(failing)

    assert len(control) == 1 and int(control.iloc[0]["id"]) == 0
    assert len(failing) == 1 and int(failing.iloc[0]["id"]) == 0


if __name__ == "__main__":
    main()

Expected behavior

Both queries should return the second grouped row after the final ordering and OFFSET 1:

id=0, count_id=1, nunique_j=0

Actual behavior

The control query returns the expected row, but the query with the inner ORDER BY ... LIMIT 8 returns an empty result.

Observed output:

control query result:
   id  count_id  nunique_j
0   0         1          0

query with inner ORDER BY/LIMIT result:
Empty DataFrame
Columns: [id, count_id, nunique_j]
Index: []

To Reproduce

No response

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions