Skip to content

DISTINCT ORDER BY NULLS FIRST LIMIT returns the wrong top row #22554

@Fly-a-Kite

Description

@Fly-a-Kite

Describe the bug

SELECT DISTINCT v FROM t0 ORDER BY v ASC NULLS FIRST LIMIT 1 should return
the first row of the same ordered DISTINCT query without LIMIT. On
DataFusion 53.0.0, the full ordered query puts NULL first, but adding
LIMIT 1 returns the first non-null value for string, integer, and float
columns.

To Reproduce

Environment

 OS: Ubuntu 24.04.1 x86_64
 Python: 3.12.3
 datafusion: 53.0.0
 pyarrow: 24.0.0

Reproduce

import pyarrow as pa
from datafusion import SessionContext

ctx = SessionContext()
batch = pa.RecordBatch.from_pylist(
    [{"v": None}, {"v": ""}, {"v": "a"}],
    schema=pa.schema([pa.field("v", pa.string(), nullable=True)]),
)
ctx.register_record_batches("t0", [[batch]])

full_sql = "SELECT DISTINCT v FROM t0 ORDER BY v ASC NULLS FIRST"
top1_sql = full_sql + " LIMIT 1"

full = ctx.sql(full_sql).collect()[0].to_pydict()["v"]
top1 = ctx.sql(top1_sql).collect()[0].to_pydict()["v"]

print("full:", full)
print("top1:", top1)
assert top1 == full[:1]

Output

full: [None, '', 'a']
top1: [None]

Expected behavior

full: [None, '', 'a']
top1: ['']

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingregressionSomething that used to work no longer does

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions