Skip to content

Follow-ups for stats-based RG / file reorder (#21956): multi-column, function-wrapped, compound exprs #22198

@zhuqi-lucas

Description

@zhuqi-lucas

Follow-up enhancements to the row-group / file stats-based reorder landed in #21956. Listed here so they can be picked up independently.

Today PreparedAccessPlan::reorder_by_statistics and ParquetSource::reorder_files both:

  • only look at sort_order.first() — secondary sort keys are ignored
  • only fire when the leading expression is a plain Column — function-wrapped sorts (date_trunc('day', ts) DESC, CAST(x AS Date) DESC, ceil(value) DESC, …) skip the stats-based reorder

Four follow-ups, by priority:


P0 — Multi-column lexicographic reorder by per-column min

For ORDER BY a ASC, b DESC LIMIT N where both a and b are plain columns in the file schema, lex-sort row groups (and files) by (min(a) ASC, min(b) DESC) instead of just min(a) ASC. Parquet stores per-column stats, so the data is already there — switch from arrow::compute::sort_to_indices over a single stat_mins to lexsort_to_indices over per-column min arrays, with per-column SortOptions.

Most useful when the leading sort column has low cardinality (e.g. ORDER BY country, ts DESC LIMIT 100 — RGs that tie on country.min should be broken by ts.min).

Estimated cost: ~30-50 lines + a few unit tests.

P1 — Function-wrapped reorder via leaf-column extraction

For sort expressions like date_trunc('day', ts) DESC or CAST(x AS Date) DESC — single leaf Column wrapped in a monotonic function chain. Parquet has no stats keyed by the function expression, but min(f(col)) = f(min(col)) when f is monotonic, so min(col)'s order is a valid proxy.

Approach:

  1. Walk the sort expression's PhysicalExpr::children() to find the unique leaf Column.
  2. Verify monotonicity through the wrapper chain — EquivalenceProperties already does this in ordering_satisfy; the same machinery should be reusable here.
  3. Drive reorder_by_statistics off the leaf column's stats.

DataFusion's existing evaluate_bounds / monotonicity inference is the right tool — happy to discuss the cleanest hook point.

P1 — Detect "already-sorted" sources to skip redundant reorder

When the source's declared output_ordering already matches the request modulo NULL placement (and the column is either non-nullable or known null-free), try_pushdown_sort currently returns Inexact instead of Exact because EquivalenceProperties::ordering_satisfy is strict about nulls_first / nulls_last.

Concrete example surfaced while diagnosing CI snapshot drift on topk.slt's partial_sorted case in #21956:

  • Source declares output_ordering=[number DESC] — DataFusion's default makes that number DESC NULLS FIRST.
  • Query asks for ORDER BY number DESC NULLS LAST LIMIT 3.
  • ordering_satisfy returns false on the NULL position mismatch, so the Phase 3 column-in-schema branch fires, setting sort_order_for_reorder=[number DESC NULLS LAST] and reverse_row_groups=true.
  • At runtime the opener sorts row groups ASC-by-min, then reverses iteration — two operations that very nearly cancel out, returning the row groups to (approximately) their on-disk DESC order. Correct, but wasteful, and the surrounding SortExec is kept even though the data is already in the requested order on disk.

A softer "Exact-equivalent" check would help: when the source ordering matches the request on (expr, descending) and either the column is non-nullable or null-count statistics confirm no NULLs in the sort column, try_pushdown_sort could short-circuit to Exact so the optimizer drops the SortExec outright.

P2 — Compound-expression reorder via interval propagation (probably defer)

For sorts on expressions involving more than one column (a + b DESC, a * 2 - b DESC, …) parquet stats give per-column intervals; combining them needs operator-specific interval arithmetic ([min(a)+min(b), max(a)+max(b)] for +, sign-aware for *, etc.). DataFusion has the building blocks (evaluate_bounds), but real-world workloads rarely sort by compound expressions and the integration cost is non-trivial. Listed for completeness; probably not worth doing unless a concrete benchmark motivates it.


cc @adriangb @alamb — wanted to keep the parent PR focused on the core mechanism + the review feedback you raised; these were good directions that came out of the design discussion but seemed better as separate issues.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions