Add row-number late materialization for TopK#22617
Conversation
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing codex/late-materialization-topk (68e8de6) to 0da8961 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing codex/late-materialization-topk (68e8de6) to 0da8961 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing codex/late-materialization-topk (68e8de6) to 0da8961 (merge-base) diff using: tpch File an issue against this benchmark runner |
…ization-topk # Conflicts: # datafusion/core/src/optimizer_rule_reference.md # datafusion/datasource-parquet/src/opener/mod.rs
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
Thank you for opening this pull request! Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
Is this selective late materialization 👀 ? |
That would be the idea! |
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing codex/late-materialization-topk (c5da797) to 32a1fe5 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing codex/late-materialization-topk (c5da797) to 32a1fe5 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing codex/late-materialization-topk (c5da797) to 32a1fe5 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
Which issue does this PR close?
None yet.
Rationale for this change
Large
ORDER BY ... LIMITqueries can sort only key columns first and materialize non-key columns after the TopK rows are known. This reduces the amount of data read and moved through TopK execution, while preserving result order through row-number based selection.What changes are included in this PR?
LateMaterializationphysical optimizer rule enabled by default viadatafusion.optimizer.enable_row_number_topk_late_materializationLateTopKMaterializationExec, including a generic fallback path and a Parquet/file row-selection pushdown pathFileRowsSelectionsupport to file scan extensions and Parquet access planningAre these changes tested?
Yes.
cargo fmt --allcargo clippy --all-targets --all-features -- -D warningscargo check -p datafusion-datasource-parquet -p datafusion-physical-optimizer -p datafusioncargo clippy -p datafusion-physical-optimizer -p datafusion-datasource-parquet -- -D warningscargo test -p datafusion --test core_integration late_materializationcargo test -p datafusion --test parquet_integration file_rows_selectiongit diff --checkAre there any user-facing changes?
Yes. The new optimizer rule is enabled by default and can be disabled with
datafusion.optimizer.enable_row_number_topk_late_materialization = false.