test: benchmarks and SLT tests for push-down TopK through join by adriangb · Pull Request #22760 · apache/datafusion

adriangb · 2026-06-04T12:50:19Z

Which issue does this PR close?

Relates to Push down TopK below Join #11900

Rationale for this change

This splits the test and benchmark scaffolding out of #21621 so the
PushDownTopKThroughJoin optimizer rule itself can be reviewed in
isolation, with a small, focused diff.

The benchmark and SLT files here do not depend on the rule. They are
committed first so that:

The benchmark can measure the rule's effect against a baseline that
does not register it.
The follow-up rule PR's diff shows exactly which plans change, since
the EXPLAIN plans here capture the current (pre-rule) behavior.

What changes are included in this PR?

A push_down_topk benchmark (dfbench push-down-topk) that runs
ORDER BY <cols> LIMIT N queries over outer joins against TPC-H
customer/orders/nation, plus its query files under
benchmarks/queries/push_down_topk/.
push_down_topk_through_join.slt covering the scenarios the rule
handles: preserved-side sort keys, ineligible join types
(inner/full/semi/anti), ON-clause filters, projection and
SubqueryAlias resolution, existing child sorts, ties, multi-level
joins, OFFSET, and volatile expressions.

The EXPLAIN plans assert current behavior (TopK not yet pushed through
the join). The follow-up PR that adds the rule updates those plans in
place; the query-result checks hold regardless of whether the rule is
enabled.

The new optimizer rule, the push_down_limit.rs changes, and the
optimizer_rule_reference.md update from #21621 are intentionally left
for the follow-up PR.

Are these changes tested?

Yes — this PR is the tests. push_down_topk_through_join.slt passes
against main, and the benchmark binary compiles and runs.

Are there any user-facing changes?

No. No API changes; only new benchmark and test files plus benchmark CLI
wiring.

Splits the test and benchmark scaffolding out of apache#21621 so the `PushDownTopKThroughJoin` optimizer rule can be reviewed on its own. - Adds a `push_down_topk` benchmark (`dfbench push-down-topk`) that runs ORDER BY ... LIMIT queries over outer joins against TPC-H data, so the rule's effect can be measured against a baseline that does not register it. - Adds `push_down_topk_through_join.slt` covering the scenarios the rule handles (preserved-side sort keys, ineligible join types, semi/anti joins, projection/alias resolution, ties, multi-level joins, volatility). The EXPLAIN plans capture current (pre-rule) behavior so the follow-up rule PR's diff shows exactly which plans change; the query-result checks hold either way. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

alamb

Thank you @adriangb and @SubhamSinghal

alamb · 2026-06-04T14:19:15Z

+// specific language governing permissions and limitations
+// under the License.
+
+//! Benchmark for `push_down_topk_through_join`.


So we really need a whole special executor for this benchmark? Can we use the new benchmark runner stuff that @Omega359 is working on?

I initially copied over verbatim but I will rework them into the new framework.

Addresses review feedback: replace the bespoke `push_down_topk` Rust executor with declarative `.benchmark` files under `sql_benchmarks/push_down_topk/`, run by the existing `cargo bench --bench sql` harness. - Removes `benchmarks/src/push_down_topk.rs` and its `dfbench` subcommand wiring (`lib.rs`, `dfbench.rs` now match main). - Adds `sql_benchmarks/push_down_topk/{benchmarks/q01..q05.benchmark, init/load.sql,init/cleanup.sql}`, reusing the TPC-H parquet data. - Wires `push_down_topk` into `bench.sh` (data + run) and documents the suite in the sql_benchmarks README. Verified all five queries load, assert, and run via `BENCH_NAME=push_down_topk cargo bench --bench sql` against a generated TPC-H dataset. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions Bot added the sqllogictest SQL Logic Tests (.slt) label Jun 4, 2026

adriangb mentioned this pull request Jun 4, 2026

Push down topk through join #21621

Open

alamb approved these changes Jun 4, 2026

View reviewed changes

adriangb enabled auto-merge June 4, 2026 15:19

adriangb added this pull request to the merge queue Jun 4, 2026

Merged via the queue into apache:main with commit 249b599 Jun 4, 2026
57 of 58 checks passed

adriangb deleted the push-down-topk-bench-tests branch June 4, 2026 15:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: benchmarks and SLT tests for push-down TopK through join#22760

test: benchmarks and SLT tests for push-down TopK through join#22760
adriangb merged 2 commits into
apache:mainfrom
pydantic:push-down-topk-bench-tests

adriangb commented Jun 4, 2026

Uh oh!

alamb left a comment

Uh oh!

alamb Jun 4, 2026

Uh oh!

adriangb Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adriangb commented Jun 4, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

adriangb Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants