optimizer: allow projection pushdown through aliased recursive CTE references #17875

kosiew · 2025-10-02T07:22:06Z

Which issue does this PR close?

Closes DataSourceExec is projecting/reading unused columns from Parquet files for recursive queries #16684
Closes Projection pushdown on Recursive CTEs with nested subqueries #17853

Rationale for this change

The projection-pruning rule in the optimizer previously treated any SubqueryAlias whose alias name did not exactly match the CTE name as an "other subquery", and therefore aborted projection pushdown for that branch. This incorrectly prevented projection pushdown when the recursive CTE referenced itself with an alias (for example FROM nodes AS child).

Because the optimizer could not recognize that the aliased reference still targeted the same CTE, it conservatively kept all columns on the table scan that feeds the recursive branch. In practice this can cause unnecessary I/O (for example with Parquet) because columns not required by the final output are read.

This change allows the optimizer to detect aliased self-references inside a recursive CTE and continue projection pushdown into the recursive term when safe.

What changes are included in this PR?

Modify plan_contains_other_subqueries in datafusion/optimizer/src/optimize_projections/mod.rs so that a SubqueryAlias whose alias name differs from the CTE name is not immediately treated as an unrelated subquery if the aliased input ultimately targets the same CTE. Instead we call a helper to detect whether the aliased subquery actually targets the recursive CTE.
Add helper function subquery_alias_targets_recursive_cte to optimize_projections/mod.rs which recursively walks a plan (through SubqueryAlias and single-input operators) to determine whether the leaf TableScan refers to the CTE name.
Add an integration test recursive_cte_with_aliased_self_reference in datafusion/optimizer/tests/optimizer_integration.rs which asserts that projection pushdown occurs when a recursive CTE references itself with an alias. The test checks that only the projected column (id) is kept in the TableScan of the recursive branch.

Files changed (summary):

datafusion/optimizer/src/optimize_projections/mod.rs
- Allow descending into aliased subqueries to see if they target the same recursive CTE.
- Add subquery_alias_targets_recursive_cte.
datafusion/optimizer/tests/optimizer_integration.rs
- Add recursive_cte_with_aliased_self_reference test.

Are these changes tested?

Yes — this PR adds an integration test (recursive_cte_with_aliased_self_reference) that reproduces the problematic scenario and validates the expected plan after optimization. Existing test harness/tooling runs the new test as part of the optimizer integration suite.

Are there any user-facing changes?

No changes to public APIs or SQL syntax. This is an internal optimizer improvement which can reduce unnecessary I/O by enabling projection pushdown in more recursive-CTE cases (when the recursive term uses an alias for the CTE). There are no breaking changes.

Additional notes / implementation details

The heuristic used by subquery_alias_targets_recursive_cte is intentionally conservative: it only walks through SubqueryAlias and operators with a single input. If a plan node has multiple inputs (e.g. join) the helper returns false so we do not accidentally mis-detect unrelated plans as targeting the CTE.
The change preserves safety by only allowing pushdown when we can be confident the aliased subquery resolves back to the same CTE's table scan.

Jefffrey · 2025-10-03T04:06:53Z

datafusion/core/tests/sql/explain_analyze.rs

+    let formatted = arrow::util::pretty::pretty_format_batches(&actual)
+        .unwrap()
+        .to_string();
+
+    let scan_line = formatted
+        .lines()
+        .find(|line| line.contains("DataSourceExec"))
+        .expect("DataSourceExec not found");
+
+    assert!(
+        scan_line.contains("projection=[id]"),
+        "expected scan to only project id column, found: {scan_line}"
+    );
+    assert!(
+        !scan_line.contains("parent_id") && !scan_line.contains("value"),
+        "unexpected columns projected in scan: {scan_line}"
+    );


Probably better to assert_snapshot!() the explain plan here, to make it more robust

I'll amend the test.

datafusion/core/tests/sql/explain_analyze.rs

Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>

…and improve SQL formatting

…to use correct temporary file path

…rary directory paths in snapshots

alamb · 2025-10-03T17:24:15Z

I ran the reproducer from

DataSourceExec is projecting/reading unused columns from Parquet files for recursive queries #16684

With the code from this PR and the output is:

|                   |               DataSourceExec: file_groups={1 group: [[var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/.tmpgg4Dpy/sample_data.parquet]]}, projection=[id], file_type=parquet, predicate=id@0 = 1, pruning_predicate=id_null_count@2 != row_count@3 AND id_min@0 <= 1 AND 1 <= id_max@1, required_guarantees=[id in (1)], metrics=[output_rows=10, elapsed_compute=1ns, batches_split=0, bytes_scanned=123, file_open_errors=0, file_scan_errors=0, files_ranges_pruned_statistics=0, num_predicate_creation_errors=0, page_index_rows_matched=10, page_index_rows_pruned=0, predicate_cache_inner_records=0, predicate_cache_records=0, predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=1, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=0, bloom_filter_eval_time=146.668µs, metadata_load_time=272.042µs, page_index_eval_time=141.793µs, row_pushdown_eval_time=2ns, statistics_eval_time=138.251µs, time_elapsed_opening=1.353083ms, time_elapsed_processing=1.590625ms, time_elapsed_scanning_total=372.166µs, time_elapsed_scanning_until_data=335.25µs] |

Which has only the necessary columns, as expected:

projection=[id]

🎉

Thank you @kosiew and @Jefffrey

Add recursive CTE handling for aliased self-references in optimizer

91484b6

github-actions bot added the optimizer Optimizer rules label Oct 2, 2025

Add recursive projection pushdown test for Parquet

bab227b

github-actions bot added the core Core DataFusion crate label Oct 2, 2025

Merge branch 'main' into recursive-project-16684

7bcbaca

kosiew marked this pull request as ready for review October 2, 2025 09:51

Add comment for parquet recursive projection pushdown

118c390

Jefffrey approved these changes Oct 3, 2025

View reviewed changes

kosiew and others added 3 commits October 3, 2025 15:44

Update datafusion/core/tests/sql/explain_analyze.rs

cb08f17

Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>

Refactor parquet_recursive_projection_pushdown test to return Result …

6a97be7

…and improve SQL formatting

Fix snapshot assertion in parquet_recursive_projection_pushdown test …

d42b9f7

…to use correct temporary file path

kosiew marked this pull request as draft October 3, 2025 11:41

kosiew added 3 commits October 3, 2025 22:12

Enhance parquet_recursive_projection_pushdown test to normalize tempo…

1b1b051

…rary directory paths in snapshots

Fix fmt errors

ba5234a

Fix clippy error

bf79a56

kosiew marked this pull request as ready for review October 3, 2025 15:00

alamb added the performance Make DataFusion faster label Oct 3, 2025

alamb added this pull request to the merge queue Oct 4, 2025

Merged via the queue into apache:main with commit 76904e8 Oct 4, 2025
29 checks passed

alamb mentioned this pull request Oct 4, 2025

Projection pushdown on Recursive CTEs with nested subqueries #17853

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optimizer: allow projection pushdown through aliased recursive CTE references #17875

optimizer: allow projection pushdown through aliased recursive CTE references #17875

Uh oh!

kosiew commented Oct 2, 2025 •

edited by alamb

Loading

Uh oh!

Jefffrey Oct 3, 2025

Uh oh!

kosiew Oct 3, 2025

Uh oh!

Uh oh!

alamb commented Oct 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

optimizer: allow projection pushdown through aliased recursive CTE references #17875

optimizer: allow projection pushdown through aliased recursive CTE references #17875

Uh oh!

Conversation

kosiew commented Oct 2, 2025 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Additional notes / implementation details

Uh oh!

Jefffrey Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

kosiew Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kosiew commented Oct 2, 2025 •

edited by alamb

Loading

alamb commented Oct 3, 2025 •

edited

Loading