Skip to content

Conversation

@kosiew
Copy link
Contributor

@kosiew kosiew commented Oct 2, 2025

Which issue does this PR close?

Rationale for this change

The projection-pruning rule in the optimizer previously treated any SubqueryAlias whose alias name did not exactly match the CTE name as an "other subquery", and therefore aborted projection pushdown for that branch. This incorrectly prevented projection pushdown when the recursive CTE referenced itself with an alias (for example FROM nodes AS child).

Because the optimizer could not recognize that the aliased reference still targeted the same CTE, it conservatively kept all columns on the table scan that feeds the recursive branch. In practice this can cause unnecessary I/O (for example with Parquet) because columns not required by the final output are read.

This change allows the optimizer to detect aliased self-references inside a recursive CTE and continue projection pushdown into the recursive term when safe.

What changes are included in this PR?

  • Modify plan_contains_other_subqueries in datafusion/optimizer/src/optimize_projections/mod.rs so that a SubqueryAlias whose alias name differs from the CTE name is not immediately treated as an unrelated subquery if the aliased input ultimately targets the same CTE. Instead we call a helper to detect whether the aliased subquery actually targets the recursive CTE.

  • Add helper function subquery_alias_targets_recursive_cte to optimize_projections/mod.rs which recursively walks a plan (through SubqueryAlias and single-input operators) to determine whether the leaf TableScan refers to the CTE name.

  • Add an integration test recursive_cte_with_aliased_self_reference in datafusion/optimizer/tests/optimizer_integration.rs which asserts that projection pushdown occurs when a recursive CTE references itself with an alias. The test checks that only the projected column (id) is kept in the TableScan of the recursive branch.

Files changed (summary):

  • datafusion/optimizer/src/optimize_projections/mod.rs

    • Allow descending into aliased subqueries to see if they target the same recursive CTE.
    • Add subquery_alias_targets_recursive_cte.
  • datafusion/optimizer/tests/optimizer_integration.rs

    • Add recursive_cte_with_aliased_self_reference test.

Are these changes tested?

Yes — this PR adds an integration test (recursive_cte_with_aliased_self_reference) that reproduces the problematic scenario and validates the expected plan after optimization. Existing test harness/tooling runs the new test as part of the optimizer integration suite.

Are there any user-facing changes?

No changes to public APIs or SQL syntax. This is an internal optimizer improvement which can reduce unnecessary I/O by enabling projection pushdown in more recursive-CTE cases (when the recursive term uses an alias for the CTE). There are no breaking changes.

Additional notes / implementation details

  • The heuristic used by subquery_alias_targets_recursive_cte is intentionally conservative: it only walks through SubqueryAlias and operators with a single input. If a plan node has multiple inputs (e.g. join) the helper returns false so we do not accidentally mis-detect unrelated plans as targeting the CTE.

  • The change preserves safety by only allowing pushdown when we can be confident the aliased subquery resolves back to the same CTE's table scan.

@github-actions github-actions bot added the optimizer Optimizer rules label Oct 2, 2025
@github-actions github-actions bot added the core Core DataFusion crate label Oct 2, 2025
@kosiew kosiew marked this pull request as ready for review October 2, 2025 09:51
Comment on lines 803 to 819
let formatted = arrow::util::pretty::pretty_format_batches(&actual)
.unwrap()
.to_string();

let scan_line = formatted
.lines()
.find(|line| line.contains("DataSourceExec"))
.expect("DataSourceExec not found");

assert!(
scan_line.contains("projection=[id]"),
"expected scan to only project id column, found: {scan_line}"
);
assert!(
!scan_line.contains("parent_id") && !scan_line.contains("value"),
"unexpected columns projected in scan: {scan_line}"
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to assert_snapshot!() the explain plan here, to make it more robust

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll amend the test.

kosiew and others added 3 commits October 3, 2025 15:44
@kosiew kosiew marked this pull request as draft October 3, 2025 11:41
@kosiew kosiew marked this pull request as ready for review October 3, 2025 15:00
@alamb
Copy link
Contributor

alamb commented Oct 3, 2025

I ran the reproducer from

With the code from this PR and the output is:

|                   |               DataSourceExec: file_groups={1 group: [[var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/.tmpgg4Dpy/sample_data.parquet]]}, projection=[id], file_type=parquet, predicate=id@0 = 1, pruning_predicate=id_null_count@2 != row_count@3 AND id_min@0 <= 1 AND 1 <= id_max@1, required_guarantees=[id in (1)], metrics=[output_rows=10, elapsed_compute=1ns, batches_split=0, bytes_scanned=123, file_open_errors=0, file_scan_errors=0, files_ranges_pruned_statistics=0, num_predicate_creation_errors=0, page_index_rows_matched=10, page_index_rows_pruned=0, predicate_cache_inner_records=0, predicate_cache_records=0, predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=1, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=0, bloom_filter_eval_time=146.668µs, metadata_load_time=272.042µs, page_index_eval_time=141.793µs, row_pushdown_eval_time=2ns, statistics_eval_time=138.251µs, time_elapsed_opening=1.353083ms, time_elapsed_processing=1.590625ms, time_elapsed_scanning_total=372.166µs, time_elapsed_scanning_until_data=335.25µs] |

Which has only the necessary columns, as expected:

projection=[id]

🎉

Thank you @kosiew and @Jefffrey

@alamb alamb added the performance Make DataFusion faster label Oct 3, 2025
@alamb alamb added this pull request to the merge queue Oct 4, 2025
Merged via the queue into apache:main with commit 76904e8 Oct 4, 2025
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate optimizer Optimizer rules performance Make DataFusion faster

Projects

None yet

3 participants