-
Notifications
You must be signed in to change notification settings - Fork 1.8k
optimizer: allow projection pushdown through aliased recursive CTE references #17875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| let formatted = arrow::util::pretty::pretty_format_batches(&actual) | ||
| .unwrap() | ||
| .to_string(); | ||
|
|
||
| let scan_line = formatted | ||
| .lines() | ||
| .find(|line| line.contains("DataSourceExec")) | ||
| .expect("DataSourceExec not found"); | ||
|
|
||
| assert!( | ||
| scan_line.contains("projection=[id]"), | ||
| "expected scan to only project id column, found: {scan_line}" | ||
| ); | ||
| assert!( | ||
| !scan_line.contains("parent_id") && !scan_line.contains("value"), | ||
| "unexpected columns projected in scan: {scan_line}" | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to assert_snapshot!() the explain plan here, to make it more robust
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll amend the test.
Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
…and improve SQL formatting
…to use correct temporary file path
…rary directory paths in snapshots
|
I ran the reproducer from With the code from this PR and the output is: Which has only the necessary columns, as expected: 🎉 |
Which issue does this PR close?
DataSourceExecis projecting/reading unused columns from Parquet files for recursive queries #16684Rationale for this change
The projection-pruning rule in the optimizer previously treated any
SubqueryAliaswhose alias name did not exactly match the CTE name as an "other subquery", and therefore aborted projection pushdown for that branch. This incorrectly prevented projection pushdown when the recursive CTE referenced itself with an alias (for exampleFROM nodes AS child).Because the optimizer could not recognize that the aliased reference still targeted the same CTE, it conservatively kept all columns on the table scan that feeds the recursive branch. In practice this can cause unnecessary I/O (for example with Parquet) because columns not required by the final output are read.
This change allows the optimizer to detect aliased self-references inside a recursive CTE and continue projection pushdown into the recursive term when safe.
What changes are included in this PR?
Modify
plan_contains_other_subqueriesindatafusion/optimizer/src/optimize_projections/mod.rsso that aSubqueryAliaswhose alias name differs from the CTE name is not immediately treated as an unrelated subquery if the aliased input ultimately targets the same CTE. Instead we call a helper to detect whether the aliased subquery actually targets the recursive CTE.Add helper function
subquery_alias_targets_recursive_ctetooptimize_projections/mod.rswhich recursively walks a plan (throughSubqueryAliasand single-input operators) to determine whether the leafTableScanrefers to the CTE name.Add an integration test
recursive_cte_with_aliased_self_referenceindatafusion/optimizer/tests/optimizer_integration.rswhich asserts that projection pushdown occurs when a recursive CTE references itself with an alias. The test checks that only the projected column (id) is kept in theTableScanof the recursive branch.Files changed (summary):
datafusion/optimizer/src/optimize_projections/mod.rssubquery_alias_targets_recursive_cte.datafusion/optimizer/tests/optimizer_integration.rsrecursive_cte_with_aliased_self_referencetest.Are these changes tested?
Yes — this PR adds an integration test (
recursive_cte_with_aliased_self_reference) that reproduces the problematic scenario and validates the expected plan after optimization. Existing test harness/tooling runs the new test as part of the optimizer integration suite.Are there any user-facing changes?
No changes to public APIs or SQL syntax. This is an internal optimizer improvement which can reduce unnecessary I/O by enabling projection pushdown in more recursive-CTE cases (when the recursive term uses an alias for the CTE). There are no breaking changes.
Additional notes / implementation details
The heuristic used by
subquery_alias_targets_recursive_cteis intentionally conservative: it only walks throughSubqueryAliasand operators with a single input. If a plan node has multiple inputs (e.g. join) the helper returnsfalseso we do not accidentally mis-detect unrelated plans as targeting the CTE.The change preserves safety by only allowing pushdown when we can be confident the aliased subquery resolves back to the same CTE's table scan.