OptimizeProjections: safely prune struct-only UNNEST when outputs are unused#20668
Open
kosiew wants to merge 17 commits intoapache:mainfrom
Open
OptimizeProjections: safely prune struct-only UNNEST when outputs are unused#20668kosiew wants to merge 17 commits intoapache:mainfrom
kosiew wants to merge 17 commits intoapache:mainfrom
Conversation
- Added handling for volatile expressions to impact the optimization process within the `optimize_projections` function. - Introduced checks for volatile expressions in both plan and ancestor nodes to adjust required indices accordingly. - Updated `RequiredIndices` struct to track whether it encounters volatile expressions and to handle multiplicity sensitivity. - Implemented new utility functions to streamline the processing of child requirements and eliminate unnecessary unnesting when certain conditions are met. - Added unit tests to validate the new functionality related to unnesting and aggregation on volatile expression scenarios.
- Allow elimination of unnest operation for empty lists while preserving nulls. - Modify the `eliminate_unnest_when_only_group_keys_are_required` test case to specify struct unnest conditions. - Introduce a new test case `keep_list_unnest_when_group_keys_are_only_required_outputs` to verify unnest behavior when only group keys are required. - Ensure that the optimization logic correctly handles different unnest scenarios based on list and struct types.
- Introduced new SQL Logic Tests to validate unnest pruning behavior in DataFusion. - Tests include scenarios with empty lists and null values to ensure correct handling of cardinality-sensitive cases. - Added explanations for expected logical plans for both aggregation and selection queries.
…projections - Removed repetitive code for handling volatile ancestors across different input plans. - Introduced a new helper function `with_volatile_if_needed` to encapsulate the logic of conditionally adding a volatile ancestor. - Improved code readability and maintainability by reducing duplication in `optimize_projections` method.
…city and volatility - Introduced methods `for_multiplicity_sensitive_child` and `for_multiplicity_insensitive_child` for better handling of child requirements in `RequiredIndices`. - Replaced usage of `with_volatile_if_needed` with `with_plan_volatile` and `with_volatile_ancestor_if` for clearer logic when managing volatile context. - Updated `optimize_projections` function to use new methods, improving code readability and maintainability.
…ts for unnest pruning - Updated the `rewrite_projection_given_requirements` function to enhance handling of projection requirements based on additional conditions such as projected benefit, multiplicity sensitivity, and volatile ancestors. - Added a new SQL logic test to validate the pruning of struct unnest in cases where it is cardinality-preserving and outputs are irrelevant. - Improved comments for clarity on unnest semantics regarding null preservation.
…te_projection_given_requirements function This change simplifies the logic in the rewrite_projection_given_requirements function by removing the check for projection benefit, which was deemed unnecessary. This helps streamline the code and improve readability.
… dedicated functions for aggregates, windows, and table scans
Extract repeated child-requirement construction logic into dedicated helper functions to improve code clarity and maintainability. Introduce `build_all_expr_input_requirements`, `build_extension_input_requirements`, and `build_unnest_fallback_requirements` for streamlined requirement handling in various components.
Add helper in mod.rs for handling child multiplicity. Replace duplicate code in aggregate and window paths with the new helper method, passing in multiplicity sensitivity based on the presence of expressions. This improves code readability and maintainability.
Extract shared post-processing into finalize_child_requirements() to handle multiplicity mode, volatile-ancestor propagation, and plan-volatile propagation. Update optimize_aggregate_projections and optimize_window_projections to utilize this helper. Improve readability with clearer plural naming for new aggregation and window expressions.
Implement strict proof checks for UNNEST removal. Ensure it is only eliminated under specific conditions, such as when the ancestor context is multiplicity-insensitive, the list rows are provably preserved, and the recursion depth is exactly 1. Add new optimizer_unnest_prune.slt coverage for unnest removal in query plans.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
DataFusion’s logical plans can contain
LogicalPlan::Unnesteven when none of the unnested outputs are actually used by ancestor operators. In some cases (notably struct-unnest and provably non-empty deterministic list expressions),Unnestis cardinality-preserving and therefore semantically redundant when its outputs are dead.However, eliminating
Unnestis not always safe:COUNT(*), window functions, and similar operators can observe row multiplicity changes.This PR adds strict logical-level safety checks and propagates “multiplicity sensitivity” and “volatile ancestor” context through projection pruning so
Unnestis removed only when semantics are preserved.What changes are included in this PR?
Refactor
optimize_projectionsbuild_plan_input_requirements,build_all_expr_input_requirements,build_extension_input_requirements,build_join_input_requirements,build_unnest_*).rewrite_plan_childrento reduce duplication and consistently trigger schema recomputation after child rewrites.volatile_in_planandRequiredIndices::with_plan_volatile.Refactor + centralize projection pruning requirements propagation
multiplicity_sensitiveandhas_volatile_ancestortoRequiredIndices.volatile_in_plan) through requirement propagation.Safe logical elimination of
LogicalPlan::UnnestAdds
can_eliminate_unnestgating with strict checks:Parent chain must be multiplicity-insensitive.
No volatile ancestor context.
All requested outputs are passthrough columns (no unnested outputs required).
For list unnest, elimination only allowed when row preservation is proven:
make_array(1,2,3)or non-empty list literals).Schema correctness improvements
Tests
Adds unit tests for struct vs list unnest pruning and multiplicity-sensitive negative cases.
Adds a new SQLLogicTest file
optimizer_unnest_prune.sltvalidating:EXPLAINplans dropUnnestonly in safe cases.COUNT(*)multiplicity sensitivity.Are these changes tested?
Yes.
Rust unit tests in
optimize_projections:eliminate_struct_unnest_when_only_group_keys_are_requiredkeep_list_unnest_when_group_keys_are_only_required_outputskeep_unnest_when_count_depends_on_row_multiplicitykeep_unnest_when_preserve_nulls_is_disabledSQLLogicTests:
datafusion/sqllogictest/test_files/optimizer_unnest_prune.sltUnnestelimination appears only when safe inEXPLAIN.Are there any user-facing changes?
Yes (planner/optimizer behavior):
Unnestoutputs are unused and row multiplicity is guaranteed unchanged, the logical optimizer can now removeUnnest. This produces smaller logical plans and can reduce unnecessary computation.No SQL/API breaking changes are intended.
LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.