Skip to content

OptimizeProjections: safely prune struct-only UNNEST when outputs are unused#20668

Open
kosiew wants to merge 17 commits intoapache:mainfrom
kosiew:logical-prune-20118
Open

OptimizeProjections: safely prune struct-only UNNEST when outputs are unused#20668
kosiew wants to merge 17 commits intoapache:mainfrom
kosiew:logical-prune-20118

Conversation

@kosiew
Copy link
Contributor

@kosiew kosiew commented Mar 3, 2026

Which issue does this PR close?

Rationale for this change

DataFusion’s logical plans can contain LogicalPlan::Unnest even when none of the unnested outputs are actually used by ancestor operators. In some cases (notably struct-unnest and provably non-empty deterministic list expressions), Unnest is cardinality-preserving and therefore semantically redundant when its outputs are dead.

However, eliminating Unnest is not always safe:

  • List unnest can change row multiplicity for empty / null lists.
  • Aggregations like COUNT(*), window functions, and similar operators can observe row multiplicity changes.
  • Volatile expressions must not be reordered/hidden/exposed in a way that changes observable results.

This PR adds strict logical-level safety checks and propagates “multiplicity sensitivity” and “volatile ancestor” context through projection pruning so Unnest is removed only when semantics are preserved.


What changes are included in this PR?

  • Refactor optimize_projections

    • Extracts dedicated helpers for Aggregate / Window / TableScan projection optimization.
    • Centralizes requirement propagation logic (build_plan_input_requirements, build_all_expr_input_requirements, build_extension_input_requirements, build_join_input_requirements, build_unnest_*).
    • Introduces rewrite_plan_children to reduce duplication and consistently trigger schema recomputation after child rewrites.
    • Makes volatile-context handling explicit via volatile_in_plan and RequiredIndices::with_plan_volatile.
  • Refactor + centralize projection pruning requirements propagation

    • Introduces multiplicity_sensitive and has_volatile_ancestor to RequiredIndices.
    • Threads volatility context (volatile_in_plan) through requirement propagation.
    • Encapsulates common requirement-building logic for plan inputs, extensions, joins, and fallback cases.
  • Safe logical elimination of LogicalPlan::Unnest

    • Adds can_eliminate_unnest gating with strict checks:

      • Parent chain must be multiplicity-insensitive.

      • No volatile ancestor context.

      • All requested outputs are passthrough columns (no unnested outputs required).

      • For list unnest, elimination only allowed when row preservation is proven:

        • struct-only unnest is always row-preserving.
        • list unnest only if inputs are provably non-empty and deterministic (e.g. make_array(1,2,3) or non-empty list literals).
  • Schema correctness improvements

    • After rewriting children, recomputes schema when transformations occur.
  • Tests

    • Adds unit tests for struct vs list unnest pruning and multiplicity-sensitive negative cases.

    • Adds a new SQLLogicTest file optimizer_unnest_prune.slt validating:

      • EXPLAIN plans drop Unnest only in safe cases.
      • Correctness for empty-list/null behavior.
      • Negative coverage for COUNT(*) multiplicity sensitivity.

Are these changes tested?

Yes.

  • Rust unit tests in optimize_projections:

    • eliminate_struct_unnest_when_only_group_keys_are_required
    • keep_list_unnest_when_group_keys_are_only_required_outputs
    • keep_unnest_when_count_depends_on_row_multiplicity
    • keep_unnest_when_preserve_nulls_is_disabled
  • SQLLogicTests:

    • datafusion/sqllogictest/test_files/optimizer_unnest_prune.slt

      • Verifies Unnest elimination appears only when safe in EXPLAIN.
      • Includes correctness assertions for row behavior under empty-list and null inputs.

Are there any user-facing changes?

  • Yes (planner/optimizer behavior):

    • For queries where Unnest outputs are unused and row multiplicity is guaranteed unchanged, the logical optimizer can now remove Unnest. This produces smaller logical plans and can reduce unnecessary computation.
  • No SQL/API breaking changes are intended.

    • The rewrite is gated by strict semantic checks (multiplicity sensitivity, null/empty semantics, determinism/volatility).

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

kosiew added 7 commits March 3, 2026 20:02
- Added handling for volatile expressions to impact the optimization process within the `optimize_projections` function.
- Introduced checks for volatile expressions in both plan and ancestor nodes to adjust required indices accordingly.
- Updated `RequiredIndices` struct to track whether it encounters volatile expressions and to handle multiplicity sensitivity.
- Implemented new utility functions to streamline the processing of child requirements and eliminate unnecessary unnesting when certain conditions are met.
- Added unit tests to validate the new functionality related to unnesting and aggregation on volatile expression scenarios.
- Allow elimination of unnest operation for empty lists while preserving nulls.
- Modify the `eliminate_unnest_when_only_group_keys_are_required` test case to specify struct unnest conditions.
- Introduce a new test case `keep_list_unnest_when_group_keys_are_only_required_outputs` to verify unnest behavior when only group keys are required.
- Ensure that the optimization logic correctly handles different unnest scenarios based on list and struct types.
- Introduced new SQL Logic Tests to validate unnest pruning behavior in DataFusion.
- Tests include scenarios with empty lists and null values to ensure correct handling of cardinality-sensitive cases.
- Added explanations for expected logical plans for both aggregation and selection queries.
…projections

- Removed repetitive code for handling volatile ancestors across different input plans.
- Introduced a new helper function `with_volatile_if_needed` to encapsulate the logic of conditionally adding a volatile ancestor.
- Improved code readability and maintainability by reducing duplication in `optimize_projections` method.
…city and volatility

- Introduced methods `for_multiplicity_sensitive_child` and `for_multiplicity_insensitive_child` for better handling of child requirements in `RequiredIndices`.
- Replaced usage of `with_volatile_if_needed` with `with_plan_volatile` and `with_volatile_ancestor_if` for clearer logic when managing volatile context.
- Updated `optimize_projections` function to use new methods, improving code readability and maintainability.
…ts for unnest pruning

- Updated the `rewrite_projection_given_requirements` function to enhance handling of projection requirements based on additional conditions such as projected benefit, multiplicity sensitivity, and volatile ancestors.
- Added a new SQL logic test to validate the pruning of struct unnest in cases where it is cardinality-preserving and outputs are irrelevant.
- Improved comments for clarity on unnest semantics regarding null preservation.
…te_projection_given_requirements function

This change simplifies the logic in the rewrite_projection_given_requirements function by removing the check for projection benefit, which was deemed unnecessary. This helps streamline the code and improve readability.
@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Mar 3, 2026
kosiew added 10 commits March 5, 2026 14:16
… dedicated functions for aggregates, windows, and table scans
Extract repeated child-requirement construction logic into
dedicated helper functions to improve code clarity and
maintainability. Introduce `build_all_expr_input_requirements`,
`build_extension_input_requirements`, and
`build_unnest_fallback_requirements` for streamlined
requirement handling in various components.
Add helper in mod.rs for handling child multiplicity.
Replace duplicate code in aggregate and window paths with
the new helper method, passing in multiplicity sensitivity
based on the presence of expressions. This improves code
readability and maintainability.
Extract shared post-processing into finalize_child_requirements()
to handle multiplicity mode, volatile-ancestor propagation, and
plan-volatile propagation. Update optimize_aggregate_projections
and optimize_window_projections to utilize this helper. Improve
readability with clearer plural naming for new aggregation and
window expressions.
Implement strict proof checks for UNNEST removal. Ensure it is
only eliminated under specific conditions, such as when the
ancestor context is multiplicity-insensitive, the list rows
are provably preserved, and the recursion depth is exactly 1.
Add new optimizer_unnest_prune.slt coverage for unnest
removal in query plans.
@kosiew kosiew changed the title OptimizeProjections: prune struct-only UNNEST when outputs are unused and ancestors are multiplicity-insensitive OptimizeProjections: safely prune struct-only UNNEST when outputs are unused Mar 5, 2026
@kosiew kosiew marked this pull request as ready for review March 5, 2026 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant