Skip to content

[WIP][DO-NOT-REVIEW][SPARK-55886][SQL] Add DataFrame.zip for merging column-projected DataFrames#54976

Draft
zhengruifeng wants to merge 12 commits intoapache:masterfrom
zhengruifeng:df-zip
Draft

[WIP][DO-NOT-REVIEW][SPARK-55886][SQL] Add DataFrame.zip for merging column-projected DataFrames#54976
zhengruifeng wants to merge 12 commits intoapache:masterfrom
zhengruifeng:df-zip

Conversation

@zhengruifeng
Copy link
Copy Markdown
Contributor

Add a new DataFrame.zip(other) API that combines columns from two DataFrames that derive from the same base plan through Project chains. The optimizer rewrites the Zip node into a single Project over the shared base plan, and analysis rejects plans that cannot be merged.

Co-authored-by: Isaac

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

@zhengruifeng zhengruifeng changed the title [DO-NOT-REVIEW][SPARK-55886][SQL] Add DataFrame.zip for merging column-projected DataFrames [WIP][DO-NOT-REVIEW][SPARK-55886][SQL] Add DataFrame.zip for merging column-projected DataFrames Mar 24, 2026
Add a new DataFrame.zip(other) API that combines columns from two
DataFrames that derive from the same base plan through Project chains.
The optimizer rewrites the Zip node into a single Project over the
shared base plan, and analysis rejects plans that cannot be merged.

Co-authored-by: Isaac
Zip is now always unresolved (resolved=false). A new ResolveZip
analyzer rule rewrites it into a Project when both children share the
same base plan. Removes the CollapseZip optimizer rule.

Co-authored-by: Isaac
Zip is always unresolved, so deduplication does not help it resolve.
ResolveZip already handles attribute remapping from right base to left
base via sameResult() and AttributeMap.

Co-authored-by: Isaac
No longer referenced after removing Zip from DeduplicateRelations
and changing resolved to always false.

Co-authored-by: Isaac
…tors

Add Project.isScalar which returns true when the project list contains
only 1:1 row mapping expressions (no Generator, AggregateExpression,
or WindowExpression). ResolveZip now uses this to reject non-scalar
Projects rather than inline Generator checks.

In practice, ExtractGenerator rewrites Projects with Generators before
ResolveZip runs, so this is defense-in-depth.

Co-authored-by: Isaac
The childrenResolved guard already ensures children are resolved
Projects. Since Project.resolved rejects Generator,
AggregateExpression, and WindowExpression, the scalar (1:1 mapping)
property is guaranteed by the time ResolveZip fires.

Co-authored-by: Isaac
Project.resolved catches Generator, AggregateExpression, and
WindowExpression, but non-scalar Python UDFs (e.g. GROUPED_MAP)
can slip through. Add an allScalar guard using
PythonUDF.isScalarPythonUDF to reject them.

Co-authored-by: Isaac
Move the scaladoc to the abstract Dataset in sql/api and use
@inheritdoc in the classic implementation.

Co-authored-by: Isaac
Spark Connect does not yet support the Zip logical plan. Add a
placeholder implementation that throws UnsupportedOperationException.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant