[SPARK-45509][SQL][3.5] Fix df column reference behavior for Spark Connect #43699

cloud-fan · 2023-11-07T08:55:45Z

backport #43465 to 3.5

What changes were proposed in this pull request?

This PR fixes a few problems of column resolution for Spark Connect, to make the behavior closer to classic Spark SQL (unfortunately we still have some behavior differences in corner cases).

resolve df column references in both resolveExpressionByPlanChildren and resolveExpressionByPlanOutput. Previously it's only in resolveExpressionByPlanChildren.
when the plan id has multiple matches, fail with AMBIGUOUS_COLUMN_REFERENCE

Why are the changes needed?

fix behavior differences between spark connect and classic spark sql

Does this PR introduce any user-facing change?

Yes, for spark connect scala client

How was this patch tested?

new tests

Was this patch authored or co-authored using generative AI tooling?

no

This PR fixes a few problems of column resolution for Spark Connect, to make the behavior closer to classic Spark SQL (unfortunately we still have some behavior differences in corner cases). 1. resolve df column references in both `resolveExpressionByPlanChildren` and `resolveExpressionByPlanOutput`. Previously it's only in `resolveExpressionByPlanChildren`. 2. when the plan id has multiple matches, fail with `AMBIGUOUS_COLUMN_REFERENCE` fix behavior differences between spark connect and classic spark sql Yes, for spark connect scala client new tests no Closes apache#43465 from cloud-fan/column. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…nnect backport #43465 to 3.5 ### What changes were proposed in this pull request? This PR fixes a few problems of column resolution for Spark Connect, to make the behavior closer to classic Spark SQL (unfortunately we still have some behavior differences in corner cases). 1. resolve df column references in both `resolveExpressionByPlanChildren` and `resolveExpressionByPlanOutput`. Previously it's only in `resolveExpressionByPlanChildren`. 2. when the plan id has multiple matches, fail with `AMBIGUOUS_COLUMN_REFERENCE` ### Why are the changes needed? fix behavior differences between spark connect and classic spark sql ### Does this PR introduce _any_ user-facing change? Yes, for spark connect scala client ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #43699 from cloud-fan/backport. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

zhengruifeng · 2023-11-08T17:19:01Z

merged to 3.5

github-actions bot added SQL DOCS CORE PYTHON PANDAS API ON SPARK CONNECT labels Nov 7, 2023

cloud-fan mentioned this pull request Nov 7, 2023

[SPARK-45509][SQL] Fix df column reference behavior for Spark Connect #43465

Closed

zhengruifeng approved these changes Nov 8, 2023

View reviewed changes

zhengruifeng closed this Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45509][SQL][3.5] Fix df column reference behavior for Spark Connect #43699

[SPARK-45509][SQL][3.5] Fix df column reference behavior for Spark Connect #43699

cloud-fan commented Nov 7, 2023

zhengruifeng commented Nov 8, 2023

[SPARK-45509][SQL][3.5] Fix df column reference behavior for Spark Connect #43699

[SPARK-45509][SQL][3.5] Fix df column reference behavior for Spark Connect #43699

Conversation

cloud-fan commented Nov 7, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

zhengruifeng commented Nov 8, 2023