[SPARK-46541][SQL][CONNECT] Fix the ambiguous column reference in self join #44532

zhengruifeng · 2023-12-29T06:17:08Z

What changes were proposed in this pull request?

fix the logic of ambiguous column detection in spark connect

Why are the changes needed?

In [24]: df1 = spark.range(10).withColumn("a", sf.lit(0))

In [25]: df2 = df1.withColumnRenamed("a", "b")

In [26]: df1.join(df2, df1["a"] == df2["b"])
Out[26]: 23/12/22 09:33:28 ERROR ErrorUtils: Spark Connect RPC error during: analyze. UserId: ruifeng.zheng. SessionId: eaa2161f-4b64-4dbf-9809-af6b696d3005.
org.apache.spark.sql.AnalysisException: [AMBIGUOUS_COLUMN_REFERENCE] Column a is ambiguous. It's because you joined several DataFrame together, and some of these DataFrames are the same.
This column points to one of the DataFrame but Spark is unable to figure out which one.
Please alias the DataFrames with different names via DataFrame.alias before joining them,
and specify the column using qualified name, e.g. df.alias("a").join(df.alias("b"), col("a.id") > col("b.id")). SQLSTATE: 42702
	at org.apache.spark.sql.catalyst.analysis.ColumnResolutionHelper.findPlanById(ColumnResolutionHelper.scala:555)
	at

Does this PR introduce any user-facing change?

yes, fix a bug

How was this patch tested?

added ut

Was this patch authored or co-authored using generative AI tooling?

no

zhengruifeng · 2023-12-30T10:23:26Z

cc @cloud-fan I think it is ready for the initial review

zhengruifeng · 2023-12-30T10:24:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala

+          .flatMap(resolveUnresolvedAttributeByPlan(u, _, isMetadataAccess))
+        if (isMetadataAccess) {
+          // NOTE: A metadata column might appear in `output` instead of `metadataOutput`.
+          val metadataOutputSet = child.outputSet ++ AttributeSet(child.metadataOutput)


need to filter with output+metadataOutput, otherwise metadata column's tests will fail

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala

cloud-fan · 2024-01-02T13:17:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala


-    val isMetadataAccess = u.getTagValue(LogicalPlan.IS_METADATA_COL).isDefined
+  private def resolveUnresolvedAttributeByPlanId(


we should merge this with the other resolveUnresolvedAttributeByPlanId. The overall algorithm should be

bottom-up traverse the plan tree to find the matching df plan.

resolve the column and propogate it up

during propogation, prune the resolved columns with plan's output. It should happen for each plan node, not just the top plan node's children.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala

python/pyspark/pandas/base.py

common/utils/src/main/resources/error/error-classes.json

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala

update update

nit try to fix metadata column test II

try to address

Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

…ysis/ColumnResolutionHelper.scala Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

nit nit

zhengruifeng · 2024-01-10T02:39:27Z

thanks @cloud-fan for guide and review

merged to master

github-actions bot added SQL PYTHON labels Dec 29, 2023

zhengruifeng force-pushed the sql_connect_find_plan_id branch 2 times, most recently from ad7a53e to a154b93 Compare December 29, 2023 08:24

github-actions bot added the CONNECT label Dec 29, 2023

zhengruifeng force-pushed the sql_connect_find_plan_id branch 2 times, most recently from 333c038 to 50939b9 Compare December 30, 2023 02:06

zhengruifeng marked this pull request as ready for review December 30, 2023 10:22

zhengruifeng changed the title ~~[WIP][SPARK-46541][SQL][CONNECT] Fix the ambiguous column reference in self join~~ [SPARK-46541][SQL][CONNECT] Fix the ambiguous column reference in self join Dec 30, 2023

zhengruifeng commented Dec 30, 2023

View reviewed changes

cloud-fan reviewed Jan 2, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jan 2, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala Outdated Show resolved Hide resolved

zhengruifeng force-pushed the sql_connect_find_plan_id branch 2 times, most recently from 6e8cf9c to 5c07a39 Compare January 2, 2024 11:58

cloud-fan reviewed Jan 2, 2024

View reviewed changes

zhengruifeng force-pushed the sql_connect_find_plan_id branch 2 times, most recently from 966ba48 to c2e235c Compare January 3, 2024 01:37

zhengruifeng commented Jan 3, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jan 3, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala Outdated Show resolved Hide resolved

zhengruifeng force-pushed the sql_connect_find_plan_id branch from c2e235c to 43387e6 Compare January 3, 2024 09:42

github-actions bot added the DOCS label Jan 3, 2024

zhengruifeng force-pushed the sql_connect_find_plan_id branch 2 times, most recently from b093a13 to 1daf298 Compare January 4, 2024 05:46

github-actions bot added the PANDAS API ON SPARK label Jan 4, 2024

zhengruifeng commented Jan 4, 2024

View reviewed changes

python/pyspark/pandas/base.py Outdated Show resolved Hide resolved

cloud-fan reviewed Jan 4, 2024

View reviewed changes

common/utils/src/main/resources/error/error-classes.json Outdated Show resolved Hide resolved

cloud-fan reviewed Jan 4, 2024

View reviewed changes

common/utils/src/main/resources/error/error-classes.json Outdated Show resolved Hide resolved

cloud-fan reviewed Jan 4, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jan 4, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala Outdated Show resolved Hide resolved

zhengruifeng and others added 25 commits January 10, 2024 08:19

fix

59b7910

update update

fix test

b51af34

fix test

a74d3a5

try to fix metadata column test

cc16486

nit try to fix metadata column test II

nit

2bddca6

try to address

ef0ab5e

try to address

try to address II

852e918

try to address

address comments

2013fc1

fix

bbd6c14

fix pandas api on connect

4c9271e

simplfy pandas fix

f470c68

Update common/utils/src/main/resources/error/error-classes.json

48e4add

Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

Update common/utils/src/main/resources/error/error-classes.json

dee19b8

Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

Update sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/anal…

73f028f

…ysis/ColumnResolutionHelper.scala Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

adjust

725fe6b

small opt

96c6b49

add error methods

a82d966

seq of plans

500f8e8

filter by top level plans

3aafb63

always filter with p.outputset

be6f02f

rewrite frame._sort

92a9575

return none for missing column

20852f4

nit nit

address comments

47d5b62

rename and inline

776f153

add detail comments

c91af56

zhengruifeng force-pushed the sql_connect_find_plan_id branch from 2585c73 to c91af56 Compare January 10, 2024 00:20

zhengruifeng closed this in 686f428 Jan 10, 2024

zhengruifeng deleted the sql_connect_find_plan_id branch January 10, 2024 02:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-46541][SQL][CONNECT] Fix the ambiguous column reference in self join #44532

[SPARK-46541][SQL][CONNECT] Fix the ambiguous column reference in self join #44532

zhengruifeng commented Dec 29, 2023

zhengruifeng commented Dec 30, 2023

zhengruifeng Dec 30, 2023 •

edited

cloud-fan Jan 2, 2024

zhengruifeng commented Jan 10, 2024


		val isMetadataAccess = u.getTagValue(LogicalPlan.IS_METADATA_COL).isDefined
		private def resolveUnresolvedAttributeByPlanId(

[SPARK-46541][SQL][CONNECT] Fix the ambiguous column reference in self join #44532

[SPARK-46541][SQL][CONNECT] Fix the ambiguous column reference in self join #44532

Conversation

zhengruifeng commented Dec 29, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

zhengruifeng commented Dec 30, 2023

zhengruifeng Dec 30, 2023 • edited

Choose a reason for hiding this comment

cloud-fan Jan 2, 2024

Choose a reason for hiding this comment

zhengruifeng commented Jan 10, 2024

zhengruifeng Dec 30, 2023 •

edited