Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-46541][SQL][CONNECT] Fix the ambiguous column reference in self join #44532

Closed

Conversation

zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

fix the logic of ambiguous column detection in spark connect

Why are the changes needed?

In [24]: df1 = spark.range(10).withColumn("a", sf.lit(0))

In [25]: df2 = df1.withColumnRenamed("a", "b")

In [26]: df1.join(df2, df1["a"] == df2["b"])
Out[26]: 23/12/22 09:33:28 ERROR ErrorUtils: Spark Connect RPC error during: analyze. UserId: ruifeng.zheng. SessionId: eaa2161f-4b64-4dbf-9809-af6b696d3005.
org.apache.spark.sql.AnalysisException: [AMBIGUOUS_COLUMN_REFERENCE] Column a is ambiguous. It's because you joined several DataFrame together, and some of these DataFrames are the same.
This column points to one of the DataFrame but Spark is unable to figure out which one.
Please alias the DataFrames with different names via DataFrame.alias before joining them,
and specify the column using qualified name, e.g. df.alias("a").join(df.alias("b"), col("a.id") > col("b.id")). SQLSTATE: 42702
	at org.apache.spark.sql.catalyst.analysis.ColumnResolutionHelper.findPlanById(ColumnResolutionHelper.scala:555)
	at 

Does this PR introduce any user-facing change?

yes, fix a bug

How was this patch tested?

added ut

Was this patch authored or co-authored using generative AI tooling?

no

@zhengruifeng zhengruifeng force-pushed the sql_connect_find_plan_id branch 2 times, most recently from ad7a53e to a154b93 Compare December 29, 2023 08:24
@zhengruifeng zhengruifeng force-pushed the sql_connect_find_plan_id branch 2 times, most recently from 333c038 to 50939b9 Compare December 30, 2023 02:06
@zhengruifeng zhengruifeng marked this pull request as ready for review December 30, 2023 10:22
@zhengruifeng zhengruifeng changed the title [WIP][SPARK-46541][SQL][CONNECT] Fix the ambiguous column reference in self join [SPARK-46541][SQL][CONNECT] Fix the ambiguous column reference in self join Dec 30, 2023
@zhengruifeng
Copy link
Contributor Author

cc @cloud-fan I think it is ready for the initial review

.flatMap(resolveUnresolvedAttributeByPlan(u, _, isMetadataAccess))
if (isMetadataAccess) {
// NOTE: A metadata column might appear in `output` instead of `metadataOutput`.
val metadataOutputSet = child.outputSet ++ AttributeSet(child.metadataOutput)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to filter with output+metadataOutput, otherwise metadata column's tests will fail

@zhengruifeng zhengruifeng force-pushed the sql_connect_find_plan_id branch 2 times, most recently from 6e8cf9c to 5c07a39 Compare January 2, 2024 11:58

val isMetadataAccess = u.getTagValue(LogicalPlan.IS_METADATA_COL).isDefined
private def resolveUnresolvedAttributeByPlanId(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should merge this with the other resolveUnresolvedAttributeByPlanId. The overall algorithm should be

  1. bottom-up traverse the plan tree to find the matching df plan.
  2. resolve the column and propogate it up
  3. during propogation, prune the resolved columns with plan's output. It should happen for each plan node, not just the top plan node's children.

@zhengruifeng zhengruifeng force-pushed the sql_connect_find_plan_id branch 2 times, most recently from 966ba48 to c2e235c Compare January 3, 2024 01:37
@github-actions github-actions bot added the DOCS label Jan 3, 2024
@zhengruifeng zhengruifeng force-pushed the sql_connect_find_plan_id branch 2 times, most recently from b093a13 to 1daf298 Compare January 4, 2024 05:46
@zhengruifeng
Copy link
Contributor Author

thanks @cloud-fan for guide and review

merged to master

@zhengruifeng zhengruifeng deleted the sql_connect_find_plan_id branch January 10, 2024 02:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants