[SPARK-45509][SQL] Fix df column reference behavior for Spark Connect #43465

cloud-fan · 2023-10-20T12:22:35Z

What changes were proposed in this pull request?

This PR fixes a few problems of column resolution for Spark Connect, to make the behavior closer to classic Spark SQL (unfortunately we still have some behavior differences in corner cases).

resolve df column references in both resolveExpressionByPlanChildren and resolveExpressionByPlanOutput. Previously it's only in resolveExpressionByPlanChildren.
when the plan id has multiple matches, fail with AMBIGUOUS_COLUMN_REFERENCE

Why are the changes needed?

fix behavior differences between spark connect and classic spark sql

Does this PR introduce any user-facing change?

Yes, for spark connect scala client

How was this patch tested?

new tests

Was this patch authored or co-authored using generative AI tooling?

no

cloud-fan · 2023-10-20T12:26:39Z

connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala

+
+    val e2 = intercept[AnalysisException] {
+      // df1("i") is ambiguous as df1 appears in both join sides.
+      df1.join(df1, df1("i") === 1).collect()


classic spark sql thinks this is not ambiguous. It's probably a bug and I'll fix later.

cloud-fan · 2023-10-20T12:31:38Z

connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala

+    checkSameResult(
+      Seq(Row("a")),
+      // df1_filter("i") is not ambiguous as df1_filter does not exist in the join left side.
+      df1.join(df1_filter, df1_filter("i") === 1).select(df1_filter("j"))


Classic Spark SQL thinks this is ambiguous, as it uses AttributeReference directly and we are not able to re-resolve it. Spark Connect uses UnresolvedAttribute which is lazy binding and works fine in this case.

cloud-fan · 2023-10-20T12:46:49Z

cc @zhengruifeng @hvanhovell @HyukjinKwon

I think the Spark Connect behavior is very reasonable now. We should move Classic Spark SQL to this behavior as well in the future.

HyukjinKwon · 2023-10-23T02:22:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala

@@ -539,4 +533,28 @@ trait ColumnResolutionHelper extends Logging {
        None
    }
  }
+
+  private def findPlanById(


can we do tailrec?

can't find a way to make it tailrec... I need to check the return value from plan.children.flatMap(findPlanById(u, id, _))

HyukjinKwon · 2023-10-23T02:24:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala

+  //    1. extract the attached plan id from UnresolvedAttribute;
+  //    2. top-down traverse the query plan to find the plan node that matches the plan id;
+  //    3. if can not find the matching node, fail the analysis due to illegal references;
+  //    4. if more than one matching nodes are found, fail due to ambiguous column reference;


Just for my own understanding, does it mean that it fails when more than one matching nodes are found within the same level children?

within the sub-plan under the plan node that holds the column reference

zhengruifeng · 2023-10-23T02:34:29Z

common/utils/src/main/resources/error/error-classes.json

@@ -31,6 +31,15 @@
    ],
    "sqlState" : "42702"
  },
+  "AMBIGUOUS_COLUMN_REFERENCE" : {
+    "message" : [
+      "Column <name> is ambiguous. It's because you joined several DataFrame together, and some of these DataFrames are the same.",


Can we reuse AMBIGUOUS_REFERENCE (42704)?

and I guess another cause maybe dataframe creation with duplicated column names:

scala> val df = Seq((1,2), (3,4)).toDF("a","a") val df: org.apache.spark.sql.DataFrame = [a: int, a: int]

It's different. This is for found more than one matching dataframes, but AMBIGUOUS_REFERENCE is for found more than one matching attribute from candidates.

sounds good.

cloud-fan · 2023-10-24T01:31:03Z

python/pyspark/sql/connect/plan.py

-        self._func = function._build_common_inline_user_defined_function(*cols)
+        # The function takes entire DataFrame as inputs, no need to do
+        # column binding (no input columns).
+        self._func = function._build_common_inline_user_defined_function()


cc @zhengruifeng @HyukjinKwon

it seems this field is not used in connect

spark/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

Lines 694 to 723 in 48e207f

private def transformCoGroupMap(rel: proto.CoGroupMap): LogicalPlan = {

val commonUdf = rel.getFunc

commonUdf.getFunctionCase match {

case proto.CommonInlineUserDefinedFunction.FunctionCase.SCALAR_SCALA_UDF =>

transformTypedCoGroupMap(rel, commonUdf)

case proto.CommonInlineUserDefinedFunction.FunctionCase.PYTHON_UDF =>

val pythonUdf = transformPythonUDF(commonUdf)

val inputCols =

rel.getInputGroupingExpressionsList.asScala.toSeq.map(expr =>

Column(transformExpression(expr)))

val otherCols =

rel.getOtherGroupingExpressionsList.asScala.toSeq.map(expr =>

Column(transformExpression(expr)))

val input = Dataset

.ofRows(session, transformRelation(rel.getInput))

.groupBy(inputCols: _*)

val other = Dataset

.ofRows(session, transformRelation(rel.getOther))

.groupBy(otherCols: _*)

input.flatMapCoGroupsInPandas(other, pythonUdf).logicalPlan

case _ =>

throw InvalidPlanInput(

s"Function with ID: ${commonUdf.getFunctionCase.getNumber} is not supported")

}

}

cc @xinrong-meng

yea, that's why this bug is hidden. We simply picked the first matching dataframe in the query plan to resolve columns, which is wrong and now we will throw AMBIGUOUS_COLUMN_REFERENCE

dongjoon-hyun

Could you re-trigger the failed catalyst test pipeline?

cloud-fan · 2023-11-07T08:34:52Z

thanks for the reviews, merging to master!

This PR fixes a few problems of column resolution for Spark Connect, to make the behavior closer to classic Spark SQL (unfortunately we still have some behavior differences in corner cases). 1. resolve df column references in both `resolveExpressionByPlanChildren` and `resolveExpressionByPlanOutput`. Previously it's only in `resolveExpressionByPlanChildren`. 2. when the plan id has multiple matches, fail with `AMBIGUOUS_COLUMN_REFERENCE` fix behavior differences between spark connect and classic spark sql Yes, for spark connect scala client new tests no Closes apache#43465 from cloud-fan/column. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2023-11-07T08:56:04Z

The 3.5 backport PR: #43699

…nnect backport #43465 to 3.5 ### What changes were proposed in this pull request? This PR fixes a few problems of column resolution for Spark Connect, to make the behavior closer to classic Spark SQL (unfortunately we still have some behavior differences in corner cases). 1. resolve df column references in both `resolveExpressionByPlanChildren` and `resolveExpressionByPlanOutput`. Previously it's only in `resolveExpressionByPlanChildren`. 2. when the plan id has multiple matches, fail with `AMBIGUOUS_COLUMN_REFERENCE` ### Why are the changes needed? fix behavior differences between spark connect and classic spark sql ### Does this PR introduce _any_ user-facing change? Yes, for spark connect scala client ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #43699 from cloud-fan/backport. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

github-actions bot added SQL CONNECT labels Oct 20, 2023

cloud-fan commented Oct 20, 2023

View reviewed changes

cloud-fan force-pushed the column branch from 744908d to 65b547b Compare October 20, 2023 12:33

HyukjinKwon reviewed Oct 23, 2023

View reviewed changes

HyukjinKwon approved these changes Oct 23, 2023

View reviewed changes

zhengruifeng reviewed Oct 23, 2023

View reviewed changes

github-actions bot added the DOCS label Oct 23, 2023

zhengruifeng approved these changes Oct 23, 2023

View reviewed changes

github-actions bot added the PYTHON label Oct 24, 2023

cloud-fan commented Oct 24, 2023

View reviewed changes

github-actions bot added the PANDAS API ON SPARK label Oct 27, 2023

cloud-fan force-pushed the column branch from b6e0b24 to 28a18cc Compare October 27, 2023 07:10

cloud-fan added 5 commits November 2, 2023 21:17

fix df column reference behavior for Spark Connect

09a947a

fix doc

04694bf

fix spark connect

a764158

fix

3f0403f

fix pandas

ee407e1

cloud-fan force-pushed the column branch from 28a18cc to ee407e1 Compare November 2, 2023 13:17

dongjoon-hyun reviewed Nov 2, 2023

View reviewed changes

cloud-fan closed this in 8c41629 Nov 7, 2023

cloud-fan mentioned this pull request Nov 7, 2023

[SPARK-45509][SQL][3.5] Fix df column reference behavior for Spark Connect #43699

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45509][SQL] Fix df column reference behavior for Spark Connect #43465

[SPARK-45509][SQL] Fix df column reference behavior for Spark Connect #43465

cloud-fan commented Oct 20, 2023

cloud-fan Oct 20, 2023

cloud-fan Oct 20, 2023

cloud-fan commented Oct 20, 2023

HyukjinKwon Oct 23, 2023

cloud-fan Oct 23, 2023

HyukjinKwon Oct 23, 2023

cloud-fan Oct 23, 2023

zhengruifeng Oct 23, 2023

cloud-fan Oct 23, 2023

zhengruifeng Oct 23, 2023

cloud-fan Oct 24, 2023

zhengruifeng Oct 24, 2023

cloud-fan Oct 24, 2023

dongjoon-hyun left a comment

cloud-fan commented Nov 7, 2023 •

edited

Loading

cloud-fan commented Nov 7, 2023

	private def transformCoGroupMap(rel: proto.CoGroupMap): LogicalPlan = {
	val commonUdf = rel.getFunc
	commonUdf.getFunctionCase match {
	case proto.CommonInlineUserDefinedFunction.FunctionCase.SCALAR_SCALA_UDF =>
	transformTypedCoGroupMap(rel, commonUdf)

	case proto.CommonInlineUserDefinedFunction.FunctionCase.PYTHON_UDF =>
	val pythonUdf = transformPythonUDF(commonUdf)

	val inputCols =
	rel.getInputGroupingExpressionsList.asScala.toSeq.map(expr =>
	Column(transformExpression(expr)))
	val otherCols =
	rel.getOtherGroupingExpressionsList.asScala.toSeq.map(expr =>
	Column(transformExpression(expr)))

	val input = Dataset
	.ofRows(session, transformRelation(rel.getInput))
	.groupBy(inputCols: _*)
	val other = Dataset
	.ofRows(session, transformRelation(rel.getOther))
	.groupBy(otherCols: _*)

	input.flatMapCoGroupsInPandas(other, pythonUdf).logicalPlan

	case _ =>
	throw InvalidPlanInput(
	s"Function with ID: ${commonUdf.getFunctionCase.getNumber} is not supported")
	}
	}

[SPARK-45509][SQL] Fix df column reference behavior for Spark Connect #43465

[SPARK-45509][SQL] Fix df column reference behavior for Spark Connect #43465

Conversation

cloud-fan commented Oct 20, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Oct 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

cloud-fan commented Nov 7, 2023 • edited Loading

cloud-fan commented Nov 7, 2023

cloud-fan commented Nov 7, 2023 •

edited

Loading