[SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions #45446

ahshahid · 2024-03-08T22:51:17Z

What changes were proposed in this pull request?

The basis of the change is to distinguish and resolve the ambiguity based on the Dataset from which column is extracted by the user, instead of ExprIds.
That will result in a consistent and intuitive behaviour and also logically correct.
Current code is mixing the resolution basis as sometimes using ExprId and sometimes indirectly using DataSet Id.
This PR used DataSet Id present in AttributeReference's metadata to see if ambiguity can be resolved logically / sensibly by checking with the DataSet ID's of the joining DataSets.

The PR attemps to fix the issue in following way

If the projection fields contain AttributeReference which are not found in the incoming AttributeSet, and the AttributeRef metadata contains the DatasetId info, then the AttributeRef is converted into a new UnresolvedAttributeWithTag and the original attributeRef is passed as paramter .

In the ColumnResolutionHelper, to resolve the UnresolvedAttributeRefWithTag, a new resolution logic is used:
The dataSetId from the original attribute ref's metadata is extracted.

The first BinaryNode contained in the LogicalPlan containing this unresolved attribute, is found.
Then its right leg & left lag's unary nodes are checked for the presennce of DatasetID of attribute ref, using TreeNodeTag("__datasetid").
If both the legs contain datasetId and that too at same relative depth, or neither contains, then resolution exception is thrown
Else the leg which contains datasetId is used to resolve.

Why are the changes needed?

While fixing a bug where Ambiguous Column Exception was raised ( which worked fine in earlier versions of spark), came across multiple situations where a particular nested joined DataSet involving self joins, works, but fails when join order is changed or a column extract from dataset involved in join, is treated as unambiguous when used in join condition but same causes ambiguity exception when used in projection ( select)
There is also an existing test I believe which is falsely passing where resolution of attribute is not happening to the expected Dataset.
For eg:
`
val df1 = Seq((1, 2)).toDF("a", "b")
val df2 = Seq((1, 2)).toDF("aa", "bb")
val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
df2("aa"), df1("b"))

df1Joindf2.join(df1, df1Joindf2("a") === df1("a"))
`

The above works fine, but below throws Exception. The only difference between the two is that the latter has select(df1("a"). But then df1("a") works fine as a condition

`
val df1 = Seq((1, 2)).toDF("a", "b")
val df2 = Seq((1, 2)).toDF("aa", "bb")
val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
df2("aa"), df1("b"))

df1Joindf2.join(df1, df1Joindf2("a") === df1("a")).select(df1("a"))
`

There is an existing test in DataFrameSelfJoinSuite
`
test("SPARK-28344: fail ambiguous self join - column ref in Project") {
val df1 = spark.range(3)
val df2 = df1.filter($"id" > 0)

withSQLConf(
  SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "false",
  SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
  // `df2("id")` actually points to the column of `df1`.
  checkAnswer(df1.join(df2).select(df2("id")), Seq(0, 0, 1, 1, 2, 2).map(Row(_)))

  // Alias the dataframe and use qualified column names can fix ambiguous self-join.
  val aliasedDf1 = df1.alias("left")
  val aliasedDf2 = df2.as("right")
  checkAnswer(
    aliasedDf1.join(aliasedDf2).select($"right.id"),
    Seq(1, 1, 1, 2, 2, 2).map(Row(_)))
}

withSQLConf(
  SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true",
  SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
 
// Assertion1  : existing 
 assertAmbiguousSelfJoin(df1.join(df2).select(df2("id")))

  // Assertion2 :  added by me
  assertAmbiguousSelfJoin(df2.join(df1).select(df2("id")))
}

}
`
Here the Assertion1 passes ( that is ambiguous exception is thrown)
But the Assertion2 fails ( that is no ambiguous exception is thrown)
The only chnage is the join order

Logically both the assertions are invalid ( In the sense both should NOT be throwing Exception as from the user's perspective there is no ambiguity.
And this PR addresses it

Does this PR introduce any user-facing change?

Yes.
It is possible that any Dataset involving self joins which may have previously been throwing Ambiguity related exceptions are now expected to work , assuming the columns being extracted to be used in APIs are from DataSets being joined at the top most level.

How was this patch tested?

Added new tests. Making stricter assertions. Modifying the existing tests in DataFrameSelfJoinTest which are logically having unambiguity based on datasets from which columns are extracted.

Was this patch authored or co-authored using generative AI tooling?

No

…ing joins once the plan is de-duplicated. The fix involves using Dataset ID associated with the plans & attributes to attempt correct resolution

…elf join conditions

…rect resolution of attributes

…tly commented

cloud-fan · 2024-03-12T06:33:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala

@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
        assert(q.children.length == 1)
        q.children.head.output
      },
+
+      resolveOnDatasetId = (datasetid: Long, name: String) => {


can you check out function tryResolveDataFrameColumns in this file and see if it matches your expectations?

Can you also check #43115 and see if it fixes your problem?

thanks @cloud-fan will check.. The thing is that when Column is extracted from a DataFrame , then it comes as resolved ( and in cases of bugs, it is incorrect ). But still will check 43115

can you check out function tryResolveDataFrameColumns in this file and see if it matches your expectations?

I will check it out but isn't plan Id tag set only when using Client ?

Can you also check #43115 and see if it fixes your problem?

@cloud-fan I applied the patch to nearly latest master with the tests for Spark-47320 . The new tests are failing along with existing ones, so I suppose the PR #43115 needs to be enhanced

@peter-toth @cloud-fan,
Attached is the patch which describes the attempt to use planID resolution logic. But that does not work because the planID resolution logic goes till the full depth and even attempt to early end the recursion does not work, as the comparison of the return values of the recursion results, does not contain sufficient data. And further attempt to somehow reuse that code, is proving too cumbersome for me.
patch.txt

Though I have removed the previously added new class UnresolvedAttributeWithTag.

I have 2 notes to the above:

@ahshahid, the following worked in Spark 3.5 but failes in 4.0 after [SPARK-43838][SQL] Fix subquery on single table with having clause can't be optimized #41347 for the same reason as described in the old [SPARK-47217][SQL] Fix ambiguity check in self joins #45343:
test("SPARK-47217: DeduplicateRelations issue 4") { Seq(true, false).foreach(fail => withSQLConf(SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> fail.toString) { val df = Seq((1, 2)).toDF("a", "b") val df2 = df.select(df("a").as("aa"), df("b").as("bb")) val df3 = df.select(df("a"), df("b")) val df4 = df2.join(df3, df2("bb") === df("b")).select(df2("aa"), df("a")) // `df("a")` doesn't come from the the join's direct children, but from its descendants checkAnswer(df4, Row(1, 1) :: Nil) } ) }
In this test df("a")'s expression id gets deduplicated (in the right side of the join) and so the original expression id doesn't work in the final select. But I think this test case proves that we need tryResolveDataFrameColumns() like deep recursion when we try resolving by plan ids.

@cloud-fan, I think there is different problem with tryResolveDataFrameColumns().
I did try to use it for "re-resolving" attribute references that became invalid in a quick test: peter-toth@a873c24, but a few test cases failed due to some logicalplans can belong to multiple datasets. E.g. if we have:
val df = Seq((1, 2)).toDF("a", "b") val df2 = df.toDF()
then the df and df2 shares the same logicalplan instance but we can't store multiple ids in the current LogicalPlan.PLAN_ID_TAG.

@peter-toth I agree with your analysis. In the current PR, the approach I had in mind, was to allow columns from only top Joining dataframes for simplicity purposes . The reason for this thinking was :

It allows predictable behaviour and easier for user to comprehend the outcome.

while resolving does not need to do deep traversals. In case of repeat dataframes , if we reach the leaves, then dataset IDs most likely will clash, so to resolve ambiguity we would have to resort to something like shortest depth etc ( I still use depth but only 1 level deep).

yes. PlanId being a single value is a sort of limitation. Though I would hold back my thoughts because planId usage code , I have not fully grasped, except that SQLConnect code is very sensitive to it.

Maybe we could fix this issue with a small change like this: #45552

…, instead marking the UnresolvedAttribute using a tag

ahshahid · 2024-03-17T05:11:36Z

@peter-toth @cloud-fan ,
IMHO the current idea of spark resolving the attribute to dataframe lower than the top level dataframe(s) , which in process adds missing attribute to various projections in between , can be detrimental to the performance without user being aware of the cause. The scenario which I have in mind is that say user had cached the lower dataframes. Now with the plan implicitly adding missing projects may make those cached plans unusable, without user being aware of the situation.

ahshahid · 2024-03-18T18:43:27Z

@peter-toth does your PR for #45552 imply that this PR can be closed ?

…SPARK-47320

ahshahid · 2024-03-18T20:03:45Z

@peter-toth does your PR for #45552 imply that this PR can be closed ?

I am wondering if a combo of your PR #45552 and this PR would solve the issue satisfactorily.
i.e

The first preference to resolution if ( DATASET_ID_TAG is present in UnResolvedAttribute) would be the code of this PR. This will resolve if there is apparent unambiguity based on user's choice of dataframe to extract columns ( given that the dataframe used is leg of the top level join.
If the resolution is not possible or ambiguous , delegate it to PlanID resolution.

peter-toth · 2024-03-19T14:08:26Z

Let's keep this PR open till we fully figure out how to deal with these issues: #45552 (comment)

…tes resolved using datasetId for top level join, the behaviour remains unchanged independent of the flag spark.sql.analyzer.failAmbiguousSelfJoin value

… to LogicalPlan irrespective of boolean FAIL_AMBIGUOUS_SELF_JOIN_ENABLED enabled or not.

github-actions · 2024-07-26T00:20:39Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

ashahid added 12 commits February 29, 2024 15:20

SPARK-47217. bug fix for exception thrown in reused dataframes involv…

f3dd17a

…ing joins once the plan is de-duplicated. The fix involves using Dataset ID associated with the plans & attributes to attempt correct resolution

SPARK-47217. fix test failures

c29366f

SPARK-47217. fix style format issue

31d66c2

SPARK-47217 : Fixing tests and code to try and resolve ambiguity in s…

127016c

…elf join conditions

SPARK-47217 : Fix unused import issue

b8e369c

SPARK-47217 : fixed bug and made assertions in existing tests for cor…

872fece

…rect resolution of attributes

Merge branch 'master' into SPARK-47217

4b13514

SPARK-47217 : added more assetions

8ed6aa4

SPARK-47217 : fixed a bug and uncommented tests which were inadverten…

6b3b1d4

…tly commented

SPARK-47217 : added more tests and fixed inconsistency

f9653ec

Merge branch 'master' into SPARK-47217

68c2ad3

SPARK-47217 : fixed test failure

7150c98

github-actions bot added the SQL label Mar 8, 2024

ahshahid changed the title ~~[SPARK-47217][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions~~ [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions Mar 8, 2024

ahshahid mentioned this pull request Mar 11, 2024

[SPARK-47217][SQL] Fix ambiguity check in self joins #45343

Closed

SPARK-47320 : incorporate review comments

6e03ae0

cloud-fan reviewed Mar 12, 2024

View reviewed changes

SPARK-47320: reverting earlier change and refactoring

d2175d4

ahshahid force-pushed the SPARK-47320 branch from b458091 to d2175d4 Compare March 14, 2024 19:20

ashahid added 8 commits March 14, 2024 13:23

SPARK-47320: refactoring

f3d280e

SPARK-47320: fixed test failure

ab14974

SPARK-47320: fixed test failure

bf1cd92

SPARK-47320: refcatored code

611847e

SPARK-47320: removed dead code

4501ae5

SPARK-47320: refactored the code to remove UnresolvedAttributeWithTag…

3b8383d

…, instead marking the UnresolvedAttribute using a tag

SPARK-47320: removed dead code

f78eaaa

SPARK-47320: fixed pyspark failures

11bc231

peter-toth mentioned this pull request Mar 18, 2024

[SPARK-47217][SQL] Fix deduplicated expression resolution #45552

Closed

ahshahid and others added 3 commits March 18, 2024 11:44

Merge branch 'apache:master' into SPARK-47320

01a4074

Merge branch 'master' into SPARK-47320

ceb98f5

Merge branch 'SPARK-47320' of https://github.com/ahshahid/spark into …

3619857

…SPARK-47320

ashahid and others added 5 commits March 29, 2024 15:12

SPARK-47320. Modified the code to ensure that for unambiguous attribu…

03149d5

…tes resolved using datasetId for top level join, the behaviour remains unchanged independent of the flag spark.sql.analyzer.failAmbiguousSelfJoin value

SPARK-47320. cleaned up the code, made the addition of DataSet_ID_Tag…

77b3201

… to LogicalPlan irrespective of boolean FAIL_AMBIGUOUS_SELF_JOIN_ENABLED enabled or not.

Merge branch 'apache:master' into SPARK-47320

5bdd4fd

Merge branch 'apache:master' into SPARK-47320

525236d

Merge branch 'master' into SPARK-47320

fc0b3b1

github-actions bot added the Stale label Jul 26, 2024

github-actions bot closed this Jul 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions #45446

[SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions #45446

ahshahid commented Mar 8, 2024 •

edited

Loading

cloud-fan Mar 12, 2024

cloud-fan Mar 12, 2024

ahshahid Mar 12, 2024

ahshahid Mar 12, 2024

ahshahid Mar 12, 2024

ahshahid Mar 16, 2024

peter-toth Mar 16, 2024 •

edited

Loading

ahshahid Mar 16, 2024

ahshahid Mar 16, 2024

peter-toth Mar 17, 2024

ahshahid commented Mar 17, 2024

ahshahid commented Mar 18, 2024

ahshahid commented Mar 18, 2024

peter-toth commented Mar 19, 2024

github-actions bot commented Jul 26, 2024

[SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions #45446

[SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions #45446

Conversation

ahshahid commented Mar 8, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peter-toth Mar 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahshahid commented Mar 17, 2024

ahshahid commented Mar 18, 2024

ahshahid commented Mar 18, 2024

peter-toth commented Mar 19, 2024

github-actions bot commented Jul 26, 2024

ahshahid commented Mar 8, 2024 •

edited

Loading

peter-toth Mar 16, 2024 •

edited

Loading