[SPARK-33071][SPARK-33536][SQL] Avoid changing dataset_id of LogicalPlan in join() to not break DetectAmbiguousSelfJoin #30488

Ngone51 · 2020-11-24T14:32:38Z

What changes were proposed in this pull request?

Currently, join() uses withPlan(logicalPlan) for convenient to call some Dataset functions. But it leads to the dataset_id inconsistent between the logicalPlan and the original Dataset(because withPlan(logicalPlan) will create a new Dataset with the new id and reset the dataset_id with the new id of the logicalPlan). As a result, it breaks the rule DetectAmbiguousSelfJoin.

In this PR, we propose to drop the usage of withPlan but use the logicalPlan directly so its dataset_id doesn't change.

Besides, this PR also removes related metadata (DATASET_ID_KEY, COL_POS_KEY) when an Alias tries to construct its own metadata. Because the Alias is no longer a reference column after converting to an Attribute. To achieve that, we add a new field, deniedMetadataKeys, to indicate the metadata that needs to be removed.

Why are the changes needed?

For the query below, it returns the wrong result while it should throws ambiguous self join exception instead:

val emp1 = Seq[TestData](
  TestData(1, "sales"),
  TestData(2, "personnel"),
  TestData(3, "develop"),
  TestData(4, "IT")).toDS()
val emp2 = Seq[TestData](
  TestData(1, "sales"),
  TestData(2, "personnel"),
  TestData(3, "develop")).toDS()
val emp3 = emp1.join(emp2, emp1("key") === emp2("key")).select(emp1("*"))
emp1.join(emp3, emp1.col("key") === emp3.col("key"), "left_outer")
  .select(emp1.col("*"), emp3.col("key").as("e2")).show()

// wrong result
+---+---------+---+
|key|    value| e2|
+---+---------+---+
|  1|    sales|  1|
|  2|personnel|  2|
|  3|  develop|  3|
|  4|       IT|  4|
+---+---------+---+

This PR fixes the wrong behaviour.

Does this PR introduce any user-facing change?

Yes, users hit the exception instead of the wrong result after this PR.

How was this patch tested?

Added a new unit test.

Ngone51 · 2020-11-24T14:33:10Z

cc @cloud-fan @xuanyuanking Could you take a look? Thanks!

SparkQA · 2020-11-24T16:15:41Z

Test build #131668 has finished for PR 30488 at commit a617f94.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-25T07:56:32Z

Test build #131728 has finished for PR 30488 at commit 05bca19.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2020-11-25T13:01:40Z

Seems the failed UT is related.

org.apache.spark.sql.DataFrameSelfJoinSuite.SPARK-28344: don't fail if there is no ambiguous self join

Ngone51 · 2020-11-25T13:34:13Z

Yeah...I fixed it just now.

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala

SparkQA · 2020-11-25T18:47:51Z

Test build #131778 has finished for PR 30488 at commit 5cff25f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-30T14:35:15Z

Test build #131994 has finished for PR 30488 at commit 85f6f12.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala

sql/core/src/main/scala/org/apache/spark/sql/Column.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/types/Metadata.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala

SparkQA · 2020-12-02T14:12:15Z

Test build #132041 has finished for PR 30488 at commit df04549.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-12-02T14:13:16Z

retest this please

SparkQA · 2020-12-02T16:27:51Z

Test build #132054 has finished for PR 30488 at commit e06b223.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-12-02T17:51:08Z

GA passed, merging to master, thanks!

SparkQA · 2020-12-02T20:29:55Z

Test build #132059 has finished for PR 30488 at commit d13b88c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…to nonInheritableMetadataKeys in Alias ### What changes were proposed in this pull request? This PR is a followup of #30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing. ### Why are the changes needed? To make it easier to maintain and read. ### Does this PR introduce _any_ user-facing change? No. This is rather a code cleanup. ### How was this patch tested? Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them. Closes #30682 from HyukjinKwon/SPARK-33071-SPARK-33536. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…to nonInheritableMetadataKeys in Alias ### What changes were proposed in this pull request? This PR is a followup of #30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing. ### Why are the changes needed? To make it easier to maintain and read. ### Does this PR introduce _any_ user-facing change? No. This is rather a code cleanup. ### How was this patch tested? Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them. Closes #30682 from HyukjinKwon/SPARK-33071-SPARK-33536. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit b5399d4) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…to nonInheritableMetadataKeys in Alias ### What changes were proposed in this pull request? This PR is a followup of apache/spark#30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing. ### Why are the changes needed? To make it easier to maintain and read. ### Does this PR introduce _any_ user-facing change? No. This is rather a code cleanup. ### How was this patch tested? Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them. Closes #30682 from HyukjinKwon/SPARK-33071-SPARK-33536. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

* [SPARK-33641][SQL][DOC][FOLLOW-UP] Add migration guide for CHAR VARCHAR types ### What changes were proposed in this pull request? Add migration guide for CHAR VARCHAR types ### Why are the changes needed? for migration ### Does this PR introduce _any_ user-facing change? doc change ### How was this patch tested? passing ci Closes apache#30654 from yaooqinn/SPARK-33641-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-33669] Wrong error message from YARN application state monitor when sc.stop in yarn client mode ### What changes were proposed in this pull request? This change make InterruptedIOException to be treated as InterruptedException when closing YarnClientSchedulerBackend, which doesn't log error with "YARN application has exited unexpectedly xxx" ### Why are the changes needed? For YarnClient mode, when stopping YarnClientSchedulerBackend, it first tries to interrupt Yarn application monitor thread. In MonitorThread.run() it catches InterruptedException to gracefully response to stopping request. But client.monitorApplication method also throws InterruptedIOException when the hadoop rpc call is calling. In this case, MonitorThread will not know it is interrupted, a Yarn App failed is returned with "Failed to contact YARN for application xxxxx; YARN application has exited unexpectedly with state xxxxx" is logged with error level. which confuse user a lot. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? very simple patch, seems no need? Closes apache#30617 from sqlwindspeaker/yarn-client-interrupt-monitor. Authored-by: suqilong <suqilong@qiyi.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> * [SPARK-33655][SQL] Improve performance of processing FETCH_PRIOR ### What changes were proposed in this pull request? Currently, when a client requests FETCH_PRIOR to Thriftserver, Thriftserver reiterates from the start position. Because Thriftserver caches a query result with an array when THRIFTSERVER_INCREMENTAL_COLLECT feature is off, FETCH_PRIOR can be implemented without reiterating the result. A trait FeatureIterator is added in order to separate the implementation for iterator and an array. Also, FeatureIterator supports moves cursor with absolute position, which will be useful for the implementation of FETCH_RELATIVE, FETCH_ABSOLUTE. ### Why are the changes needed? For better performance of Thriftserver. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? FetchIteratorSuite Closes apache#30600 from Dooyoung-Hwang/refactor_with_fetch_iterator. Authored-by: Dooyoung Hwang <dooyoung.hwang@sk.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> * [SPARK-33719][DOC] Add make_date/make_timestamp/make_interval into the doc of ANSI Compliance ### What changes were proposed in this pull request? Add make_date/make_timestamp/make_interval into the doc of ANSI Compliance ### Why are the changes needed? Users can know that these functions throw runtime exceptions under ANSI mode if the result is not valid. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build doc and check it in browser: ![image](https://user-images.githubusercontent.com/1097932/101608930-34a79e80-39bb-11eb-9294-9d9b8c3f6faa.png) Closes apache#30683 from gengliangwang/improveDoc. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> * [SPARK-33071][SPARK-33536][SQL][FOLLOW-UP] Rename deniedMetadataKeys to nonInheritableMetadataKeys in Alias ### What changes were proposed in this pull request? This PR is a followup of apache#30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing. ### Why are the changes needed? To make it easier to maintain and read. ### Does this PR introduce _any_ user-facing change? No. This is rather a code cleanup. ### How was this patch tested? Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them. Closes apache#30682 from HyukjinKwon/SPARK-33071-SPARK-33536. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> * [SPARK-33722][SQL] Handle DELETE in ReplaceNullWithFalseInPredicate ### What changes were proposed in this pull request? This PR adds `DeleteFromTable` to supported plans in `ReplaceNullWithFalseInPredicate`. ### Why are the changes needed? This change allows Spark to optimize delete conditions like we optimize filters. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR extends the existing test cases to also cover `DeleteFromTable`. Closes apache#30688 from aokolnychyi/spark-33722. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: Kent Yao <yaooqinn@hotmail.com> Co-authored-by: suqilong <suqilong@qiyi.com> Co-authored-by: Dooyoung Hwang <dooyoung.hwang@sk.com> Co-authored-by: Gengliang Wang <gengliang.wang@databricks.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Anton Okolnychyi <aokolnychyi@apple.com>

…lan in join() to not break DetectAmbiguousSelfJoin Currently, `join()` uses `withPlan(logicalPlan)` for convenient to call some Dataset functions. But it leads to the `dataset_id` inconsistent between the `logicalPlan` and the original `Dataset`(because `withPlan(logicalPlan)` will create a new Dataset with the new id and reset the `dataset_id` with the new id of the `logicalPlan`). As a result, it breaks the rule `DetectAmbiguousSelfJoin`. In this PR, we propose to drop the usage of `withPlan` but use the `logicalPlan` directly so its `dataset_id` doesn't change. Besides, this PR also removes related metadata (`DATASET_ID_KEY`, `COL_POS_KEY`) when an `Alias` tries to construct its own metadata. Because the `Alias` is no longer a reference column after converting to an `Attribute`. To achieve that, we add a new field, `deniedMetadataKeys`, to indicate the metadata that needs to be removed. For the query below, it returns the wrong result while it should throws ambiguous self join exception instead: ```scala val emp1 = Seq[TestData]( TestData(1, "sales"), TestData(2, "personnel"), TestData(3, "develop"), TestData(4, "IT")).toDS() val emp2 = Seq[TestData]( TestData(1, "sales"), TestData(2, "personnel"), TestData(3, "develop")).toDS() val emp3 = emp1.join(emp2, emp1("key") === emp2("key")).select(emp1("*")) emp1.join(emp3, emp1.col("key") === emp3.col("key"), "left_outer") .select(emp1.col("*"), emp3.col("key").as("e2")).show() // wrong result +---+---------+---+ |key| value| e2| +---+---------+---+ | 1| sales| 1| | 2|personnel| 2| | 3| develop| 3| | 4| IT| 4| +---+---------+---+ ``` This PR fixes the wrong behaviour. Yes, users hit the exception instead of the wrong result after this PR. Added a new unit test. Closes apache#30488 from Ngone51/fix-self-join. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…to nonInheritableMetadataKeys in Alias This PR is a followup of apache#30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing. To make it easier to maintain and read. No. This is rather a code cleanup. Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them. Closes apache#30682 from HyukjinKwon/SPARK-33071-SPARK-33536. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…lan in join() to not break DetectAmbiguousSelfJoin Currently, `join()` uses `withPlan(logicalPlan)` for convenient to call some Dataset functions. But it leads to the `dataset_id` inconsistent between the `logicalPlan` and the original `Dataset`(because `withPlan(logicalPlan)` will create a new Dataset with the new id and reset the `dataset_id` with the new id of the `logicalPlan`). As a result, it breaks the rule `DetectAmbiguousSelfJoin`. In this PR, we propose to drop the usage of `withPlan` but use the `logicalPlan` directly so its `dataset_id` doesn't change. Besides, this PR also removes related metadata (`DATASET_ID_KEY`, `COL_POS_KEY`) when an `Alias` tries to construct its own metadata. Because the `Alias` is no longer a reference column after converting to an `Attribute`. To achieve that, we add a new field, `deniedMetadataKeys`, to indicate the metadata that needs to be removed. For the query below, it returns the wrong result while it should throws ambiguous self join exception instead: ```scala val emp1 = Seq[TestData]( TestData(1, "sales"), TestData(2, "personnel"), TestData(3, "develop"), TestData(4, "IT")).toDS() val emp2 = Seq[TestData]( TestData(1, "sales"), TestData(2, "personnel"), TestData(3, "develop")).toDS() val emp3 = emp1.join(emp2, emp1("key") === emp2("key")).select(emp1("*")) emp1.join(emp3, emp1.col("key") === emp3.col("key"), "left_outer") .select(emp1.col("*"), emp3.col("key").as("e2")).show() // wrong result +---+---------+---+ |key| value| e2| +---+---------+---+ | 1| sales| 1| | 2|personnel| 2| | 3| develop| 3| | 4| IT| 4| +---+---------+---+ ``` This PR fixes the wrong behaviour. Yes, users hit the exception instead of the wrong result after this PR. Added a new unit test. Closes apache#30488 from Ngone51/fix-self-join. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…to nonInheritableMetadataKeys in Alias ### What changes were proposed in this pull request? This PR is a followup of apache#30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing. ### Why are the changes needed? To make it easier to maintain and read. ### Does this PR introduce _any_ user-facing change? No. This is rather a code cleanup. ### How was this patch tested? Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them. Closes apache#30682 from HyukjinKwon/SPARK-33071-SPARK-33536. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

fix

a617f94

cloud-fan approved these changes Nov 24, 2020

View reviewed changes

github-actions bot added the SQL label Nov 24, 2020

add SPARK-33701

05bca19

Ngone51 changed the title ~~[SPARK-33536][SQL] Avoid changing dataset_id of LogicalPlan in join() to not break DetectAmbiguousSelfJoin~~ [SPARK-33071][SPARK-33536][SQL] Avoid changing dataset_id of LogicalPlan in join() to not break DetectAmbiguousSelfJoin Nov 25, 2020

Ngone51 commented Nov 25, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala Show resolved Hide resolved

Ngone51 commented Nov 25, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Nov 25, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala Outdated Show resolved Hide resolved

Ngone51 added 3 commits November 30, 2020 22:06

switch to another way

e397abb

fix df.show

eb6afe1

update equals

85f6f12

Ngone51 force-pushed the fix-self-join branch from 5cff25f to 85f6f12 Compare November 30, 2020 14:16

Ngone51 closed this Dec 1, 2020

Ngone51 reopened this Dec 1, 2020

cloud-fan reviewed Dec 1, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala Show resolved Hide resolved

cloud-fan reviewed Dec 1, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Column.scala Show resolved Hide resolved

cloud-fan reviewed Dec 1, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/types/Metadata.scala Outdated Show resolved Hide resolved

Ngone51 added 3 commits December 2, 2020 17:06

remove(keys: String*)

dc1af09

add comments

b17da86

update comment

df04549

cloud-fan reviewed Dec 2, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala Outdated Show resolved Hide resolved

Ngone51 added 2 commits December 2, 2020 21:38

use loop

9809e57

update

e06b223

cloud-fan approved these changes Dec 2, 2020

View reviewed changes

fix ColumnarAlias

d13b88c

cloud-fan closed this in a082f46 Dec 2, 2020

HyukjinKwon mentioned this pull request Dec 9, 2020

[SPARK-33071][SPARK-33536][SQL][FOLLOW-UP] Rename deniedMetadataKeys to nonInheritableMetadataKeys in Alias #30682

Closed

cloud-fan mentioned this pull request Jan 22, 2021

[SPARK-34200][SQL] Ambiguous column reference should consider attribute availability #31287

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33071][SPARK-33536][SQL] Avoid changing dataset_id of LogicalPlan in join() to not break DetectAmbiguousSelfJoin #30488

[SPARK-33071][SPARK-33536][SQL] Avoid changing dataset_id of LogicalPlan in join() to not break DetectAmbiguousSelfJoin #30488

Ngone51 commented Nov 24, 2020 •

edited

Loading

Ngone51 commented Nov 24, 2020

SparkQA commented Nov 24, 2020

SparkQA commented Nov 25, 2020

xuanyuanking commented Nov 25, 2020

Ngone51 commented Nov 25, 2020

SparkQA commented Nov 25, 2020

SparkQA commented Nov 30, 2020

SparkQA commented Dec 2, 2020

cloud-fan commented Dec 2, 2020

SparkQA commented Dec 2, 2020

cloud-fan commented Dec 2, 2020

SparkQA commented Dec 2, 2020

[SPARK-33071][SPARK-33536][SQL] Avoid changing dataset_id of LogicalPlan in join() to not break DetectAmbiguousSelfJoin #30488

[SPARK-33071][SPARK-33536][SQL] Avoid changing dataset_id of LogicalPlan in join() to not break DetectAmbiguousSelfJoin #30488

Conversation

Ngone51 commented Nov 24, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Ngone51 commented Nov 24, 2020

SparkQA commented Nov 24, 2020

SparkQA commented Nov 25, 2020

xuanyuanking commented Nov 25, 2020

Ngone51 commented Nov 25, 2020

SparkQA commented Nov 25, 2020

SparkQA commented Nov 30, 2020

SparkQA commented Dec 2, 2020

cloud-fan commented Dec 2, 2020

SparkQA commented Dec 2, 2020

cloud-fan commented Dec 2, 2020

SparkQA commented Dec 2, 2020

Ngone51 commented Nov 24, 2020 •

edited

Loading