[SPARK-32659][SQL] Fix the data issue when applying DPP on non-atomic type #29475

wangyum · 2020-08-19T08:49:47Z

What changes were proposed in this pull request?

Use InSet expression to fix data issue when pruning DPP on non-atomic type. for example:

 spark.range(1000)
 .select(col("id"), col("id").as("k"))
 .write
 .partitionBy("k")
 .format("parquet")
 .mode("overwrite")
 .saveAsTable("df1");

spark.range(100)
.select(col("id"), col("id").as("k"))
.write
.partitionBy("k")
.format("parquet")
.mode("overwrite")
.saveAsTable("df2")

spark.sql("set spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2")
spark.sql("set spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false")
spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = struct(df2.k) AND df2.id < 2").show

It should return two records, but it returns empty.

Why are the changes needed?

Fix data issue

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add new unit test.

SparkQA · 2020-08-19T14:19:53Z

Test build #127637 has finished for PR 29475 at commit ffbed43.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala

SparkQA · 2020-08-20T06:39:54Z

Test build #127670 has finished for PR 29475 at commit 4df6496.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala

SparkQA · 2020-08-21T07:02:41Z

Test build #127715 has finished for PR 29475 at commit 3bb8f41.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala

SparkQA · 2020-08-25T20:47:46Z

Test build #127894 has finished for PR 29475 at commit d1db0bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-08-26T01:58:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala


-  @transient private var result: Array[Any] = _
+  @transient private var result: Set[Any] = _
+  @transient private lazy val inSet: InSet = InSet(child, result)


nit: val inSet: InSet = -> val inSet =

maropu · 2020-08-26T01:59:29Z

sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala

+          CodegenObjectFactoryMode.CODEGEN_ONLY).foreach { mode =>
+          Seq(true, false).foreach { pruning =>
+            withSQLConf(
+              SQLConf.CODEGEN_FACTORY_MODE.key -> s"${mode.toString}",


nit: s"${mode.toString}" -> mode.toString

maropu · 2020-08-26T01:59:57Z

sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala

+          Seq(true, false).foreach { pruning =>
+            withSQLConf(
+              SQLConf.CODEGEN_FACTORY_MODE.key -> s"${mode.toString}",
+              SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> s"${pruning}") {


nit: s"$pruning"

sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala

SparkQA · 2020-08-26T06:41:56Z

Test build #127903 has finished for PR 29475 at commit 7c4f2fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-08-26T06:57:48Z

thanks, merging to master/3.0!

…type ### What changes were proposed in this pull request? Use `InSet` expression to fix data issue when pruning DPP on non-atomic type. for example: ```scala spark.range(1000) .select(col("id"), col("id").as("k")) .write .partitionBy("k") .format("parquet") .mode("overwrite") .saveAsTable("df1"); spark.range(100) .select(col("id"), col("id").as("k")) .write .partitionBy("k") .format("parquet") .mode("overwrite") .saveAsTable("df2") spark.sql("set spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2") spark.sql("set spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false") spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = struct(df2.k) AND df2.id < 2").show ``` It should return two records, but it returns empty. ### Why are the changes needed? Fix data issue ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add new unit test. Closes #29475 from wangyum/SPARK-32659. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a8b5688) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

SparkQA · 2020-08-26T07:05:02Z

Test build #127909 has finished for PR 29475 at commit f92a594.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

…ueryExec ### What changes were proposed in this pull request? This is a followup of #29475. This PR updates the code to broadcast the Array instead of Set, which was the behavior before #29475 ### Why are the changes needed? The size of Set can be much bigger than Array. It's safer to keep the behavior the same as before and build the set at the executor side. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #29838 from cloud-fan/followup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…nSubqueryExec ### What changes were proposed in this pull request? This is a followup of #29475. This PR updates the code to broadcast the Array instead of Set, which was the behavior before #29475 ### Why are the changes needed? The size of Set can be much bigger than Array. It's safer to keep the behavior the same as before and build the set at the executor side. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #29840 from cloud-fan/backport. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…nSubqueryExec ### What changes were proposed in this pull request? This is a followup of apache#29475. This PR updates the code to broadcast the Array instead of Set, which was the behavior before apache#29475 ### Why are the changes needed? The size of Set can be much bigger than Array. It's safer to keep the behavior the same as before and build the set at the executor side. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes apache#29840 from cloud-fan/backport. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

Replace Array with Set

ffbed43

probot-autolabeler bot added the SQL label Aug 19, 2020

viirya reviewed Aug 19, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala Outdated Show resolved Hide resolved

viirya reviewed Aug 19, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala Outdated Show resolved Hide resolved

Address comment

4df6496

viirya reviewed Aug 20, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala Outdated Show resolved Hide resolved

wangyum added 2 commits August 21, 2020 10:12

Move subquery.scala#L185-L196 to PlanDynamicPruningFilters.scala#L88-L89

3e98608

Revert

3bb8f41

cloud-fan reviewed Aug 21, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala Show resolved Hide resolved

cloud-fan reviewed Aug 24, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala Outdated Show resolved Hide resolved

wangyum changed the title ~~[SPARK-32659][SQL] Replace Array with Set in InSubqueryExec~~ [SPARK-32659][SQL] Fix the data issue of inserted DPP on non-atomic type Aug 25, 2020

Fix the data issue of inserted DPP on non-atomic type

d1db0bc

cloud-fan reviewed Aug 25, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala Outdated Show resolved Hide resolved

use InSet

7c4f2fb

maropu reviewed Aug 26, 2020

View reviewed changes

viirya approved these changes Aug 26, 2020

View reviewed changes

Address comment

f92a594

wangyum changed the title ~~[SPARK-32659][SQL] Fix the data issue of inserted DPP on non-atomic type~~ [SPARK-32659][SQL] Fix the data issue when pruning DPP on non-atomic type Aug 26, 2020

cloud-fan approved these changes Aug 26, 2020

View reviewed changes

cloud-fan closed this in a8b5688 Aug 26, 2020

cloud-fan changed the title ~~[SPARK-32659][SQL] Fix the data issue when pruning DPP on non-atomic type~~ [SPARK-32659][SQL] Fix the data issue when applying DPP on non-atomic type Aug 26, 2020

wangyum deleted the SPARK-32659 branch August 26, 2020 07:02

cloud-fan mentioned this pull request Sep 22, 2020

[SPARK-32659][SQL][FOLLOWUP] Broadcast Array instead of Set in InSubqueryExec #29838

Closed

cloud-fan mentioned this pull request Sep 22, 2020

[SPARK-32659][SQL][FOLLOWUP][3.0] Broadcast Array instead of Set in InSubqueryExec #29840

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32659][SQL] Fix the data issue when applying DPP on non-atomic type #29475

[SPARK-32659][SQL] Fix the data issue when applying DPP on non-atomic type #29475

wangyum commented Aug 19, 2020 •

edited

Loading

SparkQA commented Aug 19, 2020

SparkQA commented Aug 20, 2020

SparkQA commented Aug 21, 2020

SparkQA commented Aug 25, 2020

maropu Aug 26, 2020

maropu Aug 26, 2020

maropu Aug 26, 2020

SparkQA commented Aug 26, 2020

cloud-fan commented Aug 26, 2020

SparkQA commented Aug 26, 2020

[SPARK-32659][SQL] Fix the data issue when applying DPP on non-atomic type #29475

[SPARK-32659][SQL] Fix the data issue when applying DPP on non-atomic type #29475

Conversation

wangyum commented Aug 19, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Aug 19, 2020

SparkQA commented Aug 20, 2020

SparkQA commented Aug 21, 2020

SparkQA commented Aug 25, 2020

maropu Aug 26, 2020

Choose a reason for hiding this comment

maropu Aug 26, 2020

Choose a reason for hiding this comment

maropu Aug 26, 2020

Choose a reason for hiding this comment

SparkQA commented Aug 26, 2020

cloud-fan commented Aug 26, 2020

SparkQA commented Aug 26, 2020

wangyum commented Aug 19, 2020 •

edited

Loading