Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32659][SQL] Fix the data issue when applying DPP on non-atomic type #29475

Closed
wants to merge 7 commits into from
Closed

[SPARK-32659][SQL] Fix the data issue when applying DPP on non-atomic type #29475

wants to merge 7 commits into from

Conversation

wangyum
Copy link
Member

@wangyum wangyum commented Aug 19, 2020

What changes were proposed in this pull request?

Use InSet expression to fix data issue when pruning DPP on non-atomic type. for example:

 spark.range(1000)
 .select(col("id"), col("id").as("k"))
 .write
 .partitionBy("k")
 .format("parquet")
 .mode("overwrite")
 .saveAsTable("df1");

spark.range(100)
.select(col("id"), col("id").as("k"))
.write
.partitionBy("k")
.format("parquet")
.mode("overwrite")
.saveAsTable("df2")

spark.sql("set spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2")
spark.sql("set spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false")
spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = struct(df2.k) AND df2.id < 2").show

It should return two records, but it returns empty.

Why are the changes needed?

Fix data issue

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add new unit test.

@SparkQA
Copy link

SparkQA commented Aug 19, 2020

Test build #127637 has finished for PR 29475 at commit ffbed43.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 20, 2020

Test build #127670 has finished for PR 29475 at commit 4df6496.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 21, 2020

Test build #127715 has finished for PR 29475 at commit 3bb8f41.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum wangyum changed the title [SPARK-32659][SQL] Replace Array with Set in InSubqueryExec [SPARK-32659][SQL] Fix the data issue of inserted DPP on non-atomic type Aug 25, 2020
@SparkQA
Copy link

SparkQA commented Aug 25, 2020

Test build #127894 has finished for PR 29475 at commit d1db0bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


@transient private var result: Array[Any] = _
@transient private var result: Set[Any] = _
@transient private lazy val inSet: InSet = InSet(child, result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: val inSet: InSet = -> val inSet =

CodegenObjectFactoryMode.CODEGEN_ONLY).foreach { mode =>
Seq(true, false).foreach { pruning =>
withSQLConf(
SQLConf.CODEGEN_FACTORY_MODE.key -> s"${mode.toString}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s"${mode.toString}" -> mode.toString

Seq(true, false).foreach { pruning =>
withSQLConf(
SQLConf.CODEGEN_FACTORY_MODE.key -> s"${mode.toString}",
SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> s"${pruning}") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s"$pruning"

@wangyum wangyum changed the title [SPARK-32659][SQL] Fix the data issue of inserted DPP on non-atomic type [SPARK-32659][SQL] Fix the data issue when pruning DPP on non-atomic type Aug 26, 2020
@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Test build #127903 has finished for PR 29475 at commit 7c4f2fb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

@cloud-fan cloud-fan closed this in a8b5688 Aug 26, 2020
cloud-fan pushed a commit that referenced this pull request Aug 26, 2020
…type

### What changes were proposed in this pull request?

Use `InSet` expression to fix data issue when pruning DPP on non-atomic type. for example:
   ```scala
    spark.range(1000)
    .select(col("id"), col("id").as("k"))
    .write
    .partitionBy("k")
    .format("parquet")
    .mode("overwrite")
    .saveAsTable("df1");

   spark.range(100)
   .select(col("id"), col("id").as("k"))
   .write
   .partitionBy("k")
   .format("parquet")
   .mode("overwrite")
   .saveAsTable("df2")

   spark.sql("set spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2")
   spark.sql("set spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false")
   spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = struct(df2.k) AND df2.id < 2").show
   ```
   It should return two records, but it returns empty.

### Why are the changes needed?

Fix data issue

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add new unit test.

Closes #29475 from wangyum/SPARK-32659.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit a8b5688)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan cloud-fan changed the title [SPARK-32659][SQL] Fix the data issue when pruning DPP on non-atomic type [SPARK-32659][SQL] Fix the data issue when applying DPP on non-atomic type Aug 26, 2020
@wangyum wangyum deleted the SPARK-32659 branch August 26, 2020 07:02
@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Test build #127909 has finished for PR 29475 at commit f92a594.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

dongjoon-hyun pushed a commit that referenced this pull request Sep 22, 2020
…ueryExec

### What changes were proposed in this pull request?

This is a followup of #29475.

This PR updates the code to broadcast the Array instead of Set, which was the behavior before #29475

### Why are the changes needed?

The size of Set can be much bigger than Array. It's safer to keep the behavior the same as before and build the set at the executor side.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests

Closes #29838 from cloud-fan/followup.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
dongjoon-hyun pushed a commit that referenced this pull request Sep 22, 2020
…nSubqueryExec

### What changes were proposed in this pull request?

This is a followup of #29475.

This PR updates the code to broadcast the Array instead of Set, which was the behavior before #29475

### Why are the changes needed?

The size of Set can be much bigger than Array. It's safer to keep the behavior the same as before and build the set at the executor side.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests

Closes #29840 from cloud-fan/backport.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
holdenk pushed a commit to holdenk/spark that referenced this pull request Oct 27, 2020
…nSubqueryExec

### What changes were proposed in this pull request?

This is a followup of apache#29475.

This PR updates the code to broadcast the Array instead of Set, which was the behavior before apache#29475

### Why are the changes needed?

The size of Set can be much bigger than Array. It's safer to keep the behavior the same as before and build the set at the executor side.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests

Closes apache#29840 from cloud-fan/backport.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants