-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32659][SQL] Fix the data issue when applying DPP on non-atomic type #29475
Conversation
Test build #127637 has finished for PR 29475 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala
Outdated
Show resolved
Hide resolved
Test build #127670 has finished for PR 29475 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala
Outdated
Show resolved
Hide resolved
Test build #127715 has finished for PR 29475 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala
Outdated
Show resolved
Hide resolved
Test build #127894 has finished for PR 29475 at commit
|
|
||
@transient private var result: Array[Any] = _ | ||
@transient private var result: Set[Any] = _ | ||
@transient private lazy val inSet: InSet = InSet(child, result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: val inSet: InSet =
-> val inSet =
CodegenObjectFactoryMode.CODEGEN_ONLY).foreach { mode => | ||
Seq(true, false).foreach { pruning => | ||
withSQLConf( | ||
SQLConf.CODEGEN_FACTORY_MODE.key -> s"${mode.toString}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s"${mode.toString}"
-> mode.toString
Seq(true, false).foreach { pruning => | ||
withSQLConf( | ||
SQLConf.CODEGEN_FACTORY_MODE.key -> s"${mode.toString}", | ||
SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> s"${pruning}") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s"$pruning"
sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala
Show resolved
Hide resolved
Test build #127903 has finished for PR 29475 at commit
|
thanks, merging to master/3.0! |
…type ### What changes were proposed in this pull request? Use `InSet` expression to fix data issue when pruning DPP on non-atomic type. for example: ```scala spark.range(1000) .select(col("id"), col("id").as("k")) .write .partitionBy("k") .format("parquet") .mode("overwrite") .saveAsTable("df1"); spark.range(100) .select(col("id"), col("id").as("k")) .write .partitionBy("k") .format("parquet") .mode("overwrite") .saveAsTable("df2") spark.sql("set spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2") spark.sql("set spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false") spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = struct(df2.k) AND df2.id < 2").show ``` It should return two records, but it returns empty. ### Why are the changes needed? Fix data issue ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add new unit test. Closes #29475 from wangyum/SPARK-32659. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a8b5688) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Test build #127909 has finished for PR 29475 at commit
|
…ueryExec ### What changes were proposed in this pull request? This is a followup of #29475. This PR updates the code to broadcast the Array instead of Set, which was the behavior before #29475 ### Why are the changes needed? The size of Set can be much bigger than Array. It's safer to keep the behavior the same as before and build the set at the executor side. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #29838 from cloud-fan/followup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…nSubqueryExec ### What changes were proposed in this pull request? This is a followup of #29475. This PR updates the code to broadcast the Array instead of Set, which was the behavior before #29475 ### Why are the changes needed? The size of Set can be much bigger than Array. It's safer to keep the behavior the same as before and build the set at the executor side. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #29840 from cloud-fan/backport. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…nSubqueryExec ### What changes were proposed in this pull request? This is a followup of apache#29475. This PR updates the code to broadcast the Array instead of Set, which was the behavior before apache#29475 ### Why are the changes needed? The size of Set can be much bigger than Array. It's safer to keep the behavior the same as before and build the set at the executor side. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes apache#29840 from cloud-fan/backport. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
Use
InSet
expression to fix data issue when pruning DPP on non-atomic type. for example:It should return two records, but it returns empty.
Why are the changes needed?
Fix data issue
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Add new unit test.