Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34168] [SQL] Support DPP in AQE when the join is Broadcast hash join at the beginning #31258

Closed
wants to merge 17 commits into from

Conversation

JkSelf
Copy link
Contributor

@JkSelf JkSelf commented Jan 20, 2021

What changes were proposed in this pull request?

This PR is to enable AQE and DPP when the join is broadcast hash join at the beginning, which can benefit the performance improvement from DPP and AQE at the same time. This PR will make use of the result of build side and then insert the DPP filter into the probe side.

Why are the changes needed?

Improve performance

Does this PR introduce any user-facing change?

No

How was this patch tested?

adding new ut

@JkSelf
Copy link
Contributor Author

JkSelf commented Jan 20, 2021

@cloud-fan Please help me review. Thanks.

@github-actions github-actions bot added the SQL label Jan 20, 2021
@SparkQA
Copy link

SparkQA commented Jan 20, 2021

Test build #134257 has finished for PR 31258 at commit b2d70f1.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Hi, @JkSelf . Could you fix the scala style?

@JkSelf
Copy link
Contributor Author

JkSelf commented Jan 21, 2021

@cloud-fan Updated based on the offline discussions. Please help review again. Thanks.

@SparkQA
Copy link

SparkQA commented Jan 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38908/

@SparkQA
Copy link

SparkQA commented Jan 21, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38908/

@gengliangwang
Copy link
Member

cc @maryannxue

@SparkQA
Copy link

SparkQA commented Jan 21, 2021

Test build #134321 has finished for PR 31258 at commit c24efbc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SubqueryBroadcastExec(name, broadcastKeyIndex, buildKeys, exchange)

// Update the inputPlan and the currentPhysicalPlan of the adaptivePlan.
adaptivePlan.inputPlan = broadcastValues
Copy link
Contributor

@cloud-fan cloud-fan Jan 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we wrap the adaptivePlan with subquery broadcast? Then we don't need to mutate adaptivePlan.inputPlan here and keep inputPlan as immutable.

@@ -133,7 +133,7 @@ case class AdaptiveSparkPlanExec(
inputPlan, queryStagePreparationRules, Some((planChangeLogger, "AQE Preparations")))
}

@volatile private var currentPhysicalPlan = initialPlan
@volatile var currentPhysicalPlan = initialPlan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing a mutable variable seems not a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Fixed.

@@ -101,7 +102,6 @@ case class InsertAdaptiveSparkPlan(
// TODO migrate dynamic-partition-pruning onto adaptive execution.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So DPP is supported and this comment looks out-of-dated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@dongjoon-hyun
Copy link
Member

Hi, @JkSelf .
Could you take a look at the failures? It seems that this PR has 18 relevant failures still.

DynamicPartitionPruningSuiteAEOn.simple inner join triggers DPP with mock-up tables
org.scalatest.exceptions.TestFailedException: false did not equal true Should trigger DPP with a subquery duplicate:
== Parsed Logical Plan ==
'Project ['f.date_id, 'f.store_id]
+- 'Join Inner, (('f.store_id = 's.store_id) AND ('s.country = NL))
   :- 'SubqueryAlias f
   :  +- 'UnresolvedRelation [fact_sk], [], false
   +- 'SubqueryAlias s
      +- 'UnresolvedRelation [dim_store], [], false

@SparkQA
Copy link

SparkQA commented Jan 25, 2021

Test build #134426 has finished for PR 31258 at commit 89c74bb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 29, 2021

Test build #134634 has finished for PR 31258 at commit 0d78a62.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class SubqueryAdaptiveBroadcastExec(
  • case class InsertDynamicPruningFilters(

@SparkQA
Copy link

SparkQA commented Jan 29, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39222/

@SparkQA
Copy link

SparkQA commented Jan 29, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39222/

@SparkQA
Copy link

SparkQA commented Jan 29, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39224/

@SparkQA
Copy link

SparkQA commented Jan 29, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39224/

@SparkQA
Copy link

SparkQA commented Jan 29, 2021

Test build #134636 has finished for PR 31258 at commit cdd1226.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Test build #134859 has finished for PR 31258 at commit f1226b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39457/

*/
case class ShuffleQueryStageExec(
override val id: Int,
override val plan: SparkPlan) extends QueryStageExec {
override val plan: SparkPlan,
_canonicalized: SparkPlan) extends QueryStageExec {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we missed to add override def doCanonicalize(): SparkPlan = _canonicalized

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39457/

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Test build #134871 has finished for PR 31258 at commit 8a77832.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Test build #134870 has finished for PR 31258 at commit 213d2b5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member

wangyum commented Feb 5, 2021

@JkSelf @cloud-fan This implementation can not reuse BroadcastExchange if BHJ after SMJ. For example:

  SELECT count(*)
        FROM   (SELECT c.c_customer_sk,
                       s.*
                FROM   customer c
                       JOIN store_sales s
                         ON c.c_customer_sk = ss_customer_sk) t1
               JOIN date_dim
                 ON ss_sold_date_sk = d_date_sk
                    AND d_year = 2002
Enable AE Disable AE
image image

@JkSelf
Copy link
Contributor Author

JkSelf commented Feb 5, 2021

@wangyum Yes. This implementation only is the first PR to support the join is bhj before apply AQE rules. We will support the join is smj and then convert to bhj use case in the following PRs.

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39509/

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39509/

@@ -1345,7 +1371,9 @@ abstract class DynamicPartitionPruningSuiteBase
}
}

test("SPARK-32817: DPP throws error when the broadcast side is empty") {
test("SPARK-32817: DPP throws error when the broadcast side is empty",
DisableAdaptiveExecution("EliminateJoinToEmptyRelation " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can disable this rule by setting ADAPTIVE_OPTIMIZER_EXCLUDED_RULES.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start!

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Test build #134926 has finished for PR 31258 at commit a02e307.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 7, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39554/

@SparkQA
Copy link

SparkQA commented Feb 7, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39554/

@SparkQA
Copy link

SparkQA commented Feb 7, 2021

Test build #134971 has finished for PR 31258 at commit 3511996.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -165,7 +175,8 @@ case class ShuffleQueryStageExec(
override def newReuseInstance(newStageId: Int, newOutput: Seq[Attribute]): QueryStageExec = {
val reuse = ShuffleQueryStageExec(
newStageId,
ReusedExchangeExec(newOutput, shuffle))
ReusedExchangeExec(newOutput, shuffle),
shuffle.canonicalized)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this should be _canonicalized

@@ -229,7 +245,8 @@ case class BroadcastQueryStageExec(
override def newReuseInstance(newStageId: Int, newOutput: Seq[Attribute]): QueryStageExec = {
val reuse = BroadcastQueryStageExec(
newStageId,
ReusedExchangeExec(newOutput, broadcast))
ReusedExchangeExec(newOutput, broadcast),
broadcast.canonicalized)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 3b26bc2 Feb 8, 2021
@SparkQA
Copy link

SparkQA commented Feb 8, 2021

Test build #135033 has finished for PR 31258 at commit 1e1b097.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

domybest11 pushed a commit to domybest11/spark that referenced this pull request Jun 15, 2022
… join at the beginning

This PR is to enable AQE and DPP when the join is broadcast hash join at the beginning, which can benefit the performance improvement from DPP and AQE at the same time. This PR will make use of the result of build side and then insert the DPP filter into the probe side.

Improve performance

No

adding new ut

Closes apache#31258 from JkSelf/supportDPP1.

Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants