[SPARK-35282][SQL] Support AQE side shuffled hash join formula #32450

ulysses-you · 2021-05-06T05:34:54Z

What changes were proposed in this pull request?

Use runtime statistics to decide if we can convert join to shuffled hash join.

Why are the changes needed?

Use AQE runtime statistics to decide if we can use shuffled hash join instead of sort merge join. Currently, the formula of shuffled hash join selection dose not work due to the dymanic shuffle partition number.

Add a new config spark.sql.adaptive.shuffledHashJoinLocalMapThreshold to decide if join can be converted to shuffled hash join safely.

Does this PR introduce any user-facing change?

Yes, add a new config.

How was this patch tested?

Add new test.

ulysses-you · 2021-05-06T05:42:48Z

cc @maropu @cloud-fan @maryannxue @c21 do you have any thought about this new config ?

SparkQA · 2021-05-06T06:43:37Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42713/

c21 · 2021-05-06T06:36:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        s"${PREFER_SORTMERGEJOIN.key} is false.")
+      .version("3.2.0")
+      .bytesConf(ByteUnit.BYTE)
+      .createWithDefaultString("64MB")


curious why we choose this default value? to be same as spark.sql.adaptive.shuffle.targetPostShuffleInputSize?

The main idea is that the default skew join size is 256MB and the local map should smaller 3x(follow the existed formula) than other side. So assume the local map size is 64MB and other side is 192MB.

c21 · 2021-05-06T06:45:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala

+    isRuntime: Boolean = false,
+    mapOutputStatistics: Option[MapOutputStatistics] = None) {


I feel it's a bit weird that Statistics has a field MapOutputStatistics where MapOutputStatistics is a physical shuffle operator only thing, but Statistics is for all logical operators. Maybe we can have:

RunTimeStatsSpec( isRuntime: Boolean, sizeInBytesPerPartition: Option[Array[Long]] ) Statistics( runTimeStatsSpec: Option[RunTimeStatsSpec] = None, ... )

Have the similar thought and the change you point out seems a better approach. What do you think about ? @maropu @cloud-fan

FYI, we took anther approach to support SHJ in AQE. We added a rule in AdaptiveSparkPlanExec to convert SMJ to SHJ according to shuffle stats, which requires no changes in Statistics.scala as the statistics is ready in ShuffleStageInfo.

The SMJ could also be converted to SHJ if applicable even if PREFER_SORTMERGE is set. cc @Liulietong

cc @luuliietong

We added a rule in AdaptiveSparkPlanExec to convert SMJ to SHJ according to shuffle stats

This looks like a better idea. Do you want to open a PR for it?

SparkQA · 2021-05-06T10:36:44Z

Test build #138192 has finished for PR 32450 at commit 0766487.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-05-07T00:41:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala

-   * Note: this assume that the number of partition is fixed, requires additional work if it's
-   * dynamic.
+   * In AQE framework, we use runtime statistics to check if we can build local map. Only if
+   * all the partition size not large than `ADAPTIVE_SHUFFLE_HASH_JOIN_LOCAL_MAP_THRESHOLD`,


size not large -> size is not larger

dongjoon-hyun · 2021-05-07T00:41:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala

-   * dynamic.
+   * In AQE framework, we use runtime statistics to check if we can build local map. Only if
+   * all the partition size not large than `ADAPTIVE_SHUFFLE_HASH_JOIN_LOCAL_MAP_THRESHOLD`,
+   * we allow to build local hash map.


build local -> build a local

SparkQA · 2021-05-07T11:50:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42772/

SparkQA · 2021-05-07T11:54:42Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42772/

SparkQA · 2021-05-07T15:47:54Z

Test build #138250 has finished for PR 32450 at commit a706472.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-17T09:34:02Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43140/

SparkQA · 2021-05-17T09:34:03Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43140/

SparkQA · 2021-05-17T13:30:16Z

Test build #138620 has finished for PR 32450 at commit e8283a2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Support AQE side shuffled hash join formula

0766487

github-actions bot added the SQL label May 6, 2021

c21 reviewed May 6, 2021

View reviewed changes

dongjoon-hyun reviewed May 7, 2021

View reviewed changes

comment

a706472

disable conversion by default

e8283a2

ulysses-you closed this May 27, 2021

ulysses-you deleted the SPARK-35282 branch November 22, 2021 12:28

		isRuntime: Boolean = false,
		mapOutputStatistics: Option[MapOutputStatistics] = None) {

[SPARK-35282][SQL] Support AQE side shuffled hash join formula #32450

[SPARK-35282][SQL] Support AQE side shuffled hash join formula #32450

Uh oh!

Conversation

ulysses-you commented May 6, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

ulysses-you commented May 6, 2021

Uh oh!

SparkQA commented May 6, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 6, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 7, 2021

Uh oh!

SparkQA commented May 7, 2021

Uh oh!

SparkQA commented May 7, 2021

Uh oh!

SparkQA commented May 17, 2021

Uh oh!

SparkQA commented May 17, 2021

Uh oh!

SparkQA commented May 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants