[SPARK-35703][SQL][FOLLOWUP] Only eliminate shuffles if partition keys contain all the join keys #35138

cloud-fan · 2022-01-07T18:25:57Z

What changes were proposed in this pull request?

This is a followup of #32875 . Basically #32875 did two improvements:

allow bucket join even if the bucket hash function is different from Spark's shuffle hash function
allow bucket join even if the hash partition keys are subset of join keys.

The first improvement is the major target for implementing the SPIP "storage partition join". The second improvement is kind of a consequence of the framework refactor, which is not planned.

This PR is to disable the second improvement by default, which may introduce perf regression if there are data skew without shuffle. We need more designs to enable this improvement, like checking the ndv.

Why are the changes needed?

Avoid perf regression

Does this PR introduce any user-facing change?

no

How was this patch tested?

cloud-fan · 2022-01-07T18:26:42Z

sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q17.sf100/simplified.txt

@@ -1,90 +1,97 @@
 TakeOrderedAndProject [i_item_id,i_item_desc,s_state,store_sales_quantitycount,store_sales_quantityave,store_sales_quantitystdev,store_sales_quantitycov,as_store_returns_quantitycount,as_store_returns_quantityave,as_store_returns_quantitystdev,store_returns_quantitycov,catalog_sales_quantitycount,catalog_sales_quantityave,catalog_sales_quantitystdev,catalog_sales_quantitycov]


I checked it locally. Now the plan golden files are exactly the same with the ones before #32875

cloud-fan · 2022-01-07T18:27:39Z

cc @sunchao @c21

c21 · 2022-01-08T01:08:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

+    // will add shuffles with the default partitioning of `ClusteredDistribution`, which uses all
+    // the join keys.
+    if (SQLConf.get.getConf(SQLConf.REQUIRE_ALL_JOIN_KEYS_AS_PARTITION_KEYS)) {
+      distribution.clustering.forall(x => partitioning.expressions.exists(_.semanticEquals(x)))


Do we need to require partitioning.expressions to be exactly same with distribution.clustering as well? e.g. for followed cases:

partitioning.expressions: [a, b] distribution.clustering: [b, a]

partitioning.expressions: [a, b, a] distribution.clustering: [a, b]

Good point. To fully restore to the previous behavior, we should require an exact match, though I think the current change should cover the data skew issues.

I'll make the change to be conservative.

Yes the case where partitioning expressions is a superset of distribution clustering should already be rejected by HashPartitioning#satisfies, but it maybe better to make it more explicit here.

c21 · 2022-01-08T01:53:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -396,6 +396,16 @@ object SQLConf {
    .booleanConf
    .createWithDefault(true)

+  val REQUIRE_ALL_JOIN_KEYS_AS_PARTITION_KEYS =
+    buildConf("spark.sql.join.requireAllJoinKeysAsPartitionKeys")


would this config take effect for all physical operators having 2 children and requiring ClusteredDistribution? Example like CoGroupExec.

Yes, and they use HashClusteredDistribution before.

Then should we make the config name more general?

how about spark.sql.requireAllClusterKeysAsPartitionKeysToCoParition?

I'm slightly inclined to a shorter name like spark.sql.join.requireAllJoinKeysAsPartitionKeys but it's just a personal flavor. Also I'd suggest something like spark.sql.enableStrictShuffleKeysCheck but it's up to you.

sunchao

Looks good to me. @cloud-fan did you actually observe the perf regression in real Spark jobs? just curious whether it'll be very common when this is config is disabled.

sunchao · 2022-01-10T07:27:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

+    // will add shuffles with the default partitioning of `ClusteredDistribution`, which uses all
+    // the join keys.
+    if (SQLConf.get.getConf(SQLConf.REQUIRE_ALL_JOIN_KEYS_AS_PARTITION_KEYS)) {
+      distribution.clustering.forall(x => partitioning.expressions.exists(_.semanticEquals(x)))


Yes the case where partitioning expressions is a superset of distribution clustering should already be rejected by HashPartitioning#satisfies, but it maybe better to make it more explicit here.

sunchao · 2022-01-10T07:40:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -396,6 +396,16 @@ object SQLConf {
    .booleanConf
    .createWithDefault(true)

+  val REQUIRE_ALL_JOIN_KEYS_AS_PARTITION_KEYS =
+    buildConf("spark.sql.join.requireAllJoinKeysAsPartitionKeys")


I'm slightly inclined to a shorter name like spark.sql.join.requireAllJoinKeysAsPartitionKeys but it's just a personal flavor. Also I'd suggest something like spark.sql.enableStrictShuffleKeysCheck but it's up to you.

cloud-fan · 2022-01-10T13:36:09Z

@sunchao I haven't tried it on real workloads yet, but it's pretty obvious that we can construct a query with certain input data to expose this regression.

c21

LGTM

sunchao

@cloud-fan could you fix the test failure: looks like org.apache.spark.sql.catalyst.ShuffleSpecSuite is failing. LGTM after the test failures are addressed. I also compared the golden files with this PR and #32875 combined and there're no changes which is expected.

c21 · 2022-01-11T04:35:16Z

I am supporting to disable this feature by default to be safe. But just two cents from our production experience. We enabled the same feature (avoid shuffle if bucket keys are subset of join keys) by default in our production for several years, and didn't see much data skew issues. Our workload is not representative in industry, but just to provide some observation in one large scale environment.

sunchao · 2022-01-11T05:56:22Z

Thanks @c21 ! this is good data point. We're also planning to evaluate this feature in production jobs.

cloud-fan · 2022-01-11T09:07:53Z

I think we can still roll out this optimization later, with some heuristics to avoid bad cases. We just need more time to evaluate and do experiments.

sunchao · 2022-01-11T16:47:39Z

Hmm org.apache.spark.sql.sources.BucketedReadWithHiveSupportSuite also failed.

sunchao · 2022-01-12T02:56:45Z

Oops we missed another one: "SPARK-27485: EnsureRequirements.reorder should handle duplicate expressions" in PlannerSuite

cloud-fan · 2022-01-13T13:34:42Z

thanks for review, merging to master!

…s contain all the join keys ### What changes were proposed in this pull request? This is a followup of apache#32875 . Basically apache#32875 did two improvements: 1. allow bucket join even if the bucket hash function is different from Spark's shuffle hash function 2. allow bucket join even if the hash partition keys are subset of join keys. The first improvement is the major target for implementing the SPIP "storage partition join". The second improvement is kind of a consequence of the framework refactor, which is not planned. This PR is to disable the second improvement by default, which may introduce perf regression if there are data skew without shuffle. We need more designs to enable this improvement, like checking the ndv. ### Why are the changes needed? Avoid perf regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Closes apache#35138 from cloud-fan/join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added the SQL label Jan 7, 2022

cloud-fan commented Jan 7, 2022

View reviewed changes

c21 reviewed Jan 8, 2022

View reviewed changes

sunchao reviewed Jan 10, 2022

View reviewed changes

c21 approved these changes Jan 10, 2022

View reviewed changes

sunchao reviewed Jan 11, 2022

View reviewed changes

sigmod approved these changes Jan 12, 2022

View reviewed changes

cloud-fan added 5 commits January 13, 2022 13:50

only eliminate shuffles if partition keys contain all the join keys

012dce1

address comments

6fa8abd

fix tests

b6ebb2a

revert one more test change

6a39862

revert one more test change

3238aab

cloud-fan force-pushed the join branch from c68a101 to 3238aab Compare January 13, 2022 05:51

cloud-fan closed this in 4b4ff4b Jan 13, 2022

sunchao mentioned this pull request Jan 27, 2022

[SPARK-35703][SQL] Relax constraint for bucket join and remove HashClusteredDistribution #32875

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-35703][SQL][FOLLOWUP] Only eliminate shuffles if partition keys contain all the join keys #35138

[SPARK-35703][SQL][FOLLOWUP] Only eliminate shuffles if partition keys contain all the join keys #35138

cloud-fan commented Jan 7, 2022

cloud-fan Jan 7, 2022 •

edited

cloud-fan commented Jan 7, 2022

c21 Jan 8, 2022

cloud-fan Jan 9, 2022

sunchao Jan 10, 2022

c21 Jan 8, 2022

cloud-fan Jan 9, 2022

c21 Jan 9, 2022

cloud-fan Jan 10, 2022

c21 Jan 10, 2022

sunchao Jan 10, 2022

sunchao left a comment

sunchao Jan 10, 2022

sunchao Jan 10, 2022

cloud-fan commented Jan 10, 2022

c21 left a comment

sunchao left a comment

c21 commented Jan 11, 2022 •

edited

sunchao commented Jan 11, 2022

cloud-fan commented Jan 11, 2022

sunchao commented Jan 11, 2022

sunchao commented Jan 12, 2022

cloud-fan commented Jan 13, 2022

		@@ -1,90 +1,97 @@
		TakeOrderedAndProject [i_item_id,i_item_desc,s_state,store_sales_quantitycount,store_sales_quantityave,store_sales_quantitystdev,store_sales_quantitycov,as_store_returns_quantitycount,as_store_returns_quantityave,as_store_returns_quantitystdev,store_returns_quantitycov,catalog_sales_quantitycount,catalog_sales_quantityave,catalog_sales_quantitystdev,catalog_sales_quantitycov]

[SPARK-35703][SQL][FOLLOWUP] Only eliminate shuffles if partition keys contain all the join keys #35138

[SPARK-35703][SQL][FOLLOWUP] Only eliminate shuffles if partition keys contain all the join keys #35138

Conversation

cloud-fan commented Jan 7, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cloud-fan Jan 7, 2022 • edited

Choose a reason for hiding this comment

cloud-fan commented Jan 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jan 10, 2022

c21 left a comment

Choose a reason for hiding this comment

sunchao left a comment

Choose a reason for hiding this comment

c21 commented Jan 11, 2022 • edited

sunchao commented Jan 11, 2022

cloud-fan commented Jan 11, 2022

sunchao commented Jan 11, 2022

sunchao commented Jan 12, 2022

cloud-fan commented Jan 13, 2022

cloud-fan Jan 7, 2022 •

edited

c21 commented Jan 11, 2022 •

edited