[SPARK-43021][SQL] `CoalesceBucketsInJoin` not work when using AQE #40688

zzzzming95 · 2023-04-06T16:12:12Z

What changes were proposed in this pull request?

Add CoalesceBucketsInJoin to AQE preprocessingRules.

Why are the changes needed?

Previously optimized bucket join: 'CoalesceBucketsInJoin'` : #28123

But when using AQE , CoalesceBucketsInJoin can not match beacuse the top of the spark plan is AdaptiveSparkPlan.

The code :

  val spark = SparkSession.builder()
    .appName("BucketJoin")
    .master("local[*]")
    .config("spark.sql.adaptive.enabled", true)
    .config("spark.driver.memory", "4")
    .config("spark.sql.autoBroadcastJoinThreshold", "-1")
    .config("spark.sql.bucketing.coalesceBucketsInJoin.enabled", true)
    .enableHiveSupport()
    .getOrCreate()

    val df1 = (0 until 20).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", "k")
    val df2 = (0 until 20).map(i => (i % 7, i % 11, i.toString)).toDF("i", "j", "k")
    df1.write.format("parquet").bucketBy(4, "i").saveAsTable("t1")
    df2.write.format("parquet").bucketBy(2, "i").saveAsTable("t2")
    val t1 = spark.table("t1")
    val t2 = spark.table("t2")
    val joined = t1.join(t2, t1("i") === t2("i"))
    joined.explain()

Before the PR

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- SortMergeJoin [i#50], [i#56], Inner
   :- Sort [i#50 ASC NULLS FIRST], false, 0
   :  +- Filter isnotnull(i#50)
   :     +- FileScan parquet spark_catalog.default.t1[i#50,j#51,k#52] Batched: true, Bucketed: true, DataFilters: [isnotnull(i#50)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/shezhiming/gh/zzzzming_new/spark/spark-warehouse/t1], PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int,k:string>, SelectedBucketsCount: 4 out of 4
   +- Sort [i#56 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(i#56, 4), ENSURE_REQUIREMENTS, [plan_id=78]
         +- Filter isnotnull(i#56)
            +- FileScan parquet spark_catalog.default.t2[i#56,j#57,k#58] Batched: true, Bucketed: false (disabled by query planner), DataFilters: [isnotnull(i#56)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/shezhiming/gh/zzzzming_new/spark/spark-warehouse/t2], PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int,k:string>

After the PR output:

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- SortMergeJoin [i#50], [i#56], Inner
   :- Sort [i#50 ASC NULLS FIRST], false, 0
   :  +- Filter isnotnull(i#50)
   :     +- FileScan parquet spark_catalog.default.t1[i#50,j#51,k#52] Batched: true, Bucketed: true, DataFilters: [isnotnull(i#50)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/shezhiming/gh/zzzzming_new/spark/spark-warehouse/t1], PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int,k:string>, SelectedBucketsCount: 4 out of 4 (Coalesced to 2)
   +- Sort [i#56 ASC NULLS FIRST], false, 0
      +- Filter isnotnull(i#56)
         +- FileScan parquet spark_catalog.default.t2[i#56,j#57,k#58] Batched: true, Bucketed: true, DataFilters: [isnotnull(i#56)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/shezhiming/gh/zzzzming_new/spark/spark-warehouse/t2], PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int,k:string>, SelectedBucketsCount: 2 out of 2

Additional Notes:

We don't add CoalesceBucketsInJoin to AdaptiveSparkPlanExec#queryStageOptimizerRules because queryStageOptimizerRules is not applied at the beginning of the init plan. Instead, they are applied in the createQueryStages() method. And createQueryStages() is bottom-up, which causes the exchange to be eliminated to be wrapped in a layer of ShuffleQueryStage first, making CoalesceBucketsInJoin unrecognizable.

Does this PR introduce any user-facing change?

No

How was this patch tested?

add UT

dongjoon-hyun · 2023-04-07T06:10:29Z

cc @imback82, @cloud-fan , @viirya , @sunchao

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/InsertAdaptiveSparkPlan.scala

viirya

We may need adding test case.

zzzzming95 · 2023-04-07T07:00:14Z

We may need adding test case.

yeah , i will add UT later

zzzzming95 · 2023-04-07T17:18:57Z

One more question , it time to make the default value of SQLConf.COALESCE_BUCKETS_IN_JOIN_ENABLED as true ?

dongjoon-hyun · 2023-04-07T17:50:18Z

Maybe, no? If this is not working properly before, we cannot enable this configuration at Apache Spark 3.5.0. Since we need to wait for one release cycle, we may be able to do that at Apache Spark 3.6.0 if we want.

One more question , it time to make the default value of SQLConf.COALESCE_BUCKETS_IN_JOIN_ENABLED as true ?

zzzzming95 · 2023-04-08T05:28:50Z

Maybe, no? If this is not working properly before, we cannot enable this configuration at Apache Spark 3.5.0. Since we need to wait for one release cycle, we may be able to do that at Apache Spark 3.6.0 if we want.

One more question , it time to make the default value of SQLConf.COALESCE_BUCKETS_IN_JOIN_ENABLED as true ?

Yes, this is the more logical way.

zzzzming95 · 2023-04-08T05:30:13Z

The CI build failure doesn't seem to be caused by this patch, can you take a look?

@dongjoon-hyun @viirya

dongjoon-hyun · 2023-04-10T05:01:34Z

Please rebase to the master branch once more, @zzzzming95 .

The CI build failure doesn't seem to be caused by this patch, can you take a look?

cloud-fan · 2023-04-10T05:26:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala

@@ -118,6 +118,7 @@ case class AdaptiveSparkPlanExec(
    val ensureRequirements =
      EnsureRequirements(requiredDistribution.isDefined, requiredDistribution)
    Seq(
+      CoalesceBucketsInJoin,


shall we put it in queryStageOptimizerRules?

rules in queryStageOptimizerRules are invoked less often which is more efficient. The rule CoalesceBucketsInJoin does not change plan partitioning and seems can be put in queryStageOptimizerRules

In my test , the UT run failed if CoalesceBucketsInJoin add in queryStageOptimizerRules .

Can we spend a bit of time understanding why? Then we can write a code comment to explain it and future developers won't try to move this rule to queryStageOptimizerRules ever.

Yeah , I will provide detailed information and supplement it .

Because queryStageOptimizerRules is not applied at the beginning of the init plan. Instead, they are applied in the createQueryStages() method. And createQueryStages() is bottom-up, which causes the exchange to be eliminated to be wrapped in a layer of ShuffleQueryStage first, making CoalesceBucketsInJoin unrecognizable. And I have added these to the notes at the top. thanks @cloud-fan

CoalesceBucketsInJoin should before EnsureRequirements.

zzzzming95 · 2023-04-11T14:36:05Z

@cloud-fan @dongjoon-hyun @viirya Please merge to master . Thanks ~

dongjoon-hyun · 2023-04-11T17:33:47Z

To be clear, this PR didn't get any approval yet, @zzzzming95 .

Please merge to master . Thanks ~

sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala

cloud-fan · 2023-04-13T02:12:53Z

sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala

-            assert(scans.head.optionalNumCoalescedBuckets == expectedCoalescedNumBuckets)
-          } else {
-            assert(scans.isEmpty)
+                    query: String,


nit: the indentation is wrong now, can we restore to 4 spaces as before?

Can we follow https://github.com/apache/spark/pull/40731/files#diff-1dd0d5a38f73f2993e5852f759a3934396c083d4fc4cc334e73ccc8eb929a717R1013 to update the test?

https://github.com/apache/spark/pull/40731/files#diff-1dd0d5a38f73f2993e5852f759a3934396c083d4fc4cc334e73ccc8eb929a717R1013

The original DisableAdaptiveExecution logic of this UT is removed here. The current implementation does both.

wangyum · 2023-04-13T03:28:00Z

sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala

+                    query: String,
+                    expectedNumShuffles: Int,
+                    expectedCoalescedNumBuckets: Option[Int]): Unit = {


Suggested change

query: String,

expectedNumShuffles: Int,

expectedCoalescedNumBuckets: Option[Int]): Unit = {

query: String,

expectedNumShuffles: Int,

expectedCoalescedNumBuckets: Option[Int]): Unit = {

cloud-fan · 2023-04-14T03:19:11Z

thanks, merging to master!

neshkeev · 2023-04-29T13:04:21Z

I was the OP of the issue in the jira.

Thank you for the fix, but I discovered a weird behavior when hints applied and I don't know how to interpret it. Please check SPARK-43326 I filled

zzzzming95 · 2023-04-29T13:08:07Z

I was the OP of the issue in the jira.

Thank you for the fix, but I discovered a weird behavior when hints applied and I don't know how to interpret it. Please check SPARK-43326 I filled

Okay, I will follow up on this issue

github-actions bot added the SQL label Apr 6, 2023

dongjoon-hyun changed the title ~~[SPARK-43021] CoalesceBucketsInJoin not work when using AQE~~ [SPARK-43021][SQL] CoalesceBucketsInJoin not work when using AQE Apr 7, 2023

viirya reviewed Apr 7, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/InsertAdaptiveSparkPlan.scala Outdated Show resolved Hide resolved

viirya reviewed Apr 7, 2023

View reviewed changes

cloud-fan reviewed Apr 10, 2023

View reviewed changes

github-actions bot added BUILD CONNECT CORE DOCS INFRA ML PANDAS API ON SPARK PYTHON STRUCTURED STREAMING labels Apr 10, 2023

SPARK-43021

5ae6809

zzzzming95 force-pushed the SPARK-43021 branch from 12917ef to 5ae6809 Compare April 10, 2023 06:01

github-actions bot removed ML STRUCTURED STREAMING BUILD DOCS PYTHON CORE CONNECT labels Apr 10, 2023

github-actions bot removed INFRA PANDAS API ON SPARK labels Apr 10, 2023

cloud-fan mentioned this pull request Apr 11, 2023

[SPARK-43087][SQL] Support coalesce buckets in join in AQE #40731

Closed

zzzzming95 requested review from viirya and cloud-fan April 11, 2023 14:35

wangyum reviewed Apr 12, 2023

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Apr 12, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala Show resolved Hide resolved

zzzzming95 added 2 commits April 12, 2023 22:05

update AdaptiveSparkPlanExec.scala

527e6d8

BucketedReadSuite.scala

36257ae

cloud-fan reviewed Apr 13, 2023

View reviewed changes

wangyum reviewed Apr 13, 2023

View reviewed changes

update BucketedReadSuite.scala

2a55aac

cloud-fan closed this in a43a6b3 Apr 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43021][SQL] `CoalesceBucketsInJoin` not work when using AQE #40688

[SPARK-43021][SQL] `CoalesceBucketsInJoin` not work when using AQE #40688

zzzzming95 commented Apr 6, 2023 •

edited

dongjoon-hyun commented Apr 7, 2023

viirya left a comment

zzzzming95 commented Apr 7, 2023

zzzzming95 commented Apr 7, 2023

dongjoon-hyun commented Apr 7, 2023

zzzzming95 commented Apr 8, 2023

zzzzming95 commented Apr 8, 2023

dongjoon-hyun commented Apr 10, 2023

cloud-fan Apr 10, 2023

cloud-fan Apr 10, 2023

zzzzming95 Apr 10, 2023

cloud-fan Apr 10, 2023 •

edited

zzzzming95 Apr 10, 2023

zzzzming95 Apr 11, 2023 •

edited

wangyum Apr 12, 2023

zzzzming95 commented Apr 11, 2023

dongjoon-hyun commented Apr 11, 2023

cloud-fan Apr 13, 2023

cloud-fan Apr 13, 2023

zzzzming95 Apr 13, 2023

wangyum Apr 13, 2023

cloud-fan commented Apr 14, 2023

neshkeev commented Apr 29, 2023

zzzzming95 commented Apr 29, 2023

[SPARK-43021][SQL] CoalesceBucketsInJoin not work when using AQE #40688

[SPARK-43021][SQL] CoalesceBucketsInJoin not work when using AQE #40688

Conversation

zzzzming95 commented Apr 6, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun commented Apr 7, 2023

viirya left a comment

Choose a reason for hiding this comment

zzzzming95 commented Apr 7, 2023

zzzzming95 commented Apr 7, 2023

dongjoon-hyun commented Apr 7, 2023

zzzzming95 commented Apr 8, 2023

zzzzming95 commented Apr 8, 2023

dongjoon-hyun commented Apr 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Apr 10, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zzzzming95 Apr 11, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zzzzming95 commented Apr 11, 2023

dongjoon-hyun commented Apr 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Apr 14, 2023

neshkeev commented Apr 29, 2023

zzzzming95 commented Apr 29, 2023

[SPARK-43021][SQL] `CoalesceBucketsInJoin` not work when using AQE #40688

[SPARK-43021][SQL] `CoalesceBucketsInJoin` not work when using AQE #40688

zzzzming95 commented Apr 6, 2023 •

edited

cloud-fan Apr 10, 2023 •

edited

zzzzming95 Apr 11, 2023 •

edited