[SPARK-28560][SQL] Optimize shuffle reader to local shuffle reader when smj converted to bhj in adaptive execution #25295

JkSelf · 2019-07-30T03:32:11Z

What changes were proposed in this pull request?

Implement a rule in the new adaptive execution framework introduced in SPARK-23128. This rule is used to optimize the shuffle reader to local shuffle reader when smj is converted to bhj in adaptive execution.

How was this patch tested?

Existing tests

...ore/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizedLocalShuffleReader.scala

...core/src/main/scala/org/apache/spark/sql/execution/adaptive/ReduceNumShufflePartitions.scala

JkSelf · 2019-08-05T05:36:51Z

We have done the functionality and performance tests in 3TB TPC-DS. And the result is shown in here. Q82 can show 1.76x performance improvement with this PR. And no queries have significant performance degradation.
@carsonwang @@cloud-fan can you help review if you have available time? Thanks for your help.

JkSelf · 2019-09-11T03:16:28Z

fixed the conflicts.

JkSelf · 2019-09-11T03:19:55Z

@cloud-fan Can you help review if you have available time? Thanks for your help very much.

cloud-fan · 2019-09-11T05:15:23Z

Should this be a general optimization? When a reduce task needs to read some shuffle blocks that are happened to exist locally, we can read the shuffle files directly instead of going through the shuffle service.

JkSelf · 2019-09-11T10:22:03Z

@cloud-fan Thanks for you reviews! When the shuffle blocks exist locally, the shuffle service already read the blocks locally even through shuffle service in ShuffleBlockFetcherIterator, I think. Correct me if wrong understanding! If so, whether need to optimize it to locally read?

cloud-fan · 2019-09-11T12:36:49Z

You are right, local shuffle blocks are already optimized. I wanted to say the host-local shuffle blocks. Anyway, seems it's not very related to what you are trying to do here.

cloud-fan · 2019-09-11T12:55:12Z

...core/src/main/scala/org/apache/spark/sql/execution/adaptive/ReduceNumShufflePartitions.scala

@@ -180,25 +180,45 @@ case class ReduceNumShufflePartitions(conf: SQLConf) extends Rule[SparkPlan] {

 case class CoalescedShuffleReaderExec(
    child: QueryStageExec,
-    partitionStartIndices: Array[Int]) extends UnaryExecNode {
+    partitionStartIndices: Array[Int],
+    var isLocal: Boolean = false) extends UnaryExecNode {


ReduceNumShufflePartitions and local shuffle reader are two different optimizations, and they are conflicting:
ReduceNumShufflePartitions adjusts the numPartitions by assuming the partitions are post-shuffle partitions. Their data size depends on the shuffle blocks they need to read. If we change the shuffle to local shuffle reader, then the partitions become pre-shuffle partitions, and their data size is different.

If we change the shuffle to local shuffle reader, then the partitions become pre-shuffle partitions, and their data size is different.
@cloud-fan Here the local shuffle reader is still optimize the post-shuffle partitions. And I don't understand why the partitions become pre-shuffle partitions?

without local shuffle reader, a task of ShuffledRDD reads the shuffle blocks map1-reduce1, map2-reduce1, etc. With local shuffle reader, the task reads map1-reduce1, map1-reduce2, etc. The task output data size is different and we can't use the algorithm in ReduceNumShufflePartitions anymore.

Furthermore, the RDD numPartitions also becomes different after switching to local shuffle reader, how can we apply the ReduceNumShufflePartitions?

Ok, Got it. In order to make code more clear, I will create LocalShuffleReaderExec later. Thanks.

cloud-fan · 2019-09-11T13:08:48Z

I think this is a good idea, but the implementation needs more polishing. What I expect to see is:

a rule to convert ShuffleQueryStageExec to LocalShuffleReaderExec. This rule must be run before ReduceNumShufflePartitions.
A special ShuffledRDD that reads shuffle blocks grouped by mapId instead of reduceId.

I'm a little worried to see the invasive changes in the underlying shuffle component. Can you briefly explain how your special ShuffledRDD is implemented?

cloud-fan · 2019-09-11T13:18:04Z

BTW, local shuffle reader doesn't need to cooperate with other shuffle nodes in the same shuffle stage. We can adjust the numPartitions of local shuffle reader freely. This can be a followup.

viirya · 2019-09-11T20:34:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/LocalShuffledRowRDD.scala

+     var dependency: ShuffleDependency[Int, InternalRow, InternalRow],
+     metrics: Map[String, SQLMetric],
+     specifiedPartitionStartIndices: Option[Array[Int]] = None,
+     specifiedPartitionEndIndices: Option[Array[Int]] = None)


I don't see a usage of specifiedPartitionEndIndices in current change. Do we need it?

@viirya
Currently not. We may need the specifiedPartitionEndIndices variable to skip the partitions with 0 size in the following optimization. And I will retain and use it when create LocalShuffledRowRDD later.

then let's add it when you propose this optimization.

From my side, I think it may be beneficial to keep empty tasks, so that the local shuffle reader node can retain the output partitioning from the original plan and help us eliminate shuffles.

viirya · 2019-09-11T20:47:53Z

...core/src/main/scala/org/apache/spark/sql/execution/adaptive/ReduceNumShufflePartitions.scala

@@ -180,25 +180,45 @@ case class ReduceNumShufflePartitions(conf: SQLConf) extends Rule[SparkPlan] {

 case class CoalescedShuffleReaderExec(
    child: QueryStageExec,
-    partitionStartIndices: Array[Int]) extends UnaryExecNode {
+    partitionStartIndices: Array[Int],
+    var isLocal: Boolean = false) extends UnaryExecNode {

  override def output: Seq[Attribute] = child.output

  override def doCanonicalize(): SparkPlan = child.canonicalized

  override def outputPartitioning: Partitioning = {


Don't we need to override requiredChildDistribution, if isLocal is true?

I saw you check if there are additional shuffle exchange added by EnsureRequirements, to decide if local shuffle reader works or not. If don't change requiredChildDistribution, will EnsureRequirements bring additional shuffle exchange?

Maybe I miss anything here?

@viirya
Maybe not override requiredChildDistribution. Because the requiredChildDistribution of CoalescedShuffleReaderExec is UnspecificedDistribution whether the isLocal is true or false, the EnsureRequirements will not introduce the additional shuffle exchange.

Ur, don't you rely on see if EnsureRequirements introduces additional shuffle exchange, to decide doing local shuffle reader or not?

Yes I need.

JkSelf · 2019-09-12T02:40:01Z

@cloud-fan
The specific ShuffleRDD is implemented by reading the whole data from one mapper output locally to ensure there is no data transferred from the network.

JkSelf · 2019-09-18T07:11:59Z

@cloud-fan
Move the rule of converting the shuffle reader to local shuffle reader before ReduceNumShufflePartitions.

cloud-fan · 2019-09-18T07:41:48Z

...ore/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizedLocalShuffleReader.scala

+import org.apache.spark.sql.execution.joins.{BroadcastHashJoinExec}
+import org.apache.spark.sql.internal.SQLConf
+
+case class OptimizedLocalShuffleReader(conf: SQLConf) extends Rule[SparkPlan] {


nit: this should be a verb, OptimizeLocalShuffleReader

cloud-fan · 2019-09-18T07:42:38Z

...ore/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizedLocalShuffleReader.scala

+  private def setIsLocalToFalse(shuffleStage: QueryStageExec): QueryStageExec = {
+    shuffleStage match {
+      case stage: ShuffleQueryStageExec =>
+        stage.isLocalShuffle = false


if possible let's not add mutable states to the plan

cloud-fan · 2019-09-18T07:50:20Z

...ore/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizedLocalShuffleReader.scala

+    }
+
+    // Add the new `LocalShuffleReaderExec` node if the value of `isLocalShuffle` is true
+    val newPlan = plan.transformUp {


Why don't we traverse the tree once?

def isShuffleStage(plan: SparkPlan): Boolean = plan match { case _: ShuffleQueryStageExec => true case ReusedQueryStageExec(_: ShuffleQueryStageExec) => true case _ => false } def canUseLocalShuffleReaderLeft(j: BroadcastHashJoinExec): Boolean = { j.buildSide = BuildLeft && isShuffleStage(j.left) } def canUseLocalShuffleReaderRight ... ... plan transformDown { case join: BroadcastHashJoinExec if canUseLocalShuffleReaderLeft(join) => val localShuffleReader = ... join.copy(left = localShuffleReader) ... }

cloud-fan · 2019-09-18T07:52:36Z

...ore/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizedLocalShuffleReader.scala

+    val newPlan = plan.transformUp {
+      case stage: ShuffleQueryStageExec if (stage.isLocalShuffle) =>
+        LocalShuffleReaderExec(stage)
+      case ReusedQueryStageExec(_, stage: ShuffleQueryStageExec, _) if (stage.isLocalShuffle) =>


let's not strip the ReusedQueryStageExec

cloud-fan · 2019-09-18T07:59:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala

@@ -91,6 +91,7 @@ case class AdaptiveSparkPlanExec(
  // optimizations should be stage-independent.
  @transient private val queryStageOptimizerRules: Seq[Rule[SparkPlan]] = Seq(
    ReuseAdaptiveSubquery(conf, subqueryCache),
+    OptimizedLocalShuffleReader(conf),


since this may change the number of exchanges, we should put it in queryStagePreparationRules
Then the AQE framework can check the cost and give up the optimization if extra changes are introduced.

Note that, the current approach (check number of exchanges at the end of rule) is suboptimal. It's possible that the local shuffle reader can avoid exchanges downstream, which changes the stage boundaries.

Already move it in queryStagePreparationRules.

cloud-fan · 2019-09-18T08:06:51Z

...ore/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizedLocalShuffleReader.scala

+
+  override def doCanonicalize(): SparkPlan = child.canonicalized
+
+  override def outputPartitioning: Partitioning = {


shouldn't this be child.outputPartitioning?

Here for local shuffle reader, the partition number of Partitioning is the number of mappers. and the partition number of child.outputPartitioning is the number of reducers. So how can be child.outputPartitioning ?

sorry, I mean child.child.outputPartitioning

Here the child.child.outputPartitioning is UnknowPartitioning(0) and the partiiton number is not equal.

cloud-fan · 2019-09-18T08:08:10Z

...ore/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizedLocalShuffleReader.scala

+}
+
+case class LocalShuffleReaderExec(
+    child: QueryStageExec) extends UnaryExecNode {


We can make it a leaf node to hide its QueryStageExec. We don't expect any other rules to change the underlying shuffle stage.

Good suggestions. Updated.

cloud-fan · 2019-09-18T08:15:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/LocalShuffledRowRDD.scala

+  extends RDD[InternalRow](dependency.rdd.context, Nil) {
+
+  private[this] val numPreShufflePartitions = dependency.partitioner.numPartitions
+  private[this] val numPostShufflePartitions = dependency.rdd.partitions.length


The name is wrong. This is the # of mappers and thus should be called numPreShufflePartitions

Here for the LocalShuffledRowRDD, the number of mappers is the post shuffle partitions.

hmm, which "shuffle" we are talking about here? If number of mappers is the post shuffle partitions, then it means these mappers are also reducers and there is another shuffle upstream.

cloud-fan · 2019-09-18T08:22:24Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

+   * @param dep shuffle dependency object
+   * @param startMapId the start map id
+   * @param endMapId the end map id
+   * @return a sequence of locations that each includes both a host and an executor id on that


includes both a host and an executor id is confusing. We can just say task location strinng (please refer to TaskLocation)

cloud-fan · 2019-09-18T08:28:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/LocalShuffledRowRDD.scala

+    val sqlMetricsReporter = new SQLShuffleReadMetricsReporter(tempMetrics, metrics)
+    // Connect the the InternalRows read by each ShuffleReader
+    new Iterator[InternalRow] {
+      val readers = partitionStartIndices.zip(partitionEndIndices).map { case (start, end) =>


I get your point that some shuffle blocks are empty and we should skip them, but I think this optimization should be done by the shuffle implementation.

What we need to do here is simply asking the shuffle implementation to read the data for one mapper, with a simple API (e.g. getMapReader(handle, mapId, ...))

JkSelf

@cloud-fan Sorry for delay response. Resolve the comments. Please help me review again. Thanks.

JkSelf · 2019-10-08T02:51:24Z

...ore/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizedLocalShuffleReader.scala

+
+  override def doCanonicalize(): SparkPlan = child.canonicalized
+
+  override def outputPartitioning: Partitioning = {


Here the child.child.outputPartitioning is UnknowPartitioning(0) and the partiiton number is not equal.

JkSelf · 2019-10-08T02:51:57Z

...ore/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizedLocalShuffleReader.scala

+}
+
+case class LocalShuffleReaderExec(
+    child: QueryStageExec) extends UnaryExecNode {


Good suggestions. Updated.

JkSelf · 2019-10-08T02:54:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/LocalShuffledRowRDD.scala

+  extends RDD[InternalRow](dependency.rdd.context, Nil) {
+
+  private[this] val numPreShufflePartitions = dependency.partitioner.numPartitions
+  private[this] val numPostShufflePartitions = dependency.rdd.partitions.length


Here for the LocalShuffledRowRDD, the number of mappers is the post shuffle partitions.

JkSelf · 2019-10-08T02:57:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala

@@ -91,6 +91,7 @@ case class AdaptiveSparkPlanExec(
  // optimizations should be stage-independent.
  @transient private val queryStageOptimizerRules: Seq[Rule[SparkPlan]] = Seq(
    ReuseAdaptiveSubquery(conf, subqueryCache),
+    OptimizedLocalShuffleReader(conf),


Already move it in queryStagePreparationRules.

cloud-fan · 2019-10-15T07:22:40Z

retest this please

cloud-fan · 2019-10-15T07:35:42Z

core/src/main/scala/org/apache/spark/shuffle/BlockStoreShuffleReader.scala

+        handle.shuffleId,
+        startPartition,
+        endPartition)
+      case (_) => throw new IllegalArgumentException(


nit:

case Some(..) => case None =>

cloud-fan · 2019-10-15T07:37:29Z

There are still 2 code style comments not addressed. I'll merge this PR if tests pass and we can address code style comments in a followup.

SparkQA · 2019-10-15T10:02:51Z

Test build #112096 has finished for PR 25295 at commit 9c1dc55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-10-15T13:51:31Z

thanks, merging to master!

### What changes were proposed in this pull request? A followup of #25295 This PR proposes a few code cleanups: 1. rename the special `getMapSizesByExecutorId` to `getMapSizesByMapIndex` 2. rename the parameter `mapId` to `mapIndex` as that's really a mapper index. 3. `BlockStoreShuffleReader` should take `blocksByAddress` directly instead of a map id. 4. rename `getMapReader` to `getReaderForOneMapper` to be more clearer. ### Why are the changes needed? make code easier to understand ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #26128 from cloud-fan/followup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

maryannxue · 2019-10-17T13:55:25Z

...core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeLocalShuffleReader.scala

+    val numExchangeAfter = numExchanges(EnsureRequirements(conf).apply(optimizedPlan))
+
+    if (numExchangeAfter > numExchangeBefore) {
+      logWarning("OptimizeLocalShuffleReader rule is not applied due" +


logDebug should be enough.

maryannxue · 2019-10-17T14:01:45Z

Can we do a quick follow-up to address minor comments here?

[SPARK-28560][SQL] Optimize shuffle reader to local shuffle reader when smj converted to bhj in adaptive execution #25295 (comment)
[SPARK-28560][SQL] Optimize shuffle reader to local shuffle reader when smj converted to bhj in adaptive execution #25295 (comment)
Add a test for: query stage reuse still working in presence of local shuffle reader.

JkSelf · 2019-10-18T02:26:45Z

@maryannxue

#25295 (comment): this comment has been resolved in commit, which let BlockStoreShuffleReadershould take blocksByAddress directly instead of a map id.
I will resolve #25295 (comment) and test("Exchange reuse") can prove that "query stage reuse still working in presence of local shuffle reader". I will add some small updated in test("Exchange reuse") .

### What changes were proposed in this pull request? A followup of [#25295](#25295). 1) change the logWarning to logDebug in `OptimizeLocalShuffleReader`. 2) update the test to check whether query stage reuse can work well with local shuffle reader. ### Why are the changes needed? make code robust ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes #26157 from JkSelf/followup-25295. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

maryannxue · 2019-10-24T18:20:46Z

...core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeLocalShuffleReader.scala

+  }
+}
+
+case class LocalShuffleReaderExec(child: QueryStageExec) extends LeafExecNode {


Any reason to make LocalShuffleReaderExec a LeafNode?
There's a potential issue here: we make it a leaf node yet did not visit this node in createQueryStages. So a stage can be "not complete yet" but considered complete and thus trigger the creation of parent stages. This might be the root cause of the flaky tests.

case q: QueryStageExec => CreateStageResult(newPlan = q, allChildStagesMaterialized = q.resultOption.isDefined, newStages = Seq.empty) case _ => if (plan.children.isEmpty) { CreateStageResult(newPlan = plan, allChildStagesMaterialized = true, newStages = Seq.empty) } else { val results = plan.children.map(createQueryStages) CreateStageResult( newPlan = plan.withNewChildren(results.map(_.newPlan)), allChildStagesMaterialized = results.forall(_.allChildStagesMaterialized), newStages = results.flatMap(_.newStages)) }

@maryannxue Thanks for your good findings. I have create PR#26250 to fix.
Here the flaky tests may be not caused by this. It occurs by the random build side of planner for inner join. Even with PR#26250, the flaky tests still exists. Thanks.

…reader as far as possible in BroadcastHashJoin ### What changes were proposed in this pull request? [PR#25295](#25295) already implement the rule of converting the shuffle reader to local reader for the `BroadcastHashJoin` in probe side. This PR support converting the shuffle reader to local reader in build side. ### Why are the changes needed? Improve performance ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing unit tests Closes #26289 from JkSelf/supportTwoSideLocalReader. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

wangyum reviewed Jul 30, 2019

View reviewed changes

...ore/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizedLocalShuffleReader.scala Outdated Show resolved Hide resolved

wangyum reviewed Jul 30, 2019

View reviewed changes

...core/src/main/scala/org/apache/spark/sql/execution/adaptive/ReduceNumShufflePartitions.scala Outdated Show resolved Hide resolved

dongjoon-hyun added the SQL label Jul 30, 2019

JkSelf force-pushed the localShuffleOptimization branch from 8a2c04f to ec30466 Compare September 2, 2019 01:57

JkSelf force-pushed the localShuffleOptimization branch from ec30466 to a27a8b6 Compare September 11, 2019 03:15

cloud-fan reviewed Sep 11, 2019

View reviewed changes

viirya reviewed Sep 11, 2019

View reviewed changes

cloud-fan reviewed Sep 18, 2019

View reviewed changes

JkSelf commented Oct 8, 2019

View reviewed changes

JkSelf force-pushed the localShuffleOptimization branch from 6e451b8 to 763b84f Compare October 8, 2019 03:42

JkSelf added 3 commits October 15, 2019 11:17

resolve the comments

f10e5b2

resolve the small comments

37111ff

resolve the conflicts

9c1dc55

JkSelf force-pushed the localShuffleOptimization branch from 25fc43c to 9c1dc55 Compare October 15, 2019 05:10

cloud-fan reviewed Oct 15, 2019

View reviewed changes

cloud-fan closed this in 9ac4b2d Oct 15, 2019

cloud-fan mentioned this pull request Oct 15, 2019

[SPARK-28560][SQL][followup] code cleanup for local shuffle reader #26128

Closed

maryannxue reviewed Oct 17, 2019

View reviewed changes

JkSelf mentioned this pull request Oct 18, 2019

[SPARK-28560][SQL][followup] resolve the remaining comments for PR#25295 #26157

Closed

maryannxue reviewed Oct 24, 2019

View reviewed changes

JkSelf mentioned this pull request Oct 29, 2019

[SPARK-28560][SQL][followup] support the build side to local shuffle reader as far as possible in BroadcastHashJoin #26289

Closed


		override def doCanonicalize(): SparkPlan = child.canonicalized

		override def outputPartitioning: Partitioning = {

[SPARK-28560][SQL] Optimize shuffle reader to local shuffle reader when smj converted to bhj in adaptive execution #25295

[SPARK-28560][SQL] Optimize shuffle reader to local shuffle reader when smj converted to bhj in adaptive execution #25295

Conversation

JkSelf commented Jul 30, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

JkSelf commented Aug 5, 2019

JkSelf commented Sep 11, 2019

JkSelf commented Sep 11, 2019

cloud-fan commented Sep 11, 2019

JkSelf commented Sep 11, 2019

cloud-fan commented Sep 11, 2019

cloud-fan Sep 11, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Sep 11, 2019

cloud-fan commented Sep 11, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JkSelf commented Sep 12, 2019

JkSelf commented Sep 18, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Sep 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JkSelf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Oct 15, 2019

Choose a reason for hiding this comment

cloud-fan commented Oct 15, 2019

SparkQA commented Oct 15, 2019

cloud-fan commented Oct 15, 2019

Choose a reason for hiding this comment

maryannxue commented Oct 17, 2019 • edited Loading

JkSelf commented Oct 18, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JkSelf commented Jul 30, 2019 •

edited

Loading

cloud-fan Sep 11, 2019 •

edited

Loading

cloud-fan Sep 18, 2019 •

edited

Loading

maryannxue commented Oct 17, 2019 •

edited

Loading