Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28560][SQL] Optimize shuffle reader to local shuffle reader when smj converted to bhj in adaptive execution #25295

Closed
wants to merge 11 commits into from

Conversation

JkSelf
Copy link
Contributor

@JkSelf JkSelf commented Jul 30, 2019

What changes were proposed in this pull request?

Implement a rule in the new adaptive execution framework introduced in SPARK-23128. This rule is used to optimize the shuffle reader to local shuffle reader when smj is converted to bhj in adaptive execution.

How was this patch tested?

Existing tests

@JkSelf
Copy link
Contributor Author

JkSelf commented Aug 5, 2019

We have done the functionality and performance tests in 3TB TPC-DS. And the result is shown in here. Q82 can show 1.76x performance improvement with this PR. And no queries have significant performance degradation.
@carsonwang @@cloud-fan can you help review if you have available time? Thanks for your help.

@JkSelf
Copy link
Contributor Author

JkSelf commented Sep 11, 2019

fixed the conflicts.

@JkSelf
Copy link
Contributor Author

JkSelf commented Sep 11, 2019

@cloud-fan Can you help review if you have available time? Thanks for your help very much.

@cloud-fan
Copy link
Contributor

Should this be a general optimization? When a reduce task needs to read some shuffle blocks that are happened to exist locally, we can read the shuffle files directly instead of going through the shuffle service.

@JkSelf
Copy link
Contributor Author

JkSelf commented Sep 11, 2019

@cloud-fan Thanks for you reviews! When the shuffle blocks exist locally, the shuffle service already read the blocks locally even through shuffle service in ShuffleBlockFetcherIterator, I think. Correct me if wrong understanding! If so, whether need to optimize it to locally read?

@cloud-fan
Copy link
Contributor

You are right, local shuffle blocks are already optimized. I wanted to say the host-local shuffle blocks. Anyway, seems it's not very related to what you are trying to do here.

@@ -180,25 +180,45 @@ case class ReduceNumShufflePartitions(conf: SQLConf) extends Rule[SparkPlan] {

case class CoalescedShuffleReaderExec(
child: QueryStageExec,
partitionStartIndices: Array[Int]) extends UnaryExecNode {
partitionStartIndices: Array[Int],
var isLocal: Boolean = false) extends UnaryExecNode {
Copy link
Contributor

@cloud-fan cloud-fan Sep 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReduceNumShufflePartitions and local shuffle reader are two different optimizations, and they are conflicting:
ReduceNumShufflePartitions adjusts the numPartitions by assuming the partitions are post-shuffle partitions. Their data size depends on the shuffle blocks they need to read. If we change the shuffle to local shuffle reader, then the partitions become pre-shuffle partitions, and their data size is different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change the shuffle to local shuffle reader, then the partitions become pre-shuffle partitions, and their data size is different.
@cloud-fan Here the local shuffle reader is still optimize the post-shuffle partitions. And I don't understand why the partitions become pre-shuffle partitions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without local shuffle reader, a task of ShuffledRDD reads the shuffle blocks map1-reduce1, map2-reduce1, etc. With local shuffle reader, the task reads map1-reduce1, map1-reduce2, etc. The task output data size is different and we can't use the algorithm in ReduceNumShufflePartitions anymore.

Furthermore, the RDD numPartitions also becomes different after switching to local shuffle reader, how can we apply the ReduceNumShufflePartitions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, Got it. In order to make code more clear, I will create LocalShuffleReaderExec later. Thanks.

@cloud-fan
Copy link
Contributor

I think this is a good idea, but the implementation needs more polishing. What I expect to see is:

  1. a rule to convert ShuffleQueryStageExec to LocalShuffleReaderExec. This rule must be run before ReduceNumShufflePartitions.
  2. A special ShuffledRDD that reads shuffle blocks grouped by mapId instead of reduceId.

I'm a little worried to see the invasive changes in the underlying shuffle component. Can you briefly explain how your special ShuffledRDD is implemented?

@cloud-fan
Copy link
Contributor

BTW, local shuffle reader doesn't need to cooperate with other shuffle nodes in the same shuffle stage. We can adjust the numPartitions of local shuffle reader freely. This can be a followup.

var dependency: ShuffleDependency[Int, InternalRow, InternalRow],
metrics: Map[String, SQLMetric],
specifiedPartitionStartIndices: Option[Array[Int]] = None,
specifiedPartitionEndIndices: Option[Array[Int]] = None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a usage of specifiedPartitionEndIndices in current change. Do we need it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya
Currently not. We may need the specifiedPartitionEndIndices variable to skip the partitions with 0 size in the following optimization. And I will retain and use it when create LocalShuffledRowRDD later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then let's add it when you propose this optimization.

From my side, I think it may be beneficial to keep empty tasks, so that the local shuffle reader node can retain the output partitioning from the original plan and help us eliminate shuffles.

@@ -180,25 +180,45 @@ case class ReduceNumShufflePartitions(conf: SQLConf) extends Rule[SparkPlan] {

case class CoalescedShuffleReaderExec(
child: QueryStageExec,
partitionStartIndices: Array[Int]) extends UnaryExecNode {
partitionStartIndices: Array[Int],
var isLocal: Boolean = false) extends UnaryExecNode {

override def output: Seq[Attribute] = child.output

override def doCanonicalize(): SparkPlan = child.canonicalized

override def outputPartitioning: Partitioning = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to override requiredChildDistribution, if isLocal is true?

I saw you check if there are additional shuffle exchange added by EnsureRequirements, to decide if local shuffle reader works or not. If don't change requiredChildDistribution, will EnsureRequirements bring additional shuffle exchange?

Maybe I miss anything here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya
Maybe not override requiredChildDistribution. Because the requiredChildDistribution of CoalescedShuffleReaderExec is UnspecificedDistribution whether the isLocal is true or false, the EnsureRequirements will not introduce the additional shuffle exchange.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, don't you rely on see if EnsureRequirements introduces additional shuffle exchange, to decide doing local shuffle reader or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I need.

@JkSelf
Copy link
Contributor Author

JkSelf commented Sep 12, 2019

@cloud-fan
The specific ShuffleRDD is implemented by reading the whole data from one mapper output locally to ensure there is no data transferred from the network.

@JkSelf
Copy link
Contributor Author

JkSelf commented Sep 18, 2019

@cloud-fan
Move the rule of converting the shuffle reader to local shuffle reader before ReduceNumShufflePartitions.

import org.apache.spark.sql.execution.joins.{BroadcastHashJoinExec}
import org.apache.spark.sql.internal.SQLConf

case class OptimizedLocalShuffleReader(conf: SQLConf) extends Rule[SparkPlan] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this should be a verb, OptimizeLocalShuffleReader

private def setIsLocalToFalse(shuffleStage: QueryStageExec): QueryStageExec = {
shuffleStage match {
case stage: ShuffleQueryStageExec =>
stage.isLocalShuffle = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if possible let's not add mutable states to the plan

}

// Add the new `LocalShuffleReaderExec` node if the value of `isLocalShuffle` is true
val newPlan = plan.transformUp {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we traverse the tree once?

def isShuffleStage(plan: SparkPlan): Boolean = plan match {
  case _: ShuffleQueryStageExec => true
  case ReusedQueryStageExec(_: ShuffleQueryStageExec) => true
  case _ => false
}

def canUseLocalShuffleReaderLeft(j: BroadcastHashJoinExec): Boolean = {
  j.buildSide = BuildLeft && isShuffleStage(j.left)
}

def canUseLocalShuffleReaderRight ...
...
plan transformDown {
  case join: BroadcastHashJoinExec if canUseLocalShuffleReaderLeft(join) =>
    val localShuffleReader = ...
    join.copy(left = localShuffleReader)

  ...
}

val newPlan = plan.transformUp {
case stage: ShuffleQueryStageExec if (stage.isLocalShuffle) =>
LocalShuffleReaderExec(stage)
case ReusedQueryStageExec(_, stage: ShuffleQueryStageExec, _) if (stage.isLocalShuffle) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not strip the ReusedQueryStageExec

@@ -91,6 +91,7 @@ case class AdaptiveSparkPlanExec(
// optimizations should be stage-independent.
@transient private val queryStageOptimizerRules: Seq[Rule[SparkPlan]] = Seq(
ReuseAdaptiveSubquery(conf, subqueryCache),
OptimizedLocalShuffleReader(conf),
Copy link
Contributor

@cloud-fan cloud-fan Sep 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this may change the number of exchanges, we should put it in queryStagePreparationRules
Then the AQE framework can check the cost and give up the optimization if extra changes are introduced.

Note that, the current approach (check number of exchanges at the end of rule) is suboptimal. It's possible that the local shuffle reader can avoid exchanges downstream, which changes the stage boundaries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already move it in queryStagePreparationRules.


override def doCanonicalize(): SparkPlan = child.canonicalized

override def outputPartitioning: Partitioning = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be child.outputPartitioning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here for local shuffle reader, the partition number of Partitioning is the number of mappers. and the partition number of child.outputPartitioning is the number of reducers. So how can be child.outputPartitioning ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I mean child.child.outputPartitioning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the child.child.outputPartitioning is UnknowPartitioning(0) and the partiiton number is not equal.

}

case class LocalShuffleReaderExec(
child: QueryStageExec) extends UnaryExecNode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make it a leaf node to hide its QueryStageExec. We don't expect any other rules to change the underlying shuffle stage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestions. Updated.

extends RDD[InternalRow](dependency.rdd.context, Nil) {

private[this] val numPreShufflePartitions = dependency.partitioner.numPartitions
private[this] val numPostShufflePartitions = dependency.rdd.partitions.length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is wrong. This is the # of mappers and thus should be called numPreShufflePartitions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here for the LocalShuffledRowRDD, the number of mappers is the post shuffle partitions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, which "shuffle" we are talking about here? If number of mappers is the post shuffle partitions, then it means these mappers are also reducers and there is another shuffle upstream.

* @param dep shuffle dependency object
* @param startMapId the start map id
* @param endMapId the end map id
* @return a sequence of locations that each includes both a host and an executor id on that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

includes both a host and an executor id is confusing. We can just say task location strinng (please refer to TaskLocation)

val sqlMetricsReporter = new SQLShuffleReadMetricsReporter(tempMetrics, metrics)
// Connect the the InternalRows read by each ShuffleReader
new Iterator[InternalRow] {
val readers = partitionStartIndices.zip(partitionEndIndices).map { case (start, end) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point that some shuffle blocks are empty and we should skip them, but I think this optimization should be done by the shuffle implementation.

What we need to do here is simply asking the shuffle implementation to read the data for one mapper, with a simple API (e.g. getMapReader(handle, mapId, ...))

Copy link
Contributor Author

@JkSelf JkSelf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Sorry for delay response. Resolve the comments. Please help me review again. Thanks.


override def doCanonicalize(): SparkPlan = child.canonicalized

override def outputPartitioning: Partitioning = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the child.child.outputPartitioning is UnknowPartitioning(0) and the partiiton number is not equal.

}

case class LocalShuffleReaderExec(
child: QueryStageExec) extends UnaryExecNode {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestions. Updated.

extends RDD[InternalRow](dependency.rdd.context, Nil) {

private[this] val numPreShufflePartitions = dependency.partitioner.numPartitions
private[this] val numPostShufflePartitions = dependency.rdd.partitions.length
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here for the LocalShuffledRowRDD, the number of mappers is the post shuffle partitions.

@@ -91,6 +91,7 @@ case class AdaptiveSparkPlanExec(
// optimizations should be stage-independent.
@transient private val queryStageOptimizerRules: Seq[Rule[SparkPlan]] = Seq(
ReuseAdaptiveSubquery(conf, subqueryCache),
OptimizedLocalShuffleReader(conf),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already move it in queryStagePreparationRules.

@cloud-fan
Copy link
Contributor

retest this please

handle.shuffleId,
startPartition,
endPartition)
case (_) => throw new IllegalArgumentException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

case Some(..) =>
case None =>

@cloud-fan
Copy link
Contributor

There are still 2 code style comments not addressed. I'll merge this PR if tests pass and we can address code style comments in a followup.

@SparkQA
Copy link

SparkQA commented Oct 15, 2019

Test build #112096 has finished for PR 25295 at commit 9c1dc55.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan closed this in 9ac4b2d Oct 15, 2019
@cloud-fan
Copy link
Contributor

thanks, merging to master!

cloud-fan added a commit that referenced this pull request Oct 16, 2019
### What changes were proposed in this pull request?

A followup of #25295

This PR proposes a few code cleanups:
1. rename the special `getMapSizesByExecutorId` to `getMapSizesByMapIndex`
2. rename the parameter `mapId` to `mapIndex` as that's really a mapper index.
3. `BlockStoreShuffleReader` should take `blocksByAddress` directly instead of a map id.
4. rename `getMapReader` to `getReaderForOneMapper` to be more clearer.

### Why are the changes needed?

make code easier to understand

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes #26128 from cloud-fan/followup.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
val numExchangeAfter = numExchanges(EnsureRequirements(conf).apply(optimizedPlan))

if (numExchangeAfter > numExchangeBefore) {
logWarning("OptimizeLocalShuffleReader rule is not applied due" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logDebug should be enough.

@maryannxue
Copy link
Contributor

maryannxue commented Oct 17, 2019

@JkSelf
Copy link
Contributor Author

JkSelf commented Oct 18, 2019

@maryannxue

  1. #25295 (comment): this comment has been resolved in commit, which let BlockStoreShuffleReadershould take blocksByAddress directly instead of a map id.
  2. I will resolve #25295 (comment) and test("Exchange reuse") can prove that "query stage reuse still working in presence of local shuffle reader". I will add some small updated in test("Exchange reuse") .

cloud-fan pushed a commit that referenced this pull request Oct 18, 2019
### What changes were proposed in this pull request?
A followup of [#25295](#25295).
1) change the logWarning to logDebug in `OptimizeLocalShuffleReader`.
2) update the test to check whether query stage reuse can work well with local shuffle reader.

### Why are the changes needed?
make code robust

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing tests

Closes #26157 from JkSelf/followup-25295.

Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
}
}

case class LocalShuffleReaderExec(child: QueryStageExec) extends LeafExecNode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to make LocalShuffleReaderExec a LeafNode?
There's a potential issue here: we make it a leaf node yet did not visit this node in createQueryStages. So a stage can be "not complete yet" but considered complete and thus trigger the creation of parent stages. This might be the root cause of the flaky tests.

    case q: QueryStageExec =>
      CreateStageResult(newPlan = q,
        allChildStagesMaterialized = q.resultOption.isDefined, newStages = Seq.empty)

    case _ =>
      if (plan.children.isEmpty) {
        CreateStageResult(newPlan = plan, allChildStagesMaterialized = true, newStages = Seq.empty)
      } else {
        val results = plan.children.map(createQueryStages)
        CreateStageResult(
          newPlan = plan.withNewChildren(results.map(_.newPlan)),
          allChildStagesMaterialized = results.forall(_.allChildStagesMaterialized),
          newStages = results.flatMap(_.newStages))
      }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maryannxue Thanks for your good findings. I have create PR#26250 to fix.
Here the flaky tests may be not caused by this. It occurs by the random build side of planner for inner join. Even with PR#26250, the flaky tests still exists. Thanks.

cloud-fan pushed a commit that referenced this pull request Oct 31, 2019
…reader as far as possible in BroadcastHashJoin

### What changes were proposed in this pull request?
[PR#25295](#25295) already implement the rule of converting the shuffle reader to local reader for the `BroadcastHashJoin` in probe side. This PR support converting the shuffle reader to local reader in build side.

### Why are the changes needed?
Improve performance

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing unit tests

Closes #26289 from JkSelf/supportTwoSideLocalReader.

Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants