[SPARK-22537][core] Aggregation of map output statistics on driver faces single point bottleneck #19763

gcz2022 · 2017-11-16T04:21:40Z

What changes were proposed in this pull request?

In adaptive execution, the map output statistics of all mappers will be aggregated after previous stage is successfully executed. Driver takes the aggregation job while it will get slow when the number of mapper * shuffle partitions is large, since it only uses single thread to compute. This PR uses multi-thread to deal with this single point bottleneck.

How was this patch tested?

Test cases are in MapOutputTrackerSuite.scala

gcz2022 · 2017-11-16T04:22:33Z

cc @cloud-fan @viirya @gatorsmile @chenghao-intel

CodingCat · 2017-11-16T05:14:00Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

      }
+      mapStatusSubmitTasks.foreach(_.get())


this part can be simplified by using scala's Future,

val futureArray = equallyDivide(totalSizes.length, taskSlices).map { reduceIds => Future { // whatever you want to do here } } Await.result(Future.sequence(futureArray), Duration.Inf) // or some timeout value you prefer

Good idea, thx!

Should I use the scala.concurrent.ExecutionContext.Implicits.global ExecutionContext？

Don't use scala.concurrent.ExecutionContext.Implicits.global. You need to create a thread pool.

CodingCat · 2017-11-16T05:23:58Z

my question is "how many times we have seen this operation of collecting statistics is the bottleneck?"

viirya

Can you also show some benchmark numbers to demonstrate it is a bottleneck?

viirya · 2017-11-16T06:43:57Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

-          totalSizes(i) += s.getSizeForBlock(i)
-        }
+      val mapStatusSubmitTasks = ArrayBuffer[Future[_]]()
+      var taskSlices = parallelism


gcz2022 · 2017-11-16T13:39:17Z

This happens a lot in our TPC-DS 100TB test. We have an Intel Xeon CPU E5-2699 v4 @2.2GHz CPU as master, this will influence the driver's performance. And we set spark.sql.shuffle.partitions to 10976. Shuffle partitions * number of mappers will influence the workload driver does.

Let's take TPC-DS q67 as example:
Without this PR, there's 47:39-(41:16+6.3min) ~ 5s gap between map and reduce stages, most of which is used to aggregate map statistics using one thread.

With this PR, there's 25:32-(18:58+6.6min) ~ 0s gap:

cloud-fan · 2017-11-16T14:25:37Z

Seems like not a big deal for the end-to-end performance?

viirya · 2017-11-16T14:38:35Z

Looks not a significant difference.

gcz2022 · 2017-11-17T03:43:20Z

Actually, the time gap is O(number of mappers * shuffle partitions). In this case, number of mappers is not very large, while users are more likely to get slowed down when they run on a big data set.

cloud-fan · 2017-11-17T09:21:18Z

cc @zsxwing

zsxwing · 2017-11-17T22:12:59Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

-          totalSizes(i) += s.getSizeForBlock(i)
+      val parallelism = conf.getInt("spark.adaptive.map.statistics.cores", 8)
+
+      val mapStatusSubmitTasks = equallyDivide(totalSizes.length, parallelism).map {


Doing this is not cheap. I would add a config and only run this in multiple threads when #mapper * #shuffle_partitions is large.

…read or not

viirya · 2017-11-20T04:41:04Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

@@ -485,4 +485,13 @@ package object config {
        "array in the sorter.")
      .intConf
      .createWithDefault(Integer.MAX_VALUE)
+
+  private[spark] val SHUFFLE_MAP_OUTPUT_STATISTICS_MULTITHREAD_THRESHOLD =
+    ConfigBuilder("spark.shuffle.mapOutputStatisticsMultithreadThreshold")


spark.shuffle.mapOutputStatistics.parallelAggregationThreshold?

Yes, it's better!

cloud-fan · 2017-11-20T11:10:39Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

@@ -485,4 +485,13 @@ package object config {
        "array in the sorter.")
      .intConf
      .createWithDefault(Integer.MAX_VALUE)
+
+  private[spark] val SHUFFLE_MAP_OUTPUT_STATISTICS_PARALLEL_AGGREGATION_THRESHOLD =


spark.adaptive.map.statistics.cores should also be a config entry like this

spark.sql.adaptive.xxx already exists, will this be a problem?

Really? I grep the code base but can't find it.

Like https://github.com/gczsjdy/spark/blob/11b60af737a04d931356aa74ebf3c6cf4a6b08d6/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L204-L204

I think that's not a big problem, adaptive execution need both core and sql code, so both confs are needed.

I don't get it. You showed me that spark.sql.adaptive.xxx have config entries, why spark.adaptive.map.statistics.cores doesn't need config entry?

spark.adaptive.map.statistics.cores needs config entry, but I thought adaptive.xxx item has been put under spark.sql. already, so it might be inconsitent. Now I think it's no big deal.

There is also a spark.shuffle.mapOutput.dispatcher.numThreads in this file without config entry, do I need to add one?

yea let's add it. BTW shall we also use mapOutput instead of mapOutputStatistics?

Actually there are 3 confs like that... all need?

cloud-fan · 2017-11-20T11:11:32Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+      .doc("Multi-thread is used when the number of mappers * shuffle partitions exceeds this " +
+        "threshold")
+      .intConf
+      .createWithDefault(100000000)


wow 100 million is really a large threshold, how do you pick this number?

Now I also think it's a little bit large... In the case I mentioned, the 5s gap is created by 10^8 of this value. Maybe 10^7 or 2*10^7 is good?

cloud-fan · 2017-11-20T14:39:45Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

+          }
+        }
+      } else {
+        val parallelism = conf.getInt("spark.adaptive.map.statistics.cores", 8)


how is this related to adaptive?

I thought only adaptive execution code will call this. But actually it seems after all ShuffleMapTasks(which is common) of a stage completed this will be called, right?

cloud-fan · 2017-11-20T15:10:48Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+      .createWithDefault(10000000)
+
+  private[spark] val SHUFFLE_MAP_OUTPUT_STATISTICS_CORES =
+    ConfigBuilder("spark.shuffle.mapOutputStatistics.cores")


nit: cores -> parallelism

cloud-fan · 2017-11-20T15:11:49Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

-        for (i <- 0 until totalSizes.length) {
-          totalSizes(i) += s.getSizeForBlock(i)
+      if (statuses.length * totalSizes.length <=
+        conf.get(SHUFFLE_MAP_OUTPUT_STATISTICS_PARALLEL_AGGREGATION_THRESHOLD)) {


nit:

val parallelAggThreshold = ... if (statuses.length * totalSizes.length < parallelAggThreshold)

cloud-fan · 2017-11-20T15:24:34Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

+  def equallyDivide(num: Int, divisor: Int): Iterator[Seq[Int]] = {
+    assert(divisor > 0, "Divisor should be positive")
+    val (each, remain) = (num / divisor, num % divisor)
+    val (smaller, bigger) = (0 until num).splitAt((divisor-remain) * each)


can you add some comment to describe the algorithm? I'd expect something like:

to equally divide n elements to m buckets each bucket should have n/m elements for the remaining n%m elements pick the first n%m buckets and add one more element

my proposal

def equallyDivide(numElements: Int, numBuckets: Int) { val elementsPerBucket = numElements / numBuckets val remaining = numElements % numBuckets if (remaining == 0) { 0.until(num).grouped(elementsPerBucket) } else { val splitPoint = (elementsPerBucket + 1) * remaining 0.to(splitPoint).grouped(elementsPerBucket + 1) ++ (splitPoint + 1).until(numElements).grouped(elementsPerBucket) } }

zsxwing · 2017-11-20T19:12:56Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

+        }
+      } else {
+        val parallelism = conf.get(SHUFFLE_MAP_OUTPUT_STATISTICS_PARALLELISM)
+        val threadPool = ThreadUtils.newDaemonFixedThreadPool(parallelism, "map-output-statistics")


please put threadPool.shutdown in finally to shut down the thread pool

zsxwing · 2017-11-20T19:17:16Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

+          }
+        }
+      } else {
+        val parallelism = conf.get(SHUFFLE_MAP_OUTPUT_STATISTICS_PARALLELISM)


How about setting parallelism = math.min(Runtime.getRuntime.availableProcessors(), statuses.length.toLong * totalSizes.length / parallelAggThreshold) rather than introducing a new config, such as:

val parallelism = math.min( Runtime.getRuntime.availableProcessors(), statuses.length.toLong * totalSizes.length / parallelAggThreshold + 1) if (parallelism <= 1) { ... } else { .... }

zsxwing · 2017-11-20T19:27:53Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

+      0.until(numElements).grouped(elementsPerBucket)
+    } else {
+      val splitPoint = (elementsPerBucket + 1) * remaining
+      0.to(splitPoint).grouped(elementsPerBucket + 1) ++


grouped is expensive here. I saw it generates Vector rather than Range:

scala> (1 to 100).grouped(10).foreach(g => println(g.getClass)) class scala.collection.immutable.Vector class scala.collection.immutable.Vector class scala.collection.immutable.Vector class scala.collection.immutable.Vector class scala.collection.immutable.Vector class scala.collection.immutable.Vector class scala.collection.immutable.Vector class scala.collection.immutable.Vector class scala.collection.immutable.Vector class scala.collection.immutable.Vector

It means we need to generate all of numbers between 0 and numElements. Could you implement a special grouped for Range instead?

cloud-fan · 2017-11-21T07:20:17Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

+          }
+          ThreadUtils.awaitResult(Future.sequence(mapStatusSubmitTasks), Duration.Inf)
+        } finally {
+          threadpool.shutdown()


cc @zsxwing do we really need to shut down the thread pool every time? This method may be called many times and is it better to cache this thread pool? like the dispatcher thread pool.

I agree with you, with putting the thread pool in the class, the only lost is that: even if when single-thread is used, this pool still exists. The gain is reducing creating the pool after every shuffle.

We can shut down the pool after some certain idle time, but not sure if it's worth the complexity

I'm fine to create a thread pool every time since this code path seems not run pretty frequently because

Using a shared cached thread poll is just like creating new thread pool since the idle time of a thread is pretty large and is likely killed before the next call.

Using a shared fixed thread pool is totally a waste for most of use cases.

The cost of creating threads is trivial comparing the total time of a job.

@gczsjdy could you fix the compile error?

@zsxwing Actually I built using sbt/mvn, no errors...

@gczsjdy Oh, sorry. I didn't realize there is already a threadpool field in MapOutputTrackerMaster. That's why there is no error. Here you are shutting down a wrong thread pool.

ah good catch! I misread it...

@cloud-fan We can shut down the pool after some certain idle time, but not sure if it's worth the complexity I know we don't need to do this now. But if we did it how to do?

cloud-fan · 2017-11-22T23:30:31Z

OK to test

zsxwing · 2017-11-22T23:36:45Z

We can shut down the pool after some certain idle time, but not sure if it's worth the complexity

Yeah, that's just what the cached thread pool does :)

zsxwing · 2017-11-22T23:43:07Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

+        SHUFFLE_MAP_OUTPUT_PARALLEL_AGGREGATION_THRESHOLD)
+      val parallelism = math.min(
+        Runtime.getRuntime.availableProcessors(),
+        statuses.length * totalSizes.length / parallelAggThreshold + 1)


statuses.length.toLong. It's easy to overflow here.

gcz2022 · 2017-11-23T15:28:38Z

@cloud-fan Seems Jenkins's not started?

cloud-fan · 2017-11-23T17:03:05Z

retest this please

cloud-fan · 2017-11-23T17:05:24Z

LGTM

SparkQA · 2017-11-23T20:27:54Z

Test build #84134 has finished for PR 19763 at commit 72c3d97.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-11-24T04:35:37Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+    ConfigBuilder("spark.shuffle.mapOutput.parallelAggregationThreshold")
+      .internal()
+      .doc("Multi-thread is used when the number of mappers * shuffle partitions is greater than " +
+        "or equal to this threshold.")


Is this condition to enable parallel aggregation still true?

Sorry, but didn't get you.

Looks like only parallelism >= 2, this parallel aggregation is enabled. Is it equal to the number of mappers * shuffle partitions >= this threshold?

From above statuses.length.toLong * totalSizes.length / parallelAggThreshold + 1, looks like we need to have at least two times of this threshold to enable this parallel aggregation?

statuses.length.toLong * totalSizes.length / parallelAggThreshold + 1 >= 2 -> statuses.length.toLong * totalSizes.length >= parallelAggThreshold, so it doesn't need to be 2 times, just not smaller than 1x is good.

Do you think it's necessary to indicate the actual parallelism's calculation way here?

It's ok. I misread the equation. Nvm.

I think we don't need to indicate the calculation way in config description. The current one is enough.

After rethinking about this, I think it is better to indicate this threshold also determines the number of threads in parallelism. So it should not be set to zero or negative number.

Yeah, I will add some.

viirya · 2017-11-24T04:43:48Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

+          }
+        }
+      } else {
+        val threadPool = ThreadUtils.newDaemonFixedThreadPool(parallelism, "map-output-aggregate")


The value of parallelism seems making us not fully utilize all processors at all time? E.g, if availableProcessors returns 8, but parallelism is 2, we pick 2 as number of threads.

I think we don't need to fully utilize all available processors. parallelAggThreshold is default to be 10^7, which means a relatively small task to deal with. Therefore the tasks don't need to be cut smaller in most cases.
For some cases where the split is a big task, parallelAggThreshold should be tuned. This is not very direct because you don't have a xx.parallelism config to set, but the benefit is we introduced less configs.

viirya · 2017-11-24T08:00:24Z

LGTM

viirya · 2017-11-24T08:12:48Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

-        for (i <- 0 until totalSizes.length) {
-          totalSizes(i) += s.getSizeForBlock(i)
+      val parallelAggThreshold = conf.get(
+        SHUFFLE_MAP_OUTPUT_PARALLEL_AGGREGATION_THRESHOLD)


Maybe a little picky, but should we do:

val parallelAggThreshold = conf.get( SHUFFLE_MAP_OUTPUT_PARALLEL_AGGREGATION_THRESHOLD) + 1 ... val parallelism = math.min( Runtime.getRuntime.availableProcessors(), (statuses.length.toLong * totalSizes.length + 1) / parallelAggThreshold + 1).toInt

In case of the threshold being set to zero?

For zero or negative threshold, see my above comment: #19763 (comment).

I think that code will make people confused, and we need more comments to explain, that seems unworthy.
In most cases the default value is enough, so we just add some value check and docs explanation will be good?

Yeah, I left the comment before #19763 (comment). I think it is good enough to add more comment to the config entry.

cloud-fan · 2017-11-24T10:45:39Z

retest this please

SparkQA · 2017-11-24T14:06:04Z

Test build #84161 has finished for PR 19763 at commit 0f87dd6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-11-24T14:10:02Z

thanks, merging to master!

GuoChenzhao added 3 commits November 16, 2017 10:58

Use multi-thread to solve single point bottleneck

5dd0487

Add test case

819774f

Style

da02825

gcz2022 changed the title ~~[SPARK-22537] Aggregation of map output statistics on driver faces single point bottleneck~~ [SPARK-22537][core] Aggregation of map output statistics on driver faces single point bottleneck Nov 16, 2017

CodingCat reviewed Nov 16, 2017

View reviewed changes

Use Scala Future

2735b30

viirya reviewed Nov 16, 2017

View reviewed changes

zsxwing reviewed Nov 17, 2017

View reviewed changes

Not use global executionContext and add a conf to decide use multi-th…

8501970

…read or not

viirya reviewed Nov 20, 2017

View reviewed changes

GuoChenzhao added 2 commits November 20, 2017 15:54

New conf name

4dafb19

Typo

3419dfa

cloud-fan reviewed Nov 20, 2017

View reviewed changes

Change conf default value, add conf

da147d7

cloud-fan reviewed Nov 20, 2017

View reviewed changes

Refactor

8a1719d

zsxwing requested changes Nov 20, 2017

View reviewed changes

Remove introduced config parallelism, fix step=0 bug and refactor

055d44c

cloud-fan reviewed Nov 21, 2017

View reviewed changes

zsxwing reviewed Nov 22, 2017

View reviewed changes

Use long to prevent overflow

72c3d97

Fix

c9c26ce

viirya reviewed Nov 24, 2017

View reviewed changes

Add more docs and value check

0f87dd6

asfgit closed this in efd0036 Nov 24, 2017

[SPARK-22537][core] Aggregation of map output statistics on driver faces single point bottleneck #19763

[SPARK-22537][core] Aggregation of map output statistics on driver faces single point bottleneck #19763

Conversation

gcz2022 commented Nov 16, 2017

What changes were proposed in this pull request?

How was this patch tested?

gcz2022 commented Nov 16, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CodingCat commented Nov 16, 2017

viirya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gcz2022 commented Nov 16, 2017 • edited

cloud-fan commented Nov 16, 2017

viirya commented Nov 16, 2017

gcz2022 commented Nov 17, 2017

cloud-fan commented Nov 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gcz2022 Nov 20, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gcz2022 Nov 20, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Nov 20, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zsxwing Nov 20, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Nov 22, 2017

zsxwing commented Nov 22, 2017

Choose a reason for hiding this comment

gcz2022 commented Nov 23, 2017

cloud-fan commented Nov 23, 2017

cloud-fan commented Nov 23, 2017

SparkQA commented Nov 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Nov 24, 2017

Choose a reason for hiding this comment

gcz2022 commented Nov 16, 2017 •

edited

gcz2022 Nov 20, 2017 •

edited

gcz2022 Nov 20, 2017 •

edited

cloud-fan Nov 20, 2017 •

edited

zsxwing Nov 20, 2017 •

edited

gcz2022 Nov 24, 2017 •

edited