SPARK-2045 Sort-based shuffle #1499

mateiz · 2014-07-20T07:24:26Z

This adds a new ShuffleManager based on sorting, as described in https://issues.apache.org/jira/browse/SPARK-2045. The bulk of the code is in an ExternalSorter class that is similar to ExternalAppendOnlyMap, but sorts key-value pairs by partition ID and can be used to create a single sorted file with a map task's output. (Longer-term I think this can take on the remaining functionality in ExternalAppendOnlyMap and replace it so we don't have code duplication.)

The main TODOs still left are:

enabling ExternalSorter to merge across spilled files
- with an Ordering
- without an Ordering, using the keys' hash codes
adding more tests (e.g. a version of our shuffle suite that runs on this)
rebasing on top of the size-tracking refactoring in [SPARK-1777] Prevent OOMs from single partitions #1165 when that is merged
disabling spilling if spark.shuffle.spill is set to false

Despite this though, this seems to work pretty well (running successfully in cases where the hash shuffle would OOM, such as 1000 reduce tasks on executors with only 1G memory), and it seems to be comparable in speed or faster than hash-based shuffle (it will create much fewer files for the OS to keep track of). So I'm posting it to get some early feedback.

After these TODOs are done, I'd also like to enable ExternalSorter to sort data within each partition by a key as well, which will allow us to use it to implement external spilling in reduce tasks in sortByKey.

SparkQA · 2014-07-20T07:28:09Z

QA tests have started for PR 1499. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16864/consoleFull

mateiz · 2014-07-20T07:28:59Z

core/src/main/scala/org/apache/spark/util/collection/SizeTrackingBuffer.scala

+ * The tracking code is copied from SizeTrackingAppendOnlyMap -- we'll factor that out soon.
+ */
+private[spark] class SizeTrackingBuffer[T <: AnyRef](initialCapacity: Int = 64)
+  extends SizeTrackingCollection[T]


This could be replaced with the buffer in #1165 except we also need destructiveSortedIterator. And to sort stuff in-place, that in turn requires T to be a subclass of AnyRef (otherwise we'd have to pass the Array[T] to a different Arrays.sort() method based on its type).

SparkQA · 2014-07-20T07:43:18Z

QA tests have started for PR 1499. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16865/consoleFull

rxin · 2014-07-20T08:09:22Z

core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala

+   * the user.
+   */
+  def merge(spills: Seq[SpilledFile], inMemory: Iterator[((Int, K), C)])
+      : Iterator[(Int, Iterator[Product2[K, C]])] = {


I'd add @return to the javadoc to explain what the (int, iterator) represents.

SparkQA · 2014-07-20T09:02:44Z

QA results for PR 1499:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
trait SizeTrackingCollection[T] extends Iterable[T] {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16864/consoleFull

SparkQA · 2014-07-20T09:19:10Z

QA results for PR 1499:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
trait SizeTrackingCollection[T] extends Iterable[T] {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16865/consoleFull

SparkQA · 2014-07-20T18:38:12Z

QA tests have started for PR 1499. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16878/consoleFull

SparkQA · 2014-07-20T20:17:02Z

QA results for PR 1499:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
trait SizeTrackingCollection[T] extends Iterable[T] {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16878/consoleFull

aarondav · 2014-07-20T21:51:56Z

core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala

+
+  private val blockManager = SparkEnv.get.blockManager
+  private val diskBlockManager = blockManager.diskBlockManager
+  private val ser = Serializer.getSerializer(serializer.getOrElse(null))


it appears that you could just do Serializer.getSerializer(serializer), thanks to code you added last month :)

SparkQA · 2014-07-20T22:28:08Z

QA tests have started for PR 1499. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16882/consoleFull

SparkQA · 2014-07-21T00:10:16Z

QA results for PR 1499:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
trait SizeTrackingCollection[T] extends Iterable[T] {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16882/consoleFull

SparkQA · 2014-07-21T00:38:13Z

QA tests have started for PR 1499. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16885/consoleFull

mateiz · 2014-07-21T01:08:03Z

I've now updated this to support partial aggregation across spilled files and even if we don't have an Ordering, using hash code comparison similar to ExternalAppendOnlyMap. It also now fully implements the behavior in the docs, namely sorting the data if you pass an Ordering, etc.

It looks like Aaron found a problem with the size-tracking code -- will try to fix that in ExternalAppendOnlyMap as well. Once this class is in though it could replace ExternalAppendOnlyMap in most use cases, though its one downside is that it creates another object for each key of the in-memory collection (since we have ((Int, K), C) pairs to allow sorting by partition). Might still be worth it long-term though. On the flip side, I think the hash based merging code here is more efficient, avoiding a bunch of ArrayBuffer.remove calls.

SparkQA · 2014-07-21T01:08:09Z

QA tests have started for PR 1499. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16886/consoleFull

SparkQA · 2014-07-21T02:19:10Z

QA results for PR 1499:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
trait SizeTrackingCollection[T] extends Iterable[T] {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16885/consoleFull

SparkQA · 2014-07-21T02:47:02Z

QA results for PR 1499:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
trait SizeTrackingCollection[T] extends Iterable[T] {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16886/consoleFull

aarondav · 2014-07-21T02:47:48Z

core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala

+      writer.close()
+    } catch {
+      case e: Exception =>
+        writer.close()


nit: For the purposes of slightly more defensive programming, consider putting this inside a Utils.tryLog { }. We probably don't want to leak files.

By the way, is this logic correct for an exception? It looks like we'll just keep merrily chugging along -- should we propagate the exception upwards?

…reams Before it could give ArrayIndexOutOfBoundsException after we read the very last element

@DeveloperAPI

This feature allowed for more efficient serialization in some cases, but made it less efficient in the more common case of (K, V) because it required the data structures in shuffle to hold Tuple2 objects instead of unpacking them into parallel arrays. Moreover, it was broken for ShuffledRDDs with aggregation, because there wasn't any way to construct the user's Product2 type so we would return Tuple2. It seemed better to disable it. One unfortunate effect is that OrderedRDDFunctions used to have a public API that let it work on RDD[P <: Product2] and return RDD[P] from sortByKey. This can't work now, so I changed it to return RDD[(K, V)]. In practice anyone using OrderedRDDFunctions through the implicit conversions would get RDD[(K, V)] anyway, so it only mattered for people who somehow tried to create it directly. I've made OrderedRDDFunctions's constructor a @DeveloperAPI to discourage this. It will be possible to add back custom data types later, but we'll need a better interface that gives ShuffledRDD and such a mechanism to construct them and to manage buffers of them. It might also be possible to come up with an interface for "key-less" shuffles, where you shuffle an RDD[T] where each element is its own key. This would be a much better fit for sorting and some of the GraphX use cases.

This also required creating a BlockId subclass for shuffle index blocks so that the BlockManagers can report back their lists of blocks.

This fix isn't great because we sometimes allocate another Tuple2, but the only other easy option is to disable sending mutable pairs, which would be expensive.

mateiz · 2014-07-30T19:42:44Z

Thanks everyone, I think I addressed all the comments. Anything else before we merge this? I'd like to merge it fairly soon because there are a few other issues I'd like fixed for 1.1 that will depend on this code: https://issues.apache.org/jira/browse/SPARK-2711, https://issues.apache.org/jira/browse/SPARK-983

@mridulm I haven't updated the spilling code to match your fixes in #1609 yet but I will do that once that PR is merged (in case we refactor the fixes in any way)

mateiz · 2014-07-30T19:43:39Z

Jenkins, test this please

SparkQA · 2014-07-30T19:44:14Z

QA tests have started for PR 1499. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17471/consoleFull

SparkQA · 2014-07-30T19:49:00Z

QA tests have started for PR 1499. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17472/consoleFull

SparkQA · 2014-07-30T20:59:08Z

QA results for PR 1499:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class ShuffledRDD[K, V, C](
case class ShuffleBlockId(shuffleId: Int, mapId: Int, reduceId: Int) extends BlockId {
case class ShuffleIndexBlockId(shuffleId: Int, mapId: Int, reduceId: Int) extends BlockId {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17472/consoleFull

SparkQA · 2014-07-30T21:05:50Z

QA results for PR 1499:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class ShuffledRDD[K, V, C](
case class ShuffleBlockId(shuffleId: Int, mapId: Int, reduceId: Int) extends BlockId {
case class ShuffleIndexBlockId(shuffleId: Int, mapId: Int, reduceId: Int) extends BlockId {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17471/consoleFull

rxin · 2014-07-30T23:31:58Z

Jenkins, retest this please.

andrewor14 · 2014-07-31T00:09:24Z

core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala

+      while (partitionId < numPartitions &&
+          indexInPartition == spill.elementsPerPartition(partitionId)) {
+        partitionId += 1
+        indexInPartition = 0L


Why don't we need to set indexInBatch = 0 here as well? Is it because batches straddle partition boundaries (i.e. one batch can have elements from multiple partitions)?

I see, the answer is yes because when we spill we iterate through ((partition ID, k), v), not just (k, v) per partition.

Yeah exactly

andrewor14 · 2014-07-31T00:47:45Z

test this please!

SparkQA · 2014-07-31T00:54:10Z

QA tests have started for PR 1499. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17526/consoleFull

andrewor14 · 2014-07-31T01:02:49Z

I took another pass over the patch and the changes look ready to me. I also tested this locally and verified that the shuffle files were actually cleaned up. There is still a lot of very similar code in the ExternalSorter and the EAOM, though this is not of 1.1 priority to resolve. I have filed a related JIRA as a reminder to clean this up some time after the release.

rxin · 2014-07-31T01:08:00Z

Ok I merged this in master.

SparkQA · 2014-07-31T02:04:11Z

QA results for PR 1499:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class ShuffledRDD[K, V, C](
case class ShuffleBlockId(shuffleId: Int, mapId: Int, reduceId: Int) extends BlockId {
case class ShuffleIndexBlockId(shuffleId: Int, mapId: Int, reduceId: Int) extends BlockId {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17526/consoleFull

This adds a new ShuffleManager based on sorting, as described in https://issues.apache.org/jira/browse/SPARK-2045. The bulk of the code is in an ExternalSorter class that is similar to ExternalAppendOnlyMap, but sorts key-value pairs by partition ID and can be used to create a single sorted file with a map task's output. (Longer-term I think this can take on the remaining functionality in ExternalAppendOnlyMap and replace it so we don't have code duplication.) The main TODOs still left are: - [x] enabling ExternalSorter to merge across spilled files - [x] with an Ordering - [x] without an Ordering, using the keys' hash codes - [x] adding more tests (e.g. a version of our shuffle suite that runs on this) - [x] rebasing on top of the size-tracking refactoring in apache#1165 when that is merged - [x] disabling spilling if spark.shuffle.spill is set to false Despite this though, this seems to work pretty well (running successfully in cases where the hash shuffle would OOM, such as 1000 reduce tasks on executors with only 1G memory), and it seems to be comparable in speed or faster than hash-based shuffle (it will create much fewer files for the OS to keep track of). So I'm posting it to get some early feedback. After these TODOs are done, I'd also like to enable ExternalSorter to sort data within each partition by a key as well, which will allow us to use it to implement external spilling in reduce tasks in `sortByKey`. Author: Matei Zaharia <matei@databricks.com> Closes apache#1499 from mateiz/sort-based-shuffle and squashes the following commits: bd841f9 [Matei Zaharia] Various review comments d1c137f [Matei Zaharia] Various review comments a611159 [Matei Zaharia] Compile fixes due to rebase 62c56c8 [Matei Zaharia] Fix ShuffledRDD sometimes not returning Tuple2s. f617432 [Matei Zaharia] Fix a failing test (seems to be due to change in SizeTracker logic) 9464d5f [Matei Zaharia] Simplify code and fix conflicts after latest rebase 0174149 [Matei Zaharia] Add cleanup behavior and cleanup tests for sort-based shuffle eb4ee0d [Matei Zaharia] Remove customizable element type in ShuffledRDD fa2e8db [Matei Zaharia] Allow nextBatchStream to be called after we're done looking at all streams a34b352 [Matei Zaharia] Fix tracking of indices within a partition in SpillReader, and add test 03e1006 [Matei Zaharia] Add a SortShuffleSuite that runs ShuffleSuite with sort-based shuffle 3c7ff1f [Matei Zaharia] Obey the spark.shuffle.spill setting in ExternalSorter ad65fbd [Matei Zaharia] Rebase on top of Aaron's Sorter change, and use Sorter in our buffer 44d2a93 [Matei Zaharia] Use estimateSize instead of atGrowThreshold to test collection sizes 5686f71 [Matei Zaharia] Optimize merging phase for in-memory only data: 5461cbb [Matei Zaharia] Review comments and more tests (e.g. tests with 1 element per partition) e9ad356 [Matei Zaharia] Update ContextCleanerSuite to make sure shuffle cleanup tests use hash shuffle (since they were written for it) c72362a [Matei Zaharia] Added bug fix and test for when iterators are empty de1fb40 [Matei Zaharia] Make trait SizeTrackingCollection private[spark] 4988d16 [Matei Zaharia] tweak c1b7572 [Matei Zaharia] Small optimization ba7db7f [Matei Zaharia] Handle null keys in hash-based comparator, and add tests for collisions ef4e397 [Matei Zaharia] Support for partial aggregation even without an Ordering 4b7a5ce [Matei Zaharia] More tests, and ability to sort data if a total ordering is given e1f84be [Matei Zaharia] Fix disk block manager test 5a40a1c [Matei Zaharia] More tests 614f1b4 [Matei Zaharia] Add spill metrics to map tasks cc52caf [Matei Zaharia] Add more error handling and tests for error cases bbf359d [Matei Zaharia] More work 3a56341 [Matei Zaharia] More partial work towards sort-based shuffle 7a0895d [Matei Zaharia] Some more partial work towards sort-based shuffle b615476 [Matei Zaharia] Scaffolding for sort-based shuffle

mateiz reviewed Jul 20, 2014
View reviewed changes

rxin reviewed Jul 20, 2014
View reviewed changes

aarondav reviewed Jul 20, 2014
View reviewed changes

aarondav reviewed Jul 21, 2014
View reviewed changes

mateiz added 9 commits July 30, 2014 10:48

Allow nextBatchStream to be called after we're done looking at all st…

fa2e8db

…reams Before it could give ArrayIndexOutOfBoundsException after we read the very last element

Add cleanup behavior and cleanup tests for sort-based shuffle

0174149

This also required creating a BlockId subclass for shuffle index blocks so that the BlockManagers can report back their lists of blocks.

Simplify code and fix conflicts after latest rebase

9464d5f

Fix a failing test (seems to be due to change in SizeTracker logic)

f617432

Fix ShuffledRDD sometimes not returning Tuple2s.

62c56c8

This fix isn't great because we sometimes allocate another Tuple2, but the only other easy option is to disable sending mutable pairs, which would be expensive.

Compile fixes due to rebase

a611159

Various review comments

d1c137f

Various review comments

bd841f9

andrewor14 reviewed Jul 31, 2014
View reviewed changes

asfgit closed this in e966284 Jul 31, 2014

mateiz mentioned this pull request Jul 31, 2014

Fix JIRA-983 and support exteranl sort for sortByKey #931

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-2045 Sort-based shuffle #1499

SPARK-2045 Sort-based shuffle #1499

mateiz commented Jul 20, 2014

SparkQA commented Jul 20, 2014

mateiz Jul 20, 2014

SparkQA commented Jul 20, 2014

rxin Jul 20, 2014

SparkQA commented Jul 20, 2014

SparkQA commented Jul 20, 2014

SparkQA commented Jul 20, 2014

SparkQA commented Jul 20, 2014

aarondav Jul 20, 2014

SparkQA commented Jul 20, 2014

SparkQA commented Jul 21, 2014

SparkQA commented Jul 21, 2014

mateiz commented Jul 21, 2014

SparkQA commented Jul 21, 2014

SparkQA commented Jul 21, 2014

SparkQA commented Jul 21, 2014

aarondav Jul 21, 2014

mateiz commented Jul 30, 2014

mateiz commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

rxin commented Jul 30, 2014

andrewor14 Jul 31, 2014

andrewor14 Jul 31, 2014

mateiz Jul 31, 2014

andrewor14 commented Jul 31, 2014

SparkQA commented Jul 31, 2014

andrewor14 commented Jul 31, 2014

rxin commented Jul 31, 2014

SparkQA commented Jul 31, 2014

SPARK-2045 Sort-based shuffle #1499

SPARK-2045 Sort-based shuffle #1499

Conversation

mateiz commented Jul 20, 2014

SparkQA commented Jul 20, 2014

mateiz Jul 20, 2014

Choose a reason for hiding this comment

SparkQA commented Jul 20, 2014

rxin Jul 20, 2014

Choose a reason for hiding this comment

SparkQA commented Jul 20, 2014

SparkQA commented Jul 20, 2014

SparkQA commented Jul 20, 2014

SparkQA commented Jul 20, 2014

aarondav Jul 20, 2014

Choose a reason for hiding this comment

SparkQA commented Jul 20, 2014

SparkQA commented Jul 21, 2014

SparkQA commented Jul 21, 2014

mateiz commented Jul 21, 2014

SparkQA commented Jul 21, 2014

SparkQA commented Jul 21, 2014

SparkQA commented Jul 21, 2014

aarondav Jul 21, 2014

Choose a reason for hiding this comment

mateiz commented Jul 30, 2014

mateiz commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

rxin commented Jul 30, 2014

andrewor14 Jul 31, 2014

Choose a reason for hiding this comment

andrewor14 Jul 31, 2014

Choose a reason for hiding this comment

mateiz Jul 31, 2014

Choose a reason for hiding this comment

andrewor14 commented Jul 31, 2014

SparkQA commented Jul 31, 2014

andrewor14 commented Jul 31, 2014

rxin commented Jul 31, 2014

SparkQA commented Jul 31, 2014