[SPARK-24296][CORE] Replicate large blocks as a stream. #21451

squito · 2018-05-29T17:01:47Z

When replicating large cached RDD blocks, it can be helpful to replicate
them as a stream, to avoid using large amounts of memory during the
transfer. This also allows blocks larger than 2GB to be replicated.

Added unit tests in DistributedSuite. Also ran tests on a cluster for
blocks > 2gb.

squito · 2018-05-29T17:05:14Z

EDIT: no longer a WIP, as all the dependencies are in.

SparkQA · 2018-05-29T17:19:58Z

Test build #91266 has finished for PR 21451 at commit 7e517e4.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class UploadBlockStream extends BlockTransferMessage

SparkQA · 2018-05-29T17:44:10Z

Test build #91268 has finished for PR 21451 at commit 6ca6f8d.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class UploadBlockStream extends BlockTransferMessage

SparkQA · 2018-05-29T23:57:14Z

Test build #91271 has finished for PR 21451 at commit 68c5d5f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

witgo · 2018-05-30T02:44:07Z

common/network-common/src/main/java/org/apache/spark/network/server/RpcHandler.java

   * @param callback Callback which should be invoked exactly once upon success or failure of the
   *                 RPC.
   */
  public abstract void receive(
      TransportClient client,
      ByteBuffer message,
+      StreamData streamData,


It's not necessary to add a parameter. Change the message parameter to InputStream.

yes, there are other ways to do this, but I wanted to leave the old code paths as close relatively untouched to minimize the behavior change / risk of bugs. I also think its helpful to clearly separate out a portion that is read entirely into memory vs. the streaming portion, it makes it easier to work with. Also InputStream suggests the data is getting pulled instead of pushed.

your earlier approach definitely gave a lot of inspiration for this change. I'm hoping that making it a more isolated change helps us make progress here.

What about incorporating parameter message into parameter streamData?

I'm gonna move discussion here #21346 since that is the PR that will introduce this api

vanzin

Didn't see any red flags but definitely would like another look after the other change goes in - but not sure I'll have time for that.

vanzin · 2018-06-27T21:51:55Z

...twork-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlockStream.java

+/**
+ * A request to Upload a block, which the destintation should receive as a stream.
+ *
+ * The actual block data is not contained here.  It is in the streamData in the RpcHandler.receive()


Need to update to match API.

vanzin · 2018-06-27T21:52:08Z

core/src/main/scala/org/apache/spark/network/BlockDataManager.scala

+   * Put the given block that will be received as a stream.
+   *
+   * When this method is called, the data itself is not available -- it needs to be handled within
+   * the callbacks of <code>streamData</code>.


Need to update comment.

vanzin · 2018-06-27T21:53:59Z

core/src/main/scala/org/apache/spark/network/netty/NettyBlockRpcServer.scala

+    val message = BlockTransferMessage.Decoder.fromByteBuffer(messageHeader)
+    message match {
+      case uploadBlockStream: UploadBlockStream =>
+       val (level: StorageLevel, classTag: ClassTag[_]) = {


Indentation is off here.

Using .asInstanceOf[UploadBlockStream] would achieve the same goal here with less indentation, just with a different exception...

vanzin · 2018-06-27T22:09:15Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

+    // TODO if we change this method to return the ManagedBuffer, then getRemoteValues
+    // could just use the inputStream on the temp file, rather than memory-mapping the file.
+    // Until then, replication can cause the process to use too much memory and get killed
+    // by the OS / cluster manager (not a java OOM, since its a memory-mapped file) even though


vanzin · 2018-06-27T22:10:22Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

@@ -723,7 +770,9 @@ private[spark] class BlockManager(
      }

      if (data != null) {
-        return Some(new ChunkedByteBuffer(data))
+        val chunkSize =
+          conf.getSizeAsBytes("spark.storage.memoryMapLimitForTests", Int.MaxValue.toString).toInt


Want to turn this into a config constant? I'm seeing it in a bunch of places.

vanzin · 2018-06-27T22:11:27Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

@@ -1341,12 +1390,16 @@ private[spark] class BlockManager(
      try {
        val onePeerStartTime = System.nanoTime
        logTrace(s"Trying to replicate $blockId of ${data.size} bytes to $peer")
+        // This thread keeps a lock on the block, so we do not want the netty thread to unlock
+        // block when it finishes sending the message.
+        val mb = new BlockManagerManagedBuffer(blockInfoManager, blockId, data, false,


s/mb/buffer

Confusing in a place that deals with sizes all over.

SparkQA · 2018-06-28T01:29:58Z

Test build #92398 has finished for PR 21451 at commit 1cc0f3f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class UploadBlockStream extends BlockTransferMessage

SparkQA · 2018-06-28T06:53:59Z

Test build #92407 has finished for PR 21451 at commit fa1928a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-28T18:45:56Z

Test build #92427 has finished for PR 21451 at commit bdfa6ff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2018-07-18T21:23:36Z

@mridulm @jerryshao @felixcheung last one in the 2GB block limit series. just rebased to include the updates to #21440. I will also run my tests on a cluster here with this: https://github.com/squito/spark_2gb_test/blob/master/src/main/scala/com/cloudera/sparktest/LargeBlocks.scala
will report the results from that, probably tomorrow

thanks for all the reviews!

SparkQA · 2018-07-19T01:05:14Z

Test build #93250 has finished for PR 21451 at commit 335e26d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class UploadBlockStream extends BlockTransferMessage

squito · 2018-07-19T01:21:03Z

retest this please

SparkQA · 2018-07-19T05:54:50Z

Test build #93255 has finished for PR 21451 at commit 335e26d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class UploadBlockStream extends BlockTransferMessage

When replicating large cached RDD blocks, it can be helpful to replicate them as a stream, to avoid using large amounts of memory during the transfer. This also allows blocks larger than 2GB to be replicated. Added unit tests in DistributedSuite. Also ran tests on a cluster for blocks > 2gb.

SparkQA · 2018-07-25T20:09:48Z

Test build #93549 has finished for PR 21451 at commit fe31a7d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class UploadBlockStream extends BlockTransferMessage

squito · 2018-07-25T21:44:41Z

fyi, I did finally run my scale tests again on a cluster, and shuffles, remote reads, and replication worked for blocks over 2gb (sorry got sidetracked with a few other things in the meantime)

tgravescs

made one pass through, need to look at some thing in more depth to make sure I understand

tgravescs · 2018-08-02T17:27:02Z

core/src/main/scala/org/apache/spark/network/netty/NettyBlockRpcServer.scala

@@ -73,10 +73,32 @@ class NettyBlockRpcServer(
        }
        val data = new NioManagedBuffer(ByteBuffer.wrap(uploadBlock.blockData))
        val blockId = BlockId(uploadBlock.blockId)
+        logInfo(s"Receiving replicated block $blockId with level ${level} " +


this seems like it could be pretty verbose, put at debug or trace?

tgravescs · 2018-08-02T17:27:13Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+
+  private[spark] val MEMORY_MAP_LIMIT_FOR_TESTS =
+    ConfigBuilder("spark.storage.memoryMapLimitForTests")
+      .internal()


add a .doc that says is for testing only

tgravescs · 2018-08-02T17:28:02Z

core/src/main/scala/org/apache/spark/network/netty/NettyBlockRpcServer.scala

+  override def receiveStream(
+      client: TransportClient,
+      messageHeader: ByteBuffer,
+    responseContext: RpcResponseCallback): StreamCallbackWithID = {


fix spacing

tgravescs · 2018-08-02T17:29:12Z

core/src/main/scala/org/apache/spark/network/netty/NettyBlockRpcServer.scala

+        .asInstanceOf[(StorageLevel, ClassTag[_])]
+    }
+    val blockId = BlockId(message.blockId)
+    logInfo(s"Receiving replicated block $blockId with level ${level} as stream " +


tgravescs · 2018-08-02T17:31:54Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

+        // stream.
+        channel.close()
+        // TODO Even if we're only going to write the data to disk after this, we end up using a lot
+        // of memory here.  We wont' get a jvm OOM, but might get killed by the OS / cluster


spelling won't

yeah agree this could be an issue with yarn since overhead memory might not be big enough, can we file a jira to specifically track this?

filed SPARK-25035

rezasafi · 2018-08-03T14:44:51Z

LGTM good to me (after applying the @tgravescs comments). Great job on the whole issue.

SparkQA · 2018-08-07T02:16:59Z

Test build #94319 has finished for PR 21451 at commit 6d059f2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-08-08T17:58:06Z

...twork-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlockStream.java

+import static org.apache.spark.network.shuffle.protocol.BlockTransferMessage.Type;
+
+/**
+ * A request to Upload a block, which the destintation should receive as a stream.


nit: spelling destination

tgravescs · 2018-08-08T18:11:25Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

+    // we just write to a temp file, and call putBytes on the data in that file.
+    val tmpFile = diskBlockManager.createTempLocalBlock()._2
+    new StreamCallbackWithID {
+      val channel: WritableByteChannel = Channels.newChannel(new FileOutputStream(tmpFile))


we need to honor spark.io.encryption.enabled here to encrypt the file on local disk?

yeah sure looks like it.

SparkQA · 2018-08-13T22:13:28Z

Test build #94695 has finished for PR 21451 at commit 034acb4.

This patch fails from timeout after a configured wait of `300m`.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2018-08-14T07:05:02Z

Test build #94725 has finished for PR 21451 at commit c45e702.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-08-14T18:15:21Z

test this please

vanzin

Looks sane to me.

vanzin · 2018-08-14T20:38:05Z

core/src/test/scala/org/apache/spark/security/EncryptionFunSuite.scala

@@ -28,11 +28,15 @@ trait EncryptionFunSuite {
   * for the test to modify the provided SparkConf.
   */
  final protected def encryptionTest(name: String)(fn: SparkConf => Unit) {
+    encryptionTestHelper(name) { case (name, conf) =>
+        test(name)(fn(conf))


nit: indentation

vanzin · 2018-08-14T20:44:19Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

+              case MemoryMode.ON_HEAP => ByteBuffer.allocate _
+              case MemoryMode.OFF_HEAP => Platform.allocateDirectBuffer _
+            }
+            new EncryptedBlockData(tmpFile, blockSize, conf, key).toChunkedByteBuffer(allocator)


toChunkedByteBuffer is also pretty memory-hungry, right? You'll end up needing enough memory to hold the entire file in memory, if I read the code right.

This is probably ok for now, but should probably mention it in your TODO above.

yeah, you store the entire file in memory (after decrypting). its not memory mapped either, so it'll probably be a regular OOM (depending on memory mode). updated the comment

SparkQA · 2018-08-14T22:07:12Z

Test build #94762 has finished for PR 21451 at commit c45e702.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-15T06:36:55Z

Test build #94778 has finished for PR 21451 at commit 44149a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2018-08-21T14:16:35Z

@tgravescs @vanzin any more comments? I think I've addressed everything

vanzin · 2018-08-21T18:26:20Z

LGTM. Merging to master.

gatorsmile · 2018-09-14T20:40:39Z

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 7, localhost, executor 1): java.io.IOException: org.apache.spark.SparkException: corrupt remote block broadcast_0_piece0 of broadcast_0: 1651574976 != 1165629262
	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1320)
	at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
	at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
	at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
	at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
	at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:84)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:367)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1347)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:373)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: corrupt remote block broadcast_0_piece0 of broadcast_0: 1651574976 != 1165629262
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:167)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:151)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:231)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1313)
	... 13 more

Is this possible a bug introduced by this PR? After merging this PR, I saw this error multiple times. https://issues.apache.org/jira/browse/SPARK-25422

cc @squito @cloud-fan @vanzin @tgravescs

gatorsmile · 2018-09-14T20:42:33Z

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.4-test-maven-hadoop-2.7/21/

squito · 2018-09-17T17:08:35Z

looking. so far seems unrelated to me, but as you've said its failed in a few builds so I'm gonna keep digging. The error is occurring before any rdds are getting replicated via the new code path. This change is mostly not touching the path involved in sending a broadcast.

I've been unable to repro so far despite running the test hundreds of times, but I might need to run more tests or put in some pauses or something. gonna compare with other test runs with teh failure as well.

gatorsmile · 2018-09-18T18:37:13Z

@squito Thanks for digging it!

This PR introduced the failed test case. We have to know whether it exposes any serious bug (if it is not introduced by this PR) and impacts our 2.4 release.

squito · 2018-09-19T14:03:30Z

still looking -- will put comments on the jira so its more visible

squito force-pushed the clean_replication branch from 7e517e4 to 6ca6f8d Compare May 29, 2018 17:37

witgo reviewed May 30, 2018

View reviewed changes

squito mentioned this pull request May 31, 2018

[SPARK-6237][NETWORK] Network-layer changes to allow stream upload. #21346

Closed

squito mentioned this pull request Jun 27, 2018

[SPARK-24307][CORE] Support reading remote cached partitions > 2gb #21440

Closed

squito force-pushed the clean_replication branch from 68c5d5f to 1cc0f3f Compare June 27, 2018 21:16

vanzin reviewed Jun 27, 2018

View reviewed changes

squito force-pushed the clean_replication branch from bdfa6ff to 335e26d Compare July 18, 2018 21:18

squito force-pushed the clean_replication branch from 335e26d to fe31a7d Compare July 25, 2018 15:27

squito changed the title ~~[SPARK-24296][CORE][WIP] Replicate large blocks as a stream.~~ [SPARK-24296][CORE] Replicate large blocks as a stream. Jul 26, 2018

tgravescs reviewed Aug 2, 2018

View reviewed changes

review feedback

6d059f2

tgravescs reviewed Aug 8, 2018

View reviewed changes

fix

034acb4

cleanup

4664def

Merge branch 'master' into clean_replication

c45e702

vanzin reviewed Aug 14, 2018

View reviewed changes

review feedback

44149a5

asfgit closed this in 99d2e4e Aug 21, 2018

[SPARK-24296][CORE] Replicate large blocks as a stream. #21451

[SPARK-24296][CORE] Replicate large blocks as a stream. #21451

Conversation

squito commented May 29, 2018

squito commented May 29, 2018 • edited Loading

SparkQA commented May 29, 2018

SparkQA commented May 29, 2018

SparkQA commented May 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanzin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 28, 2018

SparkQA commented Jun 28, 2018

SparkQA commented Jun 28, 2018

squito commented Jul 18, 2018

SparkQA commented Jul 19, 2018

squito commented Jul 19, 2018

SparkQA commented Jul 19, 2018

SparkQA commented Jul 25, 2018

squito commented Jul 25, 2018

tgravescs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rezasafi commented Aug 3, 2018

SparkQA commented Aug 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 13, 2018

SparkQA commented Aug 14, 2018

tgravescs commented Aug 14, 2018

vanzin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 14, 2018

SparkQA commented Aug 15, 2018

squito commented Aug 21, 2018

vanzin commented Aug 21, 2018

gatorsmile commented Sep 14, 2018

gatorsmile commented Sep 14, 2018

squito commented Sep 17, 2018

gatorsmile commented Sep 18, 2018

squito commented Sep 19, 2018

squito commented May 29, 2018 •

edited

Loading