[SPARK-24307][CORE] Support reading remote cached partitions > 2gb #21440

squito · 2018-05-27T02:01:17Z

(1) Netty's ByteBuf cannot support data > 2gb. So to transfer data from a
ChunkedByteBuffer over the network, we use a custom version of
FileRegion which is backed by the ChunkedByteBuffer.

(2) On the receiving end, we need to expose all the data in a
FileSegmentManagedBuffer as a ChunkedByteBuffer. We do that by memory
mapping the entire file in chunks.

Added unit tests. Ran the randomized test a couple of hundred times on my laptop. Tests cover the equivalent of SPARK-24107 for the ChunkedByteBufferFileRegion. Also tested on a cluster with remote cache reads >2gb (in memory and on disk).

(1) Netty's ByteBuf cannot support data > 2gb. So to transfer data from a ChunkedByteBuffer over the network, we use a custom version of FileRegion which is backed by the ChunkedByteBuffer. (2) On the receiving end, we need to expose all the data in a FileSegmentManagedBuffer as a ChunkedByteBuffer. We do that by memory mapping the entire file in chunks. Added unit tests. Also tested on a cluster with remote cache reads > 2gb (in memory and on disk).

markhamstra · 2018-05-27T03:07:42Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

@@ -659,6 +659,11 @@ private[spark] class BlockManager(
   * Get block from remote block managers as serialized bytes.
   */
  def getRemoteBytes(blockId: BlockId): Option[ChunkedByteBuffer] = {
+    // TODO if we change this method to return the ManagedBuffer, then getRemoteValues
+    // could just use the inputStream on the temp file, rather than memory-mapping the file.
+    // Until then, replication can go cause the process to use too much memory and get killed


SparkQA · 2018-05-27T06:27:00Z

Test build #91194 has finished for PR 21440 at commit 4373e27.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2018-05-28T03:32:44Z

core/src/main/scala/org/apache/spark/util/io/ChunkedByteBufferFileRegion.scala

+  private val chunks = chunkedByteBuffer.getChunks()
+  private val cumLength = chunks.scanLeft(0L) { _ + _.remaining()}
+  private val size = cumLength.last
+  // Chunk size in bytes


Should this comment be moved above last line ?

Ngone51 · 2018-05-28T06:04:24Z

core/src/main/scala/org/apache/spark/util/io/ChunkedByteBufferFileRegion.scala

+
+  protected def deallocate: Unit = {}
+
+  override def count(): Long = chunkedByteBuffer.size


What's the difference between size and count? Should count indicates the rest data's size can be transfered ?

no difference, count() is just to satisfy an interface. My mistake for having them look different, I'll make them the same

Ngone51 · 2018-05-28T10:44:27Z

core/src/main/scala/org/apache/spark/util/io/ChunkedByteBufferFileRegion.scala

+    var keepGoing = true
+    var written = 0L
+    var currentChunk = chunks(currentChunkIdx)
+    var originalLimit = currentChunk.limit()


Seems it is unused.

Ngone51 · 2018-05-28T10:47:56Z

core/src/test/scala/org/apache/spark/io/ChunkedByteBufferFileRegionSuite.scala

+      acceptNBytes -= length
+      // verify we got the right data
+      (0 until length).foreach { idx =>
+        assert(bytes(idx) === (pos + idx).toByte, s"; wrong data at ${pos + idx}")


${pos + idx} or ${idx} ?

';' because it separates the automatic portion of the error msg, making it easier to read IMO:

0 did not equal 1; wrong data at 0

pos + idx I think is more appropriate, its more helpful to know the position in the overall stream of data.

I see. Just because override the bytes array and the virtual ${pos + idx} array position make me a little confused. Anyway, it is really a good designed test, especially for data verify part.

Ngone51 · 2018-05-28T10:52:56Z

core/src/test/scala/org/apache/spark/io/ChunkedByteBufferFileRegionSuite.scala

+    SparkEnv.set(null)
+  }
+
+  private def generateChunkByteBuffer(nChunks: Int, perChunk: Int): ChunkedByteBuffer = {


nit: generateChunkedByteBuffer

Ngone51 · 2018-05-28T11:04:09Z

core/src/test/scala/org/apache/spark/io/ChunkedByteBufferFileRegionSuite.scala

+    var pos = 0
+
+    override def write(src: ByteBuffer): Int = {
+      val origSrcPos = src.position()


Also seems it is unused.

Ngone51 · 2018-05-28T11:27:47Z

core/src/test/scala/org/apache/spark/io/ChunkedByteBufferFileRegionSuite.scala

+    override def write(src: ByteBuffer): Int = {
+      val origSrcPos = src.position()
+      val length = math.min(acceptNBytes, src.remaining())
+      src.get(bytes, 0, length)


We override bytes array's previously written data ?

yes, this is just test code, we're just checking the data that gets written is what we expect (which we know based on the absolute position). Really, I could read just one byte at a time and check that is it the right data, but it seemed a little easier this way.

squito · 2018-05-29T15:47:53Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

+    // could just use the inputStream on the temp file, rather than memory-mapping the file.
+    // Until then, replication can cause the process to use too much memory and get killed
+    // by the OS / cluster manager (not a java OOM, since its a memory-mapped file) even though
+    // we've read the data to disk.


btw this fix is such low-hanging fruit that I would try to do this immediately afterwards. (I haven't filed a jira yet just because there are already so many defunct jira related to this, I was going to wait till my changes got some traction).

I think its OK to get it in like this first, as this makes the behavior for 2.01 gb basically the same as it was for 1.99 gb.

Assuming this goes in shortly -- anybody interested in picking up this TODO? maybe @Ngone51 or @NiharS ?

not a java OOM, since its a memory-mapped file

I'm not sure why memory-mapped file will cause too much memory? AFAIK memory mapping is a lazy loading mechanism in page-wise, system will only load the to-be-accessed file segment to memory page, not the whole file to memory. So from my understanding even very small physical memory could map a super large file. Memory mapping will not occupy too much memory.

to be honest I don't have perfect understanding of this, but my impression is that it is not exactly lazy loading, the OS has a lot of leeway in deciding how much to keep in memory, but that it should always release the memory under pressure. this is problematic under yarn, when the container's memory use is being monitored independently of the OS. so the OS thinks its fine to put large amounts of data in physical memory, but then the yarn NM looks at the memory use of the specific process tree, decides its over the limits it has configured, and so kills it.

At least, I've seen cases of yarn killing things for exceeding memory limits where I thought that was the case, though I did not directly confirm it.

I see. I agree with you that YARN could have some issues in calculating the exact memory usage.

squito · 2018-05-29T15:48:18Z

thanks for the reviews @markhamstra @Ngone51 , I've updated the pr

SparkQA · 2018-05-29T20:10:07Z

Test build #91259 has finished for PR 21440 at commit a9cfe29.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2018-06-27T01:36:11Z

@vanzin @JoshRosen this is also ready in the sequence of 2GB limit related changes.

(I'll update #21451 now that the first change has gone in)

vanzin

It feels like transferTo could be simpler, but after thinking for a while couldn't really come up with something...

vanzin · 2018-06-27T20:27:53Z

core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala

+    }
+  }
+
+  def map(file: File, maxChunkSize: Int): ChunkedByteBuffer = {


Is this used anywhere? Couldn't find a reference.

this version isn't used till the other PR. I can pull it out there

the other version of map is used in this pr from BlockManager.getRemoteBytes() -> ChunkedByteBuffer.fromManagedBuffer() -> ChunkedByteBuffer.map

vanzin · 2018-06-27T20:35:51Z

core/src/main/scala/org/apache/spark/util/io/ChunkedByteBufferFileRegion.scala

+ */
+private[io] class ChunkedByteBufferFileRegion(
+    val chunkedByteBuffer: ChunkedByteBuffer,
+    val ioChunkSize: Int) extends AbstractReferenceCounted with FileRegion with Logging {


Extend AbstractFileRegion?

Do the fields need to be public?

You don't seem to need Logging.

vanzin · 2018-06-27T20:49:45Z

core/src/main/scala/org/apache/spark/util/io/ChunkedByteBufferFileRegion.scala

+  private var _transferred: Long = 0
+  // this duplicates the original chunks, so we're free to modify the position, limit, etc.
+  private val chunks = chunkedByteBuffer.getChunks()
+  private val cumLength = chunks.scanLeft(0L) { _ + _.remaining()}


Use foldLeft(0) { blah } + avoid the intermediate val?

vanzin · 2018-06-27T21:08:56Z

core/src/main/scala/org/apache/spark/util/io/ChunkedByteBufferFileRegion.scala

+    while (keepGoing) {
+      while (currentChunk.hasRemaining && keepGoing) {
+        val ioSize = Math.min(currentChunk.remaining(), ioChunkSize)
+        val originalPos = currentChunk.position()


sorry bunch of leftover bits from earlier debugging. all cleaned up now

vanzin

LGTM.

vanzin · 2018-06-27T23:13:25Z

core/src/main/scala/org/apache/spark/util/io/ChunkedByteBufferFileRegion.scala

+  private var _transferred: Long = 0
+  // this duplicates the original chunks, so we're free to modify the position, limit, etc.
+  private val chunks = chunkedByteBuffer.getChunks()
+  private val size = chunks.foldLeft(0) { _ + _.remaining()}


space before }

vanzin · 2018-06-27T23:13:31Z

core/src/main/scala/org/apache/spark/util/io/ChunkedByteBufferFileRegion.scala

+
+/**
+ * This exposes a ChunkedByteBuffer as a netty FileRegion, just to allow sending > 2gb in one netty
+ * message.   This is because netty cannot send a ByteBuf > 2g, but it can send a large FileRegion,


vanzin · 2018-06-27T23:16:29Z

core/src/test/scala/org/apache/spark/io/ChunkedByteBufferFileRegionSuite.scala

+  /**
+   * This mocks a channel which only accepts a limited number of bytes at a time.  It also verifies
+   * the written data matches our expectations as the data is received.
+   * @param maxWriteSize


SparkQA · 2018-06-28T02:25:32Z

Test build #92399 has finished for PR 21440 at commit 6c57e4d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-28T06:59:02Z

Test build #92406 has finished for PR 21440 at commit 4b53667.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-06-29T22:07:08Z

core/src/main/scala/org/apache/spark/util/io/ChunkedByteBufferFileRegion.scala

+  private var _transferred: Long = 0
+  // this duplicates the original chunks, so we're free to modify the position, limit, etc.
+  private val chunks = chunkedByteBuffer.getChunks()
+  private val size = chunks.foldLeft(0) { _ + _.remaining() }


0L? Otherwise this will overflow for > 2G right?

SparkQA · 2018-07-02T20:36:43Z

Test build #92533 has finished for PR 21440 at commit 65b7d87.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2018-07-03T14:21:02Z

@tgravescs @felixcheung @zsxwing maybe one of you could take a look? I got a lgtm from marcelo but he's out for a few weeks, would prefer to get another approval, plus I'll need review on #21451

squito · 2018-07-12T01:21:10Z

@mridulm @jerryshao maybe you would be interested in reviewing this as well?

felixcheung

LG

mridulm · 2018-07-12T19:17:19Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

@@ -723,7 +728,9 @@ private[spark] class BlockManager(
      }

      if (data != null) {
-        return Some(new ChunkedByteBuffer(data))
+        val chunkSize =
+          conf.getSizeAsBytes("spark.storage.memoryMapLimitForTests", Int.MaxValue.toString).toInt


nit: Make chunkSize as a private field in BlockManager instead of recomputing it each time ?

mridulm · 2018-07-12T19:21:59Z

core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala

+      val chunks = new ListBuffer[ByteBuffer]()
+      while (remaining > 0) {
+        val chunkSize = math.min(remaining, maxChunkSize)
+        val chunk = channel.map(FileChannel.MapMode.READ_ONLY, pos, chunkSize)


Wondering if we could make these FileRegion's instead : and use transferTo instead of write in ChunkedByteBufferFileRegion ?

I'm not sure I understand. What FileRegion are you referring to -- the only one I know of is netty's interface. Do you mean implement another FileRegion for each chunk?, and then have ChunkedByteBufferFileRegion delegate to that?

We could do that, but I don't think it would be any better. ChunkedByteBufferFileRegion.transferTo would be about as complex as now. Also it may be worth noting that this particular method really should disappear -- we shouldn't be mapping this at all, we should be using an input stream (see the TODO above), but I want to do that separately.

I was thinking of DefaultFileRegion .. but any other zero copy impl should be fine.

I think your concern is that when we are going to send data that is backed by a file, eg. a remote read of an RDD cached on disk, we should be able to send it using something more efficient than memory mapping the entire file. Is that correct?

That actually isn't a problem. This map() method isn't called for sending disk-cached RDDs. That is already handled correctly with FileSegmentManagedBuffer.convertToNetty(), which uses the DefaultFileRegion you had in mind. The map method is only used on the receiving end, after the data has already been transferred, and just to pass the data on to other spark code locally in the executor. (And that will avoid the map() entirely after the TODO above.)

I needed to add ChunkedByteBufferFileRegion for data that is already in memory as a ChunkedByteBuffer, eg. for memory-cached RDDs.

Perfect, thanks for clarifying !

jerryshao · 2018-07-18T02:25:10Z

core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala

 import org.apache.spark.network.util.ByteArrayWritableChannel
 import org.apache.spark.storage.StorageUtils
+import org.apache.spark.util.Utils
+


nit. This blank line seems not necessary.

jerryshao · 2018-07-18T02:31:55Z

core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala

+  }
+
+  def map(file: File, maxChunkSize: Int, offset: Long, length: Long): ChunkedByteBuffer = {
+    Utils.tryWithResource(new FileInputStream(file).getChannel()) { channel =>


Can we please use FileChannel#open instead, FileInputStream/FileOutputStream has some issues (https://www.cloudbees.com/blog/fileinputstream-fileoutputstream-considered-harmful)

I wasn't aware of that issue, thanks for sharing that, I'll update this. Should we also update other uses? Seems there are a lot of other cases, eg. UnsafeShuffleWriter, DiskBlockObjectWriter, etc.

I've already updated some of them in SPARK-21475 in shuffle related code path, but not all of them which are not so critical.

great, thanks for the explanation

jerryshao · 2018-07-18T02:44:58Z

core/src/main/scala/org/apache/spark/util/io/ChunkedByteBufferFileRegion.scala

+        val thisWriteSize = target.write(currentChunk)
+        currentChunk.limit(originalLimit)
+        written += thisWriteSize
+        if (thisWriteSize < ioSize) {


What will be happened if thisWriteSize is smaller than ioSize, will Spark throw an exception or something else?

actually this is a totally normal condition, it just means the channel is not currently ready to accept anymore data. This is something netty expects, and it will make sure the rest of the data is put on the channel eventually (it'll get called the next time with the correct position argument indicating how far along it is).

The added unit tests cover this.

I see, thanks for explain.

mridulm · 2018-07-18T07:18:06Z

LGTM, thanks for working on this @squito !
Since there are other ongoing reviews, I will defer to them to merge.

SparkQA · 2018-07-18T18:08:33Z

Test build #93234 has finished for PR 21440 at commit 4664942.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao

LGTM. Merging to master.

gatorsmile · 2018-07-23T03:46:59Z

Although the code quality is pretty good, I am still afraid it could introduce some unexpected issues. Is that possible we can introduce a conf to disable the new changes and use the previous implementation? We can remove the conf in the next release.

gatorsmile · 2018-07-23T03:47:45Z

cc @squito @mridulm @markhamstra @jerryshao @vanzin @JoshRosen @rxin @zsxwing

squito · 2018-07-24T14:45:31Z

@gatorsmile sure, thats pretty easy. I'll submit a follow up pr.

gatorsmile · 2018-07-24T18:39:49Z

@squito Thank you!

markhamstra reviewed May 27, 2018

View reviewed changes

Ngone51 reviewed May 28, 2018

View reviewed changes

review feedback

a9cfe29

squito commented May 29, 2018

View reviewed changes

squito mentioned this pull request May 29, 2018

[SPARK-24296][CORE] Replicate large blocks as a stream. #21451

Closed

vanzin reviewed Jun 27, 2018

View reviewed changes

review feedback

6c57e4d

vanzin reviewed Jun 27, 2018

View reviewed changes

review feedback

4b53667

vanzin reviewed Jun 29, 2018

View reviewed changes

fix

65b7d87

felixcheung approved these changes Jul 12, 2018

View reviewed changes

mridulm reviewed Jul 12, 2018

View reviewed changes

jerryshao reviewed Jul 18, 2018

View reviewed changes

review feedback

4664942

jerryshao approved these changes Jul 20, 2018

View reviewed changes

asfgit closed this in 7e84764 Jul 20, 2018


		protected def deallocate: Unit = {}

		override def count(): Long = chunkedByteBuffer.size

[SPARK-24307][CORE] Support reading remote cached partitions > 2gb #21440

[SPARK-24307][CORE] Support reading remote cached partitions > 2gb #21440

Conversation

squito commented May 27, 2018

Choose a reason for hiding this comment

SparkQA commented May 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ngone51 May 30, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ngone51 May 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryshao Jul 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squito commented May 29, 2018

SparkQA commented May 29, 2018

squito commented Jun 27, 2018

vanzin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanzin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 28, 2018

SparkQA commented Jun 28, 2018

Choose a reason for hiding this comment

SparkQA commented Jul 2, 2018

squito commented Jul 3, 2018

squito commented Jul 12, 2018

felixcheung left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mridulm commented Jul 18, 2018

SparkQA commented Jul 18, 2018

jerryshao left a comment

Choose a reason for hiding this comment

gatorsmile commented Jul 23, 2018 • edited Loading

gatorsmile commented Jul 23, 2018

squito commented Jul 24, 2018

gatorsmile commented Jul 24, 2018

Ngone51 May 30, 2018 •

edited

Loading

Ngone51 May 28, 2018 •

edited

Loading

jerryshao Jul 19, 2018 •

edited

Loading

gatorsmile commented Jul 23, 2018 •

edited

Loading