-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25905][CORE] When getting a remote block, avoid forcing a conversion to a ChunkedByteBuffer #23058
Conversation
Test build #98923 has finished for PR 23058 at commit
|
@attilapiros can you review this please? |
can we also make the same change to |
…ersion to a ChunkedByteBuffer In BlockManager, getRemoteValues gets a ChunkedByteBuffer (by calling getRemoteBytes) and creates an InputStream from it. getRemoteBytes, in turn, gets a ManagedBuffer and converts it to a ChunkedByteBuffer. Instead, expose a getRemoteManagedBuffer method so getRemoteValues can just get this ManagedBuffer and use its InputStream. When reading a remote cache block from disk, this reduces heap memory usage significantly. Retain getRemoteBytes for other callers.
…gedBuffer is not a BlockManagerManagedBuffer. Also, update a comment in a test method.
I had a conversation off-line with Imran. As we end up deserializing the value of the task result into a ByteBuffer anyway, this change does not seem worthwhile. |
Test build #99091 has finished for PR 23058 at commit
|
lgtm I looked more into the lifecycle of the buffers and when they get I also checked about whether we should buffer the input stream, but @wypoon one thing, can you update the testing section of the pr description to mention the coverage you found in the existing unit tests? |
Thanks @squito. I updated the testing section of the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
I have checked TaskResultGetter and the deserializers and I have an idea/question:
What is your opinion about extending SerializerInstance
with a new method which accepts ManagedBuffer?
def deserialize[T: ClassTag](bytes: ManagedBuffer): T
As all managed buffer has a createInputStream
both serializer implementation (Kyro and Java) can be extended:
Kyro (not tested):
def deserialize[T: ClassTag](bytes: ManagedBuffer): T = {
val kryo = borrowKryo()
try {
input.setInputStream(bytes.createInputStream())
kryo.readClassAndObject(input).asInstanceOf[T]
} finally {
releaseKryo(kryo)
}
}
Java (not tested):
def deserialize[T: ClassTag](bytes: ManagedBuffer): T = {
val bis = bytes.createInputStream()
val in = deserializeStream(bis)
in.readObject()
}
This way we can get rid of the wasteful getRemoteBytes
.
It is fine for me if it is done separately.
@attilapiros yes, something like that would be possible. I was thinking you'd just use the existing serializer methods to do it, soemthing like: val buffer = getRemoteManagedBuffer()
val valueItr = deserializeStream(buffer.createInputStream())
val result = valueItr.next()
assert(!valueItr.hasNext()) // makes sure its closed too my reluctance to bother with it is that you'd still be getting a |
@mridulm @jerryshao @Ngone51 @vanzin just checking if you want to look at this before I merge, will leave open a bit. |
The change looks good to me. I understand that this change uses memory efficiently but I am wondering whether it causes any performance degradation compared to memory mapping. If yes, can we measure the performance impact and document it with this change. |
@ankuriitg good question, though if you look at what the old code was doing, it wasn't memory mapping the file, it was reading it into memory from a regular input stream, take a look at basically doing the the same thing this is doing now, but without the extra memory overhead. |
core/src/main/scala/org/apache/spark/storage/BlockManager.scala
Outdated
Show resolved
Hide resolved
Test build #99420 has finished for PR 23058 at commit
|
merged to master, thanks @wypoon |
…ersion to a ChunkedByteBuffer ## What changes were proposed in this pull request? In `BlockManager`, `getRemoteValues` gets a `ChunkedByteBuffer` (by calling `getRemoteBytes`) and creates an `InputStream` from it. `getRemoteBytes`, in turn, gets a `ManagedBuffer` and converts it to a `ChunkedByteBuffer`. Instead, expose a `getRemoteManagedBuffer` method so `getRemoteValues` can just get this `ManagedBuffer` and use its `InputStream`. When reading a remote cache block from disk, this reduces heap memory usage significantly. Retain `getRemoteBytes` for other callers. ## How was this patch tested? Imran Rashid wrote an application (https://github.com/squito/spark_2gb_test/blob/master/src/main/scala/com/cloudera/sparktest/LargeBlocks.scala), that among other things, tests reading remote cache blocks. I ran this application, using 2500MB blocks, to test reading a cache block on disk. Without this change, with `--executor-memory 5g`, the test fails with `java.lang.OutOfMemoryError: Java heap space`. With the change, the test passes with `--executor-memory 2g`. I also ran the unit tests in core. In particular, `DistributedSuite` has a set of tests that exercise the `getRemoteValues` code path. `BlockManagerSuite` has several tests that call `getRemoteBytes`; I left these unchanged, so `getRemoteBytes` still gets exercised. Closes apache#23058 from wypoon/SPARK-25905. Authored-by: Wing Yew Poon <wypoon@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>
What changes were proposed in this pull request?
In
BlockManager
,getRemoteValues
gets aChunkedByteBuffer
(by callinggetRemoteBytes
) and creates anInputStream
from it.getRemoteBytes
, in turn, gets aManagedBuffer
and converts it to aChunkedByteBuffer
.Instead, expose a
getRemoteManagedBuffer
method sogetRemoteValues
can just get thisManagedBuffer
and use itsInputStream
.When reading a remote cache block from disk, this reduces heap memory usage significantly.
Retain
getRemoteBytes
for other callers.How was this patch tested?
Imran Rashid wrote an application (https://github.com/squito/spark_2gb_test/blob/master/src/main/scala/com/cloudera/sparktest/LargeBlocks.scala), that among other things, tests reading remote cache blocks. I ran this application, using 2500MB blocks, to test reading a cache block on disk. Without this change, with
--executor-memory 5g
, the test fails withjava.lang.OutOfMemoryError: Java heap space
. With the change, the test passes with--executor-memory 2g
.I also ran the unit tests in core. In particular,
DistributedSuite
has a set of tests that exercise thegetRemoteValues
code path.BlockManagerSuite
has several tests that callgetRemoteBytes
; I left these unchanged, sogetRemoteBytes
still gets exercised.