-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17110] Fix StreamCorruptionException in BlockManager.getRemoteValues() #14952
Conversation
/cc @ericl |
Test build #64908 has finished for PR 14952 at commit
|
getRemoteBytes(blockId).map { data => | ||
val values = | ||
serializerManager.dataDeserializeStream(blockId, data.toInputStream(dispose = true)) | ||
serializerManager.dataDeserializeStream(blockId, data.toInputStream(dispose = true))(ct) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for dataDeserializeStream to require a classtag to be explicitly passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not saying this should definitely be done one way or the other, but I'm curious why you have a preference for the extra code and more verbose API that come with making the classTag an explicit parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it is easy to accidentally forget to pass a correct classtag, since this has happened twice already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you forget to pass a correct ClassTag when the compiler is enforcing its presence via the context bound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, the problem is that the type parameter was inferred as Any
.
Test build #64925 has finished for PR 14952 at commit
|
looks good |
I'm going to merge this into master and branch-2.0 as an immediate fix for the PySpark caching issue. |
…Values() ## What changes were proposed in this pull request? This patch fixes a `java.io.StreamCorruptedException` error affecting remote reads of cached values when certain data types are used. The problem stems from #11801 / SPARK-13990, a patch to have Spark automatically pick the "best" serializer when caching RDDs. If PySpark cached a PythonRDD, then this would be cached as an `RDD[Array[Byte]]` and the automatic serializer selection would pick KryoSerializer for replication and block transfer. However, the `getRemoteValues()` / `getRemoteBytes()` code path did not pass proper class tags in order to enable the same serializer to be used during deserialization, causing Java to be inappropriately used instead of Kryo, leading to the StreamCorruptedException. We already fixed a similar bug in #14311, which dealt with similar issues in block replication. Prior to that patch, it seems that we had no tests to ensure that block replication actually succeeded. Similarly, prior to this bug fix patch it looks like we had no tests to perform remote reads of cached data, which is why this bug was able to remain latent for so long. This patch addresses the bug by modifying `BlockManager`'s `get()` and `getRemoteValues()` methods to accept ClassTags, allowing the proper class tag to be threaded in the `getOrElseUpdate` code path (which is used by `rdd.iterator`) ## How was this patch tested? Extended the caching tests in `DistributedSuite` to exercise the `getRemoteValues` path, plus manual testing to verify that the PySpark bug reproduction in SPARK-17110 is fixed. Author: Josh Rosen <joshrosen@databricks.com> Closes #14952 from JoshRosen/SPARK-17110. (cherry picked from commit 29cfab3) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
What changes were proposed in this pull request?
This patch fixes a
java.io.StreamCorruptedException
error affecting remote reads of cached values when certain data types are used. The problem stems from #11801 / SPARK-13990, a patch to have Spark automatically pick the "best" serializer when caching RDDs. If PySpark cached a PythonRDD, then this would be cached as anRDD[Array[Byte]]
and the automatic serializer selection would pick KryoSerializer for replication and block transfer. However, thegetRemoteValues()
/getRemoteBytes()
code path did not pass proper class tags in order to enable the same serializer to be used during deserialization, causing Java to be inappropriately used instead of Kryo, leading to the StreamCorruptedException.We already fixed a similar bug in #14311, which dealt with similar issues in block replication. Prior to that patch, it seems that we had no tests to ensure that block replication actually succeeded. Similarly, prior to this bug fix patch it looks like we had no tests to perform remote reads of cached data, which is why this bug was able to remain latent for so long.
This patch addresses the bug by modifying
BlockManager
'sget()
andgetRemoteValues()
methods to accept ClassTags, allowing the proper class tag to be threaded in thegetOrElseUpdate
code path (which is used byrdd.iterator
)How was this patch tested?
Extended the caching tests in
DistributedSuite
to exercise thegetRemoteValues
path, plus manual testing to verify that the PySpark bug reproduction in SPARK-17110 is fixed.