[SPARK-17110] Fix StreamCorruptionException in BlockManager.getRemoteValues() #14952

JoshRosen · 2016-09-03T17:45:53Z

What changes were proposed in this pull request?

This patch fixes a java.io.StreamCorruptedException error affecting remote reads of cached values when certain data types are used. The problem stems from #11801 / SPARK-13990, a patch to have Spark automatically pick the "best" serializer when caching RDDs. If PySpark cached a PythonRDD, then this would be cached as an RDD[Array[Byte]] and the automatic serializer selection would pick KryoSerializer for replication and block transfer. However, the getRemoteValues() / getRemoteBytes() code path did not pass proper class tags in order to enable the same serializer to be used during deserialization, causing Java to be inappropriately used instead of Kryo, leading to the StreamCorruptedException.

We already fixed a similar bug in #14311, which dealt with similar issues in block replication. Prior to that patch, it seems that we had no tests to ensure that block replication actually succeeded. Similarly, prior to this bug fix patch it looks like we had no tests to perform remote reads of cached data, which is why this bug was able to remain latent for so long.

This patch addresses the bug by modifying BlockManager's get() and getRemoteValues() methods to accept ClassTags, allowing the proper class tag to be threaded in the getOrElseUpdate code path (which is used by rdd.iterator)

How was this patch tested?

Extended the caching tests in DistributedSuite to exercise the getRemoteValues path, plus manual testing to verify that the PySpark bug reproduction in SPARK-17110 is fixed.

JoshRosen · 2016-09-03T17:46:13Z

/cc @ericl

SparkQA · 2016-09-03T19:33:22Z

Test build #64908 has finished for PR 14952 at commit 9eb75f5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ericl · 2016-09-03T19:53:28Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

    getRemoteBytes(blockId).map { data =>
      val values =
-        serializerManager.dataDeserializeStream(blockId, data.toInputStream(dispose = true))
+        serializerManager.dataDeserializeStream(blockId, data.toInputStream(dispose = true))(ct)


Is it possible for dataDeserializeStream to require a classtag to be explicitly passed?

I'm not saying this should definitely be done one way or the other, but I'm curious why you have a preference for the extra code and more verbose API that come with making the classTag an explicit parameter.

It seems like it is easy to accidentally forget to pass a correct classtag, since this has happened twice already.

How do you forget to pass a correct ClassTag when the compiler is enforcing its presence via the context bound?

In this case, the problem is that the type parameter was inferred as Any.

SparkQA · 2016-09-04T23:10:33Z

Test build #64925 has finished for PR 14952 at commit 68db68d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ericl · 2016-09-06T18:52:54Z

looks good

JoshRosen · 2016-09-06T22:06:49Z

I'm going to merge this into master and branch-2.0 as an immediate fix for the PySpark caching issue.

…Values() ## What changes were proposed in this pull request? This patch fixes a `java.io.StreamCorruptedException` error affecting remote reads of cached values when certain data types are used. The problem stems from #11801 / SPARK-13990, a patch to have Spark automatically pick the "best" serializer when caching RDDs. If PySpark cached a PythonRDD, then this would be cached as an `RDD[Array[Byte]]` and the automatic serializer selection would pick KryoSerializer for replication and block transfer. However, the `getRemoteValues()` / `getRemoteBytes()` code path did not pass proper class tags in order to enable the same serializer to be used during deserialization, causing Java to be inappropriately used instead of Kryo, leading to the StreamCorruptedException. We already fixed a similar bug in #14311, which dealt with similar issues in block replication. Prior to that patch, it seems that we had no tests to ensure that block replication actually succeeded. Similarly, prior to this bug fix patch it looks like we had no tests to perform remote reads of cached data, which is why this bug was able to remain latent for so long. This patch addresses the bug by modifying `BlockManager`'s `get()` and `getRemoteValues()` methods to accept ClassTags, allowing the proper class tag to be threaded in the `getOrElseUpdate` code path (which is used by `rdd.iterator`) ## How was this patch tested? Extended the caching tests in `DistributedSuite` to exercise the `getRemoteValues` path, plus manual testing to verify that the PySpark bug reproduction in SPARK-17110 is fixed. Author: Josh Rosen <joshrosen@databricks.com> Closes #14952 from JoshRosen/SPARK-17110. (cherry picked from commit 29cfab3) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

JoshRosen added 2 commits September 3, 2016 10:26

Add regression test.

470380e

Fix bug by threading proper ClassTag

9eb75f5

ericl reviewed Sep 3, 2016
View reviewed changes

JoshRosen added 2 commits September 4, 2016 13:46

Fix test assertion.

222a0ba

Require ClassTag to be passed explicitly

68db68d

asfgit closed this in 29cfab3 Sep 6, 2016

JoshRosen deleted the SPARK-17110 branch September 6, 2016 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17110] Fix StreamCorruptionException in BlockManager.getRemoteValues() #14952

[SPARK-17110] Fix StreamCorruptionException in BlockManager.getRemoteValues() #14952

JoshRosen commented Sep 3, 2016

JoshRosen commented Sep 3, 2016

SparkQA commented Sep 3, 2016

ericl Sep 3, 2016

markhamstra Sep 4, 2016

ericl Sep 4, 2016

markhamstra Sep 6, 2016

JoshRosen Sep 6, 2016

SparkQA commented Sep 4, 2016

ericl commented Sep 6, 2016

JoshRosen commented Sep 6, 2016

[SPARK-17110] Fix StreamCorruptionException in BlockManager.getRemoteValues() #14952

[SPARK-17110] Fix StreamCorruptionException in BlockManager.getRemoteValues() #14952

Conversation

JoshRosen commented Sep 3, 2016

What changes were proposed in this pull request?

How was this patch tested?

JoshRosen commented Sep 3, 2016

SparkQA commented Sep 3, 2016

ericl Sep 3, 2016

Choose a reason for hiding this comment

markhamstra Sep 4, 2016

Choose a reason for hiding this comment

ericl Sep 4, 2016

Choose a reason for hiding this comment

markhamstra Sep 6, 2016

Choose a reason for hiding this comment

JoshRosen Sep 6, 2016

Choose a reason for hiding this comment

SparkQA commented Sep 4, 2016

ericl commented Sep 6, 2016

JoshRosen commented Sep 6, 2016