[SPARK-13926] Automatically use Kryo serializer when shuffling RDDs with simple types #11755

JoshRosen · 2016-03-16T06:27:34Z

Because ClassTags are available when constructing ShuffledRDD we can use them to automatically use Kryo for shuffle serialization when the RDD's types are known to be compatible with Kryo.

This patch introduces SerializerManager, a component which picks the "best" serializer for a shuffle given the elements' ClassTags. It will automatically pick a Kryo serializer for ShuffledRDDs whose key, value, and/or combiner types are primitives, arrays of primitives, or strings. In the future we can use this class as a narrow extension point to integrate specialized serializers for other types, such as ByteBuffers.

In a planned followup patch, I will extend the BlockManager APIs so that we're able to use similar automatic serializer selection when caching RDDs (this is a little trickier because the ClassTags need to be threaded through many more places).

…best-serializer

SparkQA · 2016-03-16T06:50:59Z

Test build #53283 has finished for PR 11755 at commit 876f038.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-16T07:57:05Z

Test build #53287 has finished for PR 11755 at commit ca923b5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-16T10:12:22Z

Test build #53302 has finished for PR 11755 at commit 51205ee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-03-16T20:50:28Z

core/src/main/scala/org/apache/spark/scheduler/TaskResult.scala

@@ -83,7 +83,7 @@ private[spark] class DirectTaskResult[T](
    } else {
      // This should not run when holding a lock because it may cost dozens of seconds for a large
      // value.
-      val resultSer = SparkEnv.get.serializer.newInstance()
+      val resultSer = SparkEnv.get. serializer.newInstance()


Seems an accidental change.

nongli · 2016-03-16T21:15:10Z

LGTM

…best-serializer

SparkQA · 2016-03-16T21:44:42Z

Test build #53353 has finished for PR 11755 at commit 45b0c0b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DecisionTreeClassificationModelWriter(instance: DecisionTreeClassificationModel)
- class DecisionTreeRegressionModelWriter(instance: DecisionTreeRegressionModel)
- case class SplitData(
- case class NodeData(
- class Estimator(Params):
- class Transformer(Params):
- class Model(Transformer):
- class LogisticRegressionModel(JavaModel, MLWritable, MLReadable):
- class NaiveBayesModel(JavaModel, MLWritable, MLReadable):
- class PipelineMLWriter(JavaMLWriter, JavaWrapper):
- class PipelineMLReader(JavaMLReader):
- class PipelineModelMLWriter(JavaMLWriter, JavaWrapper):
- class PipelineModelMLReader(JavaMLReader):
- case class SQLTable(

JoshRosen · 2016-03-16T21:58:33Z

Jenkins, retest this please.

SparkQA · 2016-03-16T22:10:12Z

Test build #53362 has finished for PR 11755 at commit 45b0c0b.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DecisionTreeClassificationModelWriter(instance: DecisionTreeClassificationModel)
- class DecisionTreeRegressionModelWriter(instance: DecisionTreeRegressionModel)
- case class SplitData(
- case class NodeData(
- class Estimator(Params):
- class Transformer(Params):
- class Model(Transformer):
- class LogisticRegressionModel(JavaModel, MLWritable, MLReadable):
- class NaiveBayesModel(JavaModel, MLWritable, MLReadable):
- class PipelineMLWriter(JavaMLWriter, JavaWrapper):
- class PipelineMLReader(JavaMLReader):
- class PipelineModelMLWriter(JavaMLWriter, JavaWrapper):
- class PipelineModelMLReader(JavaMLReader):
- case class SQLTable(

JoshRosen · 2016-03-16T22:15:53Z

Jenkins, retest this please.

SparkQA · 2016-03-17T00:21:20Z

Test build #53365 has finished for PR 11755 at commit 45b0c0b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DecisionTreeClassificationModelWriter(instance: DecisionTreeClassificationModel)
- class DecisionTreeRegressionModelWriter(instance: DecisionTreeRegressionModel)
- case class SplitData(
- case class NodeData(
- class Estimator(Params):
- class Transformer(Params):
- class Model(Transformer):
- class LogisticRegressionModel(JavaModel, MLWritable, MLReadable):
- class NaiveBayesModel(JavaModel, MLWritable, MLReadable):
- class PipelineMLWriter(JavaMLWriter, JavaWrapper):
- class PipelineMLReader(JavaMLReader):
- class PipelineModelMLWriter(JavaMLWriter, JavaWrapper):
- class PipelineModelMLReader(JavaMLReader):
- case class SQLTable(

rxin · 2016-03-17T05:52:40Z

Merging in master!

Building on the `SerializerManager` introduced in SPARK-13926/ #11755, this patch Spark modifies Spark's BlockManager to use RDD's ClassTags in order to select the best serializer to use when caching RDD blocks. When storing a local block, the BlockManager `put()` methods use implicits to record ClassTags and stores those tags in the blocks' BlockInfo records. When reading a local block, the stored ClassTag is used to pick the appropriate serializer. When a block is stored with replication, the class tag is written into the block transfer metadata and will also be stored in the remote BlockManager. There are two or three places where we don't properly pass ClassTags, including TorrentBroadcast and BlockRDD. I think this happens to work because the missing ClassTag always happens to be `ClassTag.Any`, but it might be worth looking more carefully at those places to see whether we should be more explicit. Author: Josh Rosen <joshrosen@databricks.com> Closes #11801 from JoshRosen/pick-best-serializer-for-caching.

…ith simple types Because ClassTags are available when constructing ShuffledRDD we can use them to automatically use Kryo for shuffle serialization when the RDD's types are known to be compatible with Kryo. This patch introduces `SerializerManager`, a component which picks the "best" serializer for a shuffle given the elements' ClassTags. It will automatically pick a Kryo serializer for ShuffledRDDs whose key, value, and/or combiner types are primitives, arrays of primitives, or strings. In the future we can use this class as a narrow extension point to integrate specialized serializers for other types, such as ByteBuffers. In a planned followup patch, I will extend the BlockManager APIs so that we're able to use similar automatic serializer selection when caching RDDs (this is a little trickier because the ClassTags need to be threaded through many more places). Author: Josh Rosen <joshrosen@databricks.com> Closes apache#11755 from JoshRosen/automatically-pick-best-serializer.

Building on the `SerializerManager` introduced in SPARK-13926/ apache#11755, this patch Spark modifies Spark's BlockManager to use RDD's ClassTags in order to select the best serializer to use when caching RDD blocks. When storing a local block, the BlockManager `put()` methods use implicits to record ClassTags and stores those tags in the blocks' BlockInfo records. When reading a local block, the stored ClassTag is used to pick the appropriate serializer. When a block is stored with replication, the class tag is written into the block transfer metadata and will also be stored in the remote BlockManager. There are two or three places where we don't properly pass ClassTags, including TorrentBroadcast and BlockRDD. I think this happens to work because the missing ClassTag always happens to be `ClassTag.Any`, but it might be worth looking more carefully at those places to see whether we should be more explicit. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#11801 from JoshRosen/pick-best-serializer-for-caching.

JoshRosen added 5 commits March 15, 2016 20:14

Remove Serializer.getSerializer()

035f227

Wire up automatic serializer selection.

35b32b3

Remove print statements.

876f038

Merge remote-tracking branch 'origin/master' into automatically-pick-…

f36e816

…best-serializer

Add MiMa excludes.

ca923b5

More test compilation fixes.

51205ee

yhuai reviewed Mar 16, 2016
View reviewed changes

JoshRosen added 2 commits March 16, 2016 14:24

Update TaskResult.scala

09f5339

Merge remote-tracking branch 'origin/master' into automatically-pick-…

45b0c0b

…best-serializer

asfgit closed this in de1a84e Mar 17, 2016

JoshRosen deleted the automatically-pick-best-serializer branch March 17, 2016 06:07

JoshRosen mentioned this pull request Mar 17, 2016

[SPARK-13990] Automatically pick serializer when caching RDDs #11801

Closed

uncleGen mentioned this pull request Nov 24, 2016

[SPARK-18560][CORE][STREAMING][WIP] Receiver data can not be deserialized properly. #15992

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13926] Automatically use Kryo serializer when shuffling RDDs with simple types #11755

[SPARK-13926] Automatically use Kryo serializer when shuffling RDDs with simple types #11755

JoshRosen commented Mar 16, 2016

SparkQA commented Mar 16, 2016

SparkQA commented Mar 16, 2016

SparkQA commented Mar 16, 2016

yhuai Mar 16, 2016

nongli commented Mar 16, 2016

SparkQA commented Mar 16, 2016

JoshRosen commented Mar 16, 2016

SparkQA commented Mar 16, 2016

JoshRosen commented Mar 16, 2016

SparkQA commented Mar 17, 2016

rxin commented Mar 17, 2016

[SPARK-13926] Automatically use Kryo serializer when shuffling RDDs with simple types #11755

[SPARK-13926] Automatically use Kryo serializer when shuffling RDDs with simple types #11755

Conversation

JoshRosen commented Mar 16, 2016

SparkQA commented Mar 16, 2016

SparkQA commented Mar 16, 2016

SparkQA commented Mar 16, 2016

yhuai Mar 16, 2016

Choose a reason for hiding this comment

nongli commented Mar 16, 2016

SparkQA commented Mar 16, 2016

JoshRosen commented Mar 16, 2016

SparkQA commented Mar 16, 2016

JoshRosen commented Mar 16, 2016

SparkQA commented Mar 17, 2016

rxin commented Mar 17, 2016