[SPARK-746][CORE] Added Avro Serialization to Kryo #7004

JDrit · 2015-06-24T23:53:24Z

Added a custom Kryo serializer for generic Avro records to reduce the network IO
involved during a shuffle. This compresses the schema and allows for users to
register their schemas ahead of time to further reduce traffic.

Currently Kryo tries to use its default serializer for generic Records, which will include
a lot of unneeded data in each record.

… network IO involved during a shuffle. This compresses the schema and allows for users to register their schemas ahead of time to further reduce traffic. Changed how the records were serialized to reduce both the time and memory used removed unused variable

squito · 2015-06-25T15:14:56Z

Jenkins, this is ok to test

squito · 2015-06-25T15:16:20Z

core/pom.xml

+      <groupId>org.apache.avro</groupId>
+      <artifactId>avro</artifactId>
+      <version>${avro.version}</version>
+      <scope>${hadoop.deps.scope}</scope>


isn't this included already?

avro and avro-mapred are already in the dependencyMangement of the parent POM, so nothing more than groupId, artifactId and perhaps scope should be included here. In particular, version should not generally be specified when it is already in the dependencyManagement. This is a frequent error in the Spark POMs.

+1 to Mark's comments.

squito · 2015-06-25T15:34:03Z

Hi @JDrit I left a few comments on the code. I know this is still a WIP, so I didn't go through very detailed, but there are some style and scoping (more things should be private or at least private[spark]) that will need to be addressed eventually.

Since the point of this is to be more efficient, can you give some example numbers of the savings from this? But it seems like a great idea. (incidentally, @massie is also working on something to get shuffle to use parquet)

squito · 2015-06-25T17:16:50Z

One high-level question -- if a user does not call SparkConf.registerAvroSchema, this will still write the schema with every record, correct? The only potential advantage is that the schema gets compressed?

My instinct is that its not worth it for you to do any extra compression for the schema. There is already compression controlled by spark.shuffle.compress, I don't think you'll gain anything beyond that. I think removing the compression logic would simplify things a bit, and you'd still get the benefits of not sending the schema with every record.

JDrit · 2015-06-25T20:52:48Z

I wrote some benchmarking code for this and the results are recorded here.

tl;dr:
The compression on the schema is still useful in both wall time and network IO. Currently generic records use a lot of network, and using this new serializer greatly cuts down on network usage. Registering schemas ahead of time also helps a lot.

JDrit · 2015-06-25T22:22:54Z

Which of the new classes should be marked private or private[spark]?

JoshRosen · 2015-06-27T22:21:25Z

Hey @JDrit,

Serialization performance is a big interest of mine and I'd be happy to help with review of this patch. One question first, though: I've noticed that this adds a new Avro dependency, but it seems to be scoped to compile-only. I assume that this is because we don't want to introduce a hard dependency on Avro to Spark itself, since doing so might create dependency conflicts with user code. However, I'm worried about what happens if we run Spark without any version of Avro on the classpath: will we get ClassNotFoundExceptions when KryoSerializer ties to create a GenericAvroSerializer?

If we can't come up with a clean way to handle this dependency issue in Spark, the next best solution might be to release this Avro serialization code as a third-package (e.g. via http://spark-packages.org or your own preferred distribution channel). I think that this might be possible by packaging the AvroSerializer code into its own JAR, then writing a custom Kryo registrator and instructing users on how to configure spark.kryo.registrator to use it. To provide a nice experience for end-users, you could even create a custom "builder" class that users configure then apply to a SparkConf object in order to set the appropriate settings for Avro / Kryo.

JoshRosen · 2015-06-27T23:00:23Z

Also, to clarify: is this primarily intended to improve the performance of programs written against the Spark Core API? For Spark SQL + DataFrames, I think the spark-avro library will convert the Avro records into Spark SQL's internal Row representation, which should be more efficient to serialize and shuffle. I'd be curious to know whether you could see most of these benefits for simpler workflows by using Dataframes and leaving the serialization up to that.

JDrit · 2015-06-27T23:19:04Z

This was intended to be a performance increase for spark-core with RDDs. This works separately from spark-avro. I realize that dataframes could also be used but the goal was to make it easier for users who just want to use RDDs.

JoshRosen · 2015-06-28T00:40:24Z

Jenkins, this is ok to test.

JoshRosen · 2015-06-28T05:54:21Z

Any comment on the classpath issues or on whether this introduces a new Avro dependency?

JDrit · 2015-06-29T19:18:50Z

This would introduce Avro as a new dependency but this would be similar in how Spark has a dependency on Parquet or other projects. Spark already has compile time dependencies on things like Parquet, Flume, etc.

JoshRosen · 2015-06-29T20:15:00Z

I guess the real concern is over whether adding this dependency would create JAR hell for our users. Maybe @vanzin or @srowen could comment on this?

JoshRosen · 2015-07-01T18:39:24Z

Actually, maybe @pwendell or @zsxwing can comment on the Avro dependency concerns (@zsxwing, because you mentioned Avro dependency issues in #6830 (comment)).

vanzin · 2015-07-01T18:46:43Z

core/pom.xml

+    <dependency>
+      <groupId>org.apache.avro</groupId>
+      <artifactId>avro-mapred</artifactId>
+      <version>${avro.version}</version>


Similarly to the previous one, this one should only contain groupId / artifactId (everything else is inherited).

The one thing you have to add here, though, is:

<classifier>${avro.mapred.classifier}</classifier>

Thanks for pointing that out, just fixed that.

vanzin · 2015-07-01T18:47:55Z

I guess the real concern is over whether adding this dependency would create JAR hell for our users.

I don't think it really changes anything. avro is already a hadoop dependency; and avro-mapred is a dependency of spark-hive, so it's already included in the Spark assembly anyway.

zsxwing · 2015-07-02T16:07:04Z

The dependency looks good to me. @JoshRosen avro-mapred is already in the assembly jar. Built the master branch with -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive and saw the following including information in the build log:

[info] Including: avro-1.7.7.jar
[info] Including: avro-mapred-1.7.7-hadoop2.jar
[info] Including: avro-ipc-1.7.7.jar
[info] Including: avro-ipc-1.7.7-tests.jar

JDrit · 2015-07-07T18:09:18Z

@JoshRosen does that satisfy your concerns and are there other changes you would want me to make?

squito · 2015-07-15T03:04:52Z

core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala

+    bos.toByteArray
+  })
+
+


nit: delete extra newline

squito · 2015-07-15T21:16:19Z

Jenkins, retest this please

squito · 2015-07-22T20:41:55Z

core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala

+  private val schemaCache = new mutable.HashMap[Long, Schema]()
+
+  /** This needs to be lazy since SparkEnv is not initialized yet sometimes when this is called */
+  private lazy val codec = CompressionCodec.createCodec(SparkEnv.get.conf)


just curious, where does this constructed when there isn't a SparkEnv?

Why not just accept a SparkConf in the GenericAvroSerializer constructor instead of getting it from SparkEnv?

Several of the tests involving block replication fail when this value is not lazily defined.

I originally had it accept SparkConf but I was getting serialization errors since SparkConf is not serializable. This way prevents having to send the configuration with the serializer.

Joe and I talked about this a bit offline -- the reason for this is that ShuffleRDD lets you set a Serializer directly, which is used in some tests, and that is why the serializer itself needs to be serializable. I'll add a comment here explaining why its necessary when I merge.

for anybody that is curious -- I had gotten myself pretty confused about why this change would make the SparkConf serialized. KryoSerializer already had a conf argument to the constructor, that wasn't changed. But the conf there is only accessed in field initialization, never in methods, so it wasn't stored. But through the wonders of scala, when you access that conf in a method, suddenly conf also becomes a member variable, and now you can no longer serialize the KryoSerializer.

In practice this means the lazy val codec here is fine in actual use, but it could be very confusing in a unit test where the SparkEnv hasn't been set. So I'll add comment explaining this a bit.

squito · 2015-07-22T20:48:12Z

2 super minor comments, but otherwise lgtm!

Also as a bit general house-keeping, can you put some of the benchmarks you have on the jira (since that serves as a better archive than github).

@JoshRosen want to take another look at this?

JoshRosen · 2015-07-22T20:56:47Z

core/src/main/scala/org/apache/spark/SparkConf.scala

+  }
+
+  /** Gets all the avro schemas in the configuration used in the generic Avro record serializer */
+  def getAvroSchema: Map[Long, String] = {


What do the keys and values of this map denote?

The keys are longs, which represent the a unique ID of the schema and the values are the string representation of the schema.

squito · 2015-07-23T00:51:36Z

Jenkins, retest this please

squito · 2015-07-23T03:54:44Z

Jenkins, retest this please

SparkQA · 2015-07-24T16:27:36Z

Test build #1196 has finished for PR 7004 at commit 8158d51.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JDrit · 2015-07-27T16:08:51Z

@squito @JoshRosen Have I addressed all your issues, or is there anything else you would like me to do?

JoshRosen · 2015-07-27T17:25:36Z

I'm going to defer to other reviewers who have been following this patch more closely.

squito · 2015-07-29T17:08:52Z

core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala

+                """Error reading attempting to read avro data --
+                  |encountered an unknown fingerprint: $fingerprint, not sure what schema to use.
+                  |This could happen if you registered additional schemas after starting your
+                  |spark context.""".stripMargin)


I don't think we want to put in line breaks here (as you get w/ the stripMargin effect) and need to use s"..." to get $fingerprint interpreted correctly

squito · 2015-07-29T17:59:47Z

sorry I am trying to understand this serialization thing a bit better ... something doesn't make sense to me, but mostly outside of these changes. Once I get a handle on that, I will merge this, just want to make sure I'm not missing something ...

squito · 2015-07-29T19:04:16Z

merged to master, thanks @JDrit

pwendell · 2015-07-29T19:15:50Z

Did @srowen look at the build change? Sean or I should be signing off on any dependency changes in the build.

JoshRosen · 2015-07-29T19:19:31Z

There was a lot of up-thread discussion regarding the dependency change; my first comment on this PR was asking whether this introduced a new dependency, etc.

One concern which might have been overlooked is whether the required dependencies will be packaged if Spark is build with the YARN profile disabled. It looks like the rationale upthread is that this dependency is already transitively included by our other dependencies, but I'm not sure if that will always be the case if we're relying on the YARN profile to include it.

squito · 2015-07-29T20:20:23Z

@pwendell @JoshRosen ah sorry, I thought @vanzin & @zsxwing gave it the thumbs up earlier -- just following the comments I forgot to get the approval of sean or you as well. Lemme know if you think this needs an immediate revert. I will also check w/ sean.

srowen · 2015-07-29T20:37:07Z

Re: the avro dependency, this is a net new dependency for core. Previously this came in via the hive module (the metastore dependency to be specific). I suppose it relies on the Hive profile therefore, but not the YARN profile.

In any event the right thing to do is include the dependency if it's being used, of course. I suppose this is evidence that the Spark assembly -- the Hive flavors -- have had this dep and have been fine.

Avro doesn't bring anything in that we didn't already have, except Avro:

[INFO] +- org.apache.avro:avro-mapred:jar:hadoop2:1.7.7:compile
[INFO] |  +- org.apache.avro:avro-ipc:jar:1.7.7:compile
[INFO] |  |  +- (org.apache.avro:avro:jar:1.7.7:compile - version managed from 1.7.5; omitted for duplicate)
[INFO] |  |  +- (org.codehaus.jackson:jackson-core-asl:jar:1.9.13:compile - version managed from 1.9.2; omitted for duplicate)
[INFO] |  |  +- (org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13:compile - version managed from 1.9.2; omitted for duplicate)
[INFO] |  |  \- (org.slf4j:slf4j-api:jar:1.7.10:compile - version managed from 1.6.4; scope managed from runtime; omitted for duplicate)
[INFO] |  +- org.apache.avro:avro-ipc:jar:tests:1.7.7:compile
[INFO] |  |  +- (org.apache.avro:avro:jar:1.7.7:compile - version managed from 1.7.5; omitted for duplicate)
[INFO] |  |  +- (org.codehaus.jackson:jackson-core-asl:jar:1.9.13:compile - version managed from 1.9.2; omitted for duplicate)
[INFO] |  |  +- (org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13:compile - version managed from 1.9.2; omitted for duplicate)
[INFO] |  |  \- (org.slf4j:slf4j-api:jar:1.7.10:compile - version managed from 1.6.4; scope managed from runtime; omitted for duplicate)
[INFO] |  +- (org.codehaus.jackson:jackson-core-asl:jar:1.9.13:compile - version managed from 1.9.2; omitted for duplicate)
[INFO] |  +- (org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13:compile - version managed from 1.9.2; omitted for duplicate)
[INFO] |  \- (org.slf4j:slf4j-api:jar:1.7.10:compile - version managed from 1.6.4; scope managed from runtime; omitted for duplicate)

So, I think the net change here is only that Avro has been added to core. Unless there's an objection to adding Avro at all, I think this is OK from a build standpoint.

pwendell · 2015-07-29T22:15:00Z

Okay sounds good. Thanks for looking at it Sean.

Patrick

On Wed, Jul 29, 2015 at 1:37 PM, Sean Owen notifications@github.com wrote:

Re: the avro dependency, this is a net new dependency for core. Previously
this came in via the hive module (the metastore dependency to be specific).
I suppose it relies on the Hive profile therefore, but not the YARN profile.

In any event the right thing to do is include the dependency if it's being
used, of course. I suppose this is evidence that the Spark assembly -- the
Hive flavors -- have had this dep and have been fine.

Avro doesn't bring anything in that we didn't already have, except Avro:

[INFO] +- org.apache.avro:avro-mapred:jar:hadoop2:1.7.7:compile
[INFO] | +- org.apache.avro:avro-ipc:jar:1.7.7:compile
[INFO] | | +- (org.apache.avro:avro:jar:1.7.7:compile - version managed from 1.7.5; omitted for duplicate)
[INFO] | | +- (org.codehaus.jackson:jackson-core-asl:jar:1.9.13:compile - version managed from 1.9.2; omitted for duplicate)
[INFO] | | +- (org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13:compile - version managed from 1.9.2; omitted for duplicate)
[INFO] | | - (org.slf4j:slf4j-api:jar:1.7.10:compile - version managed from 1.6.4; scope managed from runtime; omitted for duplicate)
[INFO] | +- org.apache.avro:avro-ipc:jar:tests:1.7.7:compile
[INFO] | | +- (org.apache.avro:avro:jar:1.7.7:compile - version managed from 1.7.5; omitted for duplicate)
[INFO] | | +- (org.codehaus.jackson:jackson-core-asl:jar:1.9.13:compile - version managed from 1.9.2; omitted for duplicate)
[INFO] | | +- (org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13:compile - version managed from 1.9.2; omitted for duplicate)
[INFO] | | - (org.slf4j:slf4j-api:jar:1.7.10:compile - version managed from 1.6.4; scope managed from runtime; omitted for duplicate)
[INFO] | +- (org.codehaus.jackson:jackson-core-asl:jar:1.9.13:compile - version managed from 1.9.2; omitted for duplicate)
[INFO] | +- (org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13:compile - version managed from 1.9.2; omitted for duplicate)
[INFO] | - (org.slf4j:slf4j-api:jar:1.7.10:compile - version managed from 1.6.4; scope managed from runtime; omitted for duplicate)

So, I think the net change here is only that Avro has been added to core.
Unless there's an objection to adding Avro at all, I think this is OK from
a build standpoint.

—
Reply to this email directly or view it on GitHub
#7004 (comment).

JDrit changed the title ~~[SPARK-746][CORE] Added Avro Serialization to Kryo~~ [WIP][SPARK-746][CORE] Added Avro Serialization to Kryo Jun 25, 2015

JDrit changed the title ~~[WIP][SPARK-746][CORE] Added Avro Serialization to Kryo~~ [SPARK-746][CORE][WIP] Added Avro Serialization to Kryo Jun 25, 2015

squito reviewed Jun 25, 2015
View reviewed changes

started working on fixes for pr

2b545cc

fixed serialization error in that SparkConf cannot be serialized

f4ae251

Changed Avro dependency to be similar to parent

ab46d10

vanzin reviewed Jul 1, 2015
View reviewed changes

updated pom to removed versions

d421bf5

JDrit changed the title ~~[SPARK-746][CORE][WIP] Added Avro Serialization to Kryo~~ [SPARK-746][CORE] Added Avro Serialization to Kryo Jul 7, 2015

squito reviewed Jul 15, 2015
View reviewed changes

core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala

bos.toByteArray

})

Copy link

Contributor

squito Jul 15, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: delete extra newline

Joseph Batchik added 2 commits July 15, 2015 13:48

implemented @squito suggestion

c5fe794

forgot a couple of fixes

fa9298b

Joseph Batchik added 3 commits July 15, 2015 14:48

updated codec settings

1183a48

fixed bug with serializing

dd71efe

implemented @squito suggestion for SparkEnv

c0cf329

squito reviewed Jul 22, 2015
View reviewed changes

JoshRosen reviewed Jul 22, 2015
View reviewed changes

updated per feedback

8158d51

squito reviewed Jul 29, 2015
View reviewed changes

asfgit closed this in 069a4c4 Jul 29, 2015

[SPARK-746][CORE] Added Avro Serialization to Kryo #7004

[SPARK-746][CORE] Added Avro Serialization to Kryo #7004

Conversation

JDrit commented Jun 24, 2015

squito commented Jun 25, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squito commented Jun 25, 2015

squito commented Jun 25, 2015

JDrit commented Jun 25, 2015

JDrit commented Jun 25, 2015

JoshRosen commented Jun 27, 2015

JoshRosen commented Jun 27, 2015

JDrit commented Jun 27, 2015

JoshRosen commented Jun 28, 2015

JoshRosen commented Jun 28, 2015

JDrit commented Jun 29, 2015

JoshRosen commented Jun 29, 2015

JoshRosen commented Jul 1, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanzin commented Jul 1, 2015

zsxwing commented Jul 2, 2015

JDrit commented Jul 7, 2015

Choose a reason for hiding this comment

squito commented Jul 15, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squito commented Jul 22, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squito commented Jul 23, 2015

squito commented Jul 23, 2015

SparkQA commented Jul 24, 2015

JDrit commented Jul 27, 2015

JoshRosen commented Jul 27, 2015

Choose a reason for hiding this comment

squito commented Jul 29, 2015

squito commented Jul 29, 2015

pwendell commented Jul 29, 2015

JoshRosen commented Jul 29, 2015

squito commented Jul 29, 2015

srowen commented Jul 29, 2015

pwendell commented Jul 29, 2015