Skip to content

[BEAM-6021] Registered more internal classes for kryo serialization#6998

Closed
mareksimunek wants to merge 2 commits intoapache:masterfrom
seznam:simunek/internalKryoSerializer
Closed

[BEAM-6021] Registered more internal classes for kryo serialization#6998
mareksimunek wants to merge 2 commits intoapache:masterfrom
seznam:simunek/internalKryoSerializer

Conversation

@mareksimunek
Copy link
Contributor

@mareksimunek mareksimunek commented Nov 9, 2018

Registred more internal classes for kryo serialization

Follow this checklist to help us incorporate your contribution quickly and easily:

  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

It will help us expedite review of your Pull Request if you tag someone (e.g. @username) to look at it.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- --- --- --- ---
Java Build Status Build Status Build Status Build Status Build Status Build Status Build Status Build Status
Python Build Status --- Build Status
Build Status
Build Status --- --- ---

@mareksimunek
Copy link
Contributor Author

Run Spark ValidatesRunner

Copy link
Member

@dmvk dmvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a potentially great performance improvement. 👍 Did you notice any measurable performance improvement after switching from java serialization?

@lukecwik
Copy link
Member

lukecwik commented Nov 9, 2018

There have been issues in the past about using Kryo as the serializer, this a discussion about it just over a year ago: https://lists.apache.org/thread.html/a36d68b568377f8064463b7fc374e8304d59a26412360050333bb2aa@%3Cdev.beam.apache.org%3E

It may not be an issue anymore though but wanted to provide context as to why this may have not worked in the past.

@mareksimunek mareksimunek force-pushed the simunek/internalKryoSerializer branch 3 times, most recently from 5c6222f to 59e4a18 Compare November 13, 2018 08:50
@mareksimunek
Copy link
Contributor Author

As I investigate problems which Lukasz mentioned still last so I only added more classes to register so if you set KryoSerializer you have registered some of internals used objects by default.

@mareksimunek
Copy link
Contributor Author

@iemejia pls code review

@mareksimunek mareksimunek changed the title [BEAM-6021] Use KryoSerializer instead of JavaSerializer [BEAM-6021] Registred more internal classes for kryo serialization Nov 16, 2018
@aviemzur aviemzur self-requested a review November 18, 2018 08:37
Copy link
Member

@aviemzur aviemzur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

M2C: We made sure that everywhere that we need to serialize data we didn’t let Spark do it with its serializer, which defaults to Java unless configured otherwise by Spark's spark.serializer configuration. Spark users often configure this to be Kryo instead of Java, so we made sure to encode the data using Beam’s encoders, then passed Byte[] so when Spark serializes data before it is transmitted to other machines, it uses its serializer (Kryo/Java) to serialize a Byte[] which is always serializable by the serializer (Whether Java or Kryo).

If other classes, which are not the user's data are serialized by Kryo, and we're sure that they are always serializable by Kryo, that's fine.

I'm not sure this change actually makes the classes registered to be serialized by Kryo. Do we have tests to show this? IIRC this is controlled by the spark.serializer configuration.

@mareksimunek mareksimunek force-pushed the simunek/internalKryoSerializer branch from 59e4a18 to 875d716 Compare November 21, 2018 16:25
@mareksimunek
Copy link
Contributor Author

@aviemzur Hi, you are influenced by this change only if you use conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
Added more internal classes which are used and test to show that is used KryoSerializer.

@mareksimunek
Copy link
Contributor Author

Run Java PreCommit

@mareksimunek mareksimunek force-pushed the simunek/internalKryoSerializer branch 2 times, most recently from 25d0332 to 51fec74 Compare November 26, 2018 10:55
@mareksimunek
Copy link
Contributor Author

Run Java PreCommit

@iemejia iemejia changed the title [BEAM-6021] Registred more internal classes for kryo serialization [BEAM-6021] Registered more internal classes for kryo serialization Dec 10, 2018
Copy link
Member

@iemejia iemejia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let some comments. Please ping me to re review and sorry for the delay.

conf.setAppName("test");
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
// register immutable collections serializers because the SDK uses them.
conf.set("spark.kryo.registrator", BeamSparkRunnerRegistrator.class.getName());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be some initialization error because the test seems not to be testing the BeamSparkRunnerRegistrator. Probably a good idea to assert that if calling MicrobatchSource it really uses StatelessJavaSerializer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was no easy way to hook if register of BeamSparkRunnerRegistrator is called so I created wrapper around regitrator, simple pipeline and created desired test if MicrobatchSource is registred

import scala.reflect.ClassTag$;

/** Testing of beam registrar. */
public class BeamSparkRunnerRegistratorTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth to add an extra test that asserts that the Registrator is not called by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@mareksimunek mareksimunek force-pushed the simunek/internalKryoSerializer branch from 51fec74 to ff1c700 Compare December 18, 2018 14:02
@mareksimunek
Copy link
Contributor Author

Rebased and fixed test and added new one testing default behavior.
@iemejia pls review :)

iemejia added a commit that referenced this pull request Jan 8, 2019
@iemejia
Copy link
Member

iemejia commented Jan 8, 2019

Thanks a lot @mareksimunek and sorry for the extra long time to review. Already merged manually because I did some final touches, so closing it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants