Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5494][SQL] SparkSqlSerializer Ignores KryoRegistrators #4693

Closed
wants to merge 1 commit into from

Conversation

hkothari
Copy link

SparkSqlSerializer completely ignores custom KryoRegistrators (unlike the regular Spark serializer). I've applied a change to call super.newKryo before registering their custom classes, and removed duplicate registrations/settings.

@ash211
Copy link
Contributor

ash211 commented Feb 19, 2015

Jenkins this is ok to test

@SparkQA
Copy link

SparkQA commented Feb 19, 2015

Test build #27723 has finished for PR 4693 at commit 8e756cc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

There is no reason to allow custom kryo registration. The SQL serializer is only ever used to serialize SQL types. Have you seen some bug with us missing a type?

@pwoody
Copy link

pwoody commented Mar 2, 2015

I believe the bug is due to inconsistency between KryoSerializer and SqlSerializer in the order of registration. I don't know if there is a way to manage multiple Serializers, but mixed applications (sql/raw spark) can run into issues.

@marmbrus
Copy link
Contributor

marmbrus commented Mar 2, 2015

@pwoody I don't think that is possible. As long as the two sides are using copies of kryo with the same order or registration then things should work. As far as I know we never mix them up. Please let me know if you have a test case that shows otherwise.

If there are no examples of failures, then I suggest we close this issue.

@mccheah
Copy link
Contributor

mccheah commented Mar 2, 2015

Talked with @pwoody offline and he's working on a test case, but I'll summarize what I think will break. If you use an SQL Context wrapping a Spark Context where you set spark.serializer, then when you do some things in Spark SQL and try to collect the RDD, the operations in computing the SQL will be fine, but when you collect the RDD it uses the SQL serializer to serialize the results but the driver will kryo-deserialize using spark.serializer.

This was the issue that originally prompted this change.

@marmbrus
Copy link
Contributor

marmbrus commented Mar 2, 2015

@mccheah we set the serializer in SQL on a per-shuffle basis, so that would surprise me. However, if you can show it happening we should certainly fix it.

@yhuai
Copy link
Contributor

yhuai commented Mar 29, 2015

@pwoody @mccheah any update on it? If you saw an error, what was the exception and error message?

@asfgit asfgit closed this in 0cc8fcb Apr 12, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants