Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19796][CORE] Fix serialization of long property values in TaskDescription #17140

Closed
wants to merge 2 commits into from

Conversation

squito
Copy link
Contributor

@squito squito commented Mar 2, 2017

What changes were proposed in this pull request?

The properties that are serialized with a TaskDescription can have very long values (eg. "spark.job.description" which is set to the full sql statement with the thrift-server). DataOutputStream.writeUTF() does not work well for long strings, so this changes the way those values are serialized to handle longer strings.

How was this patch tested?

Updated existing unit test to reproduce the issue. All unit tests via jenkins.

@SparkQA
Copy link

SparkQA commented Mar 2, 2017

Test build #73777 has finished for PR 17140 at commit 60220a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jsoltren
Copy link

jsoltren commented Mar 2, 2017

This seems like a reasonable approach to me, so long as writing the length of the string out doesn't break anything else (it doesn't appear to).

// SPARK-19796 -- writeUTF doesn't work for long strings, which can happen for property values
val bytes = value.getBytes("utf-8")
dataOut.writeInt(bytes.length)
dataOut.write(bytes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just use .writeObject? I just read the code for this and it looks like that calls writeUTF() under the hood when the string is short enough, and otherwise calls writeUTF a few times. It seems better to use that built-in than to try to do it ourselves.

Copy link
Contributor

@kayousterhout kayousterhout Mar 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(here's the code: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/io/ObjectOutputStream.java#1160. I guess one downside of that is that it's more time-consuming because it does some extra steps, but I'm not sure that is a huge issue?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if you do think it makes sense to use writeObject, would you mind quickly checking the size of the serialized task in this test with your original version and with writeObject just to make sure the size is comparable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops realized I was looking at ObjectOutputStream and we use DataOutputStream here, so my proposal is no good. This is fine. Also I checked how this affects the serialized size and this only increases it by 6 bytes relative to writeUTF.

@@ -86,7 +86,10 @@ private[spark] object TaskDescription {
dataOut.writeInt(taskDescription.properties.size())
taskDescription.properties.asScala.foreach { case (key, value) =>
dataOut.writeUTF(key)
dataOut.writeUTF(value)
// SPARK-19796 -- writeUTF doesn't work for long strings, which can happen for property values
val bytes = value.getBytes("utf-8")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe in newer JDKs using the UTF_8 constant from StandardCharsets is actually faster.

Copy link
Contributor

@kayousterhout kayousterhout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM sorry for the numerous comments. Do you think we should do the same thing for properties keys? I think probably not?

@mridulm
Copy link
Contributor

mridulm commented Mar 3, 2017

LGTM

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73802 has finished for PR 17140 at commit 99692bf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kayousterhout
Copy link
Contributor

LGTM

@squito
Copy link
Contributor Author

squito commented Mar 6, 2017

merged to master

@squito
Copy link
Contributor Author

squito commented Mar 6, 2017

@kayousterhout

Do you think we should do the same thing for properties keys? I think probably not?

yeah I went back and forth on it, thinking maybe I should do the keys as well for consistency, but I think I'd rather we just have a hard failure for super long keys. I guess that could technically break something, but seems like a misuse ...

@asfgit asfgit closed this in 12bf832 Mar 6, 2017
@kayousterhout
Copy link
Contributor

@squito sounds good. Thanks for fixing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants