-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19796][CORE] Fix serialization of long property values in TaskDescription #17140
Conversation
Test build #73777 has finished for PR 17140 at commit
|
This seems like a reasonable approach to me, so long as writing the length of the string out doesn't break anything else (it doesn't appear to). |
// SPARK-19796 -- writeUTF doesn't work for long strings, which can happen for property values | ||
val bytes = value.getBytes("utf-8") | ||
dataOut.writeInt(bytes.length) | ||
dataOut.write(bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just use .writeObject? I just read the code for this and it looks like that calls writeUTF() under the hood when the string is short enough, and otherwise calls writeUTF a few times. It seems better to use that built-in than to try to do it ourselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(here's the code: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/io/ObjectOutputStream.java#1160. I guess one downside of that is that it's more time-consuming because it does some extra steps, but I'm not sure that is a huge issue?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if you do think it makes sense to use writeObject, would you mind quickly checking the size of the serialized task in this test with your original version and with writeObject just to make sure the size is comparable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops realized I was looking at ObjectOutputStream and we use DataOutputStream here, so my proposal is no good. This is fine. Also I checked how this affects the serialized size and this only increases it by 6 bytes relative to writeUTF.
@@ -86,7 +86,10 @@ private[spark] object TaskDescription { | |||
dataOut.writeInt(taskDescription.properties.size()) | |||
taskDescription.properties.asScala.foreach { case (key, value) => | |||
dataOut.writeUTF(key) | |||
dataOut.writeUTF(value) | |||
// SPARK-19796 -- writeUTF doesn't work for long strings, which can happen for property values | |||
val bytes = value.getBytes("utf-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe in newer JDKs using the UTF_8
constant from StandardCharsets
is actually faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM sorry for the numerous comments. Do you think we should do the same thing for properties keys? I think probably not?
LGTM |
Test build #73802 has finished for PR 17140 at commit
|
LGTM |
merged to master |
yeah I went back and forth on it, thinking maybe I should do the keys as well for consistency, but I think I'd rather we just have a hard failure for super long keys. I guess that could technically break something, but seems like a misuse ... |
@squito sounds good. Thanks for fixing this! |
What changes were proposed in this pull request?
The properties that are serialized with a TaskDescription can have very long values (eg. "spark.job.description" which is set to the full sql statement with the thrift-server). DataOutputStream.writeUTF() does not work well for long strings, so this changes the way those values are serialized to handle longer strings.
How was this patch tested?
Updated existing unit test to reproduce the issue. All unit tests via jenkins.