New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-41685][UI] Support Protobuf serializer for the KVStore in History server #39202
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of queries, looks good
@@ -111,7 +122,7 @@ private[spark] object KVUtils extends Logging { | |||
// The default serializer is slow since it is using JSON+GZip encoding. | |||
Some(new KVStoreProtobufSerializer()) | |||
} else { | |||
None | |||
Some(serializerForHistoryServer(conf)) | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: If it is always non empty, change from Option ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I just did some refactoring. Please take another look
docs/monitoring.md
Outdated
<td> | ||
Serializer for writing/reading in-memory UI objects to/from disk-based KV Store; JSON or PROTOBUF. | ||
JSON serializer is the only choice before Spark 3.4.0, thus it is the default value. | ||
PROTOBUF serializer is fast and compact, and it is the default serializer for disk-based KV store of live UI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users cannot leverage/access the live db right ? (Unlike history db, which is used across restarts).
Do we want to expose the impl details ? If we change it tomorrow to something else, it should be completely transparent to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, updated
} | ||
|
||
val LOCAL_STORE_SERIALIZER = ConfigBuilder("spark.history.store.serializer") | ||
.doc("Serializer for writing/reading in-memory UI objects to/from disk-based KV Store; " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disk-based KV Store;
, ;
or :
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It follows
val HYBRID_STORE_DISK_BACKEND = ConfigBuilder("spark.history.store.hybridStore.diskBackend")
.doc("Specifies a disk-based store used in hybrid store; LEVELDB or ROCKSDB.")
Will this be supported in the future or has it been supported now? |
I think SHS can read the written RocksDB after this PR, if the file path/DB backend/serializer configurations are set properly. |
.doc("Serializer for writing/reading in-memory UI objects to/from disk-based KV Store; " + | ||
"JSON or PROTOBUF. JSON serializer is the only choice before Spark 3.4.0, thus it is the " + | ||
"default value. PROTOBUF serializer is fast and compact, and it is the default " + | ||
"serializer for disk-based KV store of live UI.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about this config. SHS only reads the rocksdb files, but not write them. So it should support whatever format of the rocksdb values. Is there a way to encode the format in rocksdb as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SHS writes RocksDB files during replaying event logs. The default serializer is JSON+GZIP, which is slower than the Protobuf serializer. Imagine that there are 300GB of event logs to replay...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I see
+CC @thejdeep, @shardulm94 |
@techaddict @mridulm @LuciferYang @cloud-fan thanks for the review. |
What changes were proposed in this pull request?
Support configurable(
spark.history.store.serializer
) serializer for the KVStore in History server. The value can beJSON
orPROTOBUF
.Why are the changes needed?
Support the fast and compact protobuf serializer in Spark history server(SHS), so that:
Does this PR introduce any user-facing change?
Yes, introduce a new configuration
spark.history.store.serializer
for setting the serializer for the KVStore in History server, which can beJSON
orPROTOBUF
How was this patch tested?
UT