-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-33753][CORE] Reduce the memory footprint and gc of the cache (hadoopJobMetadata) #30725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
|
||
| protected val jobConfCacheKey: String = "rdd_%d_job_conf".format(id) | ||
|
|
||
| protected val inputFormatCacheKey: String = "rdd_%d_input_format".format(id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SPARK-9585 Removed the inputformat cache.
|
Do we really need a separate copy of the JobConf cached for each partition ID? Is there any opportunity for us to reduce the number of JobConfs to begin with? It seems like all of the partitions should be able to safely share the same conf object...? Regardless, weak references seem more appropriate here than soft. |
Needs. HadoopRDD#getJobConf has such a comment. spark/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala Lines 144 to 153 in f8277d3
|
|
It seems that the github test is ok, and there is a performance improvement in the production environment test. |
|
I see, thanks for the reference. So IIUC this patch is primarily targeting the |
No. If your hadoop client version is above 2.7, or use the patch of HADOOP-11209, you can enable |
|
Thanks for the further explanation, that is very helpful. Seems like potentially the comment in There is still one point I don't understand. It seems that the key for the So I would expect there to be one cached entry in Thanks for bearing with me as I try to understand this issue! |
|
To clarify, the partition here refers to the partition of the hive table, not the rdd partition. |
| // (e.g., HadoopRDD uses this to cache JobConfs). | ||
| private[spark] val hadoopJobMetadata = | ||
| CacheBuilder.newBuilder().softValues().build[String, AnyRef]().asMap() | ||
| CacheBuilder.newBuilder().weakValues().build[String, AnyRef]().asMap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it better to put a size limitation for this cache? then soft reference should also be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use limited size, soft-reference cache can reduce the number of YGC than weak-reference.
But what size should it be limited to?
In fact, the driver rarely has the opportunity to reuse the jobconf of the cache, and it makes sense to share the jobconf in the executor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether or not a limit is in place (which could also be a good back-stop to prevent a huge cache), this could be fine - I think the only risk is that weak references are quite readily reclaimed, so this risks losing most of the caching.
Now it all makes sense. Thanks for the clarification. Seems I needed to read your original message more carefully. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Modify cache(
hadoopJobMetadata) softValues to weakValues.Why are the changes needed?
Reduce driver memory pressure, gc time and frequency, job execution time.
HadoopRDDuses soft-reference map to cachejobconf(rdd_id -> jobconf)When the number of hive partitions read by the driver is large, HadoopRDD.getPartitions will create many jobconfs and add them to the cache.
The executor will also create a jobconf, add it to the cache, and share it among exeuctors.
The number of jobconfs in the driver cache increases the memory pressure. When the driver memory configuration is not high, full gc becoming very frequent, and these jobconfs are hardly reused.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Exist UT
Manual test