-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19146][Core]Drop more elements when stageData.taskData.size > retainedTasks #16527
Conversation
Test build #71118 has finished for PR 16527 at commit
|
@@ -409,7 +409,8 @@ class JobProgressListener(conf: SparkConf) extends SparkListener with Logging { | |||
|
|||
// If Tasks is too large, remove and garbage collect old tasks | |||
if (stageData.taskData.size > retainedTasks) { | |||
stageData.taskData = stageData.taskData.drop(stageData.taskData.size - retainedTasks) | |||
stageData.taskData = stageData.taskData.drop( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it maybe a little clearer to write something like
val targetSize = 0.9 * retainedTasks
stageData.taskData = stageData.taskData.drop(stageData.taskData.size - targetSizse)
(And yeah does it make sense to drop more than just 1% extra, to avoid the overhead you're concerned about?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen 1% is only a conservative value, 1/10 is better.
Test build #71127 has finished for PR 16527 at commit
|
@wangyum one more though. I note that there are actually three instances in this file where a collection is trimmed to match a max retained number of things. See this quite related change: https://github.com/apache/spark/pull/10846/files Originally, these trims always reduced the collection by 10% of the max retained size (which is quite consistent with your change). This could be too few in some cases and that issue fixed it. Your change goes farther to always remove at least this amount. I was going to suggest that all three of these should be consistent. But perhaps the instance you are changing is the only one where repeatedly removing just 1 element is expensive? CC @rajeshbalamohan and @ajbozarth -- is it fairly safe to delete a bit more than is necessary here -- it seems like it's something that is there to support the UI for informational purposes, and hasn't mattered so much exactly how much is retained. |
My only concern would be users misunderstanding how |
I use following code log trim stages/jobs time consuming:
and the result is:
It may be fine just change |
How about this: change all three instances to trim the retained data to the configured maximum, but, to remove at least (0.1 * maxRetained) items. If max is 100 and there are 150 items, it would trim to leave 100 rather than all the way to 90. That addresses the concern, maintains current behavior in more cases, and actually restores some behavior from before the previous change. That and a small note in the docs that this is a target maximum and that fewer elements may be retained in some circumstances. |
I don't quit understand @srowen mentioned. so I simply changed it to drop If max is 100 and there are 150 items, it would drop |
Test build #71373 has finished for PR 16527 at commit
|
Yes, I'm suggesting the behavior you describe isn't desirable. Right now, if there are 150 items and max is 100, it will delete 50. That's good. There's no good reason to delete 10 more than needed for performance. The problem case is where there is just 101 and the max is 100. Then it makes sense to enforce a minimum number removed. I'm suggesting dropping |
Test build #71395 has finished for PR 16527 at commit
|
The latest code looks good to me, but you still need to update the documentation to match the new behavior |
Ping @wangyum to update docs |
Test build #71753 has finished for PR 16527 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except minor comments, looks reasonable
How many jobs the Spark UI and status APIs remember before garbage | ||
collecting. | ||
How many jobs the Spark UI and status APIs remember before garbage collecting. | ||
This is a target maximum and that fewer elements may be retained in some circumstances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maximum and that fewer ... -> maximum, and fewer ...
@@ -431,6 +432,13 @@ class JobProgressListener(conf: SparkConf) extends SparkListener with Logging { | |||
} | |||
|
|||
/** | |||
* Remove at least (maxRetained / 10) items to reduce friction. | |||
*/ | |||
def removedCount(dataSize: Int, retainedSize: Int): Int = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private
? and since it isn't reporting a number that were removed, but a number to remove, maybe calculateNumberToRemove
? or something?
Test build #71769 has finished for PR 16527 at commit
|
Merged to master |
… retainedTasks ## What changes were proposed in this pull request? Drop more elements when `stageData.taskData.size > retainedTasks` to reduce the number of times on call drop function. ## How was this patch tested? Jenkins Author: Yuming Wang <wgyumg@gmail.com> Closes apache#16527 from wangyum/SPARK-19146.
… retainedTasks ## What changes were proposed in this pull request? Drop more elements when `stageData.taskData.size > retainedTasks` to reduce the number of times on call drop function. ## How was this patch tested? Jenkins Author: Yuming Wang <wgyumg@gmail.com> Closes apache#16527 from wangyum/SPARK-19146.
What changes were proposed in this pull request?
Drop more elements when
stageData.taskData.size > retainedTasks
to reduce the number of times on call drop function.How was this patch tested?
Jenkins