-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24992][Core] spark should randomize yarn local dir selection #21953
Conversation
@hthuynh2, thanks! I am a bot who has found some folks who might be able to help with the review:@li-zhihui, @mateiz and @pwendell |
@tgravescs Can you test this please? Thank you. |
Jenkins, test this please |
Test build #93964 has finished for PR 21953 at commit
|
What kind of behavior did you see? This local dir is only used to store some temporary files, which is not IO intensive, so I don't think the problem here is severe. |
We have seen jobs overloading the first disk returned by Yarn. Unfortunately the details of the job have long expired. Its in general a good practice to distribute the load anyway. I remember one of the jobs was python. I think it was the case if you look in like EvalPythonExec.scala:
That is always going to hit the disk yarn returns first for every container on that node. |
Jenkins, test this please |
val localDir = new File(getLocalDir(conf)) | ||
var localDir: File = null | ||
// Set the cachedLocalDir for the first time and re-use it later | ||
this.synchronized { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we want to be more efficient to not hit the synchronized block each time we could do one extra check before it to check cachedLocalDir.isEmpty. Only if its empty do we enter synchronized and then recheck if still empty.
this would be very similar to getOrCreateLocalRootDirs
@tgravescs I updated it. Thanks. |
+1 pending jenkins. |
Test build #94016 has finished for PR 21953 at commit
|
looks like random test timeout error. |
Jenkins, test this please |
Test build #94050 has finished for PR 21953 at commit
|
I see, thanks for explaining. |
Jenkins, retest this please. |
Test build #94085 has finished for PR 21953 at commit
|
wow each of these test failures is different. trying again |
test this please |
Test build #94136 has finished for PR 21953 at commit
|
test this please |
Jenkins, test this please |
Test build #94287 has finished for PR 21953 at commit
|
merged to master, thanks @hthuynh2 |
Description: SPARK-24992
Utils.getLocalDir is used to get path of a temporary directory. However, it always returns the the same directory, which is the first element in the array localRootDirs. When running on YARN, this might causes the case that we always write to one disk, which makes it busy while other disks are free. We should randomize the selection to spread out the loads.
What changes were proposed in this pull request?
This PR randomized the selection of local directory inside the method Utils.getLocalDir. This change affects the Utils.fetchFile method since it based on the fact that Utils.getLocalDir always return the same directory to cache file. Therefore, a new variable cachedLocalDir is used to cache the first localDirectory that it gets from Utils.getLocalDir. Also, when getting the configured local directories (inside Utils. getConfiguredLocalDirs), in case we are in yarn mode, the array of directories are also randomized before return.