Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24992][Core] spark should randomize yarn local dir selection #21953

Closed
wants to merge 3 commits into from

Conversation

hthuynh2
Copy link

@hthuynh2 hthuynh2 commented Aug 2, 2018

Description: SPARK-24992
Utils.getLocalDir is used to get path of a temporary directory. However, it always returns the the same directory, which is the first element in the array localRootDirs. When running on YARN, this might causes the case that we always write to one disk, which makes it busy while other disks are free. We should randomize the selection to spread out the loads.

What changes were proposed in this pull request?
This PR randomized the selection of local directory inside the method Utils.getLocalDir. This change affects the Utils.fetchFile method since it based on the fact that Utils.getLocalDir always return the same directory to cache file. Therefore, a new variable cachedLocalDir is used to cache the first localDirectory that it gets from Utils.getLocalDir. Also, when getting the configured local directories (inside Utils. getConfiguredLocalDirs), in case we are in yarn mode, the array of directories are also randomized before return.

@holdensmagicalunicorn
Copy link

@hthuynh2, thanks! I am a bot who has found some folks who might be able to help with the review:@li-zhihui, @mateiz and @pwendell

@hthuynh2
Copy link
Author

hthuynh2 commented Aug 2, 2018

@tgravescs Can you test this please? Thank you.

@felixcheung
Copy link
Member

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Aug 2, 2018

Test build #93964 has finished for PR 21953 at commit 3986e75.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jerryshao
Copy link
Contributor

What kind of behavior did you see? This local dir is only used to store some temporary files, which is not IO intensive, so I don't think the problem here is severe.

@tgravescs
Copy link
Contributor

We have seen jobs overloading the first disk returned by Yarn. Unfortunately the details of the job have long expired. Its in general a good practice to distribute the load anyway.

I remember one of the jobs was python. I think it was the case if you look in like EvalPythonExec.scala:

  // The queue used to buffer input rows so we can drain it to
  // combine input with output from Python.
  val queue = HybridRowQueue(context.taskMemoryManager(),
    new File(Utils.getLocalDir(SparkEnv.get.conf)), child.output.length)

That is always going to hit the disk yarn returns first for every container on that node.

@tgravescs
Copy link
Contributor

Jenkins, test this please

val localDir = new File(getLocalDir(conf))
var localDir: File = null
// Set the cachedLocalDir for the first time and re-use it later
this.synchronized {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to be more efficient to not hit the synchronized block each time we could do one extra check before it to check cachedLocalDir.isEmpty. Only if its empty do we enter synchronized and then recheck if still empty.

this would be very similar to getOrCreateLocalRootDirs

@hthuynh2
Copy link
Author

hthuynh2 commented Aug 2, 2018

@tgravescs I updated it. Thanks.

@tgravescs
Copy link
Contributor

+1 pending jenkins.

@SparkQA
Copy link

SparkQA commented Aug 2, 2018

Test build #94016 has finished for PR 21953 at commit 3986e75.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

looks like random test timeout error.

@tgravescs
Copy link
Contributor

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Aug 2, 2018

Test build #94050 has finished for PR 21953 at commit a8c1654.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jerryshao
Copy link
Contributor

I see, thanks for explaining.

@jerryshao
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Aug 3, 2018

Test build #94085 has finished for PR 21953 at commit a8c1654.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

wow each of these test failures is different. trying again

@tgravescs
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Aug 3, 2018

Test build #94136 has finished for PR 21953 at commit a8c1654.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

test this please

@tgravescs
Copy link
Contributor

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Aug 6, 2018

Test build #94287 has finished for PR 21953 at commit a8c1654.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

merged to master, thanks @hthuynh2

@asfgit asfgit closed this in 51e2b38 Aug 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants