[SPARK-24992][Core] spark should randomize yarn local dir selection #21953

hthuynh2 · 2018-08-02T02:05:00Z

Description: SPARK-24992
Utils.getLocalDir is used to get path of a temporary directory. However, it always returns the the same directory, which is the first element in the array localRootDirs. When running on YARN, this might causes the case that we always write to one disk, which makes it busy while other disks are free. We should randomize the selection to spread out the loads.

What changes were proposed in this pull request?
This PR randomized the selection of local directory inside the method Utils.getLocalDir. This change affects the Utils.fetchFile method since it based on the fact that Utils.getLocalDir always return the same directory to cache file. Therefore, a new variable cachedLocalDir is used to cache the first localDirectory that it gets from Utils.getLocalDir. Also, when getting the configured local directories (inside Utils. getConfiguredLocalDirs), in case we are in yarn mode, the array of directories are also randomized before return.

holdensmagicalunicorn · 2018-08-02T02:05:06Z

@hthuynh2, thanks! I am a bot who has found some folks who might be able to help with the review:@li-zhihui, @mateiz and @pwendell

hthuynh2 · 2018-08-02T02:05:48Z

@tgravescs Can you test this please? Thank you.

felixcheung · 2018-08-02T06:31:29Z

Jenkins, test this please

SparkQA · 2018-08-02T07:05:02Z

Test build #93964 has finished for PR 21953 at commit 3986e75.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2018-08-02T08:08:58Z

What kind of behavior did you see? This local dir is only used to store some temporary files, which is not IO intensive, so I don't think the problem here is severe.

tgravescs · 2018-08-02T14:43:59Z

We have seen jobs overloading the first disk returned by Yarn. Unfortunately the details of the job have long expired. Its in general a good practice to distribute the load anyway.

I remember one of the jobs was python. I think it was the case if you look in like EvalPythonExec.scala:

  // The queue used to buffer input rows so we can drain it to
  // combine input with output from Python.
  val queue = HybridRowQueue(context.taskMemoryManager(),
    new File(Utils.getLocalDir(SparkEnv.get.conf)), child.output.length)

That is always going to hit the disk yarn returns first for every container on that node.

tgravescs · 2018-08-02T14:44:48Z

Jenkins, test this please

tgravescs · 2018-08-02T14:54:07Z

core/src/main/scala/org/apache/spark/util/Utils.scala

-      val localDir = new File(getLocalDir(conf))
+      var localDir: File = null
+      // Set the cachedLocalDir for the first time and re-use it later
+      this.synchronized {


if we want to be more efficient to not hit the synchronized block each time we could do one extra check before it to check cachedLocalDir.isEmpty. Only if its empty do we enter synchronized and then recheck if still empty.

this would be very similar to getOrCreateLocalRootDirs

hthuynh2 · 2018-08-02T15:55:17Z

@tgravescs I updated it. Thanks.

tgravescs · 2018-08-02T17:59:16Z

+1 pending jenkins.

SparkQA · 2018-08-02T18:53:20Z

Test build #94016 has finished for PR 21953 at commit 3986e75.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-08-02T19:05:04Z

looks like random test timeout error.

tgravescs · 2018-08-02T19:05:09Z

Jenkins, test this please

SparkQA · 2018-08-02T23:01:28Z

Test build #94050 has finished for PR 21953 at commit a8c1654.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2018-08-03T01:38:56Z

I see, thanks for explaining.

jerryshao · 2018-08-03T01:39:08Z

Jenkins, retest this please.

SparkQA · 2018-08-03T05:19:41Z

Test build #94085 has finished for PR 21953 at commit a8c1654.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-08-03T13:47:07Z

wow each of these test failures is different. trying again

tgravescs · 2018-08-03T13:47:12Z

test this please

SparkQA · 2018-08-03T17:50:22Z

Test build #94136 has finished for PR 21953 at commit a8c1654.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-08-03T19:10:03Z

test this please

tgravescs · 2018-08-06T14:00:46Z

Jenkins, test this please

SparkQA · 2018-08-06T18:50:34Z

Test build #94287 has finished for PR 21953 at commit a8c1654.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-08-06T18:58:50Z

merged to master, thanks @hthuynh2

initial commit

3986e75

tgravescs reviewed Aug 2, 2018

View reviewed changes

Hieu Huynh added 2 commits August 2, 2018 10:53

update synchronization block

da9c8d0

remove extra line

a8c1654

asfgit closed this in 51e2b38 Aug 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24992][Core] spark should randomize yarn local dir selection #21953

[SPARK-24992][Core] spark should randomize yarn local dir selection #21953

hthuynh2 commented Aug 2, 2018

holdensmagicalunicorn commented Aug 2, 2018

hthuynh2 commented Aug 2, 2018

felixcheung commented Aug 2, 2018

SparkQA commented Aug 2, 2018

jerryshao commented Aug 2, 2018

tgravescs commented Aug 2, 2018

tgravescs commented Aug 2, 2018

tgravescs Aug 2, 2018

hthuynh2 commented Aug 2, 2018

tgravescs commented Aug 2, 2018

SparkQA commented Aug 2, 2018

tgravescs commented Aug 2, 2018

tgravescs commented Aug 2, 2018

SparkQA commented Aug 2, 2018

jerryshao commented Aug 3, 2018

jerryshao commented Aug 3, 2018

SparkQA commented Aug 3, 2018

tgravescs commented Aug 3, 2018

tgravescs commented Aug 3, 2018

SparkQA commented Aug 3, 2018

tgravescs commented Aug 3, 2018

tgravescs commented Aug 6, 2018

SparkQA commented Aug 6, 2018

tgravescs commented Aug 6, 2018

[SPARK-24992][Core] spark should randomize yarn local dir selection #21953

[SPARK-24992][Core] spark should randomize yarn local dir selection #21953

Conversation

hthuynh2 commented Aug 2, 2018

holdensmagicalunicorn commented Aug 2, 2018

hthuynh2 commented Aug 2, 2018

felixcheung commented Aug 2, 2018

SparkQA commented Aug 2, 2018

jerryshao commented Aug 2, 2018

tgravescs commented Aug 2, 2018

tgravescs commented Aug 2, 2018

tgravescs Aug 2, 2018

Choose a reason for hiding this comment

hthuynh2 commented Aug 2, 2018

tgravescs commented Aug 2, 2018

SparkQA commented Aug 2, 2018

tgravescs commented Aug 2, 2018

tgravescs commented Aug 2, 2018

SparkQA commented Aug 2, 2018

jerryshao commented Aug 3, 2018

jerryshao commented Aug 3, 2018

SparkQA commented Aug 3, 2018

tgravescs commented Aug 3, 2018

tgravescs commented Aug 3, 2018

SparkQA commented Aug 3, 2018

tgravescs commented Aug 3, 2018

tgravescs commented Aug 6, 2018

SparkQA commented Aug 6, 2018

tgravescs commented Aug 6, 2018