[SPARK-14423][YARN] Avoid same name files added to distributed cache again #12203

jerryshao · 2016-04-06T10:00:13Z

What changes were proposed in this pull request?

In the current implementation of assembly-free spark deployment, jars under assembly/target/scala-xxx/jars will be uploaded to distributed cache by default, there's a chance these jars' name will be conflicted with name of jars specified in --jars, this will introduce exception when starting application:

client token: N/A
     diagnostics: Application application_1459907402325_0004 failed 2 times due to AM Container for appattempt_1459907402325_0004_000002 exited with  exitCode: -1000
For more detailed output, check application tracking page:http://hw12100.local:8088/proxy/application_1459907402325_0004/Then, click on links to logs of each attempt.
Diagnostics: Resource hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar changed on src filesystem (expected 1459909780508, was 1459909782590
java.io.IOException: Resource hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar changed on src filesystem (expected 1459909780508, was 1459909782590
    at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
    at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

So here by checking the name of file to avoid same name files uploaded again.

How was this patch tested?

Unit test and manual integrated test is done locally.

SparkQA · 2016-04-06T10:17:42Z

Test build #55105 has finished for PR 12203 at commit 7ff58be.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-04-06T19:00:55Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

@@ -519,8 +528,7 @@ private[spark] class Client(
    ).foreach { case (flist, resType, addToClasspath) =>
      flist.foreach { file =>
        val (_, localizedPath) = distribute(file, resType = resType)
-        require(localizedPath != null)
-        if (addToClasspath) {
+        if (addToClasspath && localizedPath != null) {


what is this change about?

@andrewor14 , in the previous code we assume all the files will be uploaded into distributed cache, so this localizedPath should not be null. But here with my change, some duplicated files will be neglected, this is return localizedPath as null instead, so here I change to this way.

andrewor14 · 2016-04-06T19:01:01Z

Looks good.

vanzin · 2016-04-07T18:26:23Z

Change looks ok (should mostly affect local builds where example jars have duplicates), but could you add a small unit test in ClientSuite.scala?

jerryshao · 2016-04-08T01:09:42Z

Sure, will do.

SparkQA · 2016-04-08T03:20:25Z

Test build #55302 has finished for PR 12203 at commit e1b09c4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2016-04-18T00:52:02Z

@vanzin , please help to review again, thanks a lot.

vanzin · 2016-04-18T17:12:16Z

yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala

+    val jar1 = TestUtils.createJarWithFiles(Map(), jarsDir)
+    val jar2 = TestUtils.createJarWithFiles(Map(), userLibs)
+    // Copy jar2 to jar3 with same name
+    val jar3 = {


You could have used java.nio.file.Files.copy, but no need to change that now.

vanzin · 2016-04-18T17:12:53Z

LGTM, merging to master.

…ache again apache#12203 Without modidying ClientSuite.

…again ## What changes were proposed in this pull request? In the current implementation of assembly-free spark deployment, jars under `assembly/target/scala-xxx/jars` will be uploaded to distributed cache by default, there's a chance these jars' name will be conflicted with name of jars specified in `--jars`, this will introduce exception when starting application: ``` client token: N/A diagnostics: Application application_1459907402325_0004 failed 2 times due to AM Container for appattempt_1459907402325_0004_000002 exited with exitCode: -1000 For more detailed output, check application tracking page:http://hw12100.local:8088/proxy/application_1459907402325_0004/Then, click on links to logs of each attempt. Diagnostics: Resource hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar changed on src filesystem (expected 1459909780508, was 1459909782590 java.io.IOException: Resource hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar changed on src filesystem (expected 1459909780508, was 1459909782590 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` So here by checking the name of file to avoid same name files uploaded again. ## How was this patch tested? Unit test and manual integrated test is done locally. Author: jerryshao <sshao@hortonworks.com> Closes apache#12203 from jerryshao/SPARK-14423.

RicoGit · 2016-07-03T04:21:41Z

Hi guys, it is possible to apply this patch to version 1.6? What can I do for this?

jerryshao · 2016-07-04T01:56:08Z

@RicoGit This is a behavior change for jars uploading to distributed cache, I'm not sure if it is suitable to back-port to branch 1.6. Also this problem is not so severe in 1.6 since we do the assembly for packaging.

RicoGit · 2016-07-04T04:47:36Z

Thanks for reply. I have problem with running spark job with oozie. This patch solves my problem. I applied this path to spark 1.6, built (spark-yarn_2.10-1.6.0-cdh5.7.0.jar) and put into sharedLibs of oozie.

jerryshao · 2016-07-04T05:04:38Z

Can you make sure the problem you met is exactly the same as what this PR solved? Since the exception stack you pasted in the StackOverFlow is different from What I pasted here before. From you exception stack, what I could guess is that same jar (same path with same file name) added twice, this is a little different from this PR's mentioned problem.

RicoGit · 2016-07-04T05:26:50Z

Thanks, i understand this is different problems. What will you advice me? I think that this is not good solution: require(localizedPath != null) just falling with exception message "requirements fails".It is better skip adding to the distributed cache and log warning. How do you think it is enough to open issue?

jerryshao · 2016-07-04T05:31:59Z

Maybe as you mentioned - skip adding to distributed cache and log warning - is enough, throwing exception will fail the application and this is actually not a fatal problem. I'm OK to change the current behavior for this, what do you think @vanzin ?

vanzin · 2016-07-05T17:33:03Z

I think there was a version of Oozie that triggered that assert; so maybe upgrading Oozie fixes the problem. It's also probably fine to remove that assert since we haven't seen many people hit it, meaning this situation should be rare.

And, btw, please avoid long discussions on closed PRs. That's why we have mailing lists and JIRA.

… --files and --archives ## What changes were proposed in this pull request? During spark-submit, if yarn dist cache is instructed to add same file under --files and --archives, This code change ensures the spark yarn distributed cache behaviour is retained i.e. to warn and fail if same files is mentioned in both --files and --archives. ## How was this patch tested? Manually tested: 1. if same jar is mentioned in --jars and --files it will continue to submit the job. - basically functionality [SPARK-14423] #12203 is unchanged 1. if same file is mentioned in --files and --archives it will fail to submit the job. Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. … under archives and files Author: Kishor Patil <kpatil@yahoo-inc.com> Closes #15627 from kishorvpatil/spark18099. (cherry picked from commit 098e4ca) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>

… --files and --archives ## What changes were proposed in this pull request? During spark-submit, if yarn dist cache is instructed to add same file under --files and --archives, This code change ensures the spark yarn distributed cache behaviour is retained i.e. to warn and fail if same files is mentioned in both --files and --archives. ## How was this patch tested? Manually tested: 1. if same jar is mentioned in --jars and --files it will continue to submit the job. - basically functionality [SPARK-14423] #12203 is unchanged 1. if same file is mentioned in --files and --archives it will fail to submit the job. Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. … under archives and files Author: Kishor Patil <kpatil@yahoo-inc.com> Closes #15627 from kishorvpatil/spark18099.

kishorvpatil · 2016-11-08T17:22:02Z

@vanzin, @jerryshao
Sorry for breaking this functionality. I have the patch available with more unit tests added to ensure positive test case ensuring submission continues if unique files/archives are mentioned.

#15810

… --files and --archives ## What changes were proposed in this pull request? During spark-submit, if yarn dist cache is instructed to add same file under --files and --archives, This code change ensures the spark yarn distributed cache behaviour is retained i.e. to warn and fail if same files is mentioned in both --files and --archives. ## How was this patch tested? Manually tested: 1. if same jar is mentioned in --jars and --files it will continue to submit the job. - basically functionality [SPARK-14423] apache#12203 is unchanged 1. if same file is mentioned in --files and --archives it will fail to submit the job. Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. … under archives and files Author: Kishor Patil <kpatil@yahoo-inc.com> Closes apache#15627 from kishorvpatil/spark18099.

avoid same name files added to distributed cache again

7ff58be

andrewor14 reviewed Apr 6, 2016
View reviewed changes

Add unit test

e1b09c4

vanzin reviewed Apr 18, 2016
View reviewed changes

asfgit closed this in d6fb485 Apr 18, 2016

zzcclp added a commit to zzcclp/spark that referenced this pull request Apr 19, 2016

[EXT][SPARK-14423][YARN] Avoid same name files added to distributed c…

57e8f91

…ache again apache#12203 Without modidying ClientSuite.

kishorvpatil mentioned this pull request Oct 25, 2016

[SPARK-18099][YARN] Fail if same files added to distributed cache for --files and --archives #15627

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14423][YARN] Avoid same name files added to distributed cache again #12203

[SPARK-14423][YARN] Avoid same name files added to distributed cache again #12203

jerryshao commented Apr 6, 2016

SparkQA commented Apr 6, 2016

andrewor14 Apr 6, 2016

jerryshao Apr 7, 2016

andrewor14 commented Apr 6, 2016

vanzin commented Apr 7, 2016

jerryshao commented Apr 8, 2016

SparkQA commented Apr 8, 2016

jerryshao commented Apr 18, 2016

vanzin Apr 18, 2016

vanzin commented Apr 18, 2016

RicoGit commented Jul 3, 2016

jerryshao commented Jul 4, 2016

RicoGit commented Jul 4, 2016

jerryshao commented Jul 4, 2016

RicoGit commented Jul 4, 2016 •

edited

jerryshao commented Jul 4, 2016

vanzin commented Jul 5, 2016

kishorvpatil commented Nov 8, 2016

[SPARK-14423][YARN] Avoid same name files added to distributed cache again #12203

[SPARK-14423][YARN] Avoid same name files added to distributed cache again #12203

Conversation

jerryshao commented Apr 6, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Apr 6, 2016

andrewor14 Apr 6, 2016

Choose a reason for hiding this comment

jerryshao Apr 7, 2016

Choose a reason for hiding this comment

andrewor14 commented Apr 6, 2016

vanzin commented Apr 7, 2016

jerryshao commented Apr 8, 2016

SparkQA commented Apr 8, 2016

jerryshao commented Apr 18, 2016

vanzin Apr 18, 2016

Choose a reason for hiding this comment

vanzin commented Apr 18, 2016

RicoGit commented Jul 3, 2016

jerryshao commented Jul 4, 2016

RicoGit commented Jul 4, 2016

jerryshao commented Jul 4, 2016

RicoGit commented Jul 4, 2016 • edited

jerryshao commented Jul 4, 2016

vanzin commented Jul 5, 2016

kishorvpatil commented Nov 8, 2016

RicoGit commented Jul 4, 2016 •

edited