New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-14423][YARN] Avoid same name files added to distributed cache again #12203
Conversation
Test build #55105 has finished for PR 12203 at commit
|
@@ -519,8 +528,7 @@ private[spark] class Client( | |||
).foreach { case (flist, resType, addToClasspath) => | |||
flist.foreach { file => | |||
val (_, localizedPath) = distribute(file, resType = resType) | |||
require(localizedPath != null) | |||
if (addToClasspath) { | |||
if (addToClasspath && localizedPath != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this change about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewor14 , in the previous code we assume all the files will be uploaded into distributed cache, so this localizedPath
should not be null. But here with my change, some duplicated files will be neglected, this is return localizedPath
as null instead, so here I change to this way.
Looks good. |
Change looks ok (should mostly affect local builds where example jars have duplicates), but could you add a small unit test in |
Sure, will do. |
Test build #55302 has finished for PR 12203 at commit
|
@vanzin , please help to review again, thanks a lot. |
val jar1 = TestUtils.createJarWithFiles(Map(), jarsDir) | ||
val jar2 = TestUtils.createJarWithFiles(Map(), userLibs) | ||
// Copy jar2 to jar3 with same name | ||
val jar3 = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could have used java.nio.file.Files.copy
, but no need to change that now.
LGTM, merging to master. |
…ache again apache#12203 Without modidying ClientSuite.
…again ## What changes were proposed in this pull request? In the current implementation of assembly-free spark deployment, jars under `assembly/target/scala-xxx/jars` will be uploaded to distributed cache by default, there's a chance these jars' name will be conflicted with name of jars specified in `--jars`, this will introduce exception when starting application: ``` client token: N/A diagnostics: Application application_1459907402325_0004 failed 2 times due to AM Container for appattempt_1459907402325_0004_000002 exited with exitCode: -1000 For more detailed output, check application tracking page:http://hw12100.local:8088/proxy/application_1459907402325_0004/Then, click on links to logs of each attempt. Diagnostics: Resource hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar changed on src filesystem (expected 1459909780508, was 1459909782590 java.io.IOException: Resource hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar changed on src filesystem (expected 1459909780508, was 1459909782590 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` So here by checking the name of file to avoid same name files uploaded again. ## How was this patch tested? Unit test and manual integrated test is done locally. Author: jerryshao <sshao@hortonworks.com> Closes apache#12203 from jerryshao/SPARK-14423.
Hi guys, it is possible to apply this patch to version 1.6? What can I do for this? |
@RicoGit This is a behavior change for jars uploading to distributed cache, I'm not sure if it is suitable to back-port to branch 1.6. Also this problem is not so severe in 1.6 since we do the assembly for packaging. |
Thanks for reply. I have problem with running spark job with oozie. This patch solves my problem. I applied this path to spark 1.6, built (spark-yarn_2.10-1.6.0-cdh5.7.0.jar) and put into sharedLibs of oozie. |
Can you make sure the problem you met is exactly the same as what this PR solved? Since the exception stack you pasted in the StackOverFlow is different from What I pasted here before. From you exception stack, what I could guess is that same jar (same path with same file name) added twice, this is a little different from this PR's mentioned problem. |
Thanks, i understand this is different problems. What will you advice me? I think that this is not good solution: |
Maybe as you mentioned - skip adding to distributed cache and log warning - is enough, throwing exception will fail the application and this is actually not a fatal problem. I'm OK to change the current behavior for this, what do you think @vanzin ? |
I think there was a version of Oozie that triggered that assert; so maybe upgrading Oozie fixes the problem. It's also probably fine to remove that assert since we haven't seen many people hit it, meaning this situation should be rare. And, btw, please avoid long discussions on closed PRs. That's why we have mailing lists and JIRA. |
… --files and --archives ## What changes were proposed in this pull request? During spark-submit, if yarn dist cache is instructed to add same file under --files and --archives, This code change ensures the spark yarn distributed cache behaviour is retained i.e. to warn and fail if same files is mentioned in both --files and --archives. ## How was this patch tested? Manually tested: 1. if same jar is mentioned in --jars and --files it will continue to submit the job. - basically functionality [SPARK-14423] #12203 is unchanged 1. if same file is mentioned in --files and --archives it will fail to submit the job. Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. … under archives and files Author: Kishor Patil <kpatil@yahoo-inc.com> Closes #15627 from kishorvpatil/spark18099. (cherry picked from commit 098e4ca) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>
… --files and --archives ## What changes were proposed in this pull request? During spark-submit, if yarn dist cache is instructed to add same file under --files and --archives, This code change ensures the spark yarn distributed cache behaviour is retained i.e. to warn and fail if same files is mentioned in both --files and --archives. ## How was this patch tested? Manually tested: 1. if same jar is mentioned in --jars and --files it will continue to submit the job. - basically functionality [SPARK-14423] #12203 is unchanged 1. if same file is mentioned in --files and --archives it will fail to submit the job. Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. … under archives and files Author: Kishor Patil <kpatil@yahoo-inc.com> Closes #15627 from kishorvpatil/spark18099.
@vanzin, @jerryshao |
… --files and --archives ## What changes were proposed in this pull request? During spark-submit, if yarn dist cache is instructed to add same file under --files and --archives, This code change ensures the spark yarn distributed cache behaviour is retained i.e. to warn and fail if same files is mentioned in both --files and --archives. ## How was this patch tested? Manually tested: 1. if same jar is mentioned in --jars and --files it will continue to submit the job. - basically functionality [SPARK-14423] apache#12203 is unchanged 1. if same file is mentioned in --files and --archives it will fail to submit the job. Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. … under archives and files Author: Kishor Patil <kpatil@yahoo-inc.com> Closes apache#15627 from kishorvpatil/spark18099.
What changes were proposed in this pull request?
In the current implementation of assembly-free spark deployment, jars under
assembly/target/scala-xxx/jars
will be uploaded to distributed cache by default, there's a chance these jars' name will be conflicted with name of jars specified in--jars
, this will introduce exception when starting application:So here by checking the name of file to avoid same name files uploaded again.
How was this patch tested?
Unit test and manual integrated test is done locally.