[SPARK-21384] [YARN] Spark + YARN fails with LocalFileSystem as default FS #19141

devaraj-kavali · 2017-09-06T00:26:07Z

What changes were proposed in this pull request?

When the libraries temp directory(i.e. spark_libs.zip dir) file system and staging dir(destination) file systems are the same then the spark_libs.zip is not copying to the staging directory. But after making this decision the libraries zip file is getting deleted immediately and becoming unavailable for the Node Manager's localization.

With this change, client copies the files to remote always when the source scheme is "file".

How was this patch tested?

I have verified it manually in yarn/cluster and yarn/client modes with hdfs and local file systems.

spark.yarn.archive fails

jerryshao · 2017-09-06T07:46:05Z

Can you please describe your usage scenario and steps to reproduce your issue, from my understanding. Did you configure your default FS to a local FS?

jerryshao · 2017-09-06T07:46:52Z

Also looks like this is not a Spark 2.2 issue, would you please fix the PR title be more accurate about the problem?

devaraj-kavali · 2017-09-06T17:45:18Z

Thanks @jerryshao for looking into this PR.

Can you please describe your usage scenario and steps to reproduce your issue, from my understanding. Did you configure your default FS to a local FS?

Yea, this can be reproduced with one node YARN cluster and LocalFileSystem as default FS. All the spark applications fail with YARN with LocalFileSystem. This issue can be avoided by setting any one of the spark.yarn.jars / spark.yarn.archive configurations.

Also looks like this is not a Spark 2.2 issue, would you please fix the PR title be more accurate about the problem?

I have updated the PR.

jerryshao · 2017-09-07T00:40:36Z

I see, thanks for the explanation.

jerryshao · 2017-09-07T00:41:22Z

OK to test.

(I may not have the permission to trigger Jenkins test 😞 , let me try it)

vanzin · 2017-09-07T00:56:00Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

@@ -565,7 +565,6 @@ private[spark] class Client(
          distribute(jarsArchive.toURI.getPath,
            resType = LocalResourceType.ARCHIVE,
            destName = Some(LOCALIZED_LIB_DIR))
-          jarsArchive.delete()


You're undoing the fix for SPARK-20741. If this is causing a problem and you want to fix it, you need to make it so that you don't do this only when the specific scenario that's causing the problem happens.

Agree with Marcelo, this is a valid concern, we should not avoid such regression here.

Thanks @vanzin for the pointer. It was my mistake, I missed the change reason while looking at the history of the file.

I still see that SPARK-20741 has fixed the issue partially, it leaves __spark_conf__*.zip file to delete as part of shutdownhook.

I see these approaches to fix it further,

Delete __spark_conf__.zip and __spark_libs__.zip files after completing the application similar to cleanupStagingDir.
(Or)

Add a configuration whether to delete __spark_conf__.zip and __spark_libs__.zip files after copying to dest dir, so that users can decide whether to delete these immediately or as part of process exit. In case of SPARK-20741, this new configuration can be enabled to delete these files immediately.

@vanzin & @jerryshao Please let me know your thoughts on this or if you have any other way to do this. Thanks

What if your scenario and SPARK-20741's scenario are both encountered? Looks like your approach above cannot be worked.

I'm wondering if we can copy or move this spark_libs.zip temp file to another non-temp file and add that file to the dist cache. That non-temp file will not be deleted and can be overwritten during another launching, so we will always have only one copy.

Besides, I think we have several workarounds to handle this issue like spark.yarn.jars or spark.yarn.archive, so looks like this corner case is not so necessary to fix (just my thinking, normally people will not use local FS in a real cluster).

Thanks @jerryshao for the comment.

What if your scenario and SPARK-20741's scenario are both encountered? Looks like your approach above cannot be worked.

Can you provide some information why you think it doesn't work? If we delete the spark_libs.zip after completing the application(similar to staging dir deletion), it would not stack up till the process exit which solves SPARK-20741 and also becomes available during the execution for this current issue.

I'm wondering if we can copy or move this spark_libs.zip temp file to another non-temp file and add that file to the dist cache. That non-temp file will not be deleted and can be overwritten during another launching, so we will always have only one copy.

If there are multiple jobs submitted/running concurrently, we would be overwriting the existing with the latest spark_libs.zip which may lead to apps failure during the copy-in-progress and also would become ambiguous to delete the file by which application.

Besides, I think we have several workarounds to handle this issue like spark.yarn.jars or spark.yarn.archive, so looks like this corner case is not so necessary to fix (just my thinking, normally people will not use local FS in a real cluster).

I agree, this is a corner case and can be handled with workaround.

Think about this again, I think you're right. But I'm not sure if the program will be crashed or not if we delete the dependencies in the run-time.

Adding a configuration is rarely the right fix for anything. You shouldn't have to choose to have the correct behavior or not.

You can probably fix this by always uploading things to the distributed cache for resources with a "file:" scheme. This will penalize those running things with a local default FS, but it will work. And the 99.9% of the world that does not do that will not be affected.

Thanks @vanzin for the comment, I will update the PR as per your suggestion.

vanzin · 2017-09-19T22:28:40Z

ok to test

SparkQA · 2017-09-19T22:50:55Z

Test build #81953 has finished for PR 19141 at commit d2d13fe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-09-20T23:21:49Z

LGTM, merging to master.

vanzin · 2017-09-20T23:22:13Z

(Also merging to 2.2.)

…t FS ## What changes were proposed in this pull request? When the libraries temp directory(i.e. __spark_libs__*.zip dir) file system and staging dir(destination) file systems are the same then the __spark_libs__*.zip is not copying to the staging directory. But after making this decision the libraries zip file is getting deleted immediately and becoming unavailable for the Node Manager's localization. With this change, client copies the files to remote always when the source scheme is "file". ## How was this patch tested? I have verified it manually in yarn/cluster and yarn/client modes with hdfs and local file systems. Author: Devaraj K <devaraj@apache.org> Closes #19141 from devaraj-kavali/SPARK-21384. (cherry picked from commit 55d5fa7) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

…t FS ## What changes were proposed in this pull request? When the libraries temp directory(i.e. __spark_libs__*.zip dir) file system and staging dir(destination) file systems are the same then the __spark_libs__*.zip is not copying to the staging directory. But after making this decision the libraries zip file is getting deleted immediately and becoming unavailable for the Node Manager's localization. With this change, client copies the files to remote always when the source scheme is "file". ## How was this patch tested? I have verified it manually in yarn/cluster and yarn/client modes with hdfs and local file systems. Author: Devaraj K <devaraj@apache.org> Closes apache#19141 from devaraj-kavali/SPARK-21384. (cherry picked from commit 55d5fa7) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

[SPARK-21384] [YARN] Spark 2.2 + YARN without spark.yarn.jars /

208bb68

spark.yarn.archive fails

devaraj-kavali changed the title ~~[SPARK-21384] [YARN] Spark 2.2 + YARN without spark.yarn.jars / spark.yarn.archive fails~~ [SPARK-21384] [YARN] Spark + YARN fails with LocalFileSystem as default FS Sep 6, 2017

vanzin reviewed Sep 7, 2017

View reviewed changes

Copying the files to remote always when the source scheme is 'file'

d2d13fe

asfgit closed this in 55d5fa7 Sep 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-21384] [YARN] Spark + YARN fails with LocalFileSystem as default FS #19141

[SPARK-21384] [YARN] Spark + YARN fails with LocalFileSystem as default FS #19141

devaraj-kavali commented Sep 6, 2017 •

edited

Loading

jerryshao commented Sep 6, 2017

jerryshao commented Sep 6, 2017

devaraj-kavali commented Sep 6, 2017

jerryshao commented Sep 7, 2017

jerryshao commented Sep 7, 2017 •

edited

Loading

vanzin Sep 7, 2017

jerryshao Sep 7, 2017

devaraj-kavali Sep 11, 2017

jerryshao Sep 12, 2017

devaraj-kavali Sep 12, 2017

jerryshao Sep 13, 2017

vanzin Sep 13, 2017 •

edited

Loading

devaraj-kavali Sep 14, 2017

vanzin commented Sep 19, 2017

SparkQA commented Sep 19, 2017

vanzin commented Sep 20, 2017

vanzin commented Sep 20, 2017

[SPARK-21384] [YARN] Spark + YARN fails with LocalFileSystem as default FS #19141

[SPARK-21384] [YARN] Spark + YARN fails with LocalFileSystem as default FS #19141

Conversation

devaraj-kavali commented Sep 6, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

jerryshao commented Sep 6, 2017

jerryshao commented Sep 6, 2017

devaraj-kavali commented Sep 6, 2017

jerryshao commented Sep 7, 2017

jerryshao commented Sep 7, 2017 • edited Loading

vanzin Sep 7, 2017

Choose a reason for hiding this comment

jerryshao Sep 7, 2017

Choose a reason for hiding this comment

devaraj-kavali Sep 11, 2017

Choose a reason for hiding this comment

jerryshao Sep 12, 2017

Choose a reason for hiding this comment

devaraj-kavali Sep 12, 2017

Choose a reason for hiding this comment

jerryshao Sep 13, 2017

Choose a reason for hiding this comment

vanzin Sep 13, 2017 • edited Loading

Choose a reason for hiding this comment

devaraj-kavali Sep 14, 2017

Choose a reason for hiding this comment

vanzin commented Sep 19, 2017

SparkQA commented Sep 19, 2017

vanzin commented Sep 20, 2017

vanzin commented Sep 20, 2017

devaraj-kavali commented Sep 6, 2017 •

edited

Loading

jerryshao commented Sep 7, 2017 •

edited

Loading

vanzin Sep 13, 2017 •

edited

Loading