-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21384] [YARN] Spark + YARN fails with LocalFileSystem as default FS #19141
Conversation
spark.yarn.archive fails
Can you please describe your usage scenario and steps to reproduce your issue, from my understanding. Did you configure your default FS to a local FS? |
Also looks like this is not a Spark 2.2 issue, would you please fix the PR title be more accurate about the problem? |
Thanks @jerryshao for looking into this PR.
Yea, this can be reproduced with one node YARN cluster and LocalFileSystem as default FS. All the spark applications fail with YARN with LocalFileSystem. This issue can be avoided by setting any one of the spark.yarn.jars / spark.yarn.archive configurations.
I have updated the PR. |
I see, thanks for the explanation. |
OK to test. (I may not have the permission to trigger Jenkins test 😞 , let me try it) |
@@ -565,7 +565,6 @@ private[spark] class Client( | |||
distribute(jarsArchive.toURI.getPath, | |||
resType = LocalResourceType.ARCHIVE, | |||
destName = Some(LOCALIZED_LIB_DIR)) | |||
jarsArchive.delete() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're undoing the fix for SPARK-20741. If this is causing a problem and you want to fix it, you need to make it so that you don't do this only when the specific scenario that's causing the problem happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with Marcelo, this is a valid concern, we should not avoid such regression here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vanzin for the pointer. It was my mistake, I missed the change reason while looking at the history of the file.
I still see that SPARK-20741 has fixed the issue partially, it leaves __spark_conf__*.zip file to delete as part of shutdownhook.
I see these approaches to fix it further,
- Delete __spark_conf__.zip and __spark_libs__.zip files after completing the application similar to cleanupStagingDir.
(Or) - Add a configuration whether to delete __spark_conf__.zip and __spark_libs__.zip files after copying to dest dir, so that users can decide whether to delete these immediately or as part of process exit. In case of SPARK-20741, this new configuration can be enabled to delete these files immediately.
@vanzin & @jerryshao Please let me know your thoughts on this or if you have any other way to do this. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if your scenario and SPARK-20741's scenario are both encountered? Looks like your approach above cannot be worked.
I'm wondering if we can copy or move this spark_libs.zip temp file to another non-temp file and add that file to the dist cache. That non-temp file will not be deleted and can be overwritten during another launching, so we will always have only one copy.
Besides, I think we have several workarounds to handle this issue like spark.yarn.jars or spark.yarn.archive, so looks like this corner case is not so necessary to fix (just my thinking, normally people will not use local FS in a real cluster).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jerryshao for the comment.
What if your scenario and SPARK-20741's scenario are both encountered? Looks like your approach above cannot be worked.
Can you provide some information why you think it doesn't work? If we delete the spark_libs.zip after completing the application(similar to staging dir deletion), it would not stack up till the process exit which solves SPARK-20741 and also becomes available during the execution for this current issue.
I'm wondering if we can copy or move this spark_libs.zip temp file to another non-temp file and add that file to the dist cache. That non-temp file will not be deleted and can be overwritten during another launching, so we will always have only one copy.
If there are multiple jobs submitted/running concurrently, we would be overwriting the existing with the latest spark_libs.zip which may lead to apps failure during the copy-in-progress and also would become ambiguous to delete the file by which application.
Besides, I think we have several workarounds to handle this issue like spark.yarn.jars or spark.yarn.archive, so looks like this corner case is not so necessary to fix (just my thinking, normally people will not use local FS in a real cluster).
I agree, this is a corner case and can be handled with workaround.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think about this again, I think you're right. But I'm not sure if the program will be crashed or not if we delete the dependencies in the run-time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a configuration is rarely the right fix for anything. You shouldn't have to choose to have the correct behavior or not.
You can probably fix this by always uploading things to the distributed cache for resources with a "file:" scheme. This will penalize those running things with a local default FS, but it will work. And the 99.9% of the world that does not do that will not be affected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vanzin for the comment, I will update the PR as per your suggestion.
ok to test |
Test build #81953 has finished for PR 19141 at commit
|
LGTM, merging to master. |
(Also merging to 2.2.) |
…t FS ## What changes were proposed in this pull request? When the libraries temp directory(i.e. __spark_libs__*.zip dir) file system and staging dir(destination) file systems are the same then the __spark_libs__*.zip is not copying to the staging directory. But after making this decision the libraries zip file is getting deleted immediately and becoming unavailable for the Node Manager's localization. With this change, client copies the files to remote always when the source scheme is "file". ## How was this patch tested? I have verified it manually in yarn/cluster and yarn/client modes with hdfs and local file systems. Author: Devaraj K <devaraj@apache.org> Closes #19141 from devaraj-kavali/SPARK-21384. (cherry picked from commit 55d5fa7) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
…t FS ## What changes were proposed in this pull request? When the libraries temp directory(i.e. __spark_libs__*.zip dir) file system and staging dir(destination) file systems are the same then the __spark_libs__*.zip is not copying to the staging directory. But after making this decision the libraries zip file is getting deleted immediately and becoming unavailable for the Node Manager's localization. With this change, client copies the files to remote always when the source scheme is "file". ## How was this patch tested? I have verified it manually in yarn/cluster and yarn/client modes with hdfs and local file systems. Author: Devaraj K <devaraj@apache.org> Closes apache#19141 from devaraj-kavali/SPARK-21384. (cherry picked from commit 55d5fa7) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
What changes were proposed in this pull request?
When the libraries temp directory(i.e. spark_libs.zip dir) file system and staging dir(destination) file systems are the same then the spark_libs.zip is not copying to the staging directory. But after making this decision the libraries zip file is getting deleted immediately and becoming unavailable for the Node Manager's localization.
With this change, client copies the files to remote always when the source scheme is "file".
How was this patch tested?
I have verified it manually in yarn/cluster and yarn/client modes with hdfs and local file systems.