New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-7696] Prepare files to stage also in local master of spark runner. #9019
Conversation
Run Spark ValidatesRunner |
Run Java Spark PortableValidatesRunner Batch |
R: @ibzib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Yanlin, thanks for the PR.
Since PipelineResources.java
is in the core-construction
package, it is shared by all Beam runners. While adding directories just causes Spark to throw an exception, directories might be acceptable or even desirable on different runners (I don't actually know--just being cautious here). So it would be better to limit this change to the runners.spark
package (maybe by calling PipelineResources.detectClassPathResourcesToStage
as-is, then removing directories in SparkPipelineOptions.prepareFilesToStage
).
...truction-java/src/main/java/org/apache/beam/runners/core/construction/PipelineResources.java
Outdated
Show resolved
Hide resolved
Yeah. I aggre with you. And in this patch, I do exclude directories. |
I suggest changing only the Spark runner code. That way, we can exclude nonexistent files and existing directories, without having to worry about the other runners. (While it may seem strange that other runners would ever use paths to resources that do not exist at the time of calling |
OK. This is a safer way. |
d1f55e5
to
3015444
Compare
Run Spark ValidatesRunner |
List<String> filesToStage = | ||
options.getFilesToStage().stream() | ||
.map(File::new) | ||
.filter(File::exists) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about directories?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PipelineResources.prepareFilesForStaging will package directories into jar file. That's why call this method before create a SparkContext.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I missed that.
It looks like prepareFilesForStaging
also throws IllegalStateException
if a file doesn't exist. It might be better to throw the exception to alert the user, rather than silently filtering out a missing dependency and have an error show up later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I guess there are always nonexistent files. Weird. Well, this is fine then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @yanlin-Lynn ! |
I suggest that we re-use the code in Dataflow and to automatically convert directory classpath entries into jars: Line 256 in 3e97543
It will likely require some refactoring to make it available to more runners. |
Currently, there exist |
I thought the issue was that the SparkRunner couldn't handle directories but I misread the conversation above and its related to non-existent files so my comment doesn't make any sense and no change is needed/wanted. |
The result of PipelineResources.detectClassPathResourcesToStage may contain directory, which will cause exception, when running unit test of SparkRunner. See the attached picture file in BEAM-7696 for detail of exception in such case. Even though the exception does not make SparkContext stop when running in local master, it's better not to include non-jars files in classpath.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.