-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-6049] Add option to load job to GCS in Dataflow Runner #7047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1b00c1a to
2812422
Compare
swegner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contribution, and for adding unit tests.
It would also be useful to have an integration test, to ensure that a pipeline submitted to Dataflow service with this option actually runs successfully.
@robertwb do we have Dataflow-specific tests which validate integration between runner harness and service? Searching around a bit I don't see any tests within the beam-runners-google-cloud-dataflow-java project which actually run pipelines.
| import com.google.api.services.dataflow.model.Job; | ||
| import com.google.api.services.dataflow.model.ListJobsResponse; | ||
| import com.google.api.services.dataflow.model.WorkerPool; | ||
| import com.google.api.services.dataflow.model.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beam style rules dictate listing individual imports and not wildcards. See checkstyle failure.
FYI, you can re-run checkstyle directly via ./gradlew checkstyleMain checkstyleTest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm...I don't think I did that manually. Is this something IntelliJ does automatically and do you happen to know how to disable it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
|
|
||
| void setDataflowWorkerJar(String dataflowWorkerJar); | ||
|
|
||
| @Description( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this eventually become the default behavior? Is there any reason to support both?
If this will become default, I suggest using an experiment pipeline option rather than a named option. PipelineOptions become part of the public API and are harder to drop later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
...oogle-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
Show resolved
Hide resolved
...oogle-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
Show resolved
Hide resolved
2812422 to
43fa4ba
Compare
43fa4ba to
1062e4c
Compare
|
It's not supported in the service yet. I have that change pending because I'd like to submit it with an integration test before submitting. I can leave this JIRA open until I add an integration test here too once it's available in the service. |
|
LGTM. It seems the Java post-commits are currently red. I'll merge once the tests look healthy. |
|
Run JavaPortabilityApi PreCommit |
|
All green! |
|
Is this functional? We just upgraded to 2.9 and are hitting the size limitation on one of the bigger jobs. I have specified the following options on the command line: I have verified the graph is uploaded to the directory, but dataflow immediately fails due to |
|
Whoops, the logs do add Which I don't believe is helpful. |
|
@bbhoss I'm getting the same error with We enabled it via: DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
options.setExperiments(Arrays.asList("upload_graph"));@swegner/@foegler |
|
According to GCP support:
Good thing we pay for GCP support or we'd never know... Very confusing that the API library suggests a solution that isn't supported yet... |
|
Thanks for the info. It definitely seems like it was merged erroneously,
unless there is an EAP we are unaware of.
…On Mon, Mar 11, 2019 at 11:55 Jason Banich ***@***.***> wrote:
According to GCP support:
After my initial research, I was able to confirm this feature is not yet
supported on the Dataflow service side. The Dataflow Engineering team is
working on it, but there is no ETA yet. However, it is not likely to be
available before the end of Q2.
Good thing we pay for GCP support or we'd never know...
Very confusing that the API library suggests a solution that isn't
supported yet...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7047 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAbhP98U3ELe9GrAh6JPni7mg0F55BMks5vVnxtgaJpZM4Ye6ZN>
.
|
|
FYI the feature appears to be live now: https://issuetracker.google.com/issues/139540290 |
Support an option "uploadGraph" which will load the Job object to the user's staging directory, remove the embedded graph before submission, and retrieve the graph from within the service. This allows the DataflowRunner to support significantly larger job definitions.
R: @swegner
Follow this checklist to help us incorporate your contribution quickly and easily:
[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.It will help us expedite review of your Pull Request if you tag someone (e.g.
@username) to look at it.Post-Commit Tests Status (on master branch)