Skip to content

Conversation

@foegler
Copy link

@foegler foegler commented Nov 15, 2018

Support an option "uploadGraph" which will load the Job object to the user's staging directory, remove the embedded graph before submission, and retrieve the graph from within the service. This allows the DataflowRunner to support significantly larger job definitions.

R: @swegner

Follow this checklist to help us incorporate your contribution quickly and easily:

  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

It will help us expedite review of your Pull Request if you tag someone (e.g. @username) to look at it.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- --- --- --- ---
Java Build Status Build Status Build Status Build Status Build Status Build Status Build Status Build Status
Python Build Status --- Build Status
Build Status
Build Status --- --- ---

@foegler foegler force-pushed the runnerChanges branch 3 times, most recently from 1b00c1a to 2812422 Compare November 15, 2018 02:20
Copy link
Contributor

@swegner swegner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution, and for adding unit tests.

It would also be useful to have an integration test, to ensure that a pipeline submitted to Dataflow service with this option actually runs successfully.

@robertwb do we have Dataflow-specific tests which validate integration between runner harness and service? Searching around a bit I don't see any tests within the beam-runners-google-cloud-dataflow-java project which actually run pipelines.

import com.google.api.services.dataflow.model.Job;
import com.google.api.services.dataflow.model.ListJobsResponse;
import com.google.api.services.dataflow.model.WorkerPool;
import com.google.api.services.dataflow.model.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beam style rules dictate listing individual imports and not wildcards. See checkstyle failure.

FYI, you can re-run checkstyle directly via ./gradlew checkstyleMain checkstyleTest

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm...I don't think I did that manually. Is this something IntelliJ does automatically and do you happen to know how to disable it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!


void setDataflowWorkerJar(String dataflowWorkerJar);

@Description(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this eventually become the default behavior? Is there any reason to support both?

If this will become default, I suggest using an experiment pipeline option rather than a named option. PipelineOptions become part of the public API and are harder to drop later.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@foegler
Copy link
Author

foegler commented Nov 16, 2018

It's not supported in the service yet. I have that change pending because I'd like to submit it with an integration test before submitting. I can leave this JIRA open until I add an integration test here too once it's available in the service.

@swegner
Copy link
Contributor

swegner commented Nov 16, 2018

LGTM. It seems the Java post-commits are currently red. I'll merge once the tests look healthy.

@swegner
Copy link
Contributor

swegner commented Nov 16, 2018

Run JavaPortabilityApi PreCommit

@swegner
Copy link
Contributor

swegner commented Nov 16, 2018

All green!

@swegner swegner merged commit 841f9eb into apache:master Nov 16, 2018
@bbhoss
Copy link

bbhoss commented Feb 15, 2019

Is this functional? We just upgraded to 2.9 and are hitting the size limitation on one of the bigger jobs. I have specified the following options on the command line:

 --experiments=upload_graph --tempLocation=gs://mybucket/foo --stagingLocation=gs://mybucket/foo

I have verified the graph is uploaded to the directory, but dataflow immediately fails due to
Failed to retrieve the job graph. Please see the job logs for more details.
The logs that I can see don't have any details (Logs in the Dataflow UI and Stackdriver are blank other than this message)

@bbhoss
Copy link

bbhoss commented Feb 15, 2019

Whoops, the logs do add

Runnable workflow has no steps specified.

Which I don't believe is helpful.

@Jdban
Copy link

Jdban commented Mar 8, 2019

@bbhoss I'm getting the same error with 2.10.0 and 2.11.0
Runnable workflow has no steps specified.

We enabled it via:

DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
options.setExperiments(Arrays.asList("upload_graph"));

@swegner/@foegler
How do we get this working with GCP dataflow? Has the feature been enabled in the service yet?

@Jdban
Copy link

Jdban commented Mar 11, 2019

According to GCP support:

After my initial research, I was able to confirm this feature is not yet supported on the Dataflow service side. The Dataflow Engineering team is working on it, but there is no ETA yet. However, it is not likely to be available before the end of Q2.

Good thing we pay for GCP support or we'd never know...

Very confusing that the API library suggests a solution that isn't supported yet...

@bbhoss
Copy link

bbhoss commented Mar 11, 2019 via email

@Jdban
Copy link

Jdban commented Sep 17, 2019

FYI the feature appears to be live now: https://issuetracker.google.com/issues/139540290

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants