Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add helper task to print pipeline options for Dataflow portability #6979

Merged
merged 2 commits into from
Nov 8, 2018

Conversation

swegner
Copy link
Contributor

@swegner swegner commented Nov 7, 2018

Currently it is difficult to manually run a portabiltity pipeline as a
Beam user. This change adds a new Gradle task to conveniently generate
necessary artifacts and print the necessary commandline options.

Usage:

./gradlew -p runners/google-cloud-dataflow-java printFnApiPipelineOptions

Sample output:

> Task :beam-runners-google-cloud-dataflow-java:printFnApiPipelineOptions
To run a Dataflow job with portability, add the following pipeline options to your command-line:
--dataflowWorkerJar=/usr/local/google/home/swegner/beam/runners/google-cloud-dataflow-java/worker/build/libs/beam-runners-google-cloud-dataflow-java-fn-api-worker-2.9.0-SNAPSHOT.jar --workerHarnessContainerImage=gcr.io/dataflow-build/swegner/beam/java:20181107141351 --experiments=beam_fn_api

Follow this checklist to help us incorporate your contribution quickly and easily:

  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

It will help us expedite review of your Pull Request if you tag someone (e.g. @username) to look at it.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- --- --- --- ---
Java Build Status Build Status Build Status Build Status Build Status Build Status Build Status Build Status
Python Build Status --- Build Status
Build Status
Build Status --- --- ---

Currently it is difficult to manually run a portabiltity pipeline as a
Beam user. This change adds a new Gradle task to conveniently generate
necessary artifacts and print the necessary commandline options.

Usage:
  ./gradlew -p runners/google-cloud-dataflow-java printFnApiPipelineOptions
@swegner
Copy link
Contributor Author

swegner commented Nov 7, 2018

@swegner
Copy link
Contributor Author

swegner commented Nov 7, 2018

Run Dataflow PortabilityApi ValidatesRunner

@boyuanzz
Copy link
Contributor

boyuanzz commented Nov 7, 2018

We also launch portable pipelines in javaPrecommit: https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/examples/build.gradle#L79. Do you think it's necessary adding this help into the precommit? @swegner

@swegner
Copy link
Contributor Author

swegner commented Nov 7, 2018

The reason I found a need for this: I'm trying to run a simple prototype pipeline from outside of the Gradle build (as a user, following the Java Quickstart). So the commandline I need to run is something like:

mvn compile exec:java -Dexec.mainClass=example.MyPrototype -Dexec.args=" \
  --runner=DataflowRunner \
  --project=<..> --gcpTempLocation=<..> \
  [extra portability args]" \
  -Pdataflow-runner

I don't believe we publish maven artifacts to fulfill [extra portability args], so we need to run the Gradle tasks and use their results.

It seems there's a lot of redundant logic in the examples project you link above (i.e. buildAndPushDockerContainer. @boyuanzz do you know if the redundancy is intentional or should I unify them?

@boyuanzz
Copy link
Contributor

boyuanzz commented Nov 7, 2018

For maven build, the worker jar is not published: https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/build.gradle#L57 in purpose. For end users, they are supposed to use the worker jar cached by Dataflow service.

@boyuanzz
Copy link
Contributor

boyuanzz commented Nov 7, 2018

For redundant logic, I'm not sure whether it's worthy unifying them since there are just 2 usages across the whole project, but feel free to refactor them if needed.

@swegner
Copy link
Contributor Author

swegner commented Nov 7, 2018

Run Dataflow PortabilityApi ValidatesRunner

@swegner
Copy link
Contributor Author

swegner commented Nov 7, 2018

For end users, they are supposed to use the worker jar cached by Dataflow service.

@boyuanzz do you know if the service caches worker jars for portable pipelines (in addition to legacy?) Also, are we still publishing a 'latest' image for running from a -SNAPSHOT build? /cc @lukecwik

@boyuanzz
Copy link
Contributor

boyuanzz commented Nov 7, 2018

For end users, they are supposed to use the worker jar cached by Dataflow service.

@boyuanzz do you know if the service caches worker jars for portable pipelines (in addition to legacy?)

I believe the service also caches portable worker jar. If the dataflowWorkerJar option is missing, then it will pick the caches worker jar. Here is the design doc, maybe more illustrative: https://docs.google.com/document/d/1-m-GzkYWIODKOEl1ZSUNXYbcGRvRr3QkasfHsJxbuoA/edit#heading=h.gh88g5y0rekp

@swegner
Copy link
Contributor Author

swegner commented Nov 7, 2018

Thanks, good to know. I think this helper is still useful in order to use the worker image from HEAD since the cached version in the service may be updated less frequently.

I believe this is ready for review. @boyuanzz PTAL

@boyuanzz
Copy link
Contributor

boyuanzz commented Nov 7, 2018

Yes, it would be very helpful if dev is trying to test their dataflow worker changes. Btw, the ValidatesRunner test takes ~3 hr to complete.

@boyuanzz
Copy link
Contributor

boyuanzz commented Nov 7, 2018

Run Java PortabilityApi PostCommit

@swegner swegner merged commit 42984a8 into apache:master Nov 8, 2018
swegner added a commit to swegner/beam that referenced this pull request Nov 8, 2018
…o print pipeline options for Dataflow portability"

This appears to be breaking post-commit tests:
https://builds.apache.org/job/beam_PreCommit_JavaPortabilityApi_Cron/26

This reverts commit 42984a8, reversing
changes made to b558c4d.
@swegner
Copy link
Contributor Author

swegner commented Nov 8, 2018

FYI, this appears to be breaking post-commits: https://issues.apache.org/jira/browse/BEAM-6028

swegner added a commit to swegner/beam that referenced this pull request Nov 9, 2018
…nt pipeline options for Dataflow portability"

This is a revert of commit 3846092,
which previously reverted commit 42984a8.
swegner added a commit that referenced this pull request Nov 9, 2018
… Add helper task to print pipeline options for Dataflow portability
ajamato pushed a commit to ajamato/beam that referenced this pull request Nov 16, 2018
…nt pipeline options for Dataflow portability"

This is a revert of commit 3846092,
which previously reverted commit 42984a8.
ajamato pushed a commit to ajamato/beam that referenced this pull request Nov 16, 2018
…nt pipeline options for Dataflow portability"

This is a revert of commit 3846092,
which previously reverted commit 42984a8.
ajamato pushed a commit to ajamato/beam that referenced this pull request Nov 20, 2018
…nt pipeline options for Dataflow portability"

This is a revert of commit 3846092,
which previously reverted commit 42984a8.
ajamato pushed a commit to ajamato/beam that referenced this pull request Nov 27, 2018
…nt pipeline options for Dataflow portability"

This is a revert of commit 3846092,
which previously reverted commit 42984a8.
@swegner swegner deleted the dataflow_portability_gradle branch February 14, 2019 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants