[BEAM-3074] Stage the pipeline in Python DataflowRunner #4010

kennknowles · 2017-10-18T21:10:31Z

Follow this checklist to help us incorporate your contribution quickly and easily:

Make sure there is a JIRA issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes.
Each commit in the pull request should have a meaningful subject line and body.
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue.
Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
Run mvn clean verify to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

R: @aaltay (or redirect to other appropriate Python reviewer?)

Just hacked this out naively; it probably isn't respecting abstractions quite right. I confirmed enough that the file is staged - much simpler than Java 👍. Also no tests 🥇.

In doing a manual smoke test, I just tried to follow some combination of the quickstart plus the contribution guide, but broke during staging because pip install --download doesn't like that I did pip install -e .[gcp]. Is there a doc that has the steps for a new contributor to run wordcount with local modifications? I'm a bit rusty on the approved way of setting up the virtualenv. The crash occurs after the pipeline is staged, so I was able to check the basics anyhow.

aaltay

kennknowles wrote:
Follow this checklist to help us incorporate your contribution quickly and easily:

Make sure there is a JIRA issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes.

Each commit in the pull request should have a meaningful subject line and body.

Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue.

Write a pull request description that is detailed enough to understand what the pull request does, how, and why.

Run mvn clean verify to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.

If this contribution is large, please file an Apache Individual Contributor License Agreement.

R: @aaltay (or redirect to other appropriate Python reviewer?)

Just hacked this out naively; it probably isn't respecting abstractions quite right. I confirmed enough that the file is staged - much simpler than Java 👍. Also no tests 🥇.

In doing a manual smoke test, I just tried to follow some combination of the quickstart plus the contribution guide, but broke during staging because pip install --download doesn't like that I did pip install -e .[gcp]. Is there a doc that has the steps for a new contributor to run wordcount with local modifications? I'm a bit rusty on the approved way of setting up the virtualenv. The crash occurs after the pipeline is staged, so I was able to check the basics anyhow.

Unfortunately we do not have a page that explains how to get started with developing python in Beam. Filed https://issues.apache.org/jira/browse/BEAM-3075 for this.

aaltay · 2017-10-18T21:27:31Z

sdks/python/apache_beam/runners/dataflow/dataflow_runner.py

@@ -304,6 +311,11 @@ def run(self, pipeline):
    self.dataflow_client = apiclient.DataflowApplicationClient(
        pipeline._options)

+    # Upload the original proto for the pipeline
+    self.dataflow_client.stage_file(self.job.google_cloud_options.staging_location,


Could you do this within stage_job_resources (

beam/sdks/python/apache_beam/runners/dataflow/internal/apiclient.py

Line 478 in 1974b92

resources = dependency.stage_job_resources(

)

You want me to plumb the original pipeline through to there? It isn't part of the job.

I am conflicted about what to do here. I do not think we should plumb the pipeline. At the same time I want to keep all staging to happen in a single location.

You can leave a TODO here to figure out what to do about it. Perhaps we can find a middle ground as a list of additional files/buffers that needs to be staged could be passed to the dataflow_client.

aaltay · 2017-10-18T21:27:32Z

sdks/python/apache_beam/runners/dataflow/dataflow_runner.py

@@ -79,6 +81,8 @@ class DataflowRunner(PipelineRunner):
      CreatePTransformOverride(),
  ]

+  _STAGED_PIPELINE_FILENAME = "pipeline.pb"


Could you move this to runners/dataflow/internal/names.py ?

kennknowles · 2017-10-19T17:56:15Z

I actually forgot the most important part, which was the part where we tell the runner harness where to find the pipeline. So this is mostly a total rewrite so I rebased anyhow; please take a fresh look.

The pipeline being "staged" is actually the DataflowRunner talking to the runner harness, so it isn't really part of the other staging logic that is the SDK/user pipeline talking to the SDK harness. The staging location just happens to be a convenient place where we know the client can write and the runner harness can read. So I went back to using the raw stage_file method but I am happy to use more appropriate helper logic if it exists.
The setting of the metadata "pipeline_url" to this proto is the common piece across SDKs. Everyone needs to put it somewhere accessible to the runner harness, but the SDK can choose where and set this field to anything. Is it a mandatory part of every job from now on.
It looks like the stage_file method returns nothing if staged to a local file or the full GCS Object (this is the response) if it is GCS. Normally, I would set the metadata field to maybe a URL returned by stage_file but I didn't want to mess with that. So the URL is hardcoded.

kennknowles · 2017-10-19T17:57:17Z

Incidentally, my formatting may or may not matter to you. I did some default IntelliJ autoformatting, because even though there is a yapf formatter I think it does whole files only. If you have a recommended tool or manual process I am happy to oblige.

aaltay

LGTM.

If you prefer you can modify stage_file to return something in the local file case. I do not think we use its return value anywhere.

For formatting I do not use a special tool. As long as it complies with the linter it is good.

aaltay · 2017-10-19T18:02:02Z

sdks/python/apache_beam/runners/dataflow/internal/apiclient.py

@@ -40,7 +40,7 @@
 from apache_beam.options.pipeline_options import GoogleCloudOptions
 from apache_beam.options.pipeline_options import StandardOptions
 from apache_beam.options.pipeline_options import WorkerOptions
-from apache_beam.runners.dataflow.internal import dependency
+from apache_beam.runners.dataflow.internal import dependency, names


Could you break this into two lines?

kennknowles · 2017-10-19T18:25:32Z

Thanks! I'll wait for the Python precommit to pass then self-merge.

kennknowles · 2017-10-20T14:12:48Z

The failure looks like something in the fn-api runner that I've seen before.

kennknowles · 2017-10-20T14:12:55Z

run python precommit

kennknowles · 2017-10-20T16:13:01Z

Hmm, I either failed to push or wiped out my changes to apiclient_test.py.

kennknowles · 2017-10-20T19:49:49Z

retest this please

kennknowles · 2017-10-25T04:42:54Z

I believe the remaining failure in the Python precommit is that HEAD is broken. This can wait.

aaltay · 2017-10-25T22:13:35Z

@kennknowles tests are failing due to https://issues.apache.org/jira/browse/BEAM-3040, you can try re-running or running tox locally. It is unfortunate that we are blocking PRs on flaky tests.

kennknowles · 2017-10-26T01:41:13Z

Yes, I think based on personal judgment of the risk of a change we can move forward sometimes after careful manual validation.

kennknowles · 2017-10-26T18:29:02Z

retest this please

kennknowles · 2017-10-26T21:00:29Z

run python precommit

kennknowles · 2017-10-27T01:42:34Z

OK, anyhow the unit tests for Python are g2g. Gonna merge.

aaltay · 2017-10-27T01:43:25Z

Sounds good. Thank you!

…unner Stage the pipeline in Python DataflowRunner

kennknowles force-pushed the py-stage-pipeline branch from 2f52933 to d9ce45c Compare October 18, 2017 21:11

aaltay reviewed Oct 18, 2017

View reviewed changes

kennknowles force-pushed the py-stage-pipeline branch 2 times, most recently from e458703 to 2d82eb3 Compare October 19, 2017 17:50

aaltay reviewed Oct 19, 2017

View reviewed changes

kennknowles force-pushed the py-stage-pipeline branch from 2d82eb3 to 0a36072 Compare October 19, 2017 18:25

kennknowles force-pushed the py-stage-pipeline branch from 0a36072 to 8e7ad7e Compare October 20, 2017 16:19

kennknowles force-pushed the py-stage-pipeline branch 2 times, most recently from 8893d11 to 97a3a58 Compare October 25, 2017 01:37

kennknowles force-pushed the py-stage-pipeline branch from 97a3a58 to 8003f1d Compare October 25, 2017 22:28

Stage the pipeline in Python DataflowRunner

7d59c96

kennknowles force-pushed the py-stage-pipeline branch from 8003f1d to 7d59c96 Compare October 26, 2017 03:17

asfgit merged commit 7d59c96 into apache:master Oct 27, 2017

asfgit pushed a commit that referenced this pull request Oct 27, 2017

This closes #4010: [BEAM-3074] Stage the pipeline in Python DataflowR…

1f68030

…unner Stage the pipeline in Python DataflowRunner

kennknowles deleted the py-stage-pipeline branch April 25, 2018 02:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-3074] Stage the pipeline in Python DataflowRunner #4010

[BEAM-3074] Stage the pipeline in Python DataflowRunner #4010

kennknowles commented Oct 18, 2017 •

edited

aaltay left a comment

aaltay Oct 18, 2017

kennknowles Oct 18, 2017

aaltay Oct 18, 2017

aaltay Oct 18, 2017

kennknowles Oct 18, 2017

kennknowles commented Oct 19, 2017

kennknowles commented Oct 19, 2017

aaltay left a comment

aaltay Oct 19, 2017

kennknowles Oct 19, 2017

kennknowles commented Oct 19, 2017

kennknowles commented Oct 20, 2017

kennknowles commented Oct 20, 2017

kennknowles commented Oct 20, 2017

kennknowles commented Oct 20, 2017

kennknowles commented Oct 25, 2017

aaltay commented Oct 25, 2017

kennknowles commented Oct 26, 2017

kennknowles commented Oct 26, 2017

kennknowles commented Oct 26, 2017

kennknowles commented Oct 27, 2017

aaltay commented Oct 27, 2017

[BEAM-3074] Stage the pipeline in Python DataflowRunner #4010

[BEAM-3074] Stage the pipeline in Python DataflowRunner #4010

Conversation

kennknowles commented Oct 18, 2017 • edited

aaltay left a comment

Choose a reason for hiding this comment

aaltay Oct 18, 2017

Choose a reason for hiding this comment

kennknowles Oct 18, 2017

Choose a reason for hiding this comment

aaltay Oct 18, 2017

Choose a reason for hiding this comment

aaltay Oct 18, 2017

Choose a reason for hiding this comment

kennknowles Oct 18, 2017

Choose a reason for hiding this comment

kennknowles commented Oct 19, 2017

kennknowles commented Oct 19, 2017

aaltay left a comment

Choose a reason for hiding this comment

aaltay Oct 19, 2017

Choose a reason for hiding this comment

kennknowles Oct 19, 2017

Choose a reason for hiding this comment

kennknowles commented Oct 19, 2017

kennknowles commented Oct 20, 2017

kennknowles commented Oct 20, 2017

kennknowles commented Oct 20, 2017

kennknowles commented Oct 20, 2017

kennknowles commented Oct 25, 2017

aaltay commented Oct 25, 2017

kennknowles commented Oct 26, 2017

kennknowles commented Oct 26, 2017

kennknowles commented Oct 26, 2017

kennknowles commented Oct 27, 2017

aaltay commented Oct 27, 2017

kennknowles commented Oct 18, 2017 •

edited