New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-3074] Stage the pipeline in Python DataflowRunner #4010
Conversation
2f52933
to
d9ce45c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kennknowles wrote:
Follow this checklist to help us incorporate your contribution quickly and easily:
- Make sure there is a JIRA issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes.
- Each commit in the pull request should have a meaningful subject line and body.
- Format the pull request title like
[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue.- Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
- Run
mvn clean verify
to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.- If this contribution is large, please file an Apache Individual Contributor License Agreement.
R: @aaltay (or redirect to other appropriate Python reviewer?)
Just hacked this out naively; it probably isn't respecting abstractions quite right. I confirmed enough that the file is staged - much simpler than Java 👍. Also no tests 🥇.
In doing a manual smoke test, I just tried to follow some combination of the quickstart plus the contribution guide, but broke during staging because
pip install --download
doesn't like that I didpip install -e .[gcp]
. Is there a doc that has the steps for a new contributor to run wordcount with local modifications? I'm a bit rusty on the approved way of setting up the virtualenv. The crash occurs after the pipeline is staged, so I was able to check the basics anyhow.
Unfortunately we do not have a page that explains how to get started with developing python in Beam. Filed https://issues.apache.org/jira/browse/BEAM-3075 for this.
@@ -304,6 +311,11 @@ def run(self, pipeline): | |||
self.dataflow_client = apiclient.DataflowApplicationClient( | |||
pipeline._options) | |||
|
|||
# Upload the original proto for the pipeline | |||
self.dataflow_client.stage_file(self.job.google_cloud_options.staging_location, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you do this within stage_job_resources (
resources = dependency.stage_job_resources( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want me to plumb the original pipeline
through to there? It isn't part of the job
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am conflicted about what to do here. I do not think we should plumb the pipeline. At the same time I want to keep all staging to happen in a single location.
You can leave a TODO here to figure out what to do about it. Perhaps we can find a middle ground as a list of additional files/buffers that needs to be staged could be passed to the dataflow_client.
@@ -79,6 +81,8 @@ class DataflowRunner(PipelineRunner): | |||
CreatePTransformOverride(), | |||
] | |||
|
|||
_STAGED_PIPELINE_FILENAME = "pipeline.pb" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move this to runners/dataflow/internal/names.py ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
e458703
to
2d82eb3
Compare
I actually forgot the most important part, which was the part where we tell the runner harness where to find the pipeline. So this is mostly a total rewrite so I rebased anyhow; please take a fresh look.
|
Incidentally, my formatting may or may not matter to you. I did some default IntelliJ autoformatting, because even though there is a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
If you prefer you can modify stage_file
to return something in the local file case. I do not think we use its return value anywhere.
For formatting I do not use a special tool. As long as it complies with the linter it is good.
@@ -40,7 +40,7 @@ | |||
from apache_beam.options.pipeline_options import GoogleCloudOptions | |||
from apache_beam.options.pipeline_options import StandardOptions | |||
from apache_beam.options.pipeline_options import WorkerOptions | |||
from apache_beam.runners.dataflow.internal import dependency | |||
from apache_beam.runners.dataflow.internal import dependency, names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you break this into two lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
2d82eb3
to
0a36072
Compare
Thanks! I'll wait for the Python precommit to pass then self-merge. |
The failure looks like something in the fn-api runner that I've seen before. |
run python precommit |
Hmm, I either failed to push or wiped out my changes to |
0a36072
to
8e7ad7e
Compare
retest this please |
8893d11
to
97a3a58
Compare
I believe the remaining failure in the Python precommit is that HEAD is broken. This can wait. |
@kennknowles tests are failing due to https://issues.apache.org/jira/browse/BEAM-3040, you can try re-running or running |
97a3a58
to
8003f1d
Compare
Yes, I think based on personal judgment of the risk of a change we can move forward sometimes after careful manual validation. |
8003f1d
to
7d59c96
Compare
retest this please |
run python precommit |
OK, anyhow the unit tests for Python are g2g. Gonna merge. |
Sounds good. Thank you! |
Follow this checklist to help us incorporate your contribution quickly and easily:
[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue.mvn clean verify
to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.R: @aaltay (or redirect to other appropriate Python reviewer?)
Just hacked this out naively; it probably isn't respecting abstractions quite right. I confirmed enough that the file is staged - much simpler than Java 👍. Also no tests 🥇.
In doing a manual smoke test, I just tried to follow some combination of the quickstart plus the contribution guide, but broke during staging because
pip install --download
doesn't like that I didpip install -e .[gcp]
. Is there a doc that has the steps for a new contributor to run wordcount with local modifications? I'm a bit rusty on the approved way of setting up the virtualenv. The crash occurs after the pipeline is staged, so I was able to check the basics anyhow.