Skip to content

[BEAM-8457] Label Dataflow jobs from Notebook#9854

Merged
pabloem merged 2 commits intoapache:masterfrom
nika-qubit:BEAM-8457
Oct 23, 2019
Merged

[BEAM-8457] Label Dataflow jobs from Notebook#9854
pabloem merged 2 commits intoapache:masterfrom
nika-qubit:BEAM-8457

Conversation

@nika-qubit
Copy link
Contributor

  1. Changed the pipeline.run() API to allow a runner and an option
    parameter so that a pipeline initially bundled w/ an interactive runner
    can be directly run by other runners from notebook.
  2. Implicitly added the necessary source information through user labels
    when the user does p.run(runner=DataflowRunner(), options=options) or
    DataflowRunner().run_pipeline(p, options).

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
--- Build Status
Build Status
Build Status
Build Status
--- --- Build Status
XLang --- --- --- Build Status --- --- ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status
Build Status
Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@nika-qubit
Copy link
Contributor Author

R: @pabloem
PTAL, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if either runner or options are not provided? Should that throw an error? Currently, if only one is provided, it'll be ignored - and that would be quite surprising for users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! This will surprise the user. I've changed it to throw error if either is not provided instead of ignoring the input by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine - but what if we go with the runner_in_use codepath? Would users do: p.run(runner=InteractiveRunner(DataflowRunner()), options=...)? Or would users create a pipeline with InteractiveRunner and then do p.run(runner=DataflowRunner()...? Is it poissible for users to do p = beam.Pipeline(), and then do InteractiveRunner().run_pipeline(p)/InteractiveRunner(DataflowRunner()).run_pipeline(p)?

IIUC users would have to pass the interactive runner in p = beam.Pipeline() to activate the interactive mode, right? InteractiveRunner is not automatically selected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've missed the path where a new Pipeline is created and run() is invoked again.
Yes, all of these would be possible.
I've added an interactive parameter at the constructor level for Pipeline using default value None. run() and from_runner_api() will pass the None or bool value down no matter how the user chains the runners. I'm not very confident with the naming but the change should be backward compatible for Beam.

Currently, I'm running into a problem when testing. Once I set labels, Dataflow job will fail immediately and throw Error processing pipeline. error. There will be no job graph, no worker started, no logs. Looks like when there is user label in the job request, Dataflow cannot convert the work item into internal representation.

I'll do some investigation and figure out why.

@nika-qubit nika-qubit force-pushed the BEAM-8457 branch 2 times, most recently from 7caba45 to a9de48a Compare October 23, 2019 00:12
1. Changed the pipeline.run() API to allow a runner and an option
parameter so that a pipeline initially bundled w/ an interactive runner
can be directly run by other runners from notebook.
2. Implicitly added the necessary source information through user labels
when the user does p.run(runner=DataflowRunner(), options=options) or
DataflowRunner().run_pipeline(p, options).
3. User '--labels' doesn't support character '.' or '"'. When applying
version related label, replace '.' w/ '_'. Avoid enclosing any label
with '"'.
Copy link
Member

@pabloem pabloem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but I don't see why we need to modify the from_runner_api signature. I defer to @robertwb

self._options).run(False)
runner_in_use,
options_in_use,
interactive=self.interactive).run(False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you find that this was necessary? I don't think we should change the signature of the from_runner_api call. The pipeline protobuf should contain all the necessary information... Though I'd defer to @robertwb on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not necessary. On a second thought, I can just move the logic to determine interactive ad hoc in run() and put the interactive field as a parameter in the run() method. Then I don't even need to change the Pipeline constructor.

Also, I've added this to the interactive_runner for below use case:
InteractiveRunner(underlying_runner=DataflowRunner()).run_pipeline(Pipeline(DataflowRunner()))

@pabloem pabloem merged commit 1a8391d into apache:master Oct 23, 2019
@pabloem
Copy link
Member

pabloem commented Oct 23, 2019

Thanks Ning

@charlesccychen
Copy link
Contributor

@kevingg @pabloem I believe this PR may be causing issues in environments without IPython because it unconditionally imports the interactive runner (and thereby IPython).

@aaltay
Copy link
Member

aaltay commented Oct 25, 2019

Let's roll this back. I do not think we should be importing interactive in pipeline.py. And also my understanding is that runner will keep track of the interactivity not the pipeline.

aaltay added a commit that referenced this pull request Oct 25, 2019
pabloem pushed a commit to pabloem/beam that referenced this pull request Oct 25, 2019
@robertwb
Copy link
Contributor

+1. We should not be importing the interactive runner (it's causing problems with tests as well), and interactivity should not be a property of the pipeline, but of the runner (and I'd prefer a design that avoid passing an interactive bit around everywhere).

Ardagan added a commit that referenced this pull request Nov 15, 2019
[release-2.17.0] Revert "Merge pull request #9854 from [BEAM-8457] Label Dataflow jobs…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants