[BEAM-8457] Label Dataflow jobs from Notebook by nika-qubit · Pull Request #9854 · apache/beam

nika-qubit · 2019-10-22T20:45:17Z

Changed the pipeline.run() API to allow a runner and an option
parameter so that a pipeline initially bundled w/ an interactive runner
can be directly run by other runners from notebook.
Implicitly added the necessary source information through user labels
when the user does p.run(runner=DataflowRunner(), options=options) or
DataflowRunner().run_pipeline(p, options).

Please add a meaningful description for your change here

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang	SDK	Apex	Dataflow	Gearpump	Samza	Spark
Go		---	---	---	---
Java
Python		---		---	---
XLang	---	---	---	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

nika-qubit · 2019-10-22T20:45:48Z

R: @pabloem
PTAL, thanks!

pabloem · 2019-10-22T21:09:09Z

sdks/python/apache_beam/pipeline.py

What if either runner or options are not provided? Should that throw an error? Currently, if only one is provided, it'll be ignored - and that would be quite surprising for users.

You're right! This will surprise the user. I've changed it to throw error if either is not provided instead of ignoring the input by default.

pabloem · 2019-10-22T21:23:13Z

sdks/python/apache_beam/runners/dataflow/dataflow_runner.py

This seems fine - but what if we go with the runner_in_use codepath? Would users do: p.run(runner=InteractiveRunner(DataflowRunner()), options=...)? Or would users create a pipeline with InteractiveRunner and then do p.run(runner=DataflowRunner()...? Is it poissible for users to do p = beam.Pipeline(), and then do InteractiveRunner().run_pipeline(p)/InteractiveRunner(DataflowRunner()).run_pipeline(p)?

IIUC users would have to pass the interactive runner in p = beam.Pipeline() to activate the interactive mode, right? InteractiveRunner is not automatically selected?

I've missed the path where a new Pipeline is created and run() is invoked again.
Yes, all of these would be possible.
I've added an interactive parameter at the constructor level for Pipeline using default value None. run() and from_runner_api() will pass the None or bool value down no matter how the user chains the runners. I'm not very confident with the naming but the change should be backward compatible for Beam.

Currently, I'm running into a problem when testing. Once I set labels, Dataflow job will fail immediately and throw Error processing pipeline. error. There will be no job graph, no worker started, no logs. Looks like when there is user label in the job request, Dataflow cannot convert the work item into internal representation.

I'll do some investigation and figure out why.

1. Changed the pipeline.run() API to allow a runner and an option parameter so that a pipeline initially bundled w/ an interactive runner can be directly run by other runners from notebook. 2. Implicitly added the necessary source information through user labels when the user does p.run(runner=DataflowRunner(), options=options) or DataflowRunner().run_pipeline(p, options). 3. User '--labels' doesn't support character '.' or '"'. When applying version related label, replace '.' w/ '_'. Avoid enclosing any label with '"'.

pabloem

This looks good, but I don't see why we need to modify the from_runner_api signature. I defer to @robertwb

pabloem · 2019-10-23T01:13:29Z

sdks/python/apache_beam/pipeline.py

-          self._options).run(False)
+          runner_in_use,
+          options_in_use,
+          interactive=self.interactive).run(False)


Did you find that this was necessary? I don't think we should change the signature of the from_runner_api call. The pipeline protobuf should contain all the necessary information... Though I'd defer to @robertwb on this.

No, it's not necessary. On a second thought, I can just move the logic to determine interactive ad hoc in run() and put the interactive field as a parameter in the run() method. Then I don't even need to change the Pipeline constructor.

Also, I've added this to the interactive_runner for below use case:
InteractiveRunner(underlying_runner=DataflowRunner()).run_pipeline(Pipeline(DataflowRunner()))

pabloem · 2019-10-23T20:25:02Z

Thanks Ning

charlesccychen · 2019-10-25T00:20:32Z

@kevingg @pabloem I believe this PR may be causing issues in environments without IPython because it unconditionally imports the interactive runner (and thereby IPython).

aaltay · 2019-10-25T00:25:19Z

Let's roll this back. I do not think we should be importing interactive in pipeline.py. And also my understanding is that runner will keep track of the interactivity not the pipeline.

… from Notebook" This reverts commit 1a8391d.

…w jobs from Notebook" This reverts commit 1a8391d.

robertwb · 2019-10-25T21:15:19Z

+1. We should not be importing the interactive runner (it's causing problems with tests as well), and interactivity should not be a property of the pipeline, but of the runner (and I'd prefer a design that avoid passing an interactive bit around everywhere).

[release-2.17.0] Revert "Merge pull request #9854 from [BEAM-8457] Label Dataflow jobs…

nika-qubit force-pushed the BEAM-8457 branch from e37bec1 to f50a9aa Compare October 22, 2019 21:17

pabloem reviewed Oct 22, 2019

View reviewed changes

nika-qubit force-pushed the BEAM-8457 branch 2 times, most recently from 7caba45 to a9de48a Compare October 23, 2019 00:12

nika-qubit force-pushed the BEAM-8457 branch from a9de48a to 7e1cd21 Compare October 23, 2019 00:24

pabloem reviewed Oct 23, 2019

View reviewed changes

Avoid changing from_runner_api() and Pipeline() signatures.

bcfc4db

pabloem merged commit 1a8391d into apache:master Oct 23, 2019

aaltay added a commit that referenced this pull request Oct 25, 2019

Revert "Merge pull request #9854 from [BEAM-8457] Label Dataflow jobs…

1d59072

… from Notebook" This reverts commit 1a8391d.

aaltay mentioned this pull request Oct 25, 2019

Revert "[BEAM-8457] Label Dataflow jobs from Notebook" #9879

Merged

pabloem pushed a commit to pabloem/beam that referenced this pull request Oct 25, 2019

Revert "Merge pull request apache#9854 from [BEAM-8457] Label Dataflo…

e17ddfd

…w jobs from Notebook" This reverts commit 1a8391d.

Ardagan added a commit that referenced this pull request Nov 15, 2019

Merge pull request #9887 from pabloem/release-2.17.0

bd744a3

[release-2.17.0] Revert "Merge pull request #9854 from [BEAM-8457] Label Dataflow jobs…

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-8457] Label Dataflow jobs from Notebook#9854

[BEAM-8457] Label Dataflow jobs from Notebook#9854
pabloem merged 2 commits intoapache:masterfrom
nika-qubit:BEAM-8457

nika-qubit commented Oct 22, 2019

Uh oh!

nika-qubit commented Oct 22, 2019

Uh oh!

pabloem Oct 22, 2019

Uh oh!

nika-qubit Oct 22, 2019

Uh oh!

pabloem Oct 22, 2019

Uh oh!

nika-qubit Oct 22, 2019

Uh oh!

pabloem left a comment

Uh oh!

pabloem Oct 23, 2019

Uh oh!

nika-qubit Oct 23, 2019

Uh oh!

pabloem commented Oct 23, 2019

Uh oh!

charlesccychen commented Oct 25, 2019

Uh oh!

aaltay commented Oct 25, 2019

Uh oh!

robertwb commented Oct 25, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

nika-qubit commented Oct 22, 2019

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

Uh oh!

nika-qubit commented Oct 22, 2019

Uh oh!

pabloem Oct 22, 2019

Choose a reason for hiding this comment

Uh oh!

nika-qubit Oct 22, 2019

Choose a reason for hiding this comment

Uh oh!

pabloem Oct 22, 2019

Choose a reason for hiding this comment

Uh oh!

nika-qubit Oct 22, 2019

Choose a reason for hiding this comment

Uh oh!

pabloem left a comment

Choose a reason for hiding this comment

Uh oh!

pabloem Oct 23, 2019

Choose a reason for hiding this comment

Uh oh!

nika-qubit Oct 23, 2019

Choose a reason for hiding this comment

Uh oh!

pabloem commented Oct 23, 2019

Uh oh!

charlesccychen commented Oct 25, 2019

Uh oh!

aaltay commented Oct 25, 2019

Uh oh!

robertwb commented Oct 25, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants