[BEAM-8457] Label Dataflow jobs from Notebook #9885

KevinGG · 2019-10-25T18:14:40Z

Added a standalone interactive utils module that provides functionality to detect if current execution is in an interactive notebook environment.
Implicitly append a user label (goog-dataflow-notebook=$beam_version) to Dataflow jobs started from interactive notebook environment when the user does DataflowRunner().run_pipeline(p, options).
User '--labels' doesn't support character '.' or '"'. When applying version related label, replace '.' with '_' and avoid enclosing any label with '"'.
Fixed try-except condition in static imports of display_manager module. Future PRs will remove the usage of display_manager from interactive runner.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang	SDK	Apex	Dataflow	Gearpump	Samza	Spark
Go		---	---	---	---
Java
Python		---		---	---
XLang	---	---	---	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

KevinGG · 2019-10-25T20:55:19Z

sdks/python/apache_beam/pipeline.py

+      self.interactive = interactive
+    elif (type(self.runner).__module__
+          == 'apache_beam.runners.interactive.interactive_runner' and
+          type(self.runner).__name__ == 'InteractiveRunner'):


This is the difference from previous PR.
All runners are using "new-style" classes in Python, the type(obj).__module__/__name__ should always work. Please let me know if there would be backward incompatible cases.
Thanks!

KevinGG · 2019-10-28T17:32:29Z

Run Python PreCommit

KevinGG · 2019-10-28T20:36:28Z

Run Python PreCommit

KevinGG · 2019-10-29T18:41:15Z

Run Python PreCommit

KevinGG · 2019-10-29T22:40:27Z

Run Python PreCommit

KevinGG · 2019-10-30T18:32:09Z

R: @aaltay
R: @pabloem
Sorry for the previous rollback. I've removed the dependency of interactive_runner from pipeline, so there shouldn't be import issues now. Thanks!

aaltay · 2019-11-02T01:04:21Z

R: @pabloem could you make a first review pass?

pabloem · 2019-11-04T03:40:01Z

This is pretty much the same PR as before, except it checks the runner via string matching. It LGTM.

aaltay · 2019-11-05T22:26:34Z

sdks/python/apache_beam/pipeline.py

+    # A boolean value indicating whether the pipeline is created in an
+    # interactive environment such as interactive notebooks. Initialized as
+    # None. The value is set ad hoc when `pipeline.run()` is invoked.
+    self.interactive = None


I believe one more suggestion from the previous PR was to not keepting track of interactivity as part of the Pipeline but making it a property of the runner.

Thanks!
If we track interactive as a property of runner, we cannot implicitly pass along the property from runner to runner.

And if we deduce interactive from the environment, we'll introduce new dependencies into DataflowRunner. See below comment.

I'll go with the check environment route and make it a standalone utility module in the interactive package.

aaltay · 2019-11-05T22:30:26Z

sdks/python/apache_beam/pipeline.py

-  def run(self, test_runner_api=True):
-    """Runs the pipeline. Returns whatever our runner returns after running."""
-
+  def run(self, test_runner_api=True, runner=None, options=None,


Why are we adding runner and options parameters here?

IIRC, we want to allow the user to switch to DataflowRunner using the p.run() pattern instead of limiting the user to Runner().run_pipeline(p, options).

Do you think we should put it into a separate PR, or simply not supporting it at all?

Do you think we should put it into a separate PR,

Yes. At least let's have a seperate discussion for API changes like this.

or simply not supporting it at all?

Maybe not. This could be just a override for the interactive runners run() (e.g. run_with(NewRunner, NewOptions). At least, let's discuss with all stakeholders.

Putting this discussion on next Monday's agenda and will remove changes to the API.

aaltay · 2019-11-05T22:32:27Z

sdks/python/apache_beam/runners/dataflow/dataflow_runner.py

@@ -360,6 +360,16 @@ def visit_transform(self, transform_node):

  def run_pipeline(self, pipeline, options):
    """Remotely executes entire pipeline or parts reachable from node."""
+    # Label goog-dataflow-notebook if pipeline is initiated from interactive
+    # runner.
+    if pipeline.interactive:


The change could be limited to:

here detect, whether we are in interactive environment or not. (For example, check whether certain imports are loaded?) Could also be a utility method somewhere, to check whether this in an interactive method.

If yes, add the labels.

I see your point! Yes, I have the capability to check if current interpreted code is in a notebook or not. This branch will need a rebase against master to take those changes.

To clarify the process:
When a DataflowRunner tries to run a job from a given pipeline,

Check if the module interactive_environment is imported by checking the sys.modules dictionary;

Check if current_env().is_in_notebook;

If yes, label the job.

I think we have a little bit trade off here:

What we have here: Determining if the job is started from a pipeline that was originally bundled with an Interactive Runner.
Doing it with string comparison, we don't introduce new dependency into DataflowRunner.

Deduce if the job is started from a notebook environment.
We'll introduce [interactive] dependencies including at least ipython into DataflowRunner.
This will label Dataflow jobs from any pipeline originally bundled with arbitrary runner in any kind of ipython-notebook as long as interactive_environment module in interactive package has been (transitively) imported but not necessarily used.

Could we move https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/interactive_environment.py#L100:L102 to a common utility function, and each runner if they want could call this without worrying about require additional imports?

Discussed with David and Sam. Since we also want to track jobs started from notebook even if the user never uses InteractiveRunner, checking the environment might just be the only way to do it.
By putting the logic into a try-except block as it is, we could avoid introducing ipython dependency into DataflowRunner. If the [interactive] dependency is never installed and current execution_path has never imported ipython, the code would just never be executed.

I'll move the logic into a standalone utility module and import it in DataflowRunner to do the check.

1. Changed the pipeline.run() API to allow a runner and an option parameter so that a pipeline initially bundled w/ an interactive runner can be directly run by other runners from notebook. 2. Implicitly added the necessary source information through user labels when the user does p.run(runner=DataflowRunner(), options=options) or DataflowRunner().run_pipeline(p, options). 3. User '--labels' doesn't support character '.' or '"'. When applying version related label, replace '.' w/ '_'. Avoid enclosing any label with '"'.

KevinGG · 2019-11-07T23:43:00Z

Run Portable_Python PreCommit

… to a standalone utility module.

KevinGG · 2019-11-08T01:59:02Z

Run Portable_Python PreCommit

KevinGG · 2019-11-08T22:18:53Z

Run Portable_Python PreCommit

robertwb · 2019-11-11T18:13:38Z

sdks/python/apache_beam/runners/interactive/interactive_environment.py

-          self._is_in_notebook = True
-    except ImportError:
-      pass
+    self._is_in_ipython, self._is_in_notebook = is_interactive()


Conventionally, is_xxx functions return a boolean. Returning a pair will be especially surprising if one writes statements like if is_interactive() and the return value is (False, False) (which as a non-zero-length tuple evaluates to True.

Roger, will make it into 2 separate APIs.

KevinGG commented Oct 25, 2019

View reviewed changes

KevinGG force-pushed the BEAM-8457 branch 2 times, most recently from 74074cc to 5c6f2d4 Compare October 25, 2019 22:10

aaltay requested review from pabloem and aaltay November 2, 2019 01:04

aaltay requested changes Nov 5, 2019

View reviewed changes

KevinGG force-pushed the BEAM-8457 branch 4 times, most recently from b1b87e8 to e0aa5ba Compare November 7, 2019 23:09

Determine if in notebook by checking environment and moved such logic…

354fef9

… to a standalone utility module.

KevinGG force-pushed the BEAM-8457 branch from e0aa5ba to 354fef9 Compare November 8, 2019 01:48

moved utils.py to be a runner-common utility module.

3a8fddf

aaltay approved these changes Nov 8, 2019

View reviewed changes

pabloem approved these changes Nov 11, 2019

View reviewed changes

pabloem merged commit 7937317 into apache:master Nov 11, 2019

robertwb reviewed Nov 11, 2019

View reviewed changes

KevinGG deleted the BEAM-8457 branch December 16, 2019 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-8457] Label Dataflow jobs from Notebook #9885

[BEAM-8457] Label Dataflow jobs from Notebook #9885

KevinGG commented Oct 25, 2019 •

edited

KevinGG Oct 25, 2019

KevinGG commented Oct 28, 2019

KevinGG commented Oct 28, 2019

KevinGG commented Oct 29, 2019

KevinGG commented Oct 29, 2019

KevinGG commented Oct 30, 2019

aaltay commented Nov 2, 2019

pabloem commented Nov 4, 2019

aaltay Nov 5, 2019

KevinGG Nov 5, 2019 •

edited

KevinGG Nov 7, 2019

aaltay Nov 5, 2019

KevinGG Nov 5, 2019

aaltay Nov 7, 2019

KevinGG Nov 7, 2019

aaltay Nov 5, 2019

KevinGG Nov 5, 2019 •

edited

aaltay Nov 7, 2019

KevinGG Nov 7, 2019

KevinGG commented Nov 7, 2019

KevinGG commented Nov 8, 2019

KevinGG commented Nov 8, 2019

robertwb Nov 11, 2019

KevinGG Nov 11, 2019

[BEAM-8457] Label Dataflow jobs from Notebook #9885

[BEAM-8457] Label Dataflow jobs from Notebook #9885

Conversation

KevinGG commented Oct 25, 2019 • edited

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

Choose a reason for hiding this comment

KevinGG commented Oct 28, 2019

KevinGG commented Oct 28, 2019

KevinGG commented Oct 29, 2019

KevinGG commented Oct 29, 2019

KevinGG commented Oct 30, 2019

aaltay commented Nov 2, 2019

pabloem commented Nov 4, 2019

Choose a reason for hiding this comment

KevinGG Nov 5, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KevinGG Nov 5, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KevinGG commented Nov 7, 2019

KevinGG commented Nov 8, 2019

KevinGG commented Nov 8, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KevinGG commented Oct 25, 2019 •

edited

KevinGG Nov 5, 2019 •

edited

KevinGG Nov 5, 2019 •

edited