[BEAM-9639][BEAM-9608] Improvements for FnApiRunner #11270

pabloem · 2020-03-30T23:41:00Z

These changes are mostly reshuffling of code.

Each commit is a logical unit, so each commit can be reviewed separately. The commit message explains what each commit does:

commit[BEAM-9608] BundleManagers use BundleContextManager for configuration - (this commit modifies the BundleManagers to receive only a BundleContextManager with most of their configuration. It also adds a dry_run option for processing bundles without writing to pcoll_buffers)
commit[BEAM-9639] Storing side inputs after producer execution, not before consumption - (this commit ensures that side inputs are stored in state right after they are computed - not before they are consumed. this will be useful in streaming so each bundle's inputs are eagerly available)
- commitEnsuring downstream side inputs are calculated on fully expanded graph - (this commit is more of an accessory to the previous one. It ensures that during graph translations, downstream_side_inputs are annotated after SDFs are expanded)
commit[BEAM-9639] Separate Stage and Bundle execution. Improve typing. - (this commit separates the sections of the code for executing a stage such as context creation, committing of side inputs, scheduling of all bundles until no deferred inputs vs the sections of executing a bundle for that stage such as pushing data to worker, collecting bundle deferred inputs)

Notes/todos from #11229

It's odd that _run_stage no longer takes as a parameter the stage to run. Perhaps bundle_context_manager (and its class?) should be named stage_context or similar?
On this note, perhaps it makes sense to break FnApiRunner into the (mostly stateless) runner that can execute multiple pipelines and an executor (that has methods like run_stage) that might be stateful and is initialized with and tasked with running a single pipeline. Much of what is on context(s) would become state of self of this new object.
I was bitten by the fact that it is an error to access the process_bundle_descriptor before _extract_outputs is called. This should be clearly documented (and similarly for the other lazy attributes(s) in this class). Given that that's called external to this class and can't easily be checked, makes me wonder if the boundary of encapsulation needs to be adjusted here.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang	SDK	Apex	Dataflow	Gearpump	Samza
Go		---	---	---	---
Java
Python		---		---	---
XLang	---	---	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

pabloem · 2020-03-31T22:31:59Z

Run Portable_Python PreCommit

pabloem · 2020-04-01T04:44:02Z

Run Python PreCommit

pabloem · 2020-04-01T18:49:58Z

this latest commit is purely aesthetic

pabloem · 2020-04-01T20:37:27Z

Run Python PreCommit

pabloem · 2020-04-07T03:17:49Z

@robertwb ptal

pabloem · 2020-04-10T23:57:11Z

attempting to rebase. let's see how far that takes me...

pabloem · 2020-04-11T19:31:14Z

Run Python PreCommit

pabloem · 2020-04-11T19:32:25Z

Run Python PreCommit

pabloem · 2020-04-11T21:08:27Z

Run Python PreCommit

robertwb

I've read through the entire PR, here's some initial comments.

Thanks for splitting things into logical commits. (FWIW, you can put the descriptions into the commits as well.)

robertwb · 2020-04-13T17:57:38Z

sdks/python/apache_beam/runners/portability/fn_api_runner/execution.py

@@ -326,8 +327,8 @@ def commit_side_inputs_to_state(
      data_side_input,  # type: DataSideInput
  ):
    # type: (...) -> None
-    for (consuming_transform_id, tag), (buffer_id, func_spec) \
-        in data_side_input.items():
+    for (consuming_transform_id, tag), (buffer_id,


I wonder if

((consuming_transform_id, tag), (buffer_id, func_spec))

would make both yapf and humans happy.

it did not : ( hehe

robertwb · 2020-04-13T19:16:49Z

sdks/python/apache_beam/runners/portability/fn_api_runner/execution.py

+            translations.create_buffer_id(pcoll), access_pattern)
+    self.execution_context.commit_side_inputs_to_state(data_side_input)
+
+  def extract_bundle_inputs(self):


extract_bundle_inputs_and_outputs?

robertwb · 2020-04-13T19:17:24Z

sdks/python/apache_beam/runners/portability/fn_api_runner/execution.py

+        for timer_family_id in payload.timer_family_specs.keys():
+          expected_timer_output[(transform.unique_name, timer_family_id)] = (
+              create_buffer_id(timer_family_id, 'timers'))
+    return data_input, data_output, expected_timer_output


Update docs to match.

robertwb · 2020-04-13T19:18:57Z

sdks/python/apache_beam/runners/portability/fn_api_runner/execution.py

@@ -367,6 +413,73 @@ def _build_process_bundle_descriptor(self):
        state_api_service_descriptor=self.state_api_service_descriptor(),
        timer_api_service_descriptor=self.data_api_service_descriptor())

+  def commit_output_views_to_state(self):


Not sure what "output views" means. Maybe call this commit_side_inputs_to_state as well?

robertwb · 2020-04-13T19:22:45Z

sdks/python/apache_beam/runners/portability/fn_api_runner/translations.py


  def fuse(self, other):
    # type: (Stage) -> Stage
    return Stage(
        "(%s)+(%s)" % (self.name, other.name),
        self.transforms + other.transforms,
-        union(self.downstream_side_inputs, other.downstream_side_inputs),
+        self._fuse_downstream_side_inputs(other),


Nit: this sounds like it mutates self.

renamed to _get_fused_downstream_side_inputs thoughts?

robertwb · 2020-04-13T19:24:13Z

sdks/python/apache_beam/runners/portability/fn_api_runner/translations.py

+# SideInputId is identified by a consumer ParDo + tag.
+SideInputId = Tuple[str, str]
+
+DataSideInput = Dict[SideInputId,


What does the value represent?

The value is a tuple with the encoded data. I've moved these to translations.py, and updated the comment

robertwb · 2020-04-13T19:24:53Z

sdks/python/apache_beam/runners/portability/fn_api_runner/translations.py

+    res = dict(self.downstream_side_inputs)
+    for si, other_si_ids in other.downstream_side_inputs.items():
+      if si in res:
+        res[si] = union(res[si], other_si_ids)


So this is actually a dict mapping to sets?

woah this is a bug.
downstream side input is a dictionary mapping to dictionaries.

Dict[Output Pcollection, Dict[Side input ID, Access pattern]]
Where Side input ID is Tuple[consumer ptransform, input index]. Added appropriate annotations, and fixed the bug.

robertwb · 2020-04-13T19:29:53Z

sdks/python/apache_beam/runners/portability/fn_api_runner/translations.py


 class Stage(object):
  """A set of Transforms that can be sent to the worker for processing."""
  def __init__(self,
               name,  # type: str
               transforms,  # type: List[beam_runner_api_pb2.PTransform]
-               downstream_side_inputs=None,  # type: Optional[FrozenSet[str]]
+               downstream_side_inputs=None,  # type: Optional[Dict[str, SideInputId]]


The goal of this (which, yes, should have been better documented) is to quickly be able to prohibit fusion. But the reason we defined our own union was so that memory didn't grow as O(n^2) in the common case because many stages were able to share this set (rather than have their own copy). These changes seem to break that.

Also, could you clarify why this was made into a dict?

Hm so this change breaks that, so the memory requirements would be larger. I would think that they would not be too bad, since most graphs don't have many side inputs going many places. What do you think? I'm willing to find a better solution for this, but I wonder if it's worth the extra time.

The reason that this is made into a dict is to contain more information about downstream side inputs. specifically, it contains which transforms will consume the side inputs. this is used to commit the side inputs to state after they are calculated (rather than before they are consumed). This will be necessary for streaming, because side inputs will need to be added to state as they are computed.

Discussed offline, but capturing here for the record. These sets contain the transitive collection of everything downstream of any side-input consuming transform, and as such can be large even if the total number of side inputs is small. (The number of distinct such sets is about the same as the number of side inputs, so we keep the total memory use down by re-using them--to give each transform its own copy would easily be O(n^2).)

Your change of computing the side input mapping after the graph has been fused is good (and arguably better, as you only need the immediate consumers, and don't have to re-compute each time a stage is fused).

robertwb · 2020-04-13T19:36:22Z

sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py

@@ -914,14 +898,16 @@ def process_bundle(self,
                     expected_outputs,  # type: DataOutput
                     fired_timers,  # type: Mapping[Tuple[str, str], PartitionableBuffer]
                     expected_output_timers  # type: Dict[str, Dict[str, str]]
+                     dry_run=False


This should be the default, we shouldn't have to pass it.

pabloem · 2020-04-15T17:48:41Z

Run Python PreCommit

pabloem · 2020-04-15T19:13:33Z

Run Python PreCommit

pabloem · 2020-04-15T20:40:26Z

failed test is streaming wordcount test

pabloem · 2020-04-15T20:43:15Z

Run Python PreCommit

pabloem · 2020-04-15T20:56:07Z

Run Python PreCommit

…onsumption.

annotations.

pabloem · 2020-04-18T00:32:18Z

Building the side input index elsewhere. LMK what you think.

pabloem · 2020-04-18T06:54:33Z

Run Python PreCommit

pabloem · 2020-04-18T06:54:49Z

Run Python PreCommit

robertwb

Thanks, LGTM.

pabloem · 2020-04-21T19:22:10Z

Run Python2_PVR_Flink PreCommit

boyuanzz · 2020-04-30T18:20:34Z

sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner_test.py

@@ -240,6 +240,30 @@ def test_multimap_side_input(self):
              lambda k, d: (k, sorted(d[k])), beam.pvalue.AsMultiMap(side)),
          equal_to([('a', [1, 3]), ('b', [2])]))

+  def test_multimap_multiside_input(self):


This test breaks Spark VR test: https://issues.apache.org/jira/browse/BEAM-9862. Please either support the same function for Spark or sickbay it.

Thanks for reporting Boyuan, this was a flaw with the Spark runner. Fix: #11644

probot-autolabeler bot added the python label Mar 30, 2020

pabloem force-pushed the fn-ref-more branch 2 times, most recently from df9cc9b to f445a8d Compare March 31, 2020 22:11

pabloem force-pushed the fn-ref-more branch from f445a8d to 7ce7c69 Compare April 1, 2020 03:57

pabloem changed the title ~~[WIP] Improvements for FnApiRunner~~ [BEAM-9639][BEAM-9608] Improvements for FnApiRunner Apr 1, 2020

pabloem marked this pull request as ready for review April 1, 2020 15:29

pabloem requested a review from robertwb April 1, 2020 18:47

pabloem mentioned this pull request Apr 1, 2020

[BEAM-9608] Increasing scope of context managers for FnApiRunner #11229

Merged

4 tasks

pabloem force-pushed the fn-ref-more branch from 30d2f6d to f19de4f Compare April 7, 2020 01:45

pabloem force-pushed the fn-ref-more branch from 8a60b56 to 38fb513 Compare April 10, 2020 23:50

pabloem force-pushed the fn-ref-more branch 2 times, most recently from ecf34ae to 82828b3 Compare April 11, 2020 18:04

robertwb reviewed Apr 13, 2020

View reviewed changes

[BEAM-9608] BundleManagers use BundleContextManager for configuration

8de324f

pabloem force-pushed the fn-ref-more branch from c5353c4 to 2f62467 Compare April 15, 2020 23:11

[BEAM-9639] Saving side inputs after producer execution, not before c…

abb7bcb

…onsumption.

[BEAM-9639] Separate Stage and Bundle execution. Improve typing

f7ae7f6

annotations.

pabloem force-pushed the fn-ref-more branch from 2f62467 to 377b8be Compare April 18, 2020 00:31

pabloem force-pushed the fn-ref-more branch from 33f4fdb to af74e21 Compare April 18, 2020 04:21

robertwb approved these changes Apr 21, 2020

View reviewed changes

[BEAM-9639][BEAM-9608] Addressing review comments.

cf821e5

pabloem force-pushed the fn-ref-more branch from 6a14d2e to cf821e5 Compare April 21, 2020 18:19

pabloem merged commit 1fe543e into apache:master Apr 21, 2020

pabloem deleted the fn-ref-more branch April 21, 2020 19:37

boyuanzz reviewed Apr 30, 2020

View reviewed changes

[BEAM-9639][BEAM-9608] Improvements for FnApiRunner #11270

[BEAM-9639][BEAM-9608] Improvements for FnApiRunner #11270

Conversation

pabloem commented Mar 30, 2020 • edited

Notes/todos from #11229

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

pabloem commented Mar 31, 2020

pabloem commented Apr 1, 2020

pabloem commented Apr 1, 2020

pabloem commented Apr 1, 2020

pabloem commented Apr 7, 2020

pabloem commented Apr 10, 2020

pabloem commented Apr 11, 2020

pabloem commented Apr 11, 2020

pabloem commented Apr 11, 2020

robertwb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pabloem commented Apr 15, 2020

pabloem commented Apr 15, 2020

pabloem commented Apr 15, 2020

pabloem commented Apr 15, 2020

pabloem commented Apr 15, 2020

pabloem commented Apr 18, 2020

pabloem commented Apr 18, 2020

pabloem commented Apr 18, 2020

robertwb left a comment

Choose a reason for hiding this comment

pabloem commented Apr 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pabloem commented Mar 30, 2020 •

edited