[BEAM-1630] Adds ability to dynamically replace PTransforms during runtime. #3333

chamikaramj · 2017-06-08T22:07:17Z

Adds two new interfaces, PTransformMatcher and PTransformOverride.

Currently only supports replacements where input and output types are an exact match (we have to address complexities due to type hints before supporting replacements with different types).

This can be used to dynamically update a populated pipeline at runtime. Each runner can configure it's own overrides.

This will be used by SplittableDoFn where matching ParDo transforms will be dynamically replaced by SplittableParDo.

chamikaramj · 2017-06-08T22:07:55Z

R: @robertwb

chamikaramj · 2017-06-08T22:15:56Z

cc: @jkff (since SDF related)

coveralls · 2017-06-08T23:32:56Z

Coverage decreased (-0.007%) to 70.575% when pulling 578ff83 on chamikaramj:sdf_direct_runner_ptransform_override into 911bfbd on apache:master.

chamikaramj · 2017-06-09T00:54:00Z

Jenkins failure seems to be unrelated.

robertwb

Thanks, we're going to need this soon.

robertwb · 2017-06-09T18:51:19Z

sdks/python/apache_beam/pipeline.py

+
+    output_map = {}
+
+    class OutputVisitor(PipelineVisitor): # pylint: disable=used-before-assignment


Why is this called OutputVisitor?

Slightly updated. Now, first visitor (TransformUpdater) updates the transform while second visitor (InputOutputUpdater) determines transforms where inputs and outputs should be updated. After the two visits I update inputs and outputs. Trying to update inputs/outputs during visiting results in validation errors during the visit.

robertwb · 2017-06-09T18:52:29Z

sdks/python/apache_beam/pipeline.py

+
+      def __init__(self, pipeline, applied_labels):
+        self.pipeline = pipeline
+        self._applied_labels = applied_labels


applied_labels is unused

Seems like we use it to perform transform validation at following location. Not updating this properly result in errors due to label conflicts.

https://github.com/apache/beam/blob/master/sdks/python/apache_beam/pipeline.py#L235

Yes, but that's done via self.pipeline._remove_labels_recursively, not via this attribute.

robertwb · 2017-06-09T18:53:45Z

sdks/python/apache_beam/pipeline.py

+          replacement_transform = override.get_replacement_transform(
+              transform_node.transform)
+          inputs = transform_node.inputs
+          # We only support replacing single-input PTransforms.


Should there be a TODO/JIRA for more general replacements?

Added a TODO .

robertwb · 2017-06-09T18:54:19Z

sdks/python/apache_beam/pipeline.py

+              transform_node.transform)
+          inputs = transform_node.inputs
+          # We only support replacing single-input PTransforms.
+          assert len(inputs) == 1


Raise an informative NotImplementedError rather than an assert.

robertwb · 2017-06-09T18:56:25Z

sdks/python/apache_beam/pipeline.py

+
+          new_output = replacement_transform.expand(inputs[0])
+
+          # Recording updated outputs. This cannot be done in the same visor


visor -> visitor?

robertwb · 2017-06-09T19:00:20Z

sdks/python/apache_beam/pipeline.py

+  def _check_replacement(self, override):
+    matcher = override.get_matcher()
+
+    class Visitor(PipelineVisitor):


More descriptive name?

robertwb · 2017-06-09T19:01:20Z

sdks/python/apache_beam/pipeline.py

+
+     Currently this only works for replacements where input and output types
+     are exactly the same.
+     TODO: Update this to also work for transform overrides where input and


Done: https://issues.apache.org/jira/browse/BEAM-2432

robertwb · 2017-06-09T19:01:43Z

sdks/python/apache_beam/pipeline.py

+      self._replace(override)
+
+    # Checking if the transforms have been successfully replaced.
+    for override in replacements:


Note about when this could happen (e.g. the ordering of replacements is important, right?)

robertwb · 2017-06-09T19:03:23Z

sdks/python/apache_beam/pipeline.py

@@ -564,3 +685,43 @@ def from_runner_api(proto, context):
          pc.tag = tag
    result.update_input_refcounts()
    return result
+
+
+class PTransformMatcher(object):


Is this worth its own class? Could it just be a method on PTransformOverride?

I think it's good to have a new class here so that we can reuse code and maintain a hierarchy of PTransformMatchers.

FWIW, Pythonic way to do this would be to return a callable, not have a hierarchy of PTransformMatchers. (These callables could be put in a library for re-use if need be.)

Ok, makes sense. Removed the PTransformMatcher class.

robertwb · 2017-06-09T19:07:26Z

sdks/python/apache_beam/runners/direct/direct_runner.py

@@ -59,6 +65,9 @@ def apply_CombinePerKey(self, transform, pcoll):
  def run(self, pipeline):
    """Execute the entire pipeline and returns an DirectPipelineResult."""

+    # Performing configured PTransform overrides.
+    pipeline.replace_all(DirectRunner.PTRANSFORM_OVERRIDES)


Nit: I think this'd be less error prone (especially as we develop out more complicated inputs/outputs) if this was functional (i.e. returned a new pipeline object rather than mutating the existing one.

By cloning ? I think this is OK since this is not a user feature (we just have to get this right for various variations of PTransforms). Also, I think replacing is better since the idea is to update the already build pipeline before running it.

chamikaramj

Thanks. PTAL.

chamikaramj · 2017-06-10T02:11:18Z

sdks/python/apache_beam/pipeline.py

+
+      def __init__(self, pipeline, applied_labels):
+        self.pipeline = pipeline
+        self._applied_labels = applied_labels


Seems like we use it to perform transform validation at following location. Not updating this properly result in errors due to label conflicts.

https://github.com/apache/beam/blob/master/sdks/python/apache_beam/pipeline.py#L235

chamikaramj · 2017-06-10T02:12:36Z

sdks/python/apache_beam/pipeline.py

+
+    output_map = {}
+
+    class OutputVisitor(PipelineVisitor): # pylint: disable=used-before-assignment


Slightly updated. Now, first visitor (TransformUpdater) updates the transform while second visitor (InputOutputUpdater) determines transforms where inputs and outputs should be updated. After the two visits I update inputs and outputs. Trying to update inputs/outputs during visiting results in validation errors during the visit.

chamikaramj · 2017-06-10T02:14:06Z

sdks/python/apache_beam/pipeline.py

+          replacement_transform = override.get_replacement_transform(
+              transform_node.transform)
+          inputs = transform_node.inputs
+          # We only support replacing single-input PTransforms.


Added a TODO .

chamikaramj · 2017-06-10T02:18:00Z

sdks/python/apache_beam/pipeline.py

+              transform_node.transform)
+          inputs = transform_node.inputs
+          # We only support replacing single-input PTransforms.
+          assert len(inputs) == 1


chamikaramj · 2017-06-10T02:18:18Z

sdks/python/apache_beam/pipeline.py

+
+          new_output = replacement_transform.expand(inputs[0])
+
+          # Recording updated outputs. This cannot be done in the same visor


chamikaramj · 2017-06-10T02:42:10Z

sdks/python/apache_beam/pipeline.py

+  def _check_replacement(self, override):
+    matcher = override.get_matcher()
+
+    class Visitor(PipelineVisitor):


chamikaramj · 2017-06-10T02:43:01Z

sdks/python/apache_beam/pipeline.py

@@ -564,3 +685,43 @@ def from_runner_api(proto, context):
          pc.tag = tag
    result.update_input_refcounts()
    return result
+
+
+class PTransformMatcher(object):


I think it's good to have a new class here so that we can reuse code and maintain a hierarchy of PTransformMatchers.

chamikaramj · 2017-06-10T02:47:59Z

sdks/python/apache_beam/runners/direct/direct_runner.py

@@ -59,6 +65,9 @@ def apply_CombinePerKey(self, transform, pcoll):
  def run(self, pipeline):
    """Execute the entire pipeline and returns an DirectPipelineResult."""

+    # Performing configured PTransform overrides.
+    pipeline.replace_all(DirectRunner.PTRANSFORM_OVERRIDES)


By cloning ? I think this is OK since this is not a user feature (we just have to get this right for various variations of PTransforms). Also, I think replacing is better since the idea is to update the already build pipeline before running it.

chamikaramj · 2017-06-10T02:53:20Z

sdks/python/apache_beam/pipeline.py

+      self._replace(override)
+
+    # Checking if the transforms have been successfully replaced.
+    for override in replacements:


chamikaramj · 2017-06-10T02:56:17Z

sdks/python/apache_beam/pipeline.py

+
+     Currently this only works for replacements where input and output types
+     are exactly the same.
+     TODO: Update this to also work for transform overrides where input and


Done: https://issues.apache.org/jira/browse/BEAM-2432

coveralls · 2017-06-10T04:32:25Z

Coverage increased (+0.1%) to 70.703% when pulling 13fd823 on chamikaramj:sdf_direct_runner_ptransform_override into 911bfbd on apache:master.

To this end, adds two interfaces, PTransformMatcher and PTransformOverride. Currently only supports replacements where input and output types are an exact match (we have to address complexities due to type hints before supporting replacements with different types). This will be used by SplittableDoFn where matching ParDo transforms will be dynamically replaced by SplittableParDo.

robertwb · 2017-06-13T00:01:27Z

sdks/python/apache_beam/pipeline.py

+
+      def __init__(self, pipeline, applied_labels):
+        self.pipeline = pipeline
+        self._applied_labels = applied_labels


Yes, but that's done via self.pipeline._remove_labels_recursively, not via this attribute.

robertwb · 2017-06-13T00:02:22Z

sdks/python/apache_beam/pipeline.py

+                'PTransform overriding is only supported for PTransforms that '
+                'have a single input. Tried to replace %r that has %d inputs',
+                transform_node, len(inputs))
+          assert len(inputs) == 1


Remove this assert, it's redundant with the if above (or could be, if it was equality rather than greater than).

robertwb · 2017-06-13T00:07:58Z

sdks/python/apache_beam/pipeline.py

+          # Recording updated outputs. This cannot be done in the same visor
+          # since if we dynamically update output type here, we'll run into
+          # errors when visiting child nodes.
+          output_map[transform_node.outputs[None]] = new_output


Expand is not limited to returning a single PCollection (for example, it could return a tuple of PCollections). It could also be a DoFn with multiple outputs, in which case only the main one has "label" None.

If we don't intend to handle the more general case yet, we should at least check.

chamikaramj

Thanks. PTAL.

chamikaramj · 2017-06-13T08:26:41Z

sdks/python/apache_beam/pipeline.py

+                'PTransform overriding is only supported for PTransforms that '
+                'have a single input. Tried to replace %r that has %d inputs',
+                transform_node, len(inputs))
+          assert len(inputs) == 1


chamikaramj · 2017-06-13T08:27:18Z

sdks/python/apache_beam/pipeline.py

@@ -564,3 +685,43 @@ def from_runner_api(proto, context):
          pc.tag = tag
    result.update_input_refcounts()
    return result
+
+
+class PTransformMatcher(object):


Ok, makes sense. Removed the PTransformMatcher class.

chamikaramj · 2017-06-13T08:37:08Z

sdks/python/apache_beam/pipeline.py

+          # Recording updated outputs. This cannot be done in the same visor
+          # since if we dynamically update output type here, we'll run into
+          # errors when visiting child nodes.
+          output_map[transform_node.outputs[None]] = new_output


chamikaramj · 2017-06-13T08:41:47Z

sdks/python/apache_beam/pipeline.py

+
+      def __init__(self, pipeline, applied_labels):
+        self.pipeline = pipeline
+        self._applied_labels = applied_labels


coveralls · 2017-06-13T09:51:02Z

Coverage increased (+0.003%) to 70.658% when pulling 644f642 on chamikaramj:sdf_direct_runner_ptransform_override into fe3d554 on apache:master.

robertwb

LGTM, thanks.

Couldn't comment direct, but regarding "By cloning ? I think this is OK since this is not a user feature (we just have to get this right for various variations of PTransforms). Also, I think replacing is better since the idea is to update the already build pipeline before running it." yes, I was thinking by cloning, so one doesn't accidentally care any state, etc. that belonged to the old structure (especially if things change in the future). The danger of "we just have to get this right" applies to ourselves as well as end users. This is just a general comment, no need to change it now if you're confident with what you have, but just wanted to point out it's a danger.

robertwb · 2017-06-13T16:09:26Z

sdks/python/apache_beam/pipeline.py

+                'has a single output. Tried to replace %r that has %d outputs.'
+                , transform_node, len(transform_node.outputs))
+
+          if type(new_output) is tuple:


This is not sufficient, lists or dicts (or other types) could be returned. Only allow a single PCollection (and likewise that the previous was None -> PCollection)?

robertwb · 2017-06-13T16:14:07Z

sdks/python/apache_beam/pipeline_test.py

+
+    class MyParDoMatcher(object):
+
+      def __call__(self, applied_ptransform):


Alternative, you could define this as

def my_par_do_matcher(applied_ptransform): return isinstance(applied_ptransform.transform, DoubleParDo)

and then below you would have

return my_par_do_matcher

(The lack of first-class functions is what necessitates a class in Java.)

chamikaramj

Thanks for the comments.

chamikaramj · 2017-06-13T17:23:49Z

sdks/python/apache_beam/pipeline.py

+                'has a single output. Tried to replace %r that has %d outputs.'
+                , transform_node, len(transform_node.outputs))
+
+          if type(new_output) is tuple:


chamikaramj · 2017-06-13T17:23:57Z

sdks/python/apache_beam/pipeline_test.py

+
+    class MyParDoMatcher(object):
+
+      def __call__(self, applied_ptransform):


coveralls · 2017-06-13T18:30:01Z

Coverage increased (+0.003%) to 70.658% when pulling 4c382de on chamikaramj:sdf_direct_runner_ptransform_override into fe3d554 on apache:master.

robertwb reviewed Jun 9, 2017

View reviewed changes

chamikaramj commented Jun 10, 2017

View reviewed changes

chamikaramj added 2 commits June 12, 2017 12:53

Addressing reviewer comments.

2ffd509

robertwb reviewed Jun 13, 2017

View reviewed changes

Addressing reviewer comments.

644f642

chamikaramj force-pushed the sdf_direct_runner_ptransform_override branch from 13fd823 to 644f642 Compare June 13, 2017 08:35

chamikaramj commented Jun 13, 2017

View reviewed changes

robertwb approved these changes Jun 13, 2017

View reviewed changes

Addressing reviewer comments.

4c382de

chamikaramj commented Jun 13, 2017

View reviewed changes

asfgit closed this in 7d0f24a Jun 13, 2017

robertwb mentioned this pull request Jun 14, 2017

[BEAM-2421] Python streaming Create override as a composite of Impulse and a DoFn #3349

Closed

4 tasks

kennknowles mentioned this pull request Jun 3, 2022

Support PTransform overriding when input and output types are different #18444

Open


		output_map = {}

		class OutputVisitor(PipelineVisitor): # pylint: disable=used-before-assignment


		new_output = replacement_transform.expand(inputs[0])

		# Recording updated outputs. This cannot be done in the same visor


		class MyParDoMatcher(object):

		def __call__(self, applied_ptransform):

[BEAM-1630] Adds ability to dynamically replace PTransforms during runtime. #3333

[BEAM-1630] Adds ability to dynamically replace PTransforms during runtime. #3333

Conversation

chamikaramj commented Jun 8, 2017

chamikaramj commented Jun 8, 2017

chamikaramj commented Jun 8, 2017

coveralls commented Jun 8, 2017

chamikaramj commented Jun 9, 2017

robertwb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chamikaramj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Jun 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chamikaramj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Jun 13, 2017

robertwb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chamikaramj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Jun 13, 2017