[BEAM-1630] Adds support for processing Splittable DoFns using DirectRunner. #4064

chamikaramj · 2017-10-31T08:52:41Z

Updates DoFn invocation logic to allow invoking SDF methods.
Adds SDF machinery that will be common to DirectRunner and other runners.
Adds DirectRunner specific transform overrides, evaluators, and other logic for processing Splittable DoFns.

Follow this checklist to help us incorporate your contribution quickly and easily:

Make sure there is a JIRA issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes.
Each commit in the pull request should have a meaningful subject line and body.
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue.
Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
Run mvn clean verify to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

chamikaramj · 2017-10-31T08:56:43Z

R: @charlesccychen (for an overall review including DirectRunner specific logic)
R: @jkff (for SDF machinery)

cc: @robertwb

chamikaramj · 2017-11-10T15:53:38Z

Run Python PreCommit

chamikaramj · 2017-11-10T15:56:01Z

Fixed cython errors. All tests should pass now. PTAL.

chamikaramj · 2017-11-16T04:10:48Z

Friendly ping :)

charlesccychen

Thanks! Please see my comments. It would also be helpful if @jkff can further review the SDF-specific logic in this change.

charlesccychen · 2017-11-16T20:56:52Z

sdks/python/apache_beam/runners/direct/transform_evaluator.py

@@ -191,8 +194,6 @@ def __init__(self, evaluation_context, applied_ptransform,
    self._execution_context = evaluation_context.get_execution_context(
        applied_ptransform)
    self.scoped_metrics_container = scoped_metrics_container
-    with scoped_metrics_container:
-      self.start_bundle()


Thanks for this change.

charlesccychen · 2017-11-16T20:56:52Z

sdks/python/apache_beam/runners/common.py

+class OutputProcessor(object):
+
+  def process_outputs(self, windowed_input_element, results):
+    raise NotImplementedError


What is the motivation for this change?

We need to pass in a customer OutputProcessor when invoking SDF.process() instead of using the default output processor since output has to be handled at ProcessFn.

charlesccychen · 2017-11-16T20:56:52Z

sdks/python/apache_beam/runners/common.py


    Args:
-      do_fn: A DoFn object that contains the method.
+      obj: the object that contains the method.


Can you rename this and document what type of object this should be, and also validate this in the constructor if possible? "obj" seems too generic

charlesccychen · 2017-11-16T20:56:53Z

sdks/python/apache_beam/runners/sdf_common.py

+  """A transform that assigns a unique key to each element."""
+
+  def process(self, element, window=beam.DoFn.WindowParam, *args, **kwargs):
+    yield (uuid.uuid4().bytes, element)


Here and in other places, we seem to rely on uuid.uuid4() to give unique values for correctness (for example, when writing shards for file I/O). Can you add comments detailing this assumption here and elsewhere?

Added TODOs to here and iobase._WriteBundleDoFn. I think collisions are extremely rare for uuid.uuid4() though. Also, added an assertion to force a failure here if a collision is detected.

charlesccychen · 2017-11-16T20:56:53Z

sdks/python/apache_beam/runners/direct/transform_evaluator.py

+  """An evaluator for sdf_direct_runner.ProcessElements transform."""
+
+  DEFAULT_MAX_NUM_OUTPUTS = 100
+  DEFAULT_MAX_DURATION = 1


Can you please document these flags?

charlesccychen · 2017-11-16T20:56:53Z

sdks/python/apache_beam/runners/direct/transform_evaluator.py

@@ -529,16 +530,19 @@ def start_bundle(self):
    self._counter_factory = counters.CounterFactory()

    # TODO(aaltay): Consider storing the serialized form as an optimization.
-    dofn = pickler.loads(pickler.dumps(transform.dofn))
+    dofn = transform.dofn


We had this line here to ensure that the DoFn was serializable, so that a user would not hit any issues when running later on remote runners. Can you describe the reason for this change?

I ran into issues since we are piggybacking StepContext and SDF DoFnInvoker objects in the DoFn. I added an option to disable this pickling only for the SDF case.

charlesccychen · 2017-11-16T20:56:53Z

sdks/python/apache_beam/runners/direct/transform_evaluator.py

@@ -826,3 +830,74 @@ def finish_bundle(self):
          None, '', TimeDomain.WATERMARK, WatermarkManager.WATERMARK_POS_INF)

    return TransformResult(self, [], [], None, {None: hold})
+
+
+class _ProcessElemenetsEvaluator(_TransformEvaluator):


charlesccychen · 2017-11-16T20:56:53Z

sdks/python/apache_beam/runners/direct/sdf_direct_runner.py

+
+
+class SDFProcessElementInvoker(object):
+  """A utility that requsts checkpoints.


charlesccychen · 2017-11-16T20:56:54Z

sdks/python/apache_beam/io/restriction_trackers.py

+      residual_range = (
+          (self._range.start, self._range.stop)
+          if self._current_position is None
+          else (self._current_position + 1, self._range.stop))


Can you please refactor this into the more readable:

if self._current_position is None: residual_range = (self._range.start, self._range.stop) else: residual_range = (self._current_position + 1, self._range.stop)

charlesccychen · 2017-11-16T20:56:55Z

sdks/python/apache_beam/pipeline_test.py

+      return [MyParDoOverride()]
+
+    from apache_beam.runners.direct import direct_runner
+    direct_runner._get_transform_overrides = get_overrides


This sort of monkey-patching may not be safe--subsequent tests in the same process will get your modified version of _get_transform_overrides. Consider using a @mock.patch('apache_beam.runners.direct.direct_runner._get_transform_overrides') annotation on this method (that machinery will restore the original value after this test case finishes).

Actually, alternatively, you can create the DirectRunner object and directly patch runner._ptransform_overrides instead, which might be cleaner.

charlesccychen · 2017-11-16T20:56:55Z

sdks/python/apache_beam/io/restriction_trackers_test.py

+    OffsetRange(10, 100)
+
+    with self.assertRaises(ValueError):
+      OffsetRange(10, 9)


Can you also test OffsetRange(10, 10) above to make sure we get the corner case?

Done (note that this is not a failure case).

charlesccychen · 2017-11-16T20:56:55Z

sdks/python/apache_beam/io/restriction_trackers.py

+          if self._current_position is None
+          else (self._current_position + 1, self._range.stop))
+      # If self._current_position is 'None' no records have been claimed so
+      # residual should start from self._range.start.


Move this comment to before the if above.

charlesccychen · 2017-11-16T20:56:55Z

sdks/python/apache_beam/io/restriction_trackers.py

+      # residual should start from self._range.start.
+      end_position = (
+          self._range.start if self._current_position is None
+          else self._current_position + 1)


Same with above

jkff

Thanks! Overall looks very good.

jkff · 2017-11-20T23:10:52Z

sdks/python/apache_beam/examples/snippets/snippets_test.py

@@ -423,8 +423,8 @@ def __init__(self, file_to_write):

      def start_bundle(self):
        assert self.file_to_write
-        self.file_to_write += str(uuid.uuid4())
-        self.file_obj = open(self.file_to_write, 'w')
+        # Appending a UUID to create a unique file object per invocation.


Is this related to the current PR?

A version of a test that wrote a large number of files failed due to not having this fix.

jkff · 2017-11-20T23:11:57Z

sdks/python/apache_beam/io/restriction_trackers.py

+
+    return self.start == other.start and self.stop == other.stop
+
+  def __ne__(self, other):


Is this necessary? Doesn't the default implementation of "ne" just negate "eq"?

jkff · 2017-11-20T23:12:38Z

sdks/python/apache_beam/io/restriction_trackers.py

+
+
+class OffsetRestrictionTracker(RestrictionTracker):
+  """An `iobase.RestrictionTracker` implementations for byte offsets."""


Does it have to be bytes? It can be any kind of integer indices, no?

jkff · 2017-11-20T23:13:24Z

sdks/python/apache_beam/io/restriction_trackers.py

+      return False
+
+  def checkpoint(self):
+    with self._lock:


Other methods should also take the lock, e.g. current_restriction, start/stop_position.

jkff · 2017-11-20T23:13:47Z

sdks/python/apache_beam/io/restriction_trackers.py

+  def try_claim(self, position):
+    with self._lock:
+      self._last_claim_attempt = position
+      if position >= self._range.start and position < self._range.stop:


"position < self._range.start" should be an error rather than just a failed claim.

jkff · 2017-11-21T00:16:17Z

sdks/python/apache_beam/runners/direct/sdf_direct_runner_test.py

+from apache_beam.transforms.window import TimestampedValue
+
+
+class ReadFilesProvider(RestrictionProvider):


Hm, why not use the simpler example of generating a range of numbers? (this one's fine too)

I think it's good to have slightly complex one till we have actual SDFs as an illustration.

jkff · 2017-11-21T00:18:56Z

sdks/python/apache_beam/runners/direct/sdf_direct_runner_test.py

+    with TestPipeline() as p:
+      pc1 = (p
+             | 'Create1' >> beam.Create(file_names)
+             | 'SDF' >> beam.ParDo(ReadFiles(resume_count)))


Is it possible to verify that we really resume the requested number of times? In Java I did that by emitting an element to a side output once per ProcessElement call; not sure if your implementation supports side outputs yet.

Currently transform overriding only supports transforms with a single output so this cannot be done. Added a TODO for this verification (I manually verified this BTW).

jkff · 2017-11-21T00:20:45Z

sdks/python/apache_beam/runners/direct/transform_evaluator.py

+    if isinstance(element.value, KeyedWorkItem):
+      encoded_k = element.value.encoded_key
+    else:
+      assert isinstance(element.value, tuple)


Document what kind of tuple we expect it to be and why?

jkff · 2017-11-21T00:21:48Z

sdks/python/apache_beam/runners/sdf_common.py

+# limitations under the License.
+#
+
+"""This module contains Splittable DoFn logic that's common to all runners."""


In Python I think I'd recommend to start with only direct runner, because I'm not sure the contents of this file will be reusable for fn api implementation, and there won't be other runners implementing SDF directly, right?

This simply contains the core transforms (PairWithRestrictionFn, SplitRestrictionFn, ProcessElements()). This flow will be common for Fn API based implementation as well, no ?

All the implementation details other than that are in sdf_direct_runner.py.

jkff · 2017-11-21T00:22:13Z

sdks/python/apache_beam/runners/sdf_common.py

+    keyed_elements = (pcoll
+                      | 'pair' >> ParDo(PairWithRestrictionFn(sdf))
+                      | 'split' >> ParDo(SplitRestrictionFn(sdf))
+                      | 'explode' >> ParDo(ExplodeWindowsFn())


Ah, you are exploding windows! I think somewhere above there was code that handled multiple windows...

Not sure I get this. Are you saying that I won't be getting a 'WindowedValue' on ProcessFn since I explode windows here ?

chamikaramj

Thanks. PTAL.

chamikaramj · 2017-11-22T00:19:08Z

sdks/python/apache_beam/runners/direct/transform_evaluator.py

@@ -529,16 +530,19 @@ def start_bundle(self):
    self._counter_factory = counters.CounterFactory()

    # TODO(aaltay): Consider storing the serialized form as an optimization.
-    dofn = pickler.loads(pickler.dumps(transform.dofn))
+    dofn = transform.dofn


I ran into issues since we are piggybacking StepContext and SDF DoFnInvoker objects in the DoFn. I added an option to disable this pickling only for the SDF case.

chamikaramj · 2017-11-22T00:26:31Z

sdks/python/apache_beam/runners/direct/transform_evaluator.py

+
+    self._par_do_evaluator = _ParDoEvaluator(
+        evaluation_context, applied_ptransform, input_committed_bundle,
+        side_inputs, scoped_metrics_container)


Technically we only have one evaluator here which is _ProcessElementsEvaluator. _ParDoEvaluator is used as a library. We are simply using _ParDoEvaluator to evaluate a ParDo where DoFn object is the ProcessFn.

If we decide to duplicate that code this'll involve a significant amount of code copying (ParDoEvaluator's, start_bundle, process(), finish_bundle()) which I prefer avoiding.

Also, note that we use a similar implementation for Java SDK: https://github.com/apache/beam/blob/master/runners/direct-java/src/main/java/org/apache/beam/runners/direct/SplittableProcessElementsEvaluatorFactory.java#L112

WDYT ?

chamikaramj · 2017-11-22T00:44:23Z

sdks/python/apache_beam/examples/snippets/snippets_test.py

@@ -423,8 +423,8 @@ def __init__(self, file_to_write):

      def start_bundle(self):
        assert self.file_to_write
-        self.file_to_write += str(uuid.uuid4())
-        self.file_obj = open(self.file_to_write, 'w')
+        # Appending a UUID to create a unique file object per invocation.


A version of a test that wrote a large number of files failed due to not having this fix.

chamikaramj · 2017-11-22T00:45:46Z

sdks/python/apache_beam/io/restriction_trackers.py

+
+    return self.start == other.start and self.stop == other.stop
+
+  def __ne__(self, other):


chamikaramj · 2017-11-22T00:48:05Z

sdks/python/apache_beam/io/restriction_trackers.py

+
+
+class OffsetRestrictionTracker(RestrictionTracker):
+  """An `iobase.RestrictionTracker` implementations for byte offsets."""


chamikaramj · 2017-12-01T20:01:32Z

sdks/python/apache_beam/runners/direct/sdf_direct_runner.py

+      if self._max_num_outputs and output_count >= self._max_num_outputs:
+        initiate_checkpoint()
+
+    tracker.check_done()


Yeah, this is similar to the Java implementation.

chamikaramj · 2017-12-01T20:05:01Z

sdks/python/apache_beam/runners/direct/transform_evaluator.py

+  """An evaluator for sdf_direct_runner.ProcessElements transform."""
+
+  DEFAULT_MAX_NUM_OUTPUTS = 100
+  DEFAULT_MAX_DURATION = 1


chamikaramj · 2017-12-01T21:10:40Z

sdks/python/apache_beam/runners/direct/transform_evaluator.py

+    if isinstance(element.value, KeyedWorkItem):
+      encoded_k = element.value.encoded_key
+    else:
+      assert isinstance(element.value, tuple)


chamikaramj · 2017-12-01T21:15:06Z

sdks/python/apache_beam/runners/direct/transform_evaluator.py

+      encoded_k = element.value.encoded_key
+    else:
+      assert isinstance(element.value, tuple)
+      encoded_k = element.value[0]


Renamed to 'key'.

chamikaramj · 2017-12-01T21:28:45Z

sdks/python/apache_beam/runners/sdf_common.py

+  """A transform that assigns a unique key to each element."""
+
+  def process(self, element, window=beam.DoFn.WindowParam, *args, **kwargs):
+    yield (uuid.uuid4().bytes, element)


Added TODOs to here and iobase._WriteBundleDoFn. I think collisions are extremely rare for uuid.uuid4() though. Also, added an assertion to force a failure here if a collision is detected.

chamikaramj · 2017-12-06T16:59:45Z

Run Python PreCommit

jkff

Thanks, looks pretty good to me SDF-wise!

jkff · 2017-12-06T22:17:33Z

sdks/python/apache_beam/io/iobase.py

@@ -962,6 +962,7 @@ def display_data(self):

  def process(self, element, init_result):
    if self.writer is None:
+      # TODO: handle uid collisions here.


Doesn't seem necessary, uuid collisions are improbably rare

Updated to a comment.

jkff · 2017-12-06T22:19:17Z

sdks/python/apache_beam/runners/direct/sdf_direct_runner.py

+  """A `DoFn` that executes machineary for invoking a Splittable `DoFn`.
+
+  Input to the `ParDo` step that includes a `ProcessFn` will be a `PCollection`
+  of `ElementAndRestriction` objects.


Is it that, or KeyedWorkItem's / WindowedValue's?

SDK automatically converts WindowedValues to values during iteration so I think it's fine to call it a PCollection of ElementAndRestrictions here. KeyedWorkItem is a special case where we receive that instead of the original iterable of values when a timer is fired (i.e. this will be true for any DoFn) so not worth mentioning here.

jkff · 2017-12-06T22:20:45Z

sdks/python/apache_beam/runners/direct/sdf_direct_runner.py

      value = values[0]
+      if len(values) != 1:
+        raise ValueError('')
+      assert isinstance(value, (WindowedValue, ElementAndRestriction))


Is this checking for being a subclass of both at the same time? Is that how windowed values work in Python - multiple inheritance (rather than wrapping like in Java)?

This checks if instance is of one of the types mentioned. Actually I don't think we'll be getting WindowedValues here after iteration. So updated.

jkff · 2017-12-06T22:21:32Z

sdks/python/apache_beam/runners/direct/sdf_direct_runner.py

+      windowed_element = WindowedValue(element, timestamp, [window])
+    else:
+      element_and_restriction = (
+          value.value if isinstance(value, WindowedValue) else value)


Hm seems concerning, maybe @charlesccychen can comment what's going on here?

jkff · 2017-12-06T22:21:50Z

sdks/python/apache_beam/runners/direct/sdf_direct_runner.py

@@ -162,6 +188,8 @@ def process(self, element, timestamp=beam.DoFn.TimestampParam,
        break
      yield output

+    assert sdf_result


Add a message?

jkff · 2017-12-06T22:22:58Z

sdks/python/apache_beam/runners/direct/sdf_direct_runner.py

+      # Setting a timer to be reinvoked to continue processing the element.
+      # Currently Python SDK only supports setting timers based on watermark. So
+      # forcing a reinvocation by setting a timer for watermark negative
+      # infinity.


Does this have any practical consequences, except that the processing-time delay is not respected? E.g. does this unnecessarily hold the watermark more than requested, or something? Can the timer end up being dropped?

I think this works for now. Timer doesn't get dropped and watermark get resetted properly. Maria is working on adding support for proper processing time based timers, will update after we have that.

jkff · 2017-12-06T22:24:00Z

sdks/python/apache_beam/runners/direct/sdf_direct_runner.py

    for output in output_processor.output_iter:
+      # A ProcessContinuation, if returned, should be the last element.
+      assert not process_continuation


Where is it assigned to a non-none value?

it was not being set. Updated.

jkff · 2017-12-06T22:24:27Z

sdks/python/apache_beam/runners/direct/sdf_direct_runner.py

+
+        # Continuing here instead of breaking to enforce that this is the last
+        # element.
+        continue


Do we enforce that a ProcessContinuation was eventually returned at all?

We don't enforce that. User may or may not return a ProcessContinuation object. If returned has to be the last element of the values iterator.

jkff · 2017-12-06T22:26:29Z

sdks/python/apache_beam/runners/direct/sdf_direct_runner_test.py

@@ -157,6 +152,9 @@ def run_sdf_read_pipeline(

      assert_that(pc1, equal_to(expected_data))

+      # TODO(chamikara: verify the number of times process method was invoked
+      # using a side output once SDFs supports producing side outputs.


Btw what will currently happen if an SDF attempts to produce side outputs? I realized that Python doesn't have a ProcessContext like Java does, so there's no explicit code that fails if it tries to.

We raise an error during transform overriding: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/pipeline.py#L201

chamikaramj

Thanks. PTAL.

chamikaramj · 2017-12-11T18:43:05Z

sdks/python/apache_beam/io/iobase.py

@@ -962,6 +962,7 @@ def display_data(self):

  def process(self, element, init_result):
    if self.writer is None:
+      # TODO: handle uid collisions here.


Updated to a comment.

chamikaramj · 2017-12-11T21:08:27Z

sdks/python/apache_beam/runners/direct/sdf_direct_runner.py

+  """A `DoFn` that executes machineary for invoking a Splittable `DoFn`.
+
+  Input to the `ParDo` step that includes a `ProcessFn` will be a `PCollection`
+  of `ElementAndRestriction` objects.


SDK automatically converts WindowedValues to values during iteration so I think it's fine to call it a PCollection of ElementAndRestrictions here. KeyedWorkItem is a special case where we receive that instead of the original iterable of values when a timer is fired (i.e. this will be true for any DoFn) so not worth mentioning here.

chamikaramj · 2017-12-11T21:11:10Z

sdks/python/apache_beam/runners/direct/sdf_direct_runner.py

      value = values[0]
+      if len(values) != 1:
+        raise ValueError('')
+      assert isinstance(value, (WindowedValue, ElementAndRestriction))


This checks if instance is of one of the types mentioned. Actually I don't think we'll be getting WindowedValues here after iteration. So updated.

chamikaramj · 2017-12-11T21:32:33Z

sdks/python/apache_beam/runners/direct/sdf_direct_runner.py

@@ -162,6 +188,8 @@ def process(self, element, timestamp=beam.DoFn.TimestampParam,
        break
      yield output

+    assert sdf_result


chamikaramj · 2017-12-11T21:33:54Z

sdks/python/apache_beam/runners/direct/sdf_direct_runner.py

+      # Setting a timer to be reinvoked to continue processing the element.
+      # Currently Python SDK only supports setting timers based on watermark. So
+      # forcing a reinvocation by setting a timer for watermark negative
+      # infinity.


I think this works for now. Timer doesn't get dropped and watermark get resetted properly. Maria is working on adding support for proper processing time based timers, will update after we have that.

chamikaramj · 2017-12-11T21:40:38Z

sdks/python/apache_beam/runners/direct/sdf_direct_runner.py

+      windowed_element = WindowedValue(element, timestamp, [window])
+    else:
+      element_and_restriction = (
+          value.value if isinstance(value, WindowedValue) else value)


My previous statement was incorrect. process() method does get an iterator (_UnwindowedValues) of WindowedValue objects. But after iterator is expanded, we always get ElementAndRestriction objects here [1]. Updated.

[1]

beam/sdks/python/apache_beam/transforms/trigger.py

Line 885 in 25b9c35

unwindowed_value = wv.value

chamikaramj · 2017-12-11T21:51:19Z

sdks/python/apache_beam/runners/direct/sdf_direct_runner.py

    for output in output_processor.output_iter:
+      # A ProcessContinuation, if returned, should be the last element.
+      assert not process_continuation


it was not being set. Updated.

chamikaramj · 2017-12-11T21:55:01Z

sdks/python/apache_beam/runners/direct/sdf_direct_runner.py

+
+        # Continuing here instead of breaking to enforce that this is the last
+        # element.
+        continue


We don't enforce that. User may or may not return a ProcessContinuation object. If returned has to be the last element of the values iterator.

chamikaramj · 2017-12-11T22:07:46Z

sdks/python/apache_beam/runners/direct/sdf_direct_runner_test.py

@@ -157,6 +152,9 @@ def run_sdf_read_pipeline(

      assert_that(pc1, equal_to(expected_data))

+      # TODO(chamikara: verify the number of times process method was invoked
+      # using a side output once SDFs supports producing side outputs.


We raise an error during transform overriding: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/pipeline.py#L201

jkff · 2017-12-12T20:21:26Z

Thanks! LGTM as far as SDF goes; up to Charles to review the rest.

chamikaramj · 2017-12-14T05:53:01Z

Thanks Eugene.

Charles, PTAL.

charlesccychen

Thanks.

charlesccychen · 2017-12-16T01:40:42Z

sdks/python/apache_beam/io/iobase.py

-    Raises ValueError: if there is still any unclaimed work remaining in the
-      restriction invoking this method. Exception raised must have an
-      informative error message.
+    This method must raise an error if there is still any unclaimed work


"an error" -> "ValueError"

charlesccychen · 2017-12-16T01:40:42Z

sdks/python/apache_beam/runners/direct/sdf_direct_runner.py

+  """A primitive transform for processing keyed elements or KeyedWorkItems.
+
+  Will be evaluated by
+  `runners.direct.transform_evaluator._ProcessElemenetsEvaluator`.


"_ProcessElementsEvaluator"

charlesccychen · 2017-12-16T01:40:42Z

sdks/python/apache_beam/io/iobase.py

@@ -961,6 +962,7 @@ def display_data(self):

  def process(self, element, init_result):
    if self.writer is None:
+      # We ignore uuid collisions here since it's extremely rare.


capitalize "UUID".

charlesccychen · 2017-12-16T01:40:43Z

sdks/python/apache_beam/io/restriction_trackers.py

+        residual_range = (self._range.start, self._range.stop)
+        end_position = self._range.start
+      else:
+        residual_range = (self._current_position + 1, self._range.stop)


Can you factor residual_range out of this if? Just as:

residual_range = (end_position, self._range.stop)

charlesccychen · 2017-12-16T01:40:43Z

sdks/python/apache_beam/runners/direct/transform_evaluator.py

+    super(_ParDoEvaluator, self).__init__(
+        evaluation_context, applied_ptransform, input_committed_bundle,
+        side_inputs, scoped_metrics_container)
+    self._perform_dofn_pickle_test = perform_dofn_pickle_test


Please add comment that this workaround is for SDF.

charlesccychen · 2017-12-16T01:40:43Z

sdks/python/apache_beam/runners/direct/transform_evaluator.py

@@ -530,16 +541,20 @@ def start_bundle(self):
    self._counter_factory = counters.CounterFactory()

    # TODO(aaltay): Consider storing the serialized form as an optimization.
-    dofn = pickler.loads(pickler.dumps(transform.dofn))
+    dofn = (pickler.loads(pickler.dumps(transform.dofn))
+            if self._perform_dofn_pickle_test else transform.dofn)


Please add comment that this workaround is for SDF.

charlesccychen · 2017-12-16T01:40:44Z

sdks/python/apache_beam/io/restriction_trackers_test.py

+# limitations under the License.
+#
+
+"""Unit tests for the range_trackers module."""


Please add extra newline.

charlesccychen · 2017-12-16T01:40:45Z

sdks/python/apache_beam/runners/direct/sdf_direct_runner.py

+    self.sdf = sdf
+    self._element_tag = _ValueStateTag('element')
+    self._restriction_tag = _ValueStateTag('restriction')
+    self.watermark_hold_tag = _ValueStateTag('watermark_hold')


chamikaramj wrote:
This is also used in transform_evaluator module.

Acknowledged.

charlesccychen · 2017-12-16T01:40:45Z

sdks/python/apache_beam/runners/direct/transform_evaluator.py

@@ -491,14 +491,14 @@ class NullReceiver(object):
    def output(self, element):
      pass

-  class _InMemoryReceiver(common.Receiver):


chamikaramj wrote:
I ran into issues since we are piggybacking StepContext and SDF DoFnInvoker objects in the DoFn. I added an option to disable this pickling only for the SDF case.

Acknowledged.

charlesccychen · 2017-12-16T01:40:45Z

sdks/python/apache_beam/runners/sdf_common.py

+  """A transform that assigns a unique key to each element."""
+
+  def process(self, element, window=beam.DoFn.WindowParam, *args, **kwargs):
+    yield (uuid.uuid4().bytes, element)


chamikaramj wrote:
Added TODOs to here and iobase._WriteBundleDoFn. I think collisions are extremely rare for uuid.uuid4() though. Also, added an assertion to force a failure here if a collision is detected.

I don't see this assertion? I agree that they should be rare, but a comment would help anyone reading this in the future understand this assumption.

charlesccychen · 2017-12-16T01:40:45Z

sdks/python/apache_beam/io/restriction_trackers_test.py

+    OffsetRange(10, 100)
+
+    with self.assertRaises(ValueError):
+      OffsetRange(10, 9)


chamikaramj wrote:
Done (note that this is not a failure case).

Done.

charlesccychen · 2017-12-16T01:40:45Z

sdks/python/apache_beam/runners/direct/transform_evaluator.py

+        SDFProcessElementInvoker(
+            max_num_outputs=self.DEFAULT_MAX_NUM_OUTPUTS,
+            max_duration=self.DEFAULT_MAX_DURATION))
+    self._process_fn.set_process_element_invoker(process_element_invoker)


chamikaramj wrote:
Renamed to 'key'.

Done.

charlesccychen · 2017-12-16T01:40:47Z

sdks/python/apache_beam/runners/direct/transform_evaluator.py

@@ -530,16 +530,19 @@ def start_bundle(self):
    self._counter_factory = counters.CounterFactory()

    # TODO(aaltay): Consider storing the serialized form as an optimization.
-    dofn = pickler.loads(pickler.dumps(transform.dofn))
+    dofn = transform.dofn

    pipeline_options = self._evaluation_context.pipeline_options
    if (pipeline_options is not None
        and pipeline_options.view_as(TypeOptions).runtime_type_check):
      dofn = TypeCheckWrapperDoFn(dofn, transform.get_type_hints())


chamikaramj wrote:
Done.

Done.

charlesccychen · 2017-12-16T01:40:47Z

sdks/python/apache_beam/runners/direct/sdf_direct_runner.py

+    pass
+
+  def process(self, element, timestamp=beam.DoFn.TimestampParam,
+              window=beam.DoFn.WindowParam, * args, **kwargs):


chamikaramj wrote:
Done.

Done.

charlesccychen · 2017-12-16T01:40:47Z

sdks/python/apache_beam/io/restriction_trackers.py

+          if self._current_position is None
+          else (self._current_position + 1, self._range.stop))
+      # If self._current_position is 'None' no records have been claimed so
+      # residual should start from self._range.start.


chamikaramj wrote:
Done.

Done.

charlesccychen · 2017-12-16T01:40:47Z

sdks/python/apache_beam/runners/direct/sdf_direct_runner.py

+        self.sdf_invoker, windowed_element, tracker)
+
+    sdf_result = None
+    for output in output_values:


chamikaramj wrote:
Added more documentation to SDFProcessElementInvoker and Result classes.

Acknowledged.

charlesccychen · 2017-12-16T01:40:47Z

sdks/python/apache_beam/io/restriction_trackers.py

+      # residual should start from self._range.start.
+      end_position = (
+          self._range.start if self._current_position is None
+          else self._current_position + 1)


chamikaramj wrote:
Done.

Done.

charlesccychen · 2017-12-16T01:42:12Z

Please also rebase to HEAD since there seems to be a merge conflict right now.

chamikaramj

Thanks. PTAL.

chamikaramj · 2017-12-19T03:25:15Z

sdks/python/apache_beam/io/iobase.py

-    Raises ValueError: if there is still any unclaimed work remaining in the
-      restriction invoking this method. Exception raised must have an
-      informative error message.
+    This method must raise an error if there is still any unclaimed work


charlesccychen wrote:
"an error" -> "ValueError"

Done.

chamikaramj · 2017-12-19T03:25:15Z

sdks/python/apache_beam/runners/direct/sdf_direct_runner.py

+  """A primitive transform for processing keyed elements or KeyedWorkItems.
+
+  Will be evaluated by
+  `runners.direct.transform_evaluator._ProcessElemenetsEvaluator`.


charlesccychen wrote:
"_ProcessElementsEvaluator"

Done.

chamikaramj · 2017-12-19T03:25:15Z

sdks/python/apache_beam/io/iobase.py

@@ -961,6 +962,7 @@ def display_data(self):

  def process(self, element, init_result):
    if self.writer is None:
+      # We ignore uuid collisions here since it's extremely rare.


charlesccychen wrote:
capitalize "UUID".

Done.

chamikaramj · 2017-12-19T03:25:16Z

sdks/python/apache_beam/runners/direct/transform_evaluator.py

    """Ignores undeclared outputs, default execution mode."""

-    def receive(self, element):
+    def output(self, element):
      pass



charlesccychen wrote:
Please add comment that this workaround is for SDF.

Done.

chamikaramj · 2017-12-19T03:25:17Z

sdks/python/apache_beam/io/restriction_trackers_test.py

+# limitations under the License.
+#
+
+"""Unit tests for the range_trackers module."""


charlesccychen wrote:
Please add extra newline.

Done.

chamikaramj · 2017-12-19T03:25:17Z

sdks/python/apache_beam/io/restriction_trackers.py

+        residual_range = (self._range.start, self._range.stop)
+        end_position = self._range.start
+      else:
+        residual_range = (self._current_position + 1, self._range.stop)


charlesccychen wrote:
Can you factor residual_range out of this if? Just as:

residual_range = (end_position, self._range.stop)

Done.

chamikaramj · 2017-12-19T03:25:17Z

sdks/python/apache_beam/runners/direct/transform_evaluator.py

+               input_committed_bundle, side_inputs, scoped_metrics_container,
+               perform_dofn_pickle_test=True):
+    super(_ParDoEvaluator, self).__init__(
+        evaluation_context, applied_ptransform, input_committed_bundle,


charlesccychen wrote:
Please add comment that this workaround is for SDF.

Mentioned above.

chamikaramj · 2017-12-19T03:25:18Z

sdks/python/apache_beam/io/iobase.py

@@ -47,7 +47,8 @@
 from apache_beam.utils import urns
 from apache_beam.utils.windowed_value import WindowedValue

-__all__ = ['BoundedSource', 'RangeTracker', 'Read', 'Sink', 'Write', 'Writer']


chamikaramj wrote:
Updated to a comment.

Acknowledged. Updated to a comment.

chamikaramj · 2017-12-19T03:25:18Z

sdks/python/apache_beam/runners/sdf_common.py

+  """A transform that assigns a unique key to each element."""
+
+  def process(self, element, window=beam.DoFn.WindowParam, *args, **kwargs):
+    yield (uuid.uuid4().bytes, element)


charlesccychen wrote:
I don't see this assertion? I agree that they should be rare, but a comment would help anyone reading this in the future understand this assumption.

I removed the assertion after Eugene's comment. Added a comment.

chamikaramj · 2017-12-19T18:03:47Z

Failure is unrelated (due to https://issues.apache.org/jira/browse/BEAM-3369).

charlesccychen

Thanks!

charlesccychen · 2017-12-19T18:27:20Z

sdks/python/apache_beam/io/iobase.py

-    Raises ValueError: if there is still any unclaimed work remaining in the
-      restriction invoking this method. Exception raised must have an
-      informative error message.
+    This method must raise an `ValueError` if there is still any unclaimed work


a ValueError

charlesccychen · 2017-12-19T18:27:20Z

sdks/python/apache_beam/runners/common.py

@@ -332,31 +437,45 @@ def __init__(self,
      kwargs: keyword side input arguments (static and placeholder), if any


chamikaramj wrote:
We need to pass in a customer OutputProcessor when invoking SDF.process() instead of using the default output processor since output has to be handled at ProcessFn.

Done.

charlesccychen · 2017-12-19T18:27:21Z

sdks/python/apache_beam/runners/common.py


    Args:
-      do_fn: A DoFn object that contains the method.
+      obj: the object that contains the method.


chamikaramj wrote:
Done.

Done.

Updates DoFnInvocation logic to allow invoking SDF methods. Adds SDF machinery that will be common to DirectRunner and other runners. Adds DirectRunner specific transform overrides, evaluators, and other logic for processing Splittable DoFns.

chamikaramj · 2017-12-19T21:30:10Z

Thanks for the review.

charlesccychen reviewed Nov 16, 2017

View reviewed changes

jkff requested changes Nov 21, 2017

View reviewed changes

chamikaramj commented Dec 2, 2017

View reviewed changes

jkff requested changes Dec 6, 2017

View reviewed changes

chamikaramj commented Dec 11, 2017

View reviewed changes

jkff approved these changes Dec 12, 2017

View reviewed changes

chamikaramj force-pushed the sdf_direct_runner_3 branch from a8f47de to 43622fc Compare December 14, 2017 02:41

charlesccychen reviewed Dec 16, 2017

View reviewed changes

chamikaramj force-pushed the sdf_direct_runner_3 branch from 43622fc to fc58867 Compare December 19, 2017 03:24

chamikaramj commented Dec 19, 2017

View reviewed changes

charlesccychen approved these changes Dec 19, 2017

View reviewed changes

charlesccychen reviewed Dec 19, 2017

View reviewed changes

chamikaramj force-pushed the sdf_direct_runner_3 branch from fc58867 to 5443406 Compare December 19, 2017 21:27

chamikaramj merged commit 3c81b41 into apache:master Dec 19, 2017



		class SDFProcessElementInvoker(object):
		"""A utility that requsts checkpoints.


		return self.start == other.start and self.stop == other.stop

		def __ne__(self, other):



		class OffsetRestrictionTracker(RestrictionTracker):
		"""An `iobase.RestrictionTracker` implementations for byte offsets."""

		from apache_beam.transforms.window import TimestampedValue


		class ReadFilesProvider(RestrictionProvider):

		@@ -332,31 +437,45 @@ def __init__(self,
		kwargs: keyword side input arguments (static and placeholder), if any

[BEAM-1630] Adds support for processing Splittable DoFns using DirectRunner. #4064

[BEAM-1630] Adds support for processing Splittable DoFns using DirectRunner. #4064

Conversation

chamikaramj commented Oct 31, 2017

chamikaramj commented Oct 31, 2017

chamikaramj commented Nov 10, 2017

chamikaramj commented Nov 10, 2017

chamikaramj commented Nov 16, 2017

charlesccychen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkff left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chamikaramj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chamikaramj commented Dec 6, 2017

jkff left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment