[BEAM-9562] Send Timers over Data Channel as Elements #11314

boyuanzz · 2020-04-04T06:02:23Z

For commit: 3ff8c3e
r: @robertwb for data_plane.py and bundle_processor.py
r: @pabloem for fn_runner related part.

For commit: a2a7164
r: @TheNeuralBit

cc: @y1chi for reference

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang	SDK	Apex	Dataflow	Gearpump	Samza
Go		---	---	---	---
Java
Python		---		---	---
XLang	---	---	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

robertwb · 2020-04-07T20:47:21Z

sdks/python/apache_beam/pipeline.py

@@ -1071,6 +1071,14 @@ def named_inputs(self):
    }
    return dict(main_inputs, **side_inputs)

+  def main_inputs(self):


Generic transforms don't have the notion of main inputs, let's filter things out in the implementation in ParDo.

robertwb · 2020-04-07T20:48:15Z

sdks/python/apache_beam/transforms/core.py

@@ -1294,16 +1295,43 @@ def _pardo_fn_data(self):
    windowing = None
    return self.fn, self.args, self.kwargs, si_tags_and_types, windowing

-  def to_runner_api_parameter(self, context):
+  def to_runner_api(self, context, main_inputs, has_parts=False):


This is starting to look like a lot of code duplication. How about we pass (all) inputs as a keyword argument, and let PTransform.to_runner_api take an **extra_kwargs that it passes on to to_runner_api_parameter.

robertwb · 2020-04-08T21:06:52Z

sdks/python/apache_beam/runners/worker/bundle_processor.py

+               paneinfo,
+               timer_family_id,
+               timer_coder_impl,
+               output_stream


A type on this parameter would be useful.

robertwb · 2020-04-08T21:09:21Z

sdks/python/apache_beam/runners/worker/bundle_processor.py

+        dynamic_timer_tag='',
+        windows=(self._window, ),
+        clear_bit=False,
+        fire_timestamp=clear_ts,


Don't bother setting these timestamps, or paneinfo.

(Should the coder be ignoring them as well?)

(Should the coder be ignoring them as well?)

No, the timer coder is encoding all of these info now.

Don't bother setting these timestamps, or paneinfo.

Could you please explain more about this?

They're meaningless when we're clearing a timer (e.g. it won't fire, hold back the watermark, or have a pane info).

Correct, when clear_bit is True, the coder ignores these fields. I think we should have a better Timer with API of and clear like in Java as a follow up.

robertwb · 2020-04-08T21:12:19Z

sdks/python/apache_beam/runners/worker/bundle_processor.py

@@ -611,7 +611,7 @@ def __init__(self,
               transform_id,  # type: str
               key_coder,  # type: coders.Coder
               window_coder,  # type: coders.Coder
-               timer_family_specs  # type: Mapping[str, beam_runner_api_pb2.TimerFamilySpec]
+               timer_coders


robertwb · 2020-04-08T21:25:04Z

sdks/python/apache_beam/runners/worker/bundle_processor.py

@@ -1088,6 +1142,30 @@ def create_operation(self,
        transform_proto.spec.payload, parameter_type)
    return creator(self, transform_id, transform_proto, payload, consumers)

+  def get_timer_coders(self):
+    timer_coder = {}
+    for transform_id, transform_proto in self.descriptor.transforms.items():


I see us doing this loop three times now. Perhaps it would be more useful to do the loop once to set everything up, creating a single dictionary (transform_id, timer_family_id) -> (all info about that timer we need to dispatch them).

robertwb · 2020-04-08T21:27:28Z

sdks/python/apache_beam/runners/worker/bundle_processor.py

+                  instruction_id, transform_id, timer_id)
+          timer_output_streams[transform_id] = output_streams
+        self.process_timer_ops[
+            transform_id].user_state_context.update_timer_output_streams(


Nit: rather than this double nesting, it might simplify things to have an update_timer_output_streams(timer_id, output_stream) method that could be called repeatedly.

robertwb · 2020-04-08T21:30:19Z

sdks/python/apache_beam/runners/worker/bundle_processor.py

+                output_streams)
+
+      # Process timers
+      if self.timer_data_channel:


We can't safely assume the runner will finish sending all timers before sending any of the data (and the buffer may get full, resulting in a deadlock). I think we need to have a data_channel.inputs() that returns both data and timers and then branch in the loop.

pabloem · 2020-04-08T22:35:51Z

I'm sorry. I am havinhg a heavy headache. I'll bow out. @robertwb can you review fn_runner.py and siblings?

TheNeuralBit

Couple of comments on the Java side. Still working reviewing but I need to step away for a little bit.

.../java/harness/src/test/java/org/apache/beam/fn/harness/control/ProcessBundleHandlerTest.java

TheNeuralBit · 2020-04-08T21:41:29Z

...struction-java/src/main/java/org/apache/beam/runners/core/construction/ParDoTranslation.java

+      checkArgument(
+          mainInput.getCoder() instanceof KvCoder,
+          "DoFn's that use state or timers must have an input PCollection with a KvCoder but received %s",
+          mainInput.getCoder());


Just curious: did we not have this check before, and just failed when attempting to cast to KVCoder (in the removed block from translate above)?

It was being covered by validation in DoFnSignatures but it is being repeated here for defense in depth reasons.

TheNeuralBit · 2020-04-08T22:40:57Z

...w-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/BatchDataflowWorker.java

-                    idGenerator, sdkHarnessRegistry.beamFnStateApiServiceDescriptor())
+                    idGenerator,
+                    sdkHarnessRegistry.beamFnStateApiServiceDescriptor(),
+                    sdkHarnessRegistry.beamFnDataApiServiceDescriptor())


Isn't the timer API service descriptor different from the data API service descriptor? Does that need to be plumbed through SdkHarnessRegistry and used here instead of the data API descriptor? (same question below and in streaming worker)

They both use the Data API so no. All were saying here is that we will re-use the same gRPC channel for both timers and data.

I see. So we only have a separate timer_api_service_descriptor in the protos so that a runner has the option to make it separate, but it doesn't need to be separate?

That is correct.

TheNeuralBit · 2020-04-09T00:38:41Z

...worker/src/main/java/org/apache/beam/runners/dataflow/worker/graph/RegisterNodeFunction.java

  }

  private RegisterNodeFunction(
      @Nullable RunnerApi.Pipeline pipeline,
      IdGenerator idGenerator,
-      Endpoints.ApiServiceDescriptor stateApiServiceDescriptor) {
+      Endpoints.ApiServiceDescriptor stateApiServiceDescriptor,
+      Endpoints.ApiServiceDescriptor timerApiServiceDescriptor) {


timerApiServiceDescriptor isn't used? Should it be stored and written to the ProcessBundleDescrioptor?

TheNeuralBit

Java changes LGTM overall aside from the above comment. Another set of eyes (or at least another look from my own eyes when they're fresh) would be good though.

boyuanzz · 2020-04-09T06:54:58Z

The test_pardo_timers_clear fails with streaming Flink. The python sdk sends all timers(hold_timestamp=-INF with python default behavior) but only gets the timer with timestamp=20 back. Given the test only fails when streaming, it seems like something not correct with watermark(?). @lukecwik

robertwb · 2020-04-09T06:37:45Z

sdks/python/apache_beam/transforms/core.py

+
+  def to_runner_api(self, context, **extra_kwargs):
+    # type: (PipelineContext, bool) -> beam_runner_api_pb2.FunctionSpec
+    has_parts = extra_kwargs.get('has_part', False)


You can leave this in the parameter list.

robertwb · 2020-04-09T06:39:39Z

sdks/python/apache_beam/transforms/core.py

    # type: (PipelineContext) -> typing.Tuple[str, message.Message]
    assert isinstance(self, ParDo), \
        "expected instance of ParDo, but got %s" % self.__class__
+    key_coder, window_coder = self._get_key_and_window_coder(


Maybe put this in the if block below closer to where they're used?

robertwb · 2020-04-09T06:43:08Z

sdks/python/apache_beam/transforms/core.py

+  def to_runner_api(self, context, **extra_kwargs):
+    # type: (PipelineContext, bool) -> beam_runner_api_pb2.FunctionSpec
+    has_parts = extra_kwargs.get('has_part', False)
+    urn, typed_param = self.to_runner_api_parameter(context, **extra_kwargs)


https://engdoc.corp.google.com/eng/doc/devguide/py/totw/026.md?cl=head

Nevermind, I see what's going on here.

robertwb · 2020-04-09T06:46:40Z

sdks/python/apache_beam/runners/worker/bundle_processor.py

+        dynamic_timer_tag='',
+        windows=(self._window, ),
+        clear_bit=False,
+        fire_timestamp=clear_ts,


They're meaningless when we're clearing a timer (e.g. it won't fire, hold back the watermark, or have a pane info).

robertwb · 2020-04-09T06:48:38Z

sdks/python/apache_beam/runners/worker/bundle_processor.py

    # type: (...) -> OutputTimer
-    assert self._timer_receivers is not None
-    return OutputTimer(key, window, self._timer_receivers[timer_spec.name])
+    output_stream = self._timer_output_streams[timer_spec.name]


If this were a single map rather that two parallel maps, you could write something like

output_tream, timer_coder_impl = self._timer_info(timer_spec.name]

robertwb · 2020-04-09T07:10:05Z

sdks/python/apache_beam/runners/worker/data_plane.py

+              done_inputs.add((element.transform_id, element.timer_family_id))
+            else:
+              yield element
+          if isinstance(element, beam_fn_api_pb2.Elements.Data):


robertwb · 2020-04-09T07:16:51Z

sdks/python/apache_beam/runners/worker/data_plane.py

+    stream_done = False
+    while not stream_done:
+      streams = None
+      if not stream_done:


This will always be true (given the loop condition).

robertwb · 2020-04-09T07:18:42Z

sdks/python/apache_beam/runners/worker/data_plane.py

+            timer_stream.append(stream)
+          if isinstance(stream, beam_fn_api_pb2.Elements.Data):
+            data_stream.append(stream)
+        if data_stream:


No need to have these conditionals, you can just write

yield beam_fn_api_pb2.Elements(data=data_stream, timer=timer_stream)

robertwb · 2020-04-09T07:18:48Z

sdks/python/apache_beam/runners/worker/data_plane.py

+        for stream in streams:
+          if isinstance(stream, beam_fn_api_pb2.Elements.Timer):
+            timer_stream.append(stream)
+          if isinstance(stream, beam_fn_api_pb2.Elements.Data):


robertwb · 2020-04-09T07:20:41Z

sdks/python/apache_beam/runners/worker/operations.pxd

@@ -92,7 +92,7 @@ cdef class DoOperation(Operation):
  cdef DoFnRunner dofn_runner
  cdef object tagged_receivers
  cdef object side_input_maps
-  cdef object user_state_context
+  cpdef public object user_state_context


Rather than making this public, I would add an add_timer_info method to this operation.

robertwb

All my comments were pretty minor, the logic looks good (for the Python worker/data channel changes).

lukecwik · 2020-04-09T14:55:07Z

Run Python PreCommit

lukecwik · 2020-04-09T15:59:04Z

Run Python2_PVR_Flink PreCommit

Address Brian's comments on apache#11314

robertwb

OK, I've finished reviewing all the Python files.

robertwb · 2020-04-09T20:00:42Z

sdks/python/apache_beam/transforms/core.py

@@ -1272,6 +1272,8 @@ def expand(self, pcoll):
        key_coder = coder.key_coder()
      else:
        key_coder = coders.registry.get_coder(typehints.Any)
+      self.window_coder = pcoll.windowing.windowfn.get_window_coder()


Are these still used?

No. Will removed.

robertwb · 2020-04-09T20:00:59Z

sdks/python/apache_beam/transforms/core.py

+    window_coder = input_pcoll.windowing.windowfn.get_window_coder()
+    return key_coder, window_coder
+
+  def to_runner_api(self, context, **extra_kwargs):


This code looks like it's copied from the superclass, instead just do

def to_runner_api(self, context, named_inputs, **extra_kwargs): super(ParDo, self).to_runner_api, named_inputs=named_inputs, **extra_kwargs)

We can delete this override since we pass extra_kwargs from PTransform now.

robertwb · 2020-04-09T20:43:02Z

sdks/python/apache_beam/runners/worker/bundle_processor.py

      data_channels = collections.defaultdict(
          list
      )  # type: DefaultDict[data_plane.GrpcClientDataChannel, List[str]]
+
+      # Inject data inputs from data plane.


This comment is a bit misleading, as the injection doesn't happen in this for loop. (Similarly with timers.)

Updated the comment.

robertwb · 2020-04-09T20:44:24Z

sdks/python/apache_beam/runners/portability/fn_api_runner/translations.py

-          stage.timer_pcollections.append(
-              (timer_read_pcoll + '/Read', timer_write_pcoll))
+        for timer_family_id in payload.timer_family_specs.keys():
+          stage.timers.add((transform.unique_name, timer_family_id))


Nice simplification here :).

robertwb · 2020-04-09T20:46:54Z

sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py

-                     expected_outputs  # type: DataOutput
-                    ):
+                     expected_outputs,  # type: DataOutput
+                     fired_timers,  # type: Mapping[str, Mapping[str, PartitionableBuffer]]


For consistency, should this be a Mapping[Tuple[str, str], PartitionableBuffer]?

I updated the fired_timers implementation but forgot to update the typing here. Thanks!

robertwb · 2020-04-09T20:48:26Z

sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py

@@ -536,7 +525,8 @@ def _run_stage(self,
        runner_execution_context,
        bundle_context_manager,
        data_input,
-        data_output,
+        data_output, {},


Put {} on its own line. (Surprised yapf didn't complain, or maybe you haven't run it yet.)

yapf helps me put the {} here.

robertwb · 2020-04-09T20:53:48Z

sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py

@@ -896,7 +906,9 @@ def _generate_splits_for_testing(self,

  def process_bundle(self,
                     inputs,  # type: Mapping[str, PartitionableBuffer]
-                     expected_outputs  # type: DataOutput
+                     expected_outputs,  # type: DataOutput
+                     fired_timers,  # type: Mapping[str, Mapping[str, PartitionableBuffer]]


Mapping[Tuple[str, str], PartitionableBuffer]?

robertwb · 2020-04-09T20:58:02Z

sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py

+
+      for transform_id, timer_family_id in (
+          set(expected_output_timers.keys()) - set(fired_timers.keys())):
+        # Close the stream if there is no timers to be sent.


This is a subtle point. I might write something like "The worker waits for a logical timer stream to be closed for every possible timer, regardless of whether there are any timers to be sent."

Maybe it'd be clearer to iterate over expected_output_timers, and send fired_timers.get((transform_id, timer_family_id), []).

robertwb · 2020-04-09T21:00:20Z

sdks/python/apache_beam/runners/portability/fn_api_runner/execution.py

    if coder_id in self.execution_context.safe_coders:
      return self.execution_context.pipeline_context.coders[
          self.execution_context.safe_coders[coder_id]].get_impl()
    else:
      return self.execution_context.pipeline_context.coders[coder_id].get_impl()

+  def get_timer_coder_impl(self, transform_id, timer_family_id):
+    assert (transform_id, timer_family_id) in self._timer_coder_ids


The key error if it's not present below will be sufficient.

…ata channel in Beam Java and Python

mxm · 2020-04-13T09:24:30Z

This is a big change which also affects the runners. Would it have made sense to notify Runner authors, especially since post commit tests are broken? It took me a bit to figure out what caused the regression.

ibzib · 2020-04-13T17:26:45Z

@mxm Which post commits are you referring to? & Can you please mark the jira(s) with fix version 2.21.0 so we can fix the regression in the release?

boyuanzz · 2020-04-13T17:38:12Z

This is a big change which also affects the runners. Would it have made sense to notify Runner authors, especially since post commit tests are broken? It took me a bit to figure out what caused the regression.

Thanks, Max! Sorry for the inconvenience. It seems like currently both Spark and Flink fail on the same test: org.apache.beam.sdk.transforms.ParDoTest$TimerTests.testEventTimeTimerAlignBounded. The failure pattern is also the same: the pipeline only produces the output from timer, not from the ProcessElement fn. I think there should be something wrong in the java runner shared library code. Have you worked on it? Or do you want me to follow up fixing this issue?

lukecwik · 2020-04-13T18:41:39Z

The problem is with the Timer implementation inside the FnApiDoFnRunner. The spec for Timer wasn't clear as to what the defaults were when withOutputTimestamp was added and hence some critical logic was deleted during the migration.

See #11402 for the fix.

mxm · 2020-04-13T20:36:54Z

I was actually working on something related to timers in #11362 and was surprised to see that the test failed when I opened the PR, since I had run tests locally. Then figured something must have changed on master in the meantime. Thanks for following up with this!

probot-autolabeler bot added the python label Apr 4, 2020

robertwb reviewed Apr 7, 2020

View reviewed changes

boyuanzz force-pushed the data branch 3 times, most recently from e046783 to 3ff8c3e Compare April 8, 2020 20:18

boyuanzz changed the title ~~[WIP] Send Timers over Data Channel as Elements~~ [BEAM-9562] Send Timers over Data Channel as Elements Apr 8, 2020

probot-autolabeler bot added core dataflow flink fn-execution runners labels Apr 8, 2020

robertwb reviewed Apr 8, 2020

View reviewed changes

probot-autolabeler bot added the model label Apr 8, 2020

TheNeuralBit reviewed Apr 8, 2020

View reviewed changes

TheNeuralBit reviewed Apr 9, 2020

View reviewed changes

TheNeuralBit approved these changes Apr 9, 2020

View reviewed changes

robertwb reviewed Apr 9, 2020

View reviewed changes

robertwb approved these changes Apr 9, 2020

View reviewed changes

boyuanzz added a commit to boyuanzz/beam that referenced this pull request Apr 9, 2020

Merge pull request #3 from lukecwik/timers

33b818d

Address Brian's comments on apache#11314

robertwb reviewed Apr 9, 2020

View reviewed changes

boyuanzz force-pushed the data branch from 50c899a to 0c1e830 Compare April 9, 2020 22:49

probot-autolabeler bot added the infra label Apr 9, 2020

[BEAM-9562, BEAM-6274] Fix-up timers to use Elements.Timer proto in d…

8db19a4

…ata channel in Beam Java and Python

boyuanzz force-pushed the data branch from 0c1e830 to 8db19a4 Compare April 9, 2020 22:50

robertwb approved these changes Apr 9, 2020

View reviewed changes

lukecwik merged commit 1de50c3 into apache:master Apr 10, 2020

mxm mentioned this pull request Apr 13, 2020

[BEAM-9733] Always let ImpulseSourceFunction emit a final watermark #11362

Merged

[BEAM-9562] Send Timers over Data Channel as Elements #11314

[BEAM-9562] Send Timers over Data Channel as Elements #11314

Conversation

boyuanzz commented Apr 4, 2020 • edited Loading

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

boyuanzz Apr 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pabloem commented Apr 8, 2020

TheNeuralBit left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheNeuralBit left a comment

Choose a reason for hiding this comment

boyuanzz commented Apr 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertwb left a comment

Choose a reason for hiding this comment

lukecwik commented Apr 9, 2020

lukecwik commented Apr 9, 2020

robertwb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mxm commented Apr 13, 2020

ibzib commented Apr 13, 2020

boyuanzz commented Apr 13, 2020

lukecwik commented Apr 13, 2020

mxm commented Apr 13, 2020

boyuanzz commented Apr 4, 2020 •

edited

Loading

boyuanzz Apr 9, 2020 •

edited

Loading