[BEAM-9640] Sketching watermark tracking on FnApiRunner #11296

pabloem · 2020-04-02T18:48:37Z

This change adds an initial 'sketch' of watermark tracking to the batch runner. Watermark tracking is done like so:

If a PCollection has a delayed application, its watermark will be held at MIN_WATERMARK.
If a PTransform has a channel split, its input PCollection's watermark will be held at MIN_WATERMARK
For timers, an 'input pcollection' node is created for each timer family to be consumed by a transform. If a bundle execution returns a timer set at time X, the input PCollection for that timer family will be held at X. This will hold back the downstream watermarks from the stage.

Post-Commit Tests Status (on master branch)

Lang	Dataflow	Samza	Twister2
Go	---	---	---
Java
Python		---	---
XLang		---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website	Whitespace	Typescript
Non-portable
Portable	---		---	---	---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

pabloem · 2020-04-03T18:44:11Z

This is the output of watermark_manager.show() for one pipeline:

pabloem · 2020-04-30T23:30:33Z

r: @robertwb

robertwb

Sorry it took so long to get to this. Most of my questions are around watermark advancement.

sdks/python/apache_beam/runners/portability/fn_api_runner/execution.py

sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py

sdks/python/apache_beam/runners/portability/fn_api_runner/watermark_manager.py

robertwb · 2020-05-05T21:04:42Z

sdks/python/apache_beam/runners/portability/fn_api_runner/watermark_manager.py

+      w = min(i.watermark() for i in self.inputs)
+      return w
+
+    def input_watermark(self):


This doesn't seem right, the input watermarks should always be an upper bound on the output watermark.

That makes sense. I've changed the output_watermark to be calculated based on downstream PCollections.

sdks/python/apache_beam/runners/portability/fn_api_runner/watermark_manager.py

robertwb · 2020-05-05T21:12:27Z

sdks/python/apache_beam/runners/portability/fn_api_runner/watermark_manager.py

+    def set_watermark(self, wm):
+      raise NotImplementedError('Stages do not have a watermark')
+
+    def output_watermark(self):


This doesn't seem to take into account data that's "in flight." E.g. all the input watermarks could be at max-timestamp, but that doesn't mean that all the inputs' data has been consumed.

I've changed this implementation to rely on the watermark from downstream PCollections.

sdks/python/apache_beam/runners/portability/fn_api_runner/watermark_manager.py

pabloem · 2020-05-05T23:05:42Z

Sorry it took so long to get to this. Most of my questions are around watermark advancement.

no worries. This is a critical component, and I have other work to do, so I'm glad to get a thoughtful review. I'll address your comments soon.

pabloem · 2021-02-03T02:38:55Z

Run Python 3.8 PostCommit

pabloem · 2021-02-04T20:51:00Z

@robertwb sorry about the long delay on this, but I've rebased this, and addressed some of your older comments.

pabloem · 2021-02-04T21:23:40Z

cc: @tvalentyn FYI I am now focusing on this change

robertwb · 2021-03-02T00:30:45Z

sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py

+  @staticmethod
+  def _collect_written_timers(
+      bundle_context_manager: execution.BundleContextManager,
+      newly_set_timers: Dict[Tuple[str, str], ListBuffer],


Should this be a return value as well?

robertwb · 2021-03-02T00:30:53Z

sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py

+
+    This function reviews a stage that has just been run. The stage will have
+    written timers to its output buffers. The function then takes the timers,
+    and adds them to the `newly_set_timers` dictionary.


What does it return?

robertwb · 2021-03-02T00:33:21Z

sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py

+          producer_name]
+      # We take the output with tag 'out' from the producer transform. The
+      # producer transform is a GRPC read, and it has a single output.
+      pcolls_with_delayed_apps.add(transform.outputs['out'])


More flexibly, you could do only_element(transform.values())

robertwb · 2021-03-02T00:38:44Z

sdks/python/apache_beam/runners/portability/fn_api_runner/visualization_tools.py

+    import graphviz
+  except ImportError:
+    import warnings
+    warnings.warn('Unable to draw pipeline. graphviz library missing.')


Any reason not to make this a dependency? (E.g. is it fairly large?)

That's right - since the utilities are only used for internal debugging of the runner, I hesitate to add it as a dependency - it is also large as you point out. I don't think it fits under the test tag dependencies - we may need an extra tag for internal dependencies or something like that. I'm happy to add a new tag, or leave it out. Thoughts?

sdks/python/apache_beam/runners/portability/fn_api_runner/watermark_manager.py

robertwb · 2021-03-02T00:47:44Z

sdks/python/apache_beam/runners/portability/fn_api_runner/watermark_manager.py

+        assert isinstance(pcnode, WatermarkManager.PCollectionNode)
+        snode.inputs.add(pcnode)
+        node = self._watermarks_by_name[pcname]
+        assert isinstance(node, WatermarkManager.PCollectionNode)


Didn't we just assert that above?

true. removed.

sdks/python/apache_beam/runners/portability/fn_api_runner/watermark_manager.py

robertwb · 2021-03-02T00:55:06Z

sdks/python/apache_beam/runners/portability/fn_api_runner/watermark_manager.py

+    # type: (str) -> Union[PCollectionNode, StageNode]
+    return self._watermarks_by_name[name]
+
+  def get_watermark(self, name) -> timestamp.Timestamp:


get/set watermark is never used on stage nodes, right? Does it make sense to keep them in the same dictionary?

yeah, you're right. I was also having that sense when I was jumping hoops to unify the typing. I've separated them. Thanks!

robertwb · 2021-03-02T01:07:30Z

sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py

+
+    for input in stage_inputs:
+      pcoll_id = get_pcoll_id(input)
+      if pcoll_id not in updates:


I'll admit I have a hard time keeping the exact ordering here in my head. E.g. is expected_timers in this set? In which of the loops above could updates[pcoll_id] have been set?

I've added comments for these sections. LMK if that helps.

robertwb · 2021-03-02T01:12:47Z

sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py

+      pcoll_id = get_pcoll_id(tr)
+      updates[pcoll_id] = timestamp.MIN_TIMESTAMP
+
+    for timer_pcoll_id, ts in watermarks_by_transform_and_timer_family.items():


It might be simpler to do something like

for timer_pcoll_id in expected_timers: updates[timer_pcoll_id] = watermarks_by_transform_and_timer_family.get( timestamp.MAX_TIMESTAMP)

than these two loops here. Or could timer_pcoll_id be in pcolls_with_da and/or transforms_w_splits?

Done. Much better : )

codecov · 2021-03-02T20:55:55Z

Codecov Report

Merging #11296 (7246df4) into master (2d9c666) will increase coverage by 0.64%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #11296      +/-   ##
==========================================
+ Coverage   83.04%   83.68%   +0.64%     
==========================================
  Files         469      438      -31     
  Lines       58472    58731     +259     
==========================================
+ Hits        48556    49151     +595     
+ Misses       9916     9580     -336

Impacted Files	Coverage Δ
...python/apache_beam/examples/wordcount_dataframe.py	`0.00% <0.00%> (-92.60%)`	⬇️
...sdks/python/apache_beam/utils/interactive_utils.py	`87.80% <0.00%> (-7.44%)`	⬇️
...thon/apache_beam/runners/worker/sdk_worker_main.py	`72.26% <0.00%> (-5.92%)`	⬇️
...n/apache_beam/runners/direct/test_direct_runner.py	`37.50% <0.00%> (-4.81%)`	⬇️
...tes/tox/py38/build/srcs/sdks/python/test_config.py	`66.66% <0.00%> (-4.77%)`	⬇️
...s/sdks/python/apache_beam/runners/test/__init__.py	`66.66% <0.00%> (-4.77%)`	⬇️
...ld/srcs/sdks/python/apache_beam/io/gcp/__init__.py	`80.00% <0.00%> (-4.62%)`	⬇️
...thon/apache_beam/runners/worker/channel_factory.py	`75.00% <0.00%> (-3.95%)`	⬇️
...python/apache_beam/examples/streaming_wordcount.py	`30.55% <0.00%> (-3.66%)`	⬇️
...s/python/apache_beam/io/gcp/bigquery_file_loads.py	`87.50% <0.00%> (-3.61%)`	⬇️
... and 450 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1f35e2f...7246df4. Read the comment docs.

pabloem · 2021-03-02T21:09:47Z

also rebased against master

pabloem · 2021-03-12T00:30:00Z

@robertwb PTAL

pabloem · 2021-03-16T17:37:06Z

@robertwb PTAL

pabloem · 2021-03-22T21:17:49Z

@robertwb PTAL

pabloem · 2021-03-29T19:38:14Z

@robertwb PTAL

pabloem · 2021-03-31T20:50:34Z

@y1chi do you think you have time to take a look at this PR?

pabloem · 2021-04-14T20:16:56Z

@y1chi PTAL?

y1chi · 2021-04-14T21:26:07Z

@y1chi PTAL?

Will do.

y1chi

Is there any test we can use for validating the watermark logic?

y1chi · 2021-04-14T21:40:45Z

sdks/python/apache_beam/runners/portability/fn_api_runner/execution.py

+        if t.spec.urn == bundle_processor.DATA_INPUT_URN
+    }
+    self.watermark_manager = WatermarkManager(stages)
+    # self.watermark_manager.show()


y1chi · 2021-04-14T21:58:05Z

sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py

+          assert (
+              runner_execution_context.watermark_manager.get_stage_node(
+                  bundle_context_manager.stage.name
+              ).input_watermark() == timestamp.MAX_TIMESTAMP), (


output_watermark()?

initially I'm checking the input_watermark. I'll add a more advanced check in a follow-up change that updates the runner to support per-bundle execution (instead of per-stage)

OK, there is a mismatch between the assertion and the error message so that's why I'm confused.

ah thanks for the catch. fixed that.

y1chi · 2021-04-14T21:59:29Z

sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py

+          assert (
+              runner_execution_context.watermark_manager.get_stage_node(
+                  bundle_context_manager.stage.name
+              ).input_watermark() == timestamp.MAX_TIMESTAMP), (


are the assertions only for batch?

currently only batch is supported. this will be fixed later on.

y1chi · 2021-04-19T18:54:41Z

sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py

+                                    timer_family_id)] = timestamp.MAX_TIMESTAMP
+            timer_watermark_data[(transform_id, timer_family_id)] = min(
+                timer_watermark_data[(transform_id, timer_family_id)],
+                decoded_timer.fire_timestamp)


I think this should be decoded_timer.hold_timestamp, currently timers will set hold_timestamp to fire_timestamp but I think for watermark we should still use decoded_timer.hold_timestamp which prevents potential breakage from BEAM-11507.

thanks for the tip! done.

y1chi · 2021-04-19T19:29:07Z

sdks/python/apache_beam/runners/portability/fn_api_runner/watermark_manager.py

+      stage_node = WatermarkManager.StageNode(stage_name)
+      self._stages_by_name[stage_name] = stage_node
+
+      def add_pcollection(


Should this function declared outside of the loop?

y1chi · 2021-04-19T19:33:58Z

sdks/python/apache_beam/runners/portability/fn_api_runner/watermark_manager.py

+            self._pcollections_by_name[
+                timer_pcoll_name] = WatermarkManager.PCollectionNode(
+                    timer_pcoll_name)
+            timer_pcoll_node = self._pcollections_by_name[timer_pcoll_name]


why do we need these assertions immediately after updating the map?

these are added to fix type checking issues (ensuring the types are what's expected)

y1chi · 2021-04-19T19:34:48Z

sdks/python/apache_beam/runners/portability/fn_api_runner/watermark_manager.py

+          if pcoll_name not in self._pcollections_by_name:
+            self._pcollections_by_name[
+                pcoll_name] = WatermarkManager.PCollectionNode(pcoll_name)
+          pcoll_node = self._pcollections_by_name[pcoll_name]


these are added to fix type checking issues

pabloem · 2021-05-19T17:21:54Z

@y1chi PTAL

aaltay · 2021-05-27T22:42:23Z

What is the next step on this PR?

y1chi · 2021-05-27T23:18:55Z

it LGTM, I only have one more question on https://github.com/apache/beam/pull/11296/files#r635450347

pabloem · 2021-06-08T21:33:51Z

Run Python 3.8 PostCommit

probot-autolabeler bot added the python label Apr 2, 2020

pabloem force-pushed the fn-api-watermarks branch 2 times, most recently from 2797d98 to 694c8e1 Compare April 3, 2020 19:45

pabloem force-pushed the fn-api-watermarks branch 3 times, most recently from 94ca0b0 to 8a0a4c2 Compare April 22, 2020 21:17

robertwb reviewed May 5, 2020

View reviewed changes

stale bot added the stale label Aug 16, 2020

stale bot closed this Aug 23, 2020

pabloem reopened this Jan 26, 2021

stale bot removed the stale label Jan 26, 2021

pabloem force-pushed the fn-api-watermarks branch 5 times, most recently from 441a2a8 to 75b713e Compare February 2, 2021 20:05

pabloem force-pushed the fn-api-watermarks branch from 75b713e to 5c59afd Compare February 3, 2021 02:06

pabloem force-pushed the fn-api-watermarks branch from cbd391c to 4781255 Compare February 4, 2021 20:35

apache deleted a comment from codecov bot Feb 4, 2021

apache deleted a comment from stale bot Feb 4, 2021

pabloem force-pushed the fn-api-watermarks branch from 68e0101 to aadbef0 Compare February 4, 2021 23:50

apache deleted a comment from codecov bot Feb 5, 2021

robertwb reviewed Mar 2, 2021

View reviewed changes

pabloem added 7 commits March 2, 2021 13:09

[BEAM-9640] Sketching watermark tracking on FnApiRunner.

d73b4f2

Addressing some comments

9c8d225

Fixups

9166a9b

fixing bug with truncation of restrictions

1835293

Fixing output watermark for stages

d6e4dbc

Moving visualization tools to different file

b379f9a

Addressing comments

fba6e70

pabloem force-pushed the fn-api-watermarks branch from 804c769 to 7c0e075 Compare March 2, 2021 21:09

pabloem force-pushed the fn-api-watermarks branch from 7c0e075 to 109f6ef Compare March 2, 2021 21:20

Fix lint

e6dd9ed

pabloem force-pushed the fn-api-watermarks branch from 109f6ef to e6dd9ed Compare March 2, 2021 21:30

y1chi reviewed Apr 19, 2021

View reviewed changes

Addressing comments

7a3b924

Fix log message on assertions

7246df4

pabloem merged commit de29bc5 into apache:master Jun 9, 2021

pabloem deleted the fn-api-watermarks branch June 9, 2021 17:58

[BEAM-9640] Sketching watermark tracking on FnApiRunner #11296

[BEAM-9640] Sketching watermark tracking on FnApiRunner #11296

Conversation

pabloem commented Apr 2, 2020 • edited Loading

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

GitHub Actions Tests Status (on master branch)

pabloem commented Apr 3, 2020

pabloem commented Apr 30, 2020

robertwb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pabloem commented May 5, 2020

pabloem commented Feb 3, 2021

pabloem commented Feb 4, 2021

pabloem commented Feb 4, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Mar 2, 2021 • edited Loading

Codecov Report

pabloem commented Mar 2, 2021

pabloem commented Mar 12, 2021

pabloem commented Mar 16, 2021

pabloem commented Mar 22, 2021

pabloem commented Mar 29, 2021

pabloem commented Mar 31, 2021

pabloem commented Apr 14, 2021

y1chi commented Apr 14, 2021

y1chi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pabloem commented May 19, 2021

aaltay commented May 27, 2021

y1chi commented May 27, 2021

pabloem commented Jun 8, 2021

pabloem commented Apr 2, 2020 •

edited

Loading

codecov bot commented Mar 2, 2021 •

edited

Loading