[BEAM-8823] Make FnApiRunner work by executing ready elements instead of stages #15441

pabloem · 2021-09-01T18:52:59Z

This PR modifies the FnApiRunner to execute pipelines per-bundle instead of per-stage.

The previous implementation would work as such:

Order the pipeline DAG topologically
Pick the next stage in the DAG's topological sort
Pass the input bundles to this stage
If there are any deferred inputs from the stage, then go back to step 3.
If there are more stages to execute, go to step 2

The new implementation works as follows:

Create three work queues: (1)ready bundles, (2) watermark-pending bundles, and (3) real-time-pending bundles
Add all the initial bundles (i.e. IMPULSE bundles) to the ready queue
Fetch the next ready bundle, and the stage that consumes it
Execute this bundle
Enqueue all outputs from the execution of the bundle (deferred inputs, downstream outputs)
If any of the outputs from the bundle are side inputs downstream, store them in state
Update all of the watermarks
If there are no ready-bundles remaining, check watermark-pending and time-pending queues to find new ready bundles
If there are more ready bundles, go to step 3

I'm happy to dive into detail for this PR.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

`ValidatesRunner` compliance status (on master branch)

Lang	ULR	Twister2
Go	---	---
Java
Python	---	---
XLang		---

Examples testing status on various runners

Lang	ULR	Dataflow	Flink	Samza	Spark	Twister2
Go	---	---	---	---	---	---	---
Java	---		---	---	---	---	---
Python	---	---	---	---	---	---	---
XLang	---	---	---	---	---	---	---

Post-Commit SDK/Transform Integration Tests Status (on master branch)

Go	Java	Python

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website	Whitespace	Typescript
Non-portable
Portable	---			---	---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

…le-execution

codecov · 2021-09-22T18:39:27Z

Codecov Report

Merging #15441 (70fd62b) into master (476efbb) will decrease coverage by 0.29%.
The diff coverage is 88.19%.

@@            Coverage Diff             @@
##           master   #15441      +/-   ##
==========================================
- Coverage   83.79%   83.50%   -0.30%     
==========================================
  Files         444      445       +1     
  Lines       60474    61413     +939     
==========================================
+ Hits        50674    51281     +607     
- Misses       9800    10132     +332

Impacted Files	Coverage Δ
...nners/portability/fn_api_runner/worker_handlers.py	`79.34% <50.00%> (-0.11%)`	⬇️
...eam/runners/portability/fn_api_runner/execution.py	`91.94% <87.19%> (-1.40%)`	⬇️
.../runners/portability/fn_api_runner/translations.py	`93.10% <88.88%> (-0.10%)`	⬇️
...eam/runners/portability/fn_api_runner/fn_runner.py	`89.75% <89.38%> (-1.06%)`	⬇️
sdks/python/apache_beam/io/iobase.py	`86.21% <100.00%> (ø)`
...ers/portability/fn_api_runner/watermark_manager.py	`96.00% <100.00%> (ø)`
sdks/python/apache_beam/transforms/util.py	`95.84% <100.00%> (+0.02%)`	⬆️
sdks/python/apache_beam/io/gcp/bigquery.py	`62.72% <0.00%> (-12.84%)`	⬇️
...ython/apache_beam/runners/interactive/sql/utils.py	`76.09% <0.00%> (-7.91%)`	⬇️
...he_beam/runners/interactive/sql/beam_sql_magics.py	`49.75% <0.00%> (-4.79%)`	⬇️
... and 29 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 63211ee...70fd62b. Read the comment docs.

pabloem · 2021-09-27T18:22:06Z

r: @y1chi (see first comment for a quick description of the change)

y1chi · 2021-09-27T21:14:25Z

r: @y1chi (see first comment for a quick description of the change)

Thanks Pablo, I'll take a look and reach out if I have questions.

y1chi

Overall LGTM.

sdks/python/apache_beam/runners/portability/fn_api_runner/execution.py

y1chi · 2021-10-01T18:41:33Z

sdks/python/apache_beam/runners/portability/fn_api_runner/execution.py

    self.watermark_manager = WatermarkManager(stages)
+    # from apache_beam.runners.portability.fn_api_runner import \


Should this be removed?

I'd like to keep this here as a hint to show that the pipeline can be visualized here.

sdks/python/apache_beam/runners/portability/fn_api_runner/execution.py

sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py

y1chi · 2021-10-01T21:54:05Z

sdks/python/apache_beam/runners/portability/fn_api_runner/execution.py

+      _LOGGER.debug(
+          'Enqueuing stage pending watermark. Stage name: %s', stage.name)
+      self.queues.watermark_pending_inputs.enque(
+          ((stage.name, MAX_TIMESTAMP), DataInput(data_input, {})))


MAX_TIMESTAMP here seems weird to me, should it be MIN_TIMESTAMP if we expect the input to be ready (maybe for streaming)?

that's right - for BATCH, we need the upstream stage to be fully processed before moving forward. As I work on streaming this will change to be attributed to a more appropriate timestamp.

y1chi · 2021-10-01T22:04:31Z

sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py

+            # Timer was cleared, so we must skip setting it below.
+            timer_cleared = True
+            continue
+        if timer_cleared or (transform_id,


what happens if same timer was set multiple times with clear being called in between?

the SDK would only send back the latest of these events, independently of what it is. is that reasonable?

I'm not sure that's the case when it is grpc data channel https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/bundle_processor.py#L669, IIUC every time timer calls set it will write an output timer to the output queue.

ah my bad - only the last one will be saved - see in lines 547-548 of the file - we only store the latest event on a given tmier+window+key

and if a timer is cleared for certain key and window we ignore all the other set timers for the timer family in the bundle, am I misunderstanding the condition here?

see in lines 546-548 we decode all the timers that have been written, and we key them by (key, window) in a dictionary. Note that if there are multiple timers in the same (key, window), only the latest one will be saved in the timers_by_key_and_window dictionary.

Then, in the loop starting at line 551, we read the latest timer action for each (key, window)

So we will only apply the latest action - whether it is clear or not.

after the loop starting at line 551, the timer_cleared is set to True as long as one (key, window) has a clear timer, and all other (key, window) timers are skipped and not append to newly_set_timers because timer_cleared is True and the continue jumps to next iteration of loop starting at line 537. Isn't it?

ah great observation Yichi.... I've changed this code - we keep timers per dynamic timer tag, and ew clean up only previous sets - but allow new sets to work - so we'll only use the LATEST action on the timer.

Pushing code in a bit.

y1chi · 2021-10-01T22:23:45Z

sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py

+        buffer = runner_execution_context.pcoll_buffers.get(
+            buffer_id, ListBuffer(None))
+
+        if buffer and buffer_id in buffers_to_clean:
+          runner_execution_context.pcoll_buffers[buffer_id] = buffer.copy()
+          buffer = runner_execution_context.pcoll_buffers[buffer_id]
+        if buffer_id in runner_execution_context.pcoll_buffers:
+          buffers_to_clean.add(buffer_id)


I found it a bit hard to follow the logic here. Are we just trying to pop the pcollection buffer with the buffer id?

I've added comments for the special cases. lmk if that makes sense.

pabloem · 2021-10-05T00:10:51Z

oops looking at failures...

pabloem · 2021-10-05T21:16:22Z

thanks @y1chi ! this is ready for another round of review : )

y1chi

The comment addressing commit seems missing.

pabloem · 2021-10-06T22:17:18Z

the commit that tries to address comments is this one: 4a4067b

it got stumped a bit by the merge commit on top

y1chi · 2021-10-07T00:04:48Z

sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py

+            # Timer was cleared, so we must skip setting it below.
+            timer_cleared = True
+            continue
+        if timer_cleared or (transform_id,


after the loop starting at line 551, the timer_cleared is set to True as long as one (key, window) has a clear timer, and all other (key, window) timers are skipped and not append to newly_set_timers because timer_cleared is True and the continue jumps to next iteration of loop starting at line 537. Isn't it?

y1chi · 2021-10-07T00:25:01Z

sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py

+          runner_execution_context.pcoll_buffers[buffer_id] = buffer.copy()
+          buffer = runner_execution_context.pcoll_buffers[buffer_id]


how is runner_execution_context.pcoll_buffers[buffer_id] = buffer.copy() creaing copy for every stage, isn't it just overriding the original buffer with it's own copy?

right - so the flow is like this:

Run stage, and write its outputs to pcoll_buffers

For each stage output, do:

Get the data buffer from pcoll_buffers

Find its next downstream consumer and enqueue this buffer to be consumed

If there are more downstream consumers, make a copy of the buffer, add it to pcoll_buffers, and go back to point 3

If there are no more downstream consumers, continue to next step

In the general case, a PCollection will have only one consumer - so the buffer will not need to be copied, but if there are multiple downstream consumers, then we create copies starting for the second one, so that each buffer copy is pushed to one consumer.

ah ok, I guess what confused me is why we need to write the copy back to pcoll_buffers.

pabloem · 2021-10-08T21:50:08Z

Run Python 3.8 PostCommit

pabloem · 2021-10-08T21:53:08Z

address your timer comments here: #15441 (comment) - and with the latest commit

y1chi

LGTM

…nner work by executing ready elements instead of stages" This reverts commit ef43645.

…k by executing ready elements instead of stages * [BEAM-9640] Sketching watermark tracking on FnApiRunner. * Addressing some comments * Fixups * fixing bug with truncation of restrictions * Fixing output watermark for stages * Moving visualization tools to different file * Individual stages are run bundle-based * [wip] Working on per-bundle execution * fixups * Fixing tests * cleanup * fixing lint, formatting, some typing info * More tests passing * fix import * fix test * Fix most other tests * all passin * testout * testing weird fix, hehe * fixup * Fix formatting * fix typing issues * fixup * fixup * cleanup * addressing comments * fix typoschmypo * proper timer handling

…nner work by executing ready elements instead of stages" This reverts commit ef43645.

… FnApiRunner work by executing ready elements instead of stages"" This reverts commit a2f08e5.

…xecuting ready elements instead of stages * Revert "Revert "Merge pull request #15441 from [BEAM-8823] Make FnApiRunner work by executing ready elements instead of stages"" This reverts commit a2f08e5. * improving/fixing side input handling * fixup * fixup * fixup * fixup and new test * fixup * fixup * Addressing comments * Adding comments for clarity

pabloem added 18 commits February 1, 2021 21:37

[BEAM-9640] Sketching watermark tracking on FnApiRunner.

40ed4d8

Addressing some comments

c1d8f54

Fixups

5c59afd

fixing bug with truncation of restrictions

f695058

Fixing output watermark for stages

4781255

Moving visualization tools to different file

aadbef0

Individual stages are run bundle-based

6d15f2c

[wip] Working on per-bundle execution

863d1a5

fixups

69a44d7

Merge branch 'master' of https://github.com/apache/beam into per-bund…

197823b

…le-execution

Fixing tests

3edb6bc

cleanup

e03dbd9

fixing lint, formatting, some typing info

9ca0022

More tests passing

708098b

fix import

802eb45

fix test

8632ae6

Fix most other tests

455308c

all passin

d1984dd

pabloem force-pushed the per-bundle-execution branch from c5f534f to d1984dd Compare September 21, 2021 19:31

pabloem added 3 commits September 21, 2021 16:17

testout

4150ac6

testing weird fix, hehe

95a9813

fixup

dddf9f3

pabloem changed the title ~~Per bundle execution~~ [BEAM-8823] Make FnApiRunner work by executing ready elements instead of stages Sep 22, 2021

pabloem marked this pull request as ready for review September 22, 2021 18:59

pabloem added 5 commits September 22, 2021 13:41

Fix formatting

01f9fa4

fix typing issues

bb6cc4b

fixup

d2be2fa

fixup

71221b2

cleanup

a669a68

pabloem requested a review from y1chi September 27, 2021 18:21

y1chi reviewed Oct 1, 2021

View reviewed changes

addressing comments

4a4067b

pabloem added 2 commits October 4, 2021 17:20

fix typoschmypo

64e354e

Merge remote-tracking branch 'origin/master' into per-bundle-execution

2e8d046

pabloem requested a review from y1chi October 5, 2021 21:33

y1chi reviewed Oct 6, 2021

View reviewed changes

y1chi reviewed Oct 7, 2021

View reviewed changes

proper timer handling

70fd62b

y1chi approved these changes Oct 11, 2021

View reviewed changes

pabloem merged commit ef43645 into apache:master Oct 12, 2021

robertwb added a commit to robertwb/incubator-beam that referenced this pull request Oct 14, 2021

Revert "Merge pull request apache#15441 from [BEAM-8823] Make FnApiRu…

a2f08e5

…nner work by executing ready elements instead of stages" This reverts commit ef43645.

dmitriikuzinepam pushed a commit to dmitriikuzinepam/beam that referenced this pull request Nov 2, 2021

Revert "Merge pull request apache#15441 from [BEAM-8823] Make FnApiRu…

f830076

…nner work by executing ready elements instead of stages" This reverts commit ef43645.

pabloem added a commit to pabloem/beam that referenced this pull request Nov 2, 2021

Revert "Revert "Merge pull request apache#15441 from [BEAM-8823] Make…

1203e4a

… FnApiRunner work by executing ready elements instead of stages"" This reverts commit a2f08e5.

pabloem added a commit to pabloem/beam that referenced this pull request Feb 13, 2022

Revert "Revert "Merge pull request apache#15441 from [BEAM-8823] Make…

d2149f7

… FnApiRunner work by executing ready elements instead of stages"" This reverts commit a2f08e5.

pabloem added a commit to pabloem/beam that referenced this pull request Mar 17, 2022

Revert "Revert "Merge pull request apache#15441 from [BEAM-8823] Make…

1d6b08b

… FnApiRunner work by executing ready elements instead of stages"" This reverts commit a2f08e5.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-8823] Make FnApiRunner work by executing ready elements instead of stages #15441

[BEAM-8823] Make FnApiRunner work by executing ready elements instead of stages #15441

pabloem commented Sep 1, 2021 •

edited

codecov bot commented Sep 22, 2021 •

edited

pabloem commented Sep 27, 2021

y1chi commented Sep 27, 2021

y1chi left a comment

y1chi Oct 1, 2021

pabloem Oct 4, 2021

y1chi Oct 1, 2021

pabloem Oct 5, 2021

y1chi Oct 1, 2021

pabloem Oct 4, 2021

y1chi Oct 5, 2021

pabloem Oct 5, 2021

y1chi Oct 6, 2021

pabloem Oct 6, 2021

y1chi Oct 7, 2021

pabloem Oct 8, 2021

y1chi Oct 1, 2021

pabloem Oct 4, 2021

pabloem commented Oct 5, 2021

pabloem commented Oct 5, 2021

y1chi left a comment

pabloem commented Oct 6, 2021

y1chi Oct 7, 2021

y1chi Oct 7, 2021

pabloem Oct 8, 2021

y1chi Oct 8, 2021

pabloem commented Oct 8, 2021

pabloem commented Oct 8, 2021

y1chi left a comment

		self.watermark_manager = WatermarkManager(stages)
		# from apache_beam.runners.portability.fn_api_runner import \

		runner_execution_context.pcoll_buffers[buffer_id] = buffer.copy()
		buffer = runner_execution_context.pcoll_buffers[buffer_id]

[BEAM-8823] Make FnApiRunner work by executing ready elements instead of stages #15441

[BEAM-8823] Make FnApiRunner work by executing ready elements instead of stages #15441

Conversation

pabloem commented Sep 1, 2021 • edited

ValidatesRunner compliance status (on master branch)

Examples testing status on various runners

Post-Commit SDK/Transform Integration Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

GitHub Actions Tests Status (on master branch)

codecov bot commented Sep 22, 2021 • edited

Codecov Report

pabloem commented Sep 27, 2021

y1chi commented Sep 27, 2021

y1chi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pabloem commented Oct 5, 2021

pabloem commented Oct 5, 2021

y1chi left a comment

Choose a reason for hiding this comment

pabloem commented Oct 6, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pabloem commented Oct 8, 2021

pabloem commented Oct 8, 2021

y1chi left a comment

Choose a reason for hiding this comment

pabloem commented Sep 1, 2021 •

edited

`ValidatesRunner` compliance status (on master branch)

codecov bot commented Sep 22, 2021 •

edited