-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-9640] Sketching watermark tracking on FnApiRunner #11296
Conversation
2797d98
to
694c8e1
Compare
94ca0b0
to
8a0a4c2
Compare
r: @robertwb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it took so long to get to this. Most of my questions are around watermark advancement.
sdks/python/apache_beam/runners/portability/fn_api_runner/execution.py
Outdated
Show resolved
Hide resolved
sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py
Outdated
Show resolved
Hide resolved
sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py
Outdated
Show resolved
Hide resolved
sdks/python/apache_beam/runners/portability/fn_api_runner/watermark_manager.py
Outdated
Show resolved
Hide resolved
sdks/python/apache_beam/runners/portability/fn_api_runner/watermark_manager.py
Outdated
Show resolved
Hide resolved
sdks/python/apache_beam/runners/portability/fn_api_runner/watermark_manager.py
Outdated
Show resolved
Hide resolved
w = min(i.watermark() for i in self.inputs) | ||
return w | ||
|
||
def input_watermark(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem right, the input watermarks should always be an upper bound on the output watermark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I've changed the output_watermark to be calculated based on downstream PCollections.
sdks/python/apache_beam/runners/portability/fn_api_runner/watermark_manager.py
Show resolved
Hide resolved
def set_watermark(self, wm): | ||
raise NotImplementedError('Stages do not have a watermark') | ||
|
||
def output_watermark(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to take into account data that's "in flight." E.g. all the input watermarks could be at max-timestamp, but that doesn't mean that all the inputs' data has been consumed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed this implementation to rely on the watermark from downstream PCollections.
sdks/python/apache_beam/runners/portability/fn_api_runner/watermark_manager.py
Outdated
Show resolved
Hide resolved
no worries. This is a critical component, and I have other work to do, so I'm glad to get a thoughtful review. I'll address your comments soon. |
441a2a8
to
75b713e
Compare
75b713e
to
5c59afd
Compare
Run Python 3.8 PostCommit |
cbd391c
to
4781255
Compare
@robertwb sorry about the long delay on this, but I've rebased this, and addressed some of your older comments. |
cc: @tvalentyn FYI I am now focusing on this change |
68e0101
to
aadbef0
Compare
@staticmethod | ||
def _collect_written_timers( | ||
bundle_context_manager: execution.BundleContextManager, | ||
newly_set_timers: Dict[Tuple[str, str], ListBuffer], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a return value as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
This function reviews a stage that has just been run. The stage will have | ||
written timers to its output buffers. The function then takes the timers, | ||
and adds them to the `newly_set_timers` dictionary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
producer_name] | ||
# We take the output with tag 'out' from the producer transform. The | ||
# producer transform is a GRPC read, and it has a single output. | ||
pcolls_with_delayed_apps.add(transform.outputs['out']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More flexibly, you could do only_element(transform.values())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
import graphviz | ||
except ImportError: | ||
import warnings | ||
warnings.warn('Unable to draw pipeline. graphviz library missing.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to make this a dependency? (E.g. is it fairly large?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right - since the utilities are only used for internal debugging of the runner, I hesitate to add it as a dependency - it is also large as you point out. I don't think it fits under the test
tag dependencies - we may need an extra tag for internal
dependencies or something like that. I'm happy to add a new tag, or leave it out. Thoughts?
sdks/python/apache_beam/runners/portability/fn_api_runner/watermark_manager.py
Show resolved
Hide resolved
assert isinstance(pcnode, WatermarkManager.PCollectionNode) | ||
snode.inputs.add(pcnode) | ||
node = self._watermarks_by_name[pcname] | ||
assert isinstance(node, WatermarkManager.PCollectionNode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't we just assert that above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true. removed.
sdks/python/apache_beam/runners/portability/fn_api_runner/watermark_manager.py
Show resolved
Hide resolved
# type: (str) -> Union[PCollectionNode, StageNode] | ||
return self._watermarks_by_name[name] | ||
|
||
def get_watermark(self, name) -> timestamp.Timestamp: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get/set watermark is never used on stage nodes, right? Does it make sense to keep them in the same dictionary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, you're right. I was also having that sense when I was jumping hoops to unify the typing. I've separated them. Thanks!
|
||
for input in stage_inputs: | ||
pcoll_id = get_pcoll_id(input) | ||
if pcoll_id not in updates: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll admit I have a hard time keeping the exact ordering here in my head. E.g. is expected_timers in this set? In which of the loops above could updates[pcoll_id] have been set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added comments for these sections. LMK if that helps.
pcoll_id = get_pcoll_id(tr) | ||
updates[pcoll_id] = timestamp.MIN_TIMESTAMP | ||
|
||
for timer_pcoll_id, ts in watermarks_by_transform_and_timer_family.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be simpler to do something like
for timer_pcoll_id in expected_timers:
updates[timer_pcoll_id] = watermarks_by_transform_and_timer_family.get(
timestamp.MAX_TIMESTAMP)
than these two loops here. Or could timer_pcoll_id be in pcolls_with_da and/or transforms_w_splits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Much better : )
Codecov Report
@@ Coverage Diff @@
## master #11296 +/- ##
==========================================
+ Coverage 83.04% 83.68% +0.64%
==========================================
Files 469 438 -31
Lines 58472 58731 +259
==========================================
+ Hits 48556 49151 +595
+ Misses 9916 9580 -336
Continue to review full report at Codecov.
|
804c769
to
7c0e075
Compare
also rebased against master |
7c0e075
to
109f6ef
Compare
109f6ef
to
e6dd9ed
Compare
@robertwb PTAL |
@y1chi do you think you have time to take a look at this PR? |
@y1chi PTAL? |
Will do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any test we can use for validating the watermark logic?
if t.spec.urn == bundle_processor.DATA_INPUT_URN | ||
} | ||
self.watermark_manager = WatermarkManager(stages) | ||
# self.watermark_manager.show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
assert ( | ||
runner_execution_context.watermark_manager.get_stage_node( | ||
bundle_context_manager.stage.name | ||
).input_watermark() == timestamp.MAX_TIMESTAMP), ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output_watermark()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initially I'm checking the input_watermark. I'll add a more advanced check in a follow-up change that updates the runner to support per-bundle execution (instead of per-stage)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, there is a mismatch between the assertion and the error message so that's why I'm confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah thanks for the catch. fixed that.
assert ( | ||
runner_execution_context.watermark_manager.get_stage_node( | ||
bundle_context_manager.stage.name | ||
).input_watermark() == timestamp.MAX_TIMESTAMP), ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the assertions only for batch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently only batch is supported. this will be fixed later on.
timer_family_id)] = timestamp.MAX_TIMESTAMP | ||
timer_watermark_data[(transform_id, timer_family_id)] = min( | ||
timer_watermark_data[(transform_id, timer_family_id)], | ||
decoded_timer.fire_timestamp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be decoded_timer.hold_timestamp, currently timers will set hold_timestamp to fire_timestamp but I think for watermark we should still use decoded_timer.hold_timestamp which prevents potential breakage from BEAM-11507.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the tip! done.
stage_node = WatermarkManager.StageNode(stage_name) | ||
self._stages_by_name[stage_name] = stage_node | ||
|
||
def add_pcollection( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this function declared outside of the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
self._pcollections_by_name[ | ||
timer_pcoll_name] = WatermarkManager.PCollectionNode( | ||
timer_pcoll_name) | ||
timer_pcoll_node = self._pcollections_by_name[timer_pcoll_name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need these assertions immediately after updating the map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are added to fix type checking issues (ensuring the types are what's expected)
if pcoll_name not in self._pcollections_by_name: | ||
self._pcollections_by_name[ | ||
pcoll_name] = WatermarkManager.PCollectionNode(pcoll_name) | ||
pcoll_node = self._pcollections_by_name[pcoll_name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are added to fix type checking issues
@y1chi PTAL |
What is the next step on this PR? |
it LGTM, I only have one more question on https://github.com/apache/beam/pull/11296/files#r635450347 |
Run Python 3.8 PostCommit |
r: @robertwb
This change adds an initial 'sketch' of watermark tracking to the batch runner. Watermark tracking is done like so:
MIN_WATERMARK
.MIN_WATERMARK
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.