-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-14244] Use the supplied output timestamp for processing time timers rather than the input watermark #17262
[BEAM-14244] Use the supplied output timestamp for processing time timers rather than the input watermark #17262
Conversation
Can one of the admins verify this patch? |
3 similar comments
Can one of the admins verify this patch? |
Can one of the admins verify this patch? |
Can one of the admins verify this patch? |
…han the input watermark
24094bc
to
7b950b5
Compare
effectiveTimestamp = | ||
outputTimestamp != null | ||
? outputTimestamp | ||
: outputWatermark != null ? outputWatermark : inputWatermark; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the default be BoundedWindow.TIMESTAMP_MIN_VALUE
? I also think this should be min(window.maxTimestamp, outputWatermark)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't even think there should be a default, afaik outputTimestamp
is always non-null. It's set here in TimerInternalsTimer, although its possible a runner implementation doesn't correctly round-trip this information at firing time. I'm still curious why it's like this, my guess is that it was from before timers had output timestamps, and was just never updated when they were introduced.
Ideally this entire switch could be removed, and effectiveTimestamp
is always outputTimestamp
.
imo there should be little validation here, by the time the timer has fired the output timestamp should have been validated, although a check such as assert(!outputTimestamp.isAfter(window.maxTimestamp())
would have been useful to catch this much earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this should always be outputTimestamp. If outputTimestamp is set, then I think overriding it with something else (input watermark, output watermark) is incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, this chooses outputTimestamp, output watermark, input watermark, in that order. In practice I've never seen output timestamp not set, should I change this to use only output timestamp and assert it's non-null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was referring to the case when outputTimestamp
is null (for whatever reason). Is processing time outputTimestamp
guaranteed to be inside the window? Even when it is relative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the comments and cleaned up the logic in setAndVerifyOutputTimestamp
, I think it should be much clearer how its supposed to work now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure why outputTimestamp would ever be null! Does flink set it to null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never seen it actually be null, I'm happy to simplify all of this if we're assuming it can't be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reuvenlax can you check this out again? It's much more simple now w/o the explicit need to handle a null case.
commenting just to subscribe to the conversation and learn. |
Run Java PreCommit |
@@ -305,7 +305,10 @@ public void processElement( | |||
eventTimerWithOutputTimestamp | |||
.withOutputTimestamp(timerOutputTimestamp) | |||
.set(timerTimestamp); | |||
processingTimer.offset(Duration.millis(timerTimestamp.getMillis())).setRelative(); | |||
processingTimer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@je-ik can you check this change out? Now that the output timestamp is actually used, I had to change these tests to reflect it. I think I've captured the intent of what they were testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that the test with regard to the watermark contract makes little sense for processing time timers. If we want to keep the test, can we change the validation to verify that the output timestamp is equal to output watermark instead of explicitly setting the output timestamp?
I must say I'm not sure how to interpret user defined withOutputTimestamp
for processing time timers (and even less for relative timers as this one). There is no relation between the two time domains and processing time timers do not hold output watermark, that seems to be very error prone. cc @reuvenlax
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
processing time timers do not hold output watermark
Possibly not all runners implement it, but certainly on Dataflow processing time timers hold the watermark at their output timestamp. And again, if the user doesn't explicitly set the output timestamp, it is defaulted to the timestamp of the element (or timer) that set it.
a005fd7 is the commit that added support for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I somehow managed to overlook this. :)
Two notes:
a) yes, for instance FlinkRunner definitely holds watermarks only for event-time timers (this is bad, looks like missing more @ValidatesRunner tests) and
b) this seems inconsistent with how event-time timers work, which set output timestamp based on the fire timestamp (by default). Do we have a reason for such a difference? And what about looping processing-time timers, where there is actually no element? The default of output being the output watermark seems logically consistent with the event-time timers, because the output watermark gives the event-time position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems inconsistent with how event-time timers work, which set output timestamp based on the fire timestamp (by default). Do we have a reason for such a difference?
My guess would be that "firing timestamp" is in a different time domain here, so element input timestamp is the only thing that can be used that's in the event time domain to set the hold on.
And what about looping processing-time timers, where there is actually no element?
The output timestamp of the timer (the timer executing the OnTimer callback) is used.
I actually do agree that the default being the output watermark if the output timestamp on the timer is unset makes more sense. This would also remove the need for a watermark hold at all, and remove a source of confusion around processing time timers. It seems like it'd be a larger change though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this change was actually introduced later. Which is why the comment on setAndVerifyOutputTimestamp
says that processing-time timers should not have output timestamp.
My guess would be that "firing timestamp" is in a different time domain here, so element input timestamp is the only thing that can be used that's in the event time domain to set the hold on.
The different time domain relates to when the timer should fire. The default output timestamp can be (and I think it should be) consistent in both cases - it is the position in event-time in both cases. In the case of event-time timer that is the firing timestamp, in the case of processing-time timer it is the output watermark.
Agree that the change to make this consistent would be a larger work, but we should know what the correct behavior should be and if we actually have runners that violate the model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the other problem is that this decision has been reified into multiple SDKs and the portability API, all currently use the input watermark when setting the output timestamp for processing time timers, so that seems to imply the current behavior is the intended behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to close the loop here, I updated the Flink DoFnOperator to preserve the existing behavior. If it wasn't changed, it could potentially produce late data since there's no watermark hold on the output timestamp. In a follow up review we can fix it to set a watermark hold on the output timestamp of processing time timers (essentially unifying them with event time timers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. This fixes it for both code paths. OK, makes sense. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on their being two different timestamps, the firing timestamp and the output timestamp.
The firing timestamp can be in the processing time domain or the event time domain.
The output timestamp is always in the event time domain and is used to hold the watermark.
For event time timers, the output timestamp being equivalent to the firing time was expected to be a natural default since users would want to produce data with the same output timestamp as the firing timestamp but users could still override it by setting the output timestamp separately from the firing timestamp.
For processing time timers, the output timestamp can't rely on the firing time as it doesn't make sense as a default since it represents a different time domain. This is why the output timestamp defaults to the input element/timer timestamp and can be overridden to a different output timestamp.
Also, any scheduled timers should have an input timestamp equal to the output timestamp that was set allowing for new timers that are scheduled from the timer callback to get reasonable defaults as well.
…old at the output timestamp for processing time timers
...src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/DoFnOperator.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Do you plan to do the follow-up for watermark hold in Flink? Would you do a @ValidatesRunner
test for that?
Run Flink ValidatesRunner |
Run Dataflow ValidatesRunner |
If I have some time I can take a swing at it, but I don't think it'd be for a little while. I also don't use Flink so verifying that it actually works correctly will be more tricky. :) On the surface the fix looks pretty easy, just adding a watermark hold in DoFnOperator for processing time timers as it does for event time ones (and clearing them as well). |
Yes, it actually should be about removing several checks for EVENT_TIME time domain. And adding test to verify it works. Could you do a tracking JIRA for that? |
|
Looks like this broke the ValidatesRunner for Flink, I'll check it out. |
Run Flink ValidatesRunner |
1 similar comment
Run Flink ValidatesRunner |
Seems like some of these tests might just be flaky/racy, it didn't fail the next run but another unrelated one did... Also I can't reproduce the failure locally, so I think its just a racy test. |
Run Java PreCommit |
Run Flink ValidatesRunner |
Last Flink VR passed 🎊 |
The question is if the test was flaky before the change. :) I have the feeling that flink validates runner tests were pretty stable (but it is some time since I was working on the runner). |
Yeah, agreed. I'm suspicious that there's a |
|
Run Flink ValidatesRunner |
This one is unrelated to timers. The one that I saw the first time looked like it could be related. |
Yeah, I only saw it fail that first time though. I've run it ~50 times locally too and never had it fail again. |
OK, let's merge it after we have all tests green. 👍 |
// https://github.com/apache/beam/pull/17262 processing time timers did not correctly emit | ||
// elements at their output timestamp. In this case we need to continue doing the wrong thing | ||
// and using the output watermark rather than the firing timestamp. Once flink correctly sets | ||
// a watermark hold for the output timestamp, this should be changed back. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How difficult would it be to get Flink to set a watermark hold for processing-time timers as well? The fact that it doesn't seems like a significant bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be too hard (famous last words?) #17262 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is handled in onNewEventTimeTimer, and doesn't look difficult to do for processing-time timers
Run Java PreCommit |
Run Dataflow ValidatesRunner |
Run Flink ValidatesRunner |
Run Java PreCommit |
1 similar comment
Run Java PreCommit |
|
I filed BEAM-14269 for the pulsar flake |
Run Java PreCommit |
1 similar comment
Run Java PreCommit |
… timestamp for processing time timers rather than the input watermark
* added initial commit * removed modified file * removed params that dont exist * added clock, removed generics that were causing pickle error, fixed metrics name * fixed class names removed class that goes in apis * added base test file * Added unit tests * reordered imports * replied to comments * apis to api * added license * added mock clock test for metrics, realized our metric wouldn't work right with a generator * Minor changes from Andys comments. Push metric namespace decision to modleLoader class * Update sdks/python/apache_beam/ml/inference/base.py typo fix valentyn's suggestion Co-authored-by: tvalentyn <tvalentyn@users.noreply.github.com> * updated with changes from valentyns comments * merged from tfx version * added comment * linted * changed import order for jenkins linter * added a bug to track a todo * fixed for Roberts comments * make clock and metrics collector private * make shared second parameter * mark RunInferenceDoFn private * moved initialization of shared.Shared into constructor * added todo * Update sdks/python/apache_beam/ml/inference/base.py Co-authored-by: Brian Hulette <hulettbh@gmail.com> * Update sdks/python/apache_beam/ml/inference/base.py Co-authored-by: Brian Hulette <hulettbh@gmail.com> * Update sdks/python/apache_beam/ml/inference/base.py Co-authored-by: Brian Hulette <hulettbh@gmail.com> * updated to correct variable names * udpated variable names * added typevar * remove unbatch * added note that users should expect changes * Update python container version * Add --dataflowServiceOptions=enable_prime to useUnifiedWorker conditions (#17213) * Add self-descriptive message for expected errors. Ideally we would not log these in the first place, but this is an easy hack. * [BEAM-10529] nullable xlang coder (#16923) * [BEAM-10529] add java and generic components of nullable xlang tests * [BEAM-10529] fix test case * [BEAM-10529] add coders and typehints to support nullable xlang coders * [BEAM-10529] update external builder to support nullable coder * [BEAM-10529] clean up coders.py * [BEAM-10529] add coder translation test * [BEAM-10529] add additional check to typecoder to not accidentally misidentify coders as nullable * [BEAM-10529] add test to retrieve nullable coder from typehint * [BEAM-10529] run spotless * [BEAM-10529] add go nullable coder * [BEAM-10529] cleanup extra println * [BEAM-10529] improve comments, clean up python * [BEAM-10529] remove changes to kafkaIO to simplify pr * [BEAM-10529] add coders to go exec, add asf license text * [BEAM-10529] clean up error handlign * [BEAM-10529] update go fromyaml to handle nullable cases * [BEAM-10529] add unit test, register nullable coder in dataflow.go * [BEAM-10529] remove mistaken commit * [BEAM-10529] add argument check to CoderTranslators * [BEAM-10529] Address python comments & cleanup * [BEAM-10529] address go comments * [BEAM-10529] remove extra check that was added in error * [BEAM-10529] fix typo * [BEAM-10529] re-order check for nonetype to prevent attribute errors * [BEAM-10529] change isinstance to == * Fix go fmt break in core/typex/special.go (#17266) * [BEAM-8970] Add docs to run wordcount example on portable Spark Runner * [BEAM-8970] Add period to end of sentence * [BEAM-5436] Add doc page on Go cross compilation. (#17256) * Pr-bot Don't count all reviews as approvals (#17269) * Fix postcommits (#17263) * [BEAM-14241] Address staticcheck warnings in boot.go (#17264) * [BEAM-14157] GrpcWindmillServer: Use stream specific boolean to do client closed check (#17191) * [BEAM-14157] GrpcWindmillServer: Use stream specific boolean to do client closed check This is a follow up to #17162. An AbstractWindmillStream can have more than one grpc stream during its lifetime, new streams can be created after client closed for sending pending requests. So it is not correct to check `if(clientClosed)` in `send()`, this PR adds a new grpc stream level boolean to do the closed check in `send()`. * [BEAM-14157] Add unit test testing CommitWorkStream retries around stream closing * [BEAM-14157] review comments * [BEAM-14157] review comments * [BEAM-14157] review comments * [BEAM-14157] fix test * [BEAM-14157] fix test Co-authored-by: Arun Pandian <pandiana@google.com> * [BEAM-10582] Allow (and test) pyarrow 7 (#17229) * [BEAM-13519] Solve race issues when the server responds with an error before the GrpcStateClient finishes being constructed. (#17240) * [BEAM-13519] Solve race issues when the server responds with an error before the GrpcStateClient finishes. The issue was that the InboundObserver can be invoked before outboundObserverFactory#outboundObserverFor returns meaning that the server is waiting for a response for cache.remove but cache.computeIfAbsent is being invoked at the same time. Another issue was that the outstandingRequests map could be updated with another request within GrpcStateClient during closeAndCleanup meaning that the CompleteableFuture would never be completed exceptionally. Passes 1000 times locally now without getting stuck or failing. * [BEAM-14256] update SpEL dependency to 5.3.18.RELEASE * [BEAM-14256] remove .RELEASE * [BEAM-13015] Disable retries for fnapi grpc channels which otherwise defaults on. (#17243) * [BEAM-13015] Disable retries for grpc channels which otherwise default to true. Since the channel is to the local runner process, retries are not expected to help. This simplifies the grpc stream stack to not involve a RetryStream object. * fixup comment * Update sdks/java/fn-execution/src/main/java/org/apache/beam/sdk/fn/channel/ManagedChannelFactory.java * Update sdks/java/fn-execution/src/main/java/org/apache/beam/sdk/fn/channel/ManagedChannelFactory.java Co-authored-by: Lukasz Cwik <lcwik@google.com> * [BEAM-9649] Add region option to Mongo Dataflow test. * Fix dependency issue causing failures * [BEAM-13952] Sickbay testAfterProcessingTimeContinuationTriggerUsingState (#17214) * BEAM-14235 parquetio module does not parse PEP-440 compliant Pyarrow version (#17275) * Update parquetio.py * Update CHANGES.md * Fix import order * [BEAM-14250] Fix coder registration for types defined in __main__. Until all runners are portable and we can get rid of all round trips between Pipeline and proto representatons, register types in __main__ according to their string representations as pickling does not preserve identity. * Allow get_coder(None). Co-authored-by: Andy Ye <andyye333@gmail.com> * [Website] Contribution guide page indent bug fix (#17287) * Fix markdown indent issue in Development Setup section * update query * [BEAM-10976] Document go sdk bundle finalization (#17048) * [BEAM-13829] Expose status API from Go SDK Harness (#16957) * Avoid pr-bot state desync (#17299) * [BEAM-14259] Clean up staticcheck warnings in the exec package (#17285) * Minor: Prefer registered schema in SQL docs (#17298) * Prefer registered schema in SQL docs * address review comments * [Playground] add meta tags (#17207) * playground add meta tags * playground fix meta tags * fixes golint and deprecated issues in recent Go SDK import (#17304) * [BEAM-14262] Update plugins for Dockerized Jenkins. I copied the list from the cwiki and removed all of the ones that failed to install. https://cwiki.apache.org/confluence/display/INFRA/ci-beam.apache.org * Add ansicolor and ws-cleanup plugins. Without them, the seed job prints warnings: Warning: (CommonJobProperties.groovy, line 107) plugin 'ansicolor' needs to be installed Warning: (CommonJobProperties.groovy, line 113) plugin 'ws-cleanup' needs to be installed * [BEAM-14266] Replace deprecated ptypes package uses (#17302) * [BEAM-11936] Fix rawtypes warnings in SnowflakeIO (#17257) * [BEAM-10556] Fix rawtypes warnings in SnowflakeIO * fixup! [BEAM-10556] Fix rawtypes warnings in SnowflakeIO * Merge pull request #17262: [BEAM-14244] Use the supplied output timestamp for processing time timers rather than the input watermark * removed unused typing * added list typing * linted Co-authored-by: tvalentyn <tvalentyn@users.noreply.github.com> Co-authored-by: Brian Hulette <hulettbh@gmail.com> Co-authored-by: kileys <kileysok@google.com> Co-authored-by: Yichi Zhang <zyichi@google.com> Co-authored-by: Kyle Weaver <kcweaver@google.com> Co-authored-by: johnjcasey <95318300+johnjcasey@users.noreply.github.com> Co-authored-by: Jack McCluskey <34928439+jrmccluskey@users.noreply.github.com> Co-authored-by: Benjamin Gonzalez <benjamin.gonzalez@wizeline.com> Co-authored-by: Robert Burke <lostluck@users.noreply.github.com> Co-authored-by: Danny McCormick <dannymccormick@google.com> Co-authored-by: Arun Pandian <arunpandianp@gmail.com> Co-authored-by: Arun Pandian <pandiana@google.com> Co-authored-by: Brian Hulette <bhulette@google.com> Co-authored-by: Lukasz Cwik <lcwik@google.com> Co-authored-by: johnjcasey <johnjcasey@google.com> Co-authored-by: scwhittle <scwhittle@users.noreply.github.com> Co-authored-by: Arwin Tio <arwin.tio@adroll.com> Co-authored-by: Robert Bradshaw <robertwb@gmail.com> Co-authored-by: Andy Ye <andyye333@gmail.com> Co-authored-by: Yi Hu <yathu@google.com> Co-authored-by: Michael Li <bingyeli@google.com> Co-authored-by: Ritesh Ghorse <riteshghorse@gmail.com> Co-authored-by: Aydar Farrakhov <stranniknm@gmail.com> Co-authored-by: Kamil Breguła <kamil.bregula@snowflake.com> Co-authored-by: Steven Niemitz <steveniemitz@gmail.com>
Using the input watermark as the output timestamp for processing time timers causes many issues, rooted in the fact that the input watermark can advance past the end of the window that the timer is set in.
This causes:
OnTimerContext.output
, elements may be emitted with a timestamp after the end of their window. One of the most noticeable cases here is that elements are emitted withBoundedWindow.MAX_TIMESTAMP
in some cases when dataflow jobs are drained.For more context: https://lists.apache.org/thread/gmqr5d4jm3sb308rmm8t6wc45pm97h26
R: @reuvenlax @lukecwik
PS: I'm not sure if there's an existing JIRA capturing this, if not I'll make a new one.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.