Skip to content

[FLINK-6719] [docs] Add details about fault-tolerance of timers to ProcessFunction docs#5887

Closed
bowenli86 wants to merge 2 commits intoapache:masterfrom
bowenli86:FLINK-6719
Closed

[FLINK-6719] [docs] Add details about fault-tolerance of timers to ProcessFunction docs#5887
bowenli86 wants to merge 2 commits intoapache:masterfrom
bowenli86:FLINK-6719

Conversation

@bowenli86
Copy link
Member

What is the purpose of the change

The fault-tolerance of timers is a frequently asked questions on the mailing lists. We should add details about the topic in the ProcessFunction docs.

Brief change log

Added details about the topic in the ProcessFunction docs.

Verifying this change

This change is a trivial rework / code cleanup without any test coverage.

Does this pull request potentially affect one of the following parts:

none

Documentation

none

@rice668
Copy link

rice668 commented Apr 22, 2018

Looks good @bowenli86

@bowenli86 bowenli86 changed the title [FLINK-6719] Add details about fault-tolerance of timers to ProcessFunction docs [FLINK-6719] [docs] Add details about fault-tolerance of timers to ProcessFunction docs Apr 23, 2018
Copy link
Contributor

@fhueske fhueske left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for extending and improving the documentation about timer @bowenli86!

I've made a few comments and suggestions.
Best, Fabian

### Timer Coalescing
### Optimizations - Timer Coalescing

Every timer registered at the `TimerService` via `registerEventTimeTimer()` or
Copy link
Contributor

@fhueske fhueske Apr 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the first paragraph under the ## Timer section

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it would be great if you could find a good spot to add a note that calls to processElement() and onTimer() are always synchronized, i.e., users do not have to worry about concurrent modification of state.

</div>
</div>

### Fault Tolerance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the ###Fault Tolerance section above the ###Optimizations section

Timers registered within `ProcessFunction` are fault tolerant.

Timers registered within `ProcessFunction` will be checkpointed by Flink. Upon restoring, timers that are checkpointed
from the previous job will be restored on whatever new instance is responsible for that key.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note that timers are synchronously checkpointed (regardless of the configuration of the state backend). Hence, a large number of timers can significantly increase checkpointing time. See optimizations section for advice to reduce the number of timers.


For processing timer timers, note that the firing time of a timer is an absolute value of when to fire.

What this means is that if a checkpointed timer’s firing processing timestamp is t (which is basically the registering
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(which is basically the registering time + configured trigger time)

This is often the case, but not necessarily true. Esp. for processing time, the timer can also be set to something completely different. I'd remove this to avoid confusion.

For processing timer timers, note that the firing time of a timer is an absolute value of when to fire.

What this means is that if a checkpointed timer’s firing processing timestamp is t (which is basically the registering
time + configured trigger time), then it will also fire at processing timestamp t on the new instance. Therefore, you
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by new instance? Are you discussing the scenario when a task is recovered on a different machine? I don't think we need to mention this. It should be quite clear that clock synchronization is an issue in processing time.

The info that a pt-timer fires on restore if the time passed while the job was down is important. Also mention that this is true for savepoint, which is even more critical because more time may pass between taking and restoring from a savepoint.


#### Event Time Timers

For event time timers, given that Flink does not checkpoint watermarks, a restored event time timer will fire when the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that Flink doesn't checkpoint watermarks is not really related to and does not affect the behavior of timers. It is useful information but I don't think we need to mention it here.

It's sufficient to mention that et-timer fire when the wm passes them.

@bowenli86
Copy link
Member Author

@fhueske updated! let me know how it looks now


Timers registered within `ProcessFunction` are fault tolerant. They are synchronously checkpointed by Flink, regardless of
configurations of state backends. (Therefore, a large number of timers can significantly increase checkpointing time. See optimizations
section for advice to reduce the number of timers.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the optimizations section for advice on how to reduce the number of timers.

@fhueske
Copy link
Contributor

fhueske commented May 2, 2018

Thanks for the update @bowenli86.

I'll merge the PR later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants