New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-5465] [streaming] Wait for pending timer threads to finish or … #5058
[FLINK-5465] [streaming] Wait for pending timer threads to finish or … #5058
Conversation
…to exceed a time limit in exceptional stream task shutdown
CC @aljoscha |
|
||
/** | ||
* Shuts down and clean up the timer service provider hard and immediately. This does wait | ||
* for all timer to complete or until the time limit is exceeded. Any call to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "timers"
@@ -122,6 +123,8 @@ | |||
/** The logger used by the StreamTask and its subclasses. */ | |||
private static final Logger LOG = LoggerFactory.getLogger(StreamTask.class); | |||
|
|||
private static final long TIMER_SERVICE_TERMINATION_AWAIT_MS = 7500L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a configuration setting.
Looks good all in all. I am torn between adding a configuration setting for the timer grace period and not having one. Seems like a setting that is unlikely to ever be used and just blows up the config space. On the other hand, magic numbers should be configurable... |
So which way can we agree upon? I feel like this is a setting that is very specific and hard to understand for the user. On the other hand, it is nice if we can adjust it by configuration in case it causes any problems. |
Implemented the configurable timeout, so that we can choose between the two options. |
c8e5413
to
8f5dde0
Compare
+1 This LGTM now. |
Thanks for the reviews @aljoscha and @StephanEwen. Will merge this now. |
I would like to make a few comments for followup:
|
I was thinking about where to put those options and after all it was a choice between fragmentation and inconsistency. I considered I agree that we can change the key string. |
…to exceed a time limit in exceptional stream task shutdown. This closes apache#5058. (cherry picked from commit d86c6b6)
Can i ask a question here? Why the code below in
|
@Aitozi the problem with the code is that (at least in that version of Flink) the thread could still become interrupted as part of an ongoing cancellation while waiting for the timer service to finish. It will then immediately unblock with Edit: one more hint fyi, there is also still |
@StefanRRichter It is mentioned in Jira ticket that this affects versions 1.5.0, 1.4.2, 1.6.0. |
It affects 1.4 as well. |
…to exceed a time limit in exceptional stream task shutdown
What is the purpose of the change
When stream tasks are going through their cleanup in the failover case, pending timer threads can still access native resources of a state backend after the backend's disposal. In some cases, this can crash the JVM.
The obvious fix is to wait until all timers are finished before disposing those resources, and the main reason why we did not change this is that (in theory) a the user code of a triggering timer can block forever and suppress or delay restarts in failover cases. However, the situation has changed a bit since this topic was discussed for the last time and there is now also a watchdog that ensures that tasks are terminated after some time of inactivity. I would propose a middle ground that probably catches close to all of those problems, while still ensuring a reasonable fast shutdown. I would suggest to wait for the termination of the threadpool that runs all timer events until a time limit. Typically, there is a very limited number of in-flight timers and they are usually short lived. A wait interval of a few seconds should basically fix this problem, even though there can still be very rare cases of very long running timers that also happen to still access the disposed resources. But the likelihood of this scenario should be reduced by orders of magnitude and in particular, the cascading effect should be mitigated.
Brief change log
Verifying this change
This change added tests to
SystemProcessingTimeServiceTest
.Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (no)Documentation