-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-11630] Wait for the termination of all running Tasks when shutting down TaskExecutor #9072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 181a7e4 (Fri Aug 23 10:20:12 UTC 2019) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. DetailsThe Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
29515b7 to
cd5ad8d
Compare
|
@flinkbot attention @tillrohrmann @kisimple |
|
Hi @azagrebin , I'm following the unstable case https://issues.apache.org/jira/browse/FLINK-11631 which is relevant to this PR. I guess you have taken over the old PR #7757. I have a concern same with @tillrohrmann , there are two different tracking ways of task lifecycle. One is I think this PR could work well at the moment. I'm just afraid we might encounter some subtle corner cases in the future. Maybe we should unify these two lifecycle tracking ways. Or we could merge this PR first, then think about unification. What do you think? |
|
Hi @ifndef-SleePy |
|
Hi @azagrebin |
tillrohrmann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening this PR @azagrebin. Overall the idea looks good. I had some comments concerning concurrency and how to test this feature. Would be great if you could take a look at my comments.
| } | ||
| final Throwable throwableBeforeTasksCompletion = jobManagerDisconnectThrowable; | ||
| return taskCompletionTracker | ||
| .failIncompleteTasksAndWaitForAllCompleted() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's rename this method into failIncompleteTasksAndGetTerminationFuture()
flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
| private final Map<ExecutionAttemptID, Task> incompleteTasks; | ||
|
|
||
| private TaskCompletionTracker() { | ||
| incompleteTasks = new ConcurrentHashMap<>(8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a ConcurrentHashMap? Where does the concurrency come from?
|
|
||
| void trackTaskCompletion(Task task) { | ||
| incompleteTasks.put(task.getExecutionId(), task); | ||
| task.getTerminationFuture().thenRun(() -> incompleteTasks.remove(task.getExecutionId())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we also rely on TaskExecutor#unregisterTaskAndNotifyFinalState to remove the task from the TaskCompletionTracker? That way we would not apply concurrent changes to incompleteTasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also my concern. We already have TaskExecutor#unregisterTaskAndNotifyFinalState and Task#getTerminationFuture. I think we should unify them somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Andrey explained to me that we call unregisterTaskAndNotifyFinalState before the clean up of the Task has completed. I think this is a problem in the Task's lifecycle. Hence, one could fix this as a follow up and then simply use the TerminationFuture to signal the final state to the TaskExecutor. But this is out of scope for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Andrey has explained to me before.
Never mind, it's just a mention when I saw this comment. I agree that we could follow up on it later.
BTW, we could also avoid concurrent modifications through thenRunAsync with mainThreadExecutor if we prefer relying on task.getTerminationFuture().
| try { | ||
| taskExecutor.start(); | ||
| taskSlotTableStarted.get(); | ||
| taskSlotTable.allocateSlot(0, jobId, allocationId, timeout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is dangerous, as we are mutating state from the testing thread which is actually owned by the TaskExecutor's rpc main thread. Better to call TaskExecutorGateway#requestSlot after registering a ResourceManagerGateway.
|
|
||
| Task task = taskSlotTable.getTask(taskDeploymentDescriptor.getExecutionAttemptId()); | ||
| assertThat(task.getTerminationFuture().isDone(), is(true)); | ||
| assertThat(TestInterruptableInvokable.INTERRUPTED_FUTURE.isDone(), is(true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we change the test so that we have control over the submitted task's termination future t. Given that, we could stop the TaskExecutor and check that it has not terminated if the task's termination future t is not completed. Then we complete t and check that the TaskExecutor terminates. Then there would also be no need to interact directly with the TaskSlotTable which is an implementation detail of the TaskExecutor.
| return jobManagerTable; | ||
| } | ||
|
|
||
| private TaskDeploymentDescriptor createTaskDeploymentDescriptor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about introducing a TaskDeploymentDescriptorBuilder and replacing the different test instantiations of TDDs with that?
56241d0 to
581be90
Compare
|
Thanks for the review @tillrohrmann, I have addressed comments. |
tillrohrmann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments @azagrebin. LGTM.
The PR contains some checkstyle violations. I will push an update and merge once Travis gives green light.
…ting down TaskExecutor This closes apache#9072. This closes apache#7757.
581be90 to
181a7e4
Compare
…ting down TaskExecutor This closes apache#9072. This closes apache#7757.
What is the purpose of the change
Add a unit test to #7757.
Currently, the
TaskExecutordoes not properly wait for the termination of Tasks when it shuts down inTaskExecutor#onStop. This patch changesTaskExecutor#onStopto fail all running tasks and wait for their termination before stopping all services.Brief change log
TaskCompletionTrackerto track task termination futures inTaskExecutorTaskExecutorTest.testTaskInterruptionAndTerminationOnShutdownVerifying this change
run
TaskExecutorTest.testTaskInterruptionAndTerminationOnShutdownDoes this pull request potentially affect one of the following parts:
@Public(Evolving): (no)Documentation