[FLINK-11630] Wait for the termination of all running Tasks when shutting down TaskExecutor #9072

azagrebin · 2019-07-10T14:53:28Z

What is the purpose of the change

Add a unit test to #7757.

Currently, the TaskExecutor does not properly wait for the termination of Tasks when it shuts down in TaskExecutor#onStop. This patch changes TaskExecutor#onStop to fail all running tasks and wait for their termination before stopping all services.

Brief change log

add TaskCompletionTracker to track task termination futures in TaskExecutor
try to fail all running tasks and wait for their termination before stopping all services
add TaskExecutorTest.testTaskInterruptionAndTerminationOnShutdown

Verifying this change

run TaskExecutorTest.testTaskInterruptionAndTerminationOnShutdown

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
The serializers: (no)
The runtime per-record code paths (performance sensitive): (no)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
The S3 file system connector: (no)

Documentation

Does this pull request introduce a new feature? (no)
If yes, how is the feature documented? (not applicable)

flinkbot · 2019-07-10T14:56:13Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 181a7e4 (Fri Aug 23 10:20:12 UTC 2019)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❗ 3. Needs [attention] from.
- Needs attention by @kisimple, @tillrohrmann [PMC]
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

azagrebin · 2019-07-10T15:05:39Z

@flinkbot attention @tillrohrmann @kisimple

flinkbot · 2019-07-11T01:53:40Z

CI report for commit 29515b7: SUCCESS Build

flinkbot · 2019-07-11T02:45:29Z

CI report for commit cd5ad8d: SUCCESS Build

ifndef-SleePy · 2019-07-17T17:09:39Z

Hi @azagrebin ,

I'm following the unstable case https://issues.apache.org/jira/browse/FLINK-11631 which is relevant to this PR.

I guess you have taken over the old PR #7757. I have a concern same with @tillrohrmann , there are two different tracking ways of task lifecycle. One is unregisterTaskAndNotifyFinalState and TaskSlotTable, the other is TaskCompletionTracker.

I think this PR could work well at the moment. I'm just afraid we might encounter some subtle corner cases in the future. Maybe we should unify these two lifecycle tracking ways. Or we could merge this PR first, then think about unification. What do you think?

flinkbot · 2019-07-17T17:21:31Z

CI report:

cd5ad8d : FAILURE Build
9b6f812 : CANCELED Build
ad8ea8a : FAILURE Build
56241d0 : FAILURE Build
581be90 : FAILURE Build
181a7e4 : FAILURE Build

azagrebin · 2019-07-18T08:50:47Z

Hi @ifndef-SleePy
If you mean this concern #7757 (review) about forgetting to wait for some tasks, it is because the old PR used TaskSlotTable to get uncompleted tasks at some point of time. The problem is that TaskSlotTable might not always track tasks synchronised with their completion futures as it addresses other concerns and there are can be race conditions. I also left more explanation about it on the old PR #7757 (comment). This is the reason why the old PR was changed and this PR uses a separate tracking which explicitly removes the task only if it is completed. The completion future is always at the very end of Task.run and is called always on Task.run exit to not miss any action in Task.run (in this case release of resources is important).

ifndef-SleePy · 2019-07-19T02:50:32Z

Hi @azagrebin
Thank you for explanation. Nice work to take the old PR over!
Anyway, for this PR, it looks good to me.

tillrohrmann

Thanks for opening this PR @azagrebin. Overall the idea looks good. I had some comments concerning concurrency and how to test this feature. Would be great if you could take a look at my comments.

tillrohrmann · 2019-08-14T14:10:10Z

flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java

-		}
+		final Throwable throwableBeforeTasksCompletion = jobManagerDisconnectThrowable;
+		return taskCompletionTracker
+			.failIncompleteTasksAndWaitForAllCompleted()


let's rename this method into failIncompleteTasksAndGetTerminationFuture()

flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java

tillrohrmann · 2019-08-14T14:16:08Z

flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java

+		private final Map<ExecutionAttemptID, Task> incompleteTasks;
+
+		private TaskCompletionTracker() {
+			incompleteTasks = new ConcurrentHashMap<>(8);


Why do we need a ConcurrentHashMap? Where does the concurrency come from?

tillrohrmann · 2019-08-14T14:18:31Z

flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java

+
+		void trackTaskCompletion(Task task) {
+			incompleteTasks.put(task.getExecutionId(), task);
+			task.getTerminationFuture().thenRun(() -> incompleteTasks.remove(task.getExecutionId()));


could we also rely on TaskExecutor#unregisterTaskAndNotifyFinalState to remove the task from the TaskCompletionTracker? That way we would not apply concurrent changes to incompleteTasks.

This is also my concern. We already have TaskExecutor#unregisterTaskAndNotifyFinalState and Task#getTerminationFuture. I think we should unify them somehow.

Andrey explained to me that we call unregisterTaskAndNotifyFinalState before the clean up of the Task has completed. I think this is a problem in the Task's lifecycle. Hence, one could fix this as a follow up and then simply use the TerminationFuture to signal the final state to the TaskExecutor. But this is out of scope for this PR.

Yes, Andrey has explained to me before.
Never mind, it's just a mention when I saw this comment. I agree that we could follow up on it later.
BTW, we could also avoid concurrent modifications through thenRunAsync with mainThreadExecutor if we prefer relying on task.getTerminationFuture().

tillrohrmann · 2019-08-14T14:24:02Z

flink-runtime/src/test/java/org/apache/flink/runtime/taskexecutor/TaskExecutorTest.java

+		try {
+			taskExecutor.start();
+			taskSlotTableStarted.get();
+			taskSlotTable.allocateSlot(0, jobId, allocationId, timeout);


This is dangerous, as we are mutating state from the testing thread which is actually owned by the TaskExecutor's rpc main thread. Better to call TaskExecutorGateway#requestSlot after registering a ResourceManagerGateway.

tillrohrmann · 2019-08-14T14:26:49Z

flink-runtime/src/test/java/org/apache/flink/runtime/taskexecutor/TaskExecutorTest.java

+
+		Task task = taskSlotTable.getTask(taskDeploymentDescriptor.getExecutionAttemptId());
+		assertThat(task.getTerminationFuture().isDone(), is(true));
+		assertThat(TestInterruptableInvokable.INTERRUPTED_FUTURE.isDone(), is(true));


Can't we change the test so that we have control over the submitted task's termination future t. Given that, we could stop the TaskExecutor and check that it has not terminated if the task's termination future t is not completed. Then we complete t and check that the TaskExecutor terminates. Then there would also be no need to interact directly with the TaskSlotTable which is an implementation detail of the TaskExecutor.

tillrohrmann · 2019-08-14T14:29:07Z

flink-runtime/src/test/java/org/apache/flink/runtime/taskexecutor/TaskExecutorTest.java

+		return jobManagerTable;
+	}
+
+	private TaskDeploymentDescriptor createTaskDeploymentDescriptor(


What about introducing a TaskDeploymentDescriptorBuilder and replacing the different test instantiations of TDDs with that?

azagrebin · 2019-08-16T08:32:08Z

Thanks for the review @tillrohrmann, I have addressed comments.

tillrohrmann

Thanks for addressing my comments @azagrebin. LGTM.

The PR contains some checkstyle violations. I will push an update and merge once Travis gives green light.

…ting down TaskExecutor This closes apache#9072. This closes apache#7757.

…ting down TaskExecutor This closes #9072. This closes #7757.

rmetzger added the review=description? label Jul 10, 2019

azagrebin force-pushed the FLINK-11630-az branch from 29515b7 to cd5ad8d Compare July 10, 2019 15:04

rmetzger added the component=Runtime/Coordination label Jul 10, 2019

tillrohrmann self-assigned this Jul 10, 2019

tillrohrmann requested changes Aug 14, 2019

View reviewed changes

ifndef-SleePy mentioned this pull request Aug 15, 2019

[FLINK-11631][test] Harden TaskExecutorITCase #9147

Closed

azagrebin force-pushed the FLINK-11630-az branch 2 times, most recently from 56241d0 to 581be90 Compare August 16, 2019 08:31

tillrohrmann approved these changes Aug 16, 2019

View reviewed changes

blueszheng and others added 2 commits August 16, 2019 15:12

[FLINK-11630] Triggers the termination of all running Tasks when shut…

38c8015

…ting down TaskExecutor This closes apache#9072. This closes apache#7757.

[hotfix] Introduce TaskDeploymentDescriptorBuilder in tests

181a7e4

tillrohrmann force-pushed the FLINK-11630-az branch from 581be90 to 181a7e4 Compare August 16, 2019 13:13

tillrohrmann pushed a commit to tillrohrmann/flink that referenced this pull request Aug 16, 2019

[FLINK-11630] Triggers the termination of all running Tasks when shut…

c0dac19

…ting down TaskExecutor This closes apache#9072. This closes apache#7757.

tillrohrmann closed this in cee8a38 Aug 18, 2019

tillrohrmann pushed a commit that referenced this pull request Aug 18, 2019

[FLINK-11630] Triggers the termination of all running Tasks when shut…

65e6dbb

…ting down TaskExecutor This closes #9072. This closes #7757.

azagrebin mentioned this pull request Aug 22, 2019

[FLINK-13769][Coordination] Close RM connection in TaskExecutor.onStop and do not reconnect #9513

Merged

[FLINK-11630] Wait for the termination of all running Tasks when shutting down TaskExecutor #9072

[FLINK-11630] Wait for the termination of all running Tasks when shutting down TaskExecutor #9072

Uh oh!

Conversation

azagrebin commented Jul 10, 2019 • edited by tillrohrmann Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Jul 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated Checks

Review Progress

Uh oh!

azagrebin commented Jul 10, 2019

Uh oh!

flinkbot commented Jul 11, 2019

Uh oh!

flinkbot commented Jul 11, 2019

Uh oh!

ifndef-SleePy commented Jul 17, 2019

Uh oh!

flinkbot commented Jul 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

azagrebin commented Jul 18, 2019

Uh oh!

ifndef-SleePy commented Jul 19, 2019

Uh oh!

tillrohrmann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

azagrebin commented Aug 16, 2019

Uh oh!

tillrohrmann left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

azagrebin commented Jul 10, 2019 •

edited by tillrohrmann

Loading

flinkbot commented Jul 10, 2019 •

edited

Loading

flinkbot commented Jul 17, 2019 •

edited

Loading