[FLINK-15467][task] Wait for invokable cancellation when stopping Task #12714

rkhachatryan · 2020-06-19T07:33:08Z

What is the purpose of the change

Wait for the LegacySourceFunctionThread to finish when cancelling the task.
This prevents premature freeing of resources, particularly LibraryCache (see FLINK-15467 discussion).

Implemented by adding Task. invokableCancelFuture instead of just joining the source thread in cancelTask. This is because cancelTask can be called from TaskCanceller, while resources are freed by Task.executingThread or upon its death.

Brief change log

There are three commits that should be squashed after reviewing:

Return future from StreamTask.cancelTask and finishTask - only change of signatures except for SourceStreamTask
Return completion future from AbstractInvokable cancel - only change of signatures except for StreamTask
Wait in Task.cancelInvokable for completion

Verifying this change

Added unit test: TaskTest.testTaskTerminatesOnlyAfterInvokable
Manually tested on cluster

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no
If yes, how is the feature documented? not applicable

flinkbot · 2020-06-19T07:35:36Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 5d78f3b (Fri Jun 19 07:35:35 UTC 2020)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2020-06-19T07:48:23Z

CI report:

0d85f9e Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

pnowojski

Thanks for the bug fix, I've left couple of comments & questions.

flink-streaming-java/src/test/java/org/apache/flink/streaming/runtime/tasks/StreamTaskTest.java

flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java

flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java

rkhachatryan · 2020-06-19T11:49:49Z

Thanks for the review @pnowojski, I've addressed your comments (1 unresolved).

pnowojski

Thanks for the quick update. Left one comment, besides I think LGTM.

flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java

flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java

flink-runtime/src/test/java/org/apache/flink/runtime/taskmanager/TaskTest.java

pnowojski · 2020-06-19T17:03:35Z

flink-runtime/src/test/java/org/apache/flink/runtime/taskmanager/TaskTest.java

@@ -139,6 +142,33 @@ public void testRegularExecution() throws Exception {
 		taskManagerActions.validateListenerMessage(ExecutionState.FINISHED, task, null);
 	}

+	@Test
+	public void testTaskTerminatesOnlyAfterInvokable() throws Exception {


Is it covering for the user reported bug? It's not using SourceStreamTask anywhere, so it looks like it's not? If that's the case, couldn't we add a test in SourceStreamTaskTest?

Added SourceStreamTaskTest.testWatForCancelCompletion.
For that, I extracted SourceStreamTask.LegacySourceFunctionRunner interface in a preceding hotfix commit.

Hmm.

extracted SourceStreamTask.LegacySourceFunctionRunner interface in a preceding hotfix commit.

this commit is not super pretty, maybe we could make it a bit better, but I'm not sure if that's the right direction.

SourceStreamTaskTest.testWatForCancelCompletion

this doesn't seem to be testing much, it's almost a no-op test for passing the returned value of one getter. It's definitely not testing for the reported bug.

What about creating a mock source that waits on a latch in a loop ignoring interrupts and the test thread fires cancellation & ensures task is not completed? Setup should be quite easy like in SourceStreamTaskTest#finishingIgnoresExceptions? What do you think?

It's definitely not testing for the reported bug.

A proper test for the reported bug would be an end-to-end test involving running a remote cluster, syncing source thread with class unloading and then checking the logs.
I think we agreed offline that such automated test would be an overkill and could rather be done manually. And the rest should be covered with unit tests.

Now, this unit test checks that SourceStreamTask returns completion future of its thread.
This is what changed - and this is what I think should be tested.

If we also want to test LegacySourceFunctionThread future completion, it can be done independently now after extraction (however, I think this is out of the scope of this PR).

Furthermore, it's not possible to reliably test that a program does not perform some action if there is no time boundary (in this case test that task never completes if source thread never completes).

Yes, e2e test would be an overkill, but semi integration test in SourceStreamTask seems to be quite easy.

it's not possible to reliably test that a program does not perform some action if there is no time boundary (in this case test that task never completes if source thread never completes).

What's the problem with waiting on the latch as I proposed? It should provide automated test coverage for that bug.

What's the problem with waiting on the latch as I proposed?

For how long should the test thread wait on the latch (assuming SourceStreamTask.sourceThread never exits)?
Or did I misunderstood your proposal?

It should provide automated test coverage for that bug.

It provides only partial coverage because it doesn't cover integration with Task, StreamTask and the code that actually unloads the classes. So in terms of coverage, it's equivalent to a set of unit tests.

I meant that source thread, running mock source, would be waiting on the latch (with some timeout), not the test thread. Test thread would:

trigger cancellation

maybe wait 10ms

assert source task's thread hasn't completed

release the latch

assert that source task has completed (with a timeout)

It provides only partial coverage because it doesn't cover integration with Task, StreamTask and the code that actually unloads the classes. So in terms of coverage, it's equivalent to a set of unit tests.

It would provide a coverage for task thread shutting down prematurely, before source thread completes.

…terface to facilitate testing

…ancel

Task.invokableCancelFuture is added instead of just joining the source thread in cancelTask. This is because cancelTask can be called from TaskCanceller thread, while resources are freed by Task.executingThread or upon it's death.

rkhachatryan · 2020-07-29T07:39:05Z

Superseded by #13000.

rmetzger added the review=description? label Jun 19, 2020

rmetzger added the component=Runtime/Task label Jun 19, 2020

pnowojski reviewed Jun 19, 2020

View reviewed changes

rkhachatryan force-pushed the wait-source-task-thread-future branch from 5d78f3b to 35c875a Compare June 19, 2020 11:46

pnowojski reviewed Jun 19, 2020

View reviewed changes

flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java Show resolved Hide resolved

flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java Show resolved Hide resolved

rkhachatryan force-pushed the wait-source-task-thread-future branch from 35c875a to 2a05ea5 Compare June 19, 2020 12:41

pnowojski reviewed Jun 19, 2020

View reviewed changes

rkhachatryan force-pushed the wait-source-task-thread-future branch 2 times, most recently from eaf54b8 to 1d9d6ab Compare June 23, 2020 13:52

rkhachatryan added 4 commits June 23, 2020 15:53

[hotfix][task] Extract SourceStreamTask.LegacySourceFunctionRunner in…

7d6f3bf

…terface to facilitate testing

[FLINK-15467][task] Return future from StreamTask cancelTask/finishTask

86452a6

[FLINK-15467][task] Return completion future from AbstractInvokable c…

6d3fe44

…ancel

rkhachatryan force-pushed the wait-source-task-thread-future branch from 1d9d6ab to 0d85f9e Compare June 23, 2020 13:54

rkhachatryan closed this Jul 29, 2020

rkhachatryan deleted the wait-source-task-thread-future branch July 29, 2020 14:29

rkhachatryan mentioned this pull request Jul 29, 2020

[FLINK-15467][task] Join source thread in SourceStreamTask.cancelTask #12700

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-15467][task] Wait for invokable cancellation when stopping Task #12714

[FLINK-15467][task] Wait for invokable cancellation when stopping Task #12714

rkhachatryan commented Jun 19, 2020

flinkbot commented Jun 19, 2020

flinkbot commented Jun 19, 2020 •

edited

pnowojski left a comment

rkhachatryan commented Jun 19, 2020

pnowojski left a comment

pnowojski Jun 19, 2020

rkhachatryan Jun 22, 2020

pnowojski Jun 22, 2020

rkhachatryan Jun 22, 2020

pnowojski Jun 23, 2020 •

edited

rkhachatryan Jun 23, 2020

pnowojski Jul 21, 2020 •

edited

rkhachatryan commented Jul 29, 2020

[FLINK-15467][task] Wait for invokable cancellation when stopping Task #12714

[FLINK-15467][task] Wait for invokable cancellation when stopping Task #12714

Conversation

rkhachatryan commented Jun 19, 2020

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Jun 19, 2020

Automated Checks

Review Progress

flinkbot commented Jun 19, 2020 • edited

CI report:

pnowojski left a comment

Choose a reason for hiding this comment

rkhachatryan commented Jun 19, 2020

pnowojski left a comment

Choose a reason for hiding this comment

pnowojski Jun 19, 2020

Choose a reason for hiding this comment

rkhachatryan Jun 22, 2020

Choose a reason for hiding this comment

pnowojski Jun 22, 2020

Choose a reason for hiding this comment

rkhachatryan Jun 22, 2020

Choose a reason for hiding this comment

pnowojski Jun 23, 2020 • edited

Choose a reason for hiding this comment

rkhachatryan Jun 23, 2020

Choose a reason for hiding this comment

pnowojski Jul 21, 2020 • edited

Choose a reason for hiding this comment

rkhachatryan commented Jul 29, 2020

flinkbot commented Jun 19, 2020 •

edited

pnowojski Jun 23, 2020 •

edited

pnowojski Jul 21, 2020 •

edited