[FLINK-23741][checkpointing] Properly decline triggerCheckpoint RPC if StreamTask is not running #16800

pnowojski · 2021-08-12T17:08:02Z

With ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH enabled, final checkpoint can deadlock (or timeout after very long time) if there is a race condition between selecting tasks to trigger checkpoint on and finishing tasks. FLINK-21246 was supposed to handle it, but it doesn't work as expected, because futures from:
org.apache.flink.runtime.taskexecutor.TaskExecutor#triggerCheckpoint
and
org.apache.flink.streaming.runtime.tasks.StreamTask#triggerCheckpointAsync
are not linked together. TaskExecutor#triggerCheckpoint reports that checkpoint has been successfully triggered, while StreamTask might have actually finished.

Verifying this change

TODO: implement test

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

flinkbot · 2021-08-12T17:11:18Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit b8906e0 (Thu Aug 12 17:11:17 UTC 2021)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2021-08-12T17:30:18Z

CI report:

bd40fdb Azure: PENDING Azure: FAILURE

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

pnowojski · 2021-08-13T06:47:11Z

Frankly I don't have a good idea how to test this in a meaningful way. Especially the failure case. Maybe we should just relay on the WIP FLINK-21090 that actually discovered this problem?

gaoyunhaii · 2021-08-13T10:40:37Z

Hi Piotr @pnowojski , very thanks for opening the PR!

It seems to me the current implementation get the result whether the trigger is successful only after it actually execute the synchronous part? It seems we might need to be that strict: currently as long as invokable.triggerCheckpointAsync is called without exception, we could ensure the checkpoint must be performed. This is due to if the method is successful, it ensures the mailbox is not prepareClose() and isRunning = true. Since isRunning is set to false after the mailbox is drained, thus we ensures when this mail is processed and the checkpoint is triggered, isRunning must be true.

Although logically we only care about the "false" result, but since Akka has a timeout, thus if the mail queued for a long time or the synchronous part takes long time, we might meet with AkkaAskTimeout and cancel the checkpoint wrongly?

pnowojski · 2021-08-13T11:24:59Z

I would prefer to not complicate this RPC chain with having two different definitions of what Completed future mean. However I think:

the synchronous part takes long time

is a valid concern. I will just double check if indeed this would cause akka timeouts.

pnowojski · 2021-08-13T11:34:33Z

flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java

+            CompletableFuture<Acknowledge> resultFuture = new CompletableFuture<>();
+            task.triggerCheckpointBarrier(checkpointId, checkpointTimestamp, checkpointOptions)
+                    .thenApply(
+                            triggerResult ->
+                                    triggerResult
+                                            ? resultFuture.complete(Acknowledge.get())
+                                            : resultFuture.completeExceptionally(
+                                                    new CheckpointException(
+                                                            "Task is not running?",
+                                                            CheckpointFailureReason
+                                                                    .TRIGGER_CHECKPOINT_FAILURE)));

            return CompletableFuture.completedFuture(Acknowledge.get());


🤦 note that in this version CompletableFuture.completedFuture(Acknowledge.get()) is returned regardless if the resultFuture succeeded or failed...

Sorry for also missing this issue...

gaoyunhaii · 2021-08-13T12:56:08Z

Very thanks Piotr @pnowojski for updating the PR! The new method looks good to me. The only concern is that there might be repeat decline some times logically, but in realistic the StreamTask only declines the checkpoints if isRunning = false, which should not happen due to the same reason as the above comment, and even if there are repeat decline there should be no problems.

And since now the change is limited to Task, perhaps we could add some UT in TaskTest to check the decline is indeed called in the three places? Might be something like:

public void testDeclineCheckpointIfTaskIsNotRunning() throws Exception {
        TestCheckpointResponder testCheckpointResponder = new TestCheckpointResponder();
        final Task task =
                createTaskBuilder().setCheckpointResponder(testCheckpointResponder).build();
        task.triggerCheckpointBarrier(
                1,
                1,
                CheckpointOptions.alignedNoTimeout(
                        CheckpointType.CHECKPOINT,
                        CheckpointStorageLocationReference.getDefault()));
        assertEquals(1, testCheckpointResponder.getDeclineReports().size());
        assertEquals(
                CheckpointFailureReason.CHECKPOINT_DECLINED_TASK_NOT_READY,
                testCheckpointResponder
                        .getDeclineReports()
                        .get(0)
                        .getCause()
                        .getCheckpointFailureReason());
    }

…f StreamTask is not running With ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH enabled, final checkpoint can deadlock (or timeout after very long time) if there is a race condition between selecting tasks to trigger checkpoint on and finishing tasks. FLINK-21246 was supposed to handle it, but it doesn't work as expected, because futures from: org.apache.flink.runtime.taskexecutor.TaskExecutor#triggerCheckpoint and org.apache.flink.streaming.runtime.tasks.StreamTask#triggerCheckpointAsync are not linked together. TaskExecutor#triggerCheckpoint reports that checkpoint has been successfully triggered, while StreamTask might have actually finished.

pnowojski · 2021-08-13T14:30:35Z

Thanks @gaoyunhaii for your suggestion. Your test wouldn't actually test the added cases by me, as task wouldn't be in running state, but I've easily added coverage for all 4 declining cases.

gaoyunhaii

Very thanks Piotr @pnowojski for updating the PR! LGTM and +1 to merge~

gaoyunhaii · 2021-08-14T07:25:36Z

@flinkbot run azure

pnowojski force-pushed the f23741 branch from b8906e0 to 7bc97a1 Compare August 12, 2021 17:09

rmetzger added the review=description? label Aug 12, 2021

rmetzger added component=Runtime/Checkpointing component=Runtime/Task labels Aug 12, 2021

pnowojski commented Aug 13, 2021

View reviewed changes

pnowojski force-pushed the f23741 branch from 3d90597 to 33fa745 Compare August 13, 2021 11:49

pnowojski added 2 commits August 13, 2021 16:29

[hotfix][task] Improve and unify logging in StreamTask

ddf5404

pnowojski force-pushed the f23741 branch from 33fa745 to bd40fdb Compare August 13, 2021 14:29

gaoyunhaii approved these changes Aug 13, 2021

View reviewed changes

pnowojski merged commit edaf75e into apache:master Aug 14, 2021

pnowojski deleted the f23741 branch August 14, 2021 09:25

gaoyunhaii mentioned this pull request Aug 16, 2021

[FLINK-21090][tests] Add IT case for stop-with-savepoint and final checkpoint #16773

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FLINK-23741][checkpointing] Properly decline triggerCheckpoint RPC if StreamTask is not running #16800

[FLINK-23741][checkpointing] Properly decline triggerCheckpoint RPC if StreamTask is not running #16800

Uh oh!

pnowojski commented Aug 12, 2021

Uh oh!

flinkbot commented Aug 12, 2021

Uh oh!

flinkbot commented Aug 12, 2021 •

edited

Loading

Uh oh!

pnowojski commented Aug 13, 2021

Uh oh!

gaoyunhaii commented Aug 13, 2021 •

edited

Loading

Uh oh!

pnowojski commented Aug 13, 2021

Uh oh!

pnowojski Aug 13, 2021

Uh oh!

gaoyunhaii Aug 13, 2021

Uh oh!

gaoyunhaii commented Aug 13, 2021

Uh oh!

pnowojski commented Aug 13, 2021

Uh oh!

gaoyunhaii left a comment

Uh oh!

gaoyunhaii commented Aug 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[FLINK-23741][checkpointing] Properly decline triggerCheckpoint RPC if StreamTask is not running #16800

[FLINK-23741][checkpointing] Properly decline triggerCheckpoint RPC if StreamTask is not running #16800

Uh oh!

Conversation

pnowojski commented Aug 12, 2021

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Aug 12, 2021

Automated Checks

Review Progress

Uh oh!

flinkbot commented Aug 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

pnowojski commented Aug 13, 2021

Uh oh!

gaoyunhaii commented Aug 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pnowojski commented Aug 13, 2021

Uh oh!

pnowojski Aug 13, 2021

Choose a reason for hiding this comment

Uh oh!

gaoyunhaii Aug 13, 2021

Choose a reason for hiding this comment

Uh oh!

gaoyunhaii commented Aug 13, 2021

Uh oh!

pnowojski commented Aug 13, 2021

Uh oh!

gaoyunhaii left a comment

Choose a reason for hiding this comment

Uh oh!

gaoyunhaii commented Aug 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

flinkbot commented Aug 12, 2021 •

edited

Loading

gaoyunhaii commented Aug 13, 2021 •

edited

Loading