[FLINK-23233][runtime] Failing checkpoints before failover for failed events in OperatorCoordinator #16432

gaoyunhaii · 2021-07-08T13:32:54Z

What is the purpose of the change

This PR changes how checkpoint in OperatorCoordinator tracks the result of the previously sent event to be that the failed events would be kept till it has been processed (namely triggered failover for the corresponding subtasks). Otherwise there might be event loses if there are checkpoints after fails to sending event and the subtask failover due to the lost event won't be included in these checkpoints.

Brief change log

3f92e0c changes the event tracking logic.

Verifying this change

(Please pick either of the following options)

This change can be verified by the added unit tests and by the manually test with the failed cases.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no
If yes, how is the feature documented? not applicable

flinkbot · 2021-07-08T13:42:57Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit c874338 (Thu Sep 23 18:02:32 UTC 2021)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2021-07-08T14:14:29Z

CI report:

eb00c91 Azure: FAILURE Azure: SUCCESS
7135067 Azure: PENDING

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

tillrohrmann

Thanks for creating this PR so fast @gaoyunhaii. I had a couple of comments and a question about which future to wait on. Maybe we can fix the problem also by waiting on the result handling future instead of waiting on the result future directly. PTAL.

tillrohrmann · 2021-07-08T13:54:56Z

...main/java/org/apache/flink/runtime/operators/coordination/util/IncompleteFuturesTracker.java

@@ -108,4 +94,22 @@ void removeFromSet(CompletableFuture<?> future) {
            lock.unlock();
        }
    }
+
+    @VisibleForTesting
+    Collection<CompletableFuture<?>> getCurrentIncomplete() {


Suggested change

Collection<CompletableFuture<?>> getCurrentIncomplete() {

Collection<CompletableFuture<?>> getCurrentIncompleteAndReset() {

Sorry here I made a mistake that I initially want to make this method a pure getter one without modifying the state, and it would be only used in the tests for the verification. Do you think this would be also ok that I remove incompleteFutures.clear() and keep the name as getCurrentIncomplete() ?

If incompleteFuture.clear() is not needed to be called in this method, then it should be fine.

...c/main/java/org/apache/flink/runtime/operators/coordination/util/NonSuccessFuturesTrack.java

tillrohrmann · 2021-07-08T14:06:53Z

...test/java/org/apache/flink/runtime/operators/coordination/OperatorCoordinatorHolderTest.java

+        executor.triggerAll();
+
+        // Finish the event sending. This will insert one runnable that handle
+        // failed events to the executor. And we pending this runnable to


What do you mean with "we pending this runnable"?

Here should be "delay the runnable", the initial thought is to test the case that a new checkpoint is triggered before the stage to trigger failover get executed, thus we need some method to delay this stage till the checkpoint is triggered.

Technically we are not delaying it but ignoring it, right?

...test/java/org/apache/flink/runtime/operators/coordination/OperatorCoordinatorHolderTest.java

tillrohrmann · 2021-07-08T14:11:55Z

...test/java/org/apache/flink/runtime/operators/coordination/OperatorCoordinatorHolderTest.java

+        @Override
+        public void execute(@Nonnull Runnable command) {
+            if (pendingNewRunnables) {
+                pendingRunnables.add(command);


Will the Runnables in pendingRunnables ever be executed or will they simply be ignored?

For the real case these Runnables would be executed finally, and here if they are executed would not change the result since we are testing the logic before they get executed. I think I could also finally execute these Runnables to be more consistent with the real case.

tillrohrmann · 2021-07-08T14:12:25Z

...test/java/org/apache/flink/runtime/operators/coordination/OperatorCoordinatorHolderTest.java

+
+        private final Queue<Runnable> pendingRunnables = new ArrayDeque<>();
+
+        public void setPendingNewRunnables(boolean pendingNewRunnables) {


Maybe a better name would be ignoreExecuteCalls(boolean) or so.

Perhaps we change it to something like delayNewRunnables, and in the test we finally execute these Runnables?

Sounds good.

tillrohrmann · 2021-07-08T15:57:57Z

...untime/src/main/java/org/apache/flink/runtime/operators/coordination/SubtaskGatewayImpl.java

                result.handleAsync(
                        (success, failure) -> {
-                            if (failure != null && subtaskAccess.isStillRunning()) {
-                                String msg =
-                                        String.format(
-                                                EVENT_LOSS_ERROR_MESSAGE,
-                                                evt,
-                                                subtaskAccess.subtaskName());
-                                subtaskAccess.triggerTaskFailover(new FlinkException(msg, failure));
+                            if (failure != null) {
+                                if (subtaskAccess.isStillRunning()) {
+                                    String msg =
+                                            String.format(
+                                                    EVENT_LOSS_ERROR_MESSAGE,
+                                                    evt,
+                                                    subtaskAccess.subtaskName());
+                                    subtaskAccess.triggerTaskFailover(
+                                            new FlinkException(msg, failure));
+                                }
+
+                                nonSuccessFuturesTrack.removeFailedFuture(result);


I am wondering whether it wouldn't be simpler to change result.handleAsync to result.whenAsync and then to add the result of this operation to the incompleteFuturesTracker? That way we are sure that we will have handled the result before doing any other operations (e.g. failing/completing checkpoints).

Hi Till, very thanks for the review! First for this issue, my initial concern for this method is that it seems might cause deadlocks:

For a checkpoint when it gets to the completeCheckpointOnceEventsAreDone, it would block the main thread and waits for all the pending event futures (with this method it would be the whenAsync one) are done.

When the event sending result future finished, the thread finish it would also try to submit the whenAsync stage to the main thread, which would not get executed since the main thread is blocked.

Hmm, let me check whether this is the case. If the checkpoint blocks the main thread, then this is another serious problem that should not happen.

It doesn't block the main thread. It just creates a conjunct future of all pending event futures and chains the checkpoint completion to that future. That is all fully async.

Yes, indeed, very sorry for I misunderstood at the original implementation.

gaoyunhaii · 2021-07-09T03:23:22Z

Hi Till, very thanks for the review! I addressed the inline comments, and for another possible option I think it seems to might cause deadlock.

tillrohrmann · 2021-07-09T12:53:48Z

@gaoyunhaii could you show me where exactly the deadlock will happen? I could not find it based on your description. I think the checkpoint won't block the main thread.

gaoyunhaii · 2021-07-09T14:00:14Z

Hi @tillrohrmann sorry for I should have misunderstood the implementation of completeCheckpointOnceEventsAreDone, it should only register a callback instead of doing a blocked waiting. Very sorry for the misleading.

Then it seems indeed also ok to make the checkpoint future waiting on the failure handling future: Since there won't be concurrent checkpoint triggers, thus no new checkpoint would arrive before the pending checkpoint finished; and it should also be ok to delay the checkpoint for a bit while to wait for failing the subtask. This method seems to be simpler based on the current implementation, and do you prefer to this method~?

StephanEwen · 2021-07-09T14:31:34Z

I am taking a look here, would like to double check two things before going ahead with this.

tillrohrmann · 2021-07-09T15:53:31Z

Technically, both solutions should work (modulo Stephan's investigation of the two things).

If we consider error handling being part of the event sending, then the whenAsync approach might be a bit simpler because we don't have the failed futures from previous events that can affect future operations. But I don't have a strong preference apart from that.

StephanEwen · 2021-07-09T17:33:43Z

To re-summarize the failure cause:

A failed RPC (failed result future) leads to the failure of the checkpoint and triggers a task failure. Then the tracking structures are reset, assuming that the failure is taken care of.
However, because the task failure is processed asynchronously (enqueued in the scheduler mailbox), it is possible that some successive checkpoints complete before the task failure is handled. And that violates the assumption that once the RPC loss is detected, no further checkpoints may complete.

I don't fully understand how this PR fixes that. It looks like it tries to change a bit where failure notifications are set, and that in completeCheckpointOnceEventsAreDone() we don't clear the set of futures we (so that the failed future remains in there, blocking further checkpoints) and we rely on cleaning it in a different place. Especially this part looks suspect to me: Registering the future in the tracker asynchronously might mean it isn't even tracked when we make the decision whether the checkpoint can be confirmed:

sendingExecutor.execute(
                () -> {
                    nonSuccessFuturesTrack.trackFuture(result);
                    sender.sendEvent(sendAction, result);
                });

Alternative Solution

I think what @tillrohrmann suggested is the right direction: We really need to guarantee is that by the time the event-sending future that we track is failed (due to RPC loss), we have already marked the job as failed, so no other checkpoint can be triggered. So my proposal is to re-write the following section of the code:

Existing Code

Note how the handleAsync on the result future means that triggering the subtask failure happens only after the future is already complete and has unlocked the checkpoining.

final Callable<CompletableFuture<Acknowledge>> sendAction =
        subtaskAccess.createEventSendAction(serializedEvent);

final CompletableFuture<Acknowledge> result = new CompletableFuture<>();
FutureUtils.assertNoException(
        result.handleAsync(
                (success, failure) -> {
                    if (failure != null && subtaskAccess.isStillRunning()) {
                        String msg =
                                String.format(
                                        EVENT_LOSS_ERROR_MESSAGE, evt, subtaskAccess.subtaskName());
                        subtaskAccess.triggerTaskFailover(new FlinkException(msg, failure));
                    }
                    return null;
                },
                sendingExecutor));

sendingExecutor.execute(() -> sender.sendEvent(sendAction, result));
return result;

Changed Code

Here, the future that is in the tracker is only complete once the subtask is marked as failed.

final CompletableFuture<Acknowledge> sendResult = new CompletableFuture<>();
sendingExecutor.execute(() -> sender.sendEvent(sendAction, sendResult));

final CompletableFuture<Acknowledge> result =
        sendResult.handleAsync(
                (success, failure) -> {
                    if (failure != null && subtaskAccess.isStillRunning()) {
                        String msg =
                                String.format(
                                       EVENT_LOSS_ERROR_MESSAGE, evt, subtaskAccess.subtaskName());
                        subtaskAccess.triggerTaskFailover(new FlinkException(msg, failure));
                    }
                    return success;
                },
               sendingExecutor);

incompleteFuturesTracker.trackFutureWhileIncomplete(result);
return result;

We do need to pass the incomplete futures tracker into the SubtaskGatewayImpl. But I think that is the only other change we need.
I would suggest to undo in particular the language changes, because they are not correct.

gaoyunhaii · 2021-07-09T22:23:25Z

Very thanks @tillrohrmann for the review and very thanks @StephanEwen for the detail investigation and analysis!

I also agree with chaining the futures as @tillrohrmann suggested would be correct and simpler . I'll update the PR according to this method very soon.

And for clarification, for the original method that used

sendingExecutor.execute(
                () -> {
                    nonSuccessFuturesTrack.trackFuture(result);
                    sender.sendEvent(sendAction, result);
                });

my understanding is that the order is ensured since both the above Runnable and completeCheckpointOnceEventsAreDone happen inside the main thread. For detail:

For all the events during checkpointing

OperatorCoordinator#checkpointCoordinator, happen in main thread, the implementation may proxy it to the user thread.
submit completeCheckpointOnceEventsAreDone to the main thread, happen in user thread.
actually execute completeCheckpointOnceEventsAreDone, happen in main thread.

First with my understand now we should request the implementation of OperatorCoordinator to ensure there is not event sending request between 1 and 2, otherwise both methods would have problem. Then for an event sending request, it either happen before 1 and submit a Runnable to the main thread before 3, or happen after 2 and submit a Runnable to the main thread after 3:

For the first case, trackFuture would happen before completeCheckpointOnceEventsAreDone and considered.
For the second case, trackFuture must happen after completeCheckpointOnceEventsAreDone and would not be considered.

gaoyunhaii · 2021-07-10T02:51:44Z

Hi, I updated the PR according to the above comments. During the implementation and test I found that we would still need to keep tracking future to be atomic with event sending: suppose the user thread is keeping sending events, and now a checkpoint is triggered and finally we call tryShutValve in the main thread, then the following events are viewed as after checkpoint and will be cached, and we should not track their futures. Thus the tracking would still need to be together with the sending and executed in the main thread.

tillrohrmann

Thanks for updating this PR @gaoyunhaii. I do agree with you that starting to track the sending event result future needs to happen from the main thread. Otherwise we can track events which won't be sent because the EventValve is already shut.

I had a few more comments. Please take a look.

tillrohrmann · 2021-07-12T09:01:01Z

...untime/src/main/java/org/apache/flink/runtime/operators/coordination/SubtaskGatewayImpl.java

@@ -80,11 +86,14 @@
                                                subtaskAccess.subtaskName());
                                subtaskAccess.triggerTaskFailover(new FlinkException(msg, failure));


I think we should guard that this method does not throw an exception. If it does, then we should fail hard. This will ensure that we don't swallow this failure as a send event failure.

Maybe we can add a util Runnables.assertNoException(Runnable) that calls the FatalExitExceptionHandler.INSTANCE.

Got that, we indeed should not leak the exception~ I added the util method and the check~

...test/java/org/apache/flink/runtime/operators/coordination/OperatorCoordinatorHolderTest.java

gaoyunhaii · 2021-07-12T12:42:14Z

Very thanks @tillrohrmann for the review! I updated the PR according to the comments~

tillrohrmann

Thanks for updating this PR @gaoyunhaii. I think the changes look good to me. +1 for merging it after CI gives green light. Feel also free to clean up the commit history and squashing the fixup commits.

StephanEwen · 2021-07-12T15:45:13Z

+1 also from my side.
Fix looks good!

gaoyunhaii · 2021-07-12T16:56:38Z

Very thanks @tillrohrmann @StephanEwen for the review! Then I squashed the commits~

gaoyunhaii · 2021-07-13T03:01:42Z

@flinkbot run azure

…led events processed for OepratorCoordinator This closes apache#16432.

…led events processed for OepratorCoordinator This closes #16432.

…led events processed for OepratorCoordinator This closes apache#16432.

rmetzger added the review=description? label Jul 8, 2021

tillrohrmann self-assigned this Jul 8, 2021

rmetzger added the component=Runtime/Checkpointing label Jul 8, 2021

gaoyunhaii force-pushed the fix_oc_problem branch from 3f92e0c to fa6a87a Compare July 8, 2021 14:39

tillrohrmann requested changes Jul 8, 2021

View reviewed changes

gaoyunhaii force-pushed the fix_oc_problem branch from 473db5d to 1b3a0d6 Compare July 9, 2021 03:18

gaoyunhaii force-pushed the fix_oc_problem branch from 1b3a0d6 to ff66945 Compare July 10, 2021 02:37

tillrohrmann requested changes Jul 12, 2021

View reviewed changes

tillrohrmann approved these changes Jul 12, 2021

View reviewed changes

gaoyunhaii force-pushed the fix_oc_problem branch from 91fe935 to eb00c91 Compare July 12, 2021 16:51

tillrohrmann pushed a commit to gaoyunhaii/flink that referenced this pull request Jul 13, 2021

[FLINK-23233][runtime] Ensure checkpoints confirmed after all the fai…

7135067

…led events processed for OepratorCoordinator This closes apache#16432.

tillrohrmann force-pushed the fix_oc_problem branch from eb00c91 to 7135067 Compare July 13, 2021 07:09

[FLINK-23233][runtime] Ensure checkpoints confirmed after all the fai…

c874338

…led events processed for OepratorCoordinator This closes apache#16432.

tillrohrmann force-pushed the fix_oc_problem branch from 7135067 to c874338 Compare July 13, 2021 12:03

tillrohrmann closed this in c874338 Jul 13, 2021

tillrohrmann merged commit c874338 into apache:master Jul 13, 2021

tillrohrmann pushed a commit that referenced this pull request Jul 13, 2021

[FLINK-23233][runtime] Ensure checkpoints confirmed after all the fai…

1fa52e1

…led events processed for OepratorCoordinator This closes #16432.

tillrohrmann pushed a commit that referenced this pull request Jul 13, 2021

[FLINK-23233][runtime] Ensure checkpoints confirmed after all the fai…

2c3f8f6

…led events processed for OepratorCoordinator This closes #16432.

srinipunuru pushed a commit to srinipunuru/flink that referenced this pull request Jul 21, 2021

[FLINK-23233][runtime] Ensure checkpoints confirmed after all the fai…

da0d746

…led events processed for OepratorCoordinator This closes apache#16432.

	Collection<CompletableFuture<?>> getCurrentIncomplete() {
	Collection<CompletableFuture<?>> getCurrentIncompleteAndReset() {


		private final Queue<Runnable> pendingRunnables = new ArrayDeque<>();

		public void setPendingNewRunnables(boolean pendingNewRunnables) {

		@@ -80,11 +86,14 @@
		subtaskAccess.subtaskName());
		subtaskAccess.triggerTaskFailover(new FlinkException(msg, failure));

[FLINK-23233][runtime] Failing checkpoints before failover for failed events in OperatorCoordinator #16432

[FLINK-23233][runtime] Failing checkpoints before failover for failed events in OperatorCoordinator #16432

Conversation

gaoyunhaii commented Jul 8, 2021

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Jul 8, 2021 • edited

Automated Checks

Review Progress

flinkbot commented Jul 8, 2021 • edited

CI report:

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gaoyunhaii commented Jul 9, 2021

tillrohrmann commented Jul 9, 2021 • edited

gaoyunhaii commented Jul 9, 2021

StephanEwen commented Jul 9, 2021

tillrohrmann commented Jul 9, 2021

StephanEwen commented Jul 9, 2021

Alternative Solution

gaoyunhaii commented Jul 9, 2021 • edited

gaoyunhaii commented Jul 10, 2021

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gaoyunhaii commented Jul 12, 2021

tillrohrmann left a comment

Choose a reason for hiding this comment

StephanEwen commented Jul 12, 2021

gaoyunhaii commented Jul 12, 2021

gaoyunhaii commented Jul 13, 2021

flinkbot commented Jul 8, 2021 •

edited

flinkbot commented Jul 8, 2021 •

edited

tillrohrmann commented Jul 9, 2021 •

edited

gaoyunhaii commented Jul 9, 2021 •

edited