[FLINK-25486][Runtime/Coordination] Fix the bug that flink will lost state when zookeeper leader changes #18296

Myracle · 2022-01-07T10:45:27Z

What is the purpose of the change

When the config high-availability.zookeeper.client.tolerate-suspended-connections is default false, the appMaster will failover once zk leader changes. In this case, the old appMaster will clean up all the HA data and the new appMaster will not recover from the latest checkpoint. We will fix it.

Brief change log

When cluster is shutting down with ApplicationStatus.UNKNOWN, cleanupHaData will be set false to retain HA data.

Verifying this change

This change added tests and can be verified as follows:

Added test that validates that HA data will not be cleanup after the cluster finished with ApplicationStatus.UNKNOWN.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
The serializers: (no)
The runtime per-record code paths (performance sensitive): (no)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes)
The S3 file system connector: (no)

Documentation

Does this pull request introduce a new feature? (no)

flinkbot · 2022-01-07T10:47:45Z

CI report:

226244d Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

flinkbot · 2022-01-07T10:50:11Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 16eed82 (Fri Jan 07 10:50:10 UTC 2022)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

dmvk · 2022-01-07T13:00:20Z

CI fails with https://issues.apache.org/jira/browse/FLINK-25307

dmvk · 2022-01-07T13:00:29Z

@flinkbot run azure

dmvk

Thanks for the PR @Myracle! I think we need to take a slightly different approach here. Lost of the JobMaster leadership (see JobMasterServiceLeadershipRunner) shouldn't lead to the process termination. Instead the JobMasterServiceLeadershipRunner should be able to re-participate in the new election and potentially get re-elected.

I guess the fix could be as simple as not completing the shutdownFuture in MiniDispatcher#jobReachedTerminalState for non-globally terminal states (already suggested by @tillrohrmann in the JIRA).

WDYT?

Myracle · 2022-01-11T02:54:53Z

@dmvk Thanks for the review and suggestions. I agree with you that the the process should not exit in this case. I have modified the code. Could you review again?

Myracle · 2022-01-13T12:29:22Z

@dmvk Could you review again? Thank you.

dmvk

Thanks for updating the PR, this is headed in a good direction, I've added few more comments.

Even though I think this fixes the underlying issue, I would feel more comfortable if we could create a simple integration test for the actual scenario that has been reported. This would ensure that we don't introduce the same problem in the future by accident and that we've really fixed the problem.

Basically we'd test the following:

Setup a MiniCluster with the JobDispatcherFactory and the corresponding job
Wait for the checkpoint
Revoke & Grant leadership
Assert that we have recovered from checkpoint
The test should fail without a fix (this currently doesn't hold for the newly introduced unit test, so this made me think that we need to put more effort here)

We already have a pretty similar test for the application mode, which you can use to as an inspiration (ApplicationDispatcherBootstrapITCase).

WDYT?

dmvk · 2022-01-14T15:05:37Z

flink-runtime/src/test/java/org/apache/flink/runtime/dispatcher/MiniDispatcherTest.java

+     * signals job termination if the JobStatus is not globally terminal state.
+     */
+    @Test
+    public void testNotTerminationWithoutGloballyTerminalState() throws Exception {


the test passes without the fix as well

I have modified the code to test the shutDownFuture not finished.

dmvk · 2022-01-14T15:21:18Z

flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/MiniDispatcher.java

+                    executionMode);
+            shutDownFuture.complete(ApplicationStatus.fromJobStatus(jobStatus));
+        } else {
+            log.warn(


Why do we need to print a warning here?

Always try to think about it from the user perspective. If users sees this log message, is it relevant to what's going on? Is it something he / she should investigate?

dmvk · 2022-01-14T15:25:53Z

flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/MiniDispatcher.java


-        if (jobCancelled || executionMode == ClusterEntrypoint.ExecutionMode.DETACHED) {
+        JobStatus jobStatus = archivedExecutionGraph.getState();
+        if ((jobStatus != null && jobStatus.isGloballyTerminalState())


Can the job status be null here? My intuition would be that for the terminal state this should never be the case and would actually signal an underlying issue. Maybe adding a safeguard (Objects.requireNonNull) with a reasonable message could a better fit?

tillrohrmann · 2022-01-25T07:57:08Z

What's the state of this PR? Can we resolve the open comments to merge it soon?

Myracle · 2022-01-27T02:27:51Z

@tillrohrmann Sorry for late reply. I have written the most code and will finish it soon.

Myracle · 2022-01-27T11:56:44Z

@flinkbot run azure

…state when zookeeper leader changes

Myracle · 2022-01-28T06:19:37Z

@flinkbot run azure

dmvk

Thanks for the update @Myracle, I really like it! <3 I've added few more suggestions for the test cases for speeding them up + aligning on the community code style (junit 5 for the new tests).

I think we once we address these the PR should be good to go! Once more, thanks for the contribution, this is a really great improvement.

flink-runtime/src/test/java/org/apache/flink/runtime/dispatcher/MiniDispatcherTest.java

dmvk · 2022-01-28T07:40:48Z

flink-runtime/src/test/java/org/apache/flink/runtime/dispatcher/MiniDispatcherTest.java

+     * signals job termination if the JobStatus is not globally terminal state.
+     */
+    @Test
+    public void testNotTerminationWithoutGloballyTerminalState() throws Exception {


flink-runtime/src/test/java/org/apache/flink/runtime/dispatcher/JobDispatcherITCase.java

Myracle · 2022-01-28T10:10:51Z

@dmvk Thanks for the valuable suggestions which make the code clean. I have modified the code. Could you please review again?

dmvk

LGTM 👍 Thanks for the update

dmvk · 2022-01-28T12:01:23Z

:( compile step is failing, AT_LEAST_ONE_CHECKPOINT_COMPLETED needs to be renamed (-> camel case). Sorry for confusing you.

Myracle · 2022-01-29T08:14:41Z

@dmvk Fixed and the CI passed.

tillrohrmann

Thanks for updating the PR @Myracle. Merging it now.

…state when zookeeper leader changes This closes apache#18296.

rmetzger added the component=Runtime/Coordination label Jan 7, 2022

Myracle force-pushed the FLINK-25486-lost-state-when-zookeeper-leader-changes-bugfix branch from 16eed82 to e12d974 Compare January 10, 2022 02:19

dmvk requested changes Jan 10, 2022

View reviewed changes

Myracle force-pushed the FLINK-25486-lost-state-when-zookeeper-leader-changes-bugfix branch from e12d974 to ac842b5 Compare January 11, 2022 02:51

dmvk requested changes Jan 14, 2022

View reviewed changes

Myracle force-pushed the FLINK-25486-lost-state-when-zookeeper-leader-changes-bugfix branch from ac842b5 to 6ffad61 Compare January 27, 2022 02:22

Myracle force-pushed the FLINK-25486-lost-state-when-zookeeper-leader-changes-bugfix branch from 6ffad61 to 216076e Compare January 27, 2022 08:42

[FLINK-25486][Runtime/Coordination] Fix the bug that flink will lost …

f098fea

…state when zookeeper leader changes

Myracle force-pushed the FLINK-25486-lost-state-when-zookeeper-leader-changes-bugfix branch from 216076e to f098fea Compare January 27, 2022 12:03

dmvk reviewed Jan 28, 2022

View reviewed changes

dmvk approved these changes Jan 28, 2022

View reviewed changes

[hotfix][Runtime/Coordination] Minor fix

226244d

Myracle force-pushed the FLINK-25486-lost-state-when-zookeeper-leader-changes-bugfix branch from 477c124 to 226244d Compare January 29, 2022 01:33

tillrohrmann approved these changes Jan 29, 2022

View reviewed changes

tillrohrmann closed this in 8ba13f3 Jan 29, 2022

tillrohrmann pushed a commit to tillrohrmann/flink that referenced this pull request Jan 29, 2022

[FLINK-25486][Runtime/Coordination] Fix the bug that flink will lost …

93f8d97

…state when zookeeper leader changes This closes apache#18296.

tillrohrmann pushed a commit to tillrohrmann/flink that referenced this pull request Jan 29, 2022

[FLINK-25486][Runtime/Coordination] Fix the bug that flink will lost …

d655778

…state when zookeeper leader changes This closes apache#18296.

This was referenced Jan 29, 2022

[BP-1.14][FLINK-25486][Runtime/Coordination] Fix the bug that flink will lost state when zookeeper leader changes #18559

Merged

[BP-1.13][FLINK-25486][Runtime/Coordination] Fix the bug that flink will lost state when zookeeper leader changes #18560

Closed

tillrohrmann pushed a commit to tillrohrmann/flink that referenced this pull request Jan 30, 2022

[FLINK-25486][Runtime/Coordination] Fix the bug that flink will lost …

440967f

…state when zookeeper leader changes This closes apache#18296.

tillrohrmann pushed a commit to tillrohrmann/flink that referenced this pull request Jan 30, 2022

[FLINK-25486][Runtime/Coordination] Fix the bug that flink will lost …

7b4e54b

…state when zookeeper leader changes This closes apache#18296.

tillrohrmann pushed a commit to tillrohrmann/flink that referenced this pull request Jan 30, 2022

[FLINK-25486][Runtime/Coordination] Fix the bug that flink will lost …

965155a

…state when zookeeper leader changes This closes apache#18296.

niklassemmler pushed a commit to niklassemmler/flink that referenced this pull request Feb 3, 2022

[FLINK-25486][Runtime/Coordination] Fix the bug that flink will lost …

ec7fa8b

…state when zookeeper leader changes This closes apache#18296.

[FLINK-25486][Runtime/Coordination] Fix the bug that flink will lost state when zookeeper leader changes #18296

[FLINK-25486][Runtime/Coordination] Fix the bug that flink will lost state when zookeeper leader changes #18296

Uh oh!

Conversation

Myracle commented Jan 7, 2022

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Jan 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

flinkbot commented Jan 7, 2022

Automated Checks

Review Progress

Uh oh!

dmvk commented Jan 7, 2022

Uh oh!

dmvk commented Jan 7, 2022

Uh oh!

dmvk left a comment

Choose a reason for hiding this comment

Uh oh!

Myracle commented Jan 11, 2022

Uh oh!

Myracle commented Jan 13, 2022

Uh oh!

dmvk left a comment

Choose a reason for hiding this comment

Uh oh!

dmvk Jan 14, 2022

Choose a reason for hiding this comment

Uh oh!

Myracle Jan 27, 2022

Choose a reason for hiding this comment

Uh oh!

dmvk Jan 28, 2022

Choose a reason for hiding this comment

Uh oh!

dmvk Jan 14, 2022

Choose a reason for hiding this comment

Uh oh!

dmvk Jan 14, 2022

Choose a reason for hiding this comment

Uh oh!

tillrohrmann commented Jan 25, 2022

Uh oh!

Myracle commented Jan 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Myracle commented Jan 27, 2022

Uh oh!

Myracle commented Jan 28, 2022

Uh oh!

dmvk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dmvk Jan 28, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Myracle commented Jan 28, 2022

Uh oh!

dmvk left a comment

Choose a reason for hiding this comment

Uh oh!

dmvk commented Jan 28, 2022

Uh oh!

Myracle commented Jan 29, 2022

flinkbot commented Jan 7, 2022 •

edited

Loading

Myracle commented Jan 27, 2022 •

edited

Loading