Skip to content

Conversation

@Myracle
Copy link
Contributor

@Myracle Myracle commented Jan 7, 2022

What is the purpose of the change

When the config high-availability.zookeeper.client.tolerate-suspended-connections is default false, the appMaster will failover once zk leader changes. In this case, the old appMaster will clean up all the HA data and the new appMaster will not recover from the latest checkpoint. We will fix it.

Brief change log

  • When cluster is shutting down with ApplicationStatus.UNKNOWN, cleanupHaData will be set false to retain HA data.

Verifying this change

This change added tests and can be verified as follows:

  • Added test that validates that HA data will not be cleanup after the cluster finished with ApplicationStatus.UNKNOWN.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)

@flinkbot
Copy link
Collaborator

flinkbot commented Jan 7, 2022

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@flinkbot
Copy link
Collaborator

flinkbot commented Jan 7, 2022

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 16eed82 (Fri Jan 07 10:50:10 UTC 2022)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details
The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@dmvk
Copy link
Member

dmvk commented Jan 7, 2022

@dmvk
Copy link
Member

dmvk commented Jan 7, 2022

@flinkbot run azure

@Myracle Myracle force-pushed the FLINK-25486-lost-state-when-zookeeper-leader-changes-bugfix branch from 16eed82 to e12d974 Compare January 10, 2022 02:19
Copy link
Member

@dmvk dmvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @Myracle! I think we need to take a slightly different approach here. Lost of the JobMaster leadership (see JobMasterServiceLeadershipRunner) shouldn't lead to the process termination. Instead the JobMasterServiceLeadershipRunner should be able to re-participate in the new election and potentially get re-elected.

I guess the fix could be as simple as not completing the shutdownFuture in MiniDispatcher#jobReachedTerminalState for non-globally terminal states (already suggested by @tillrohrmann in the JIRA).

WDYT?

@Myracle Myracle force-pushed the FLINK-25486-lost-state-when-zookeeper-leader-changes-bugfix branch from e12d974 to ac842b5 Compare January 11, 2022 02:51
@Myracle
Copy link
Contributor Author

Myracle commented Jan 11, 2022

@dmvk Thanks for the review and suggestions. I agree with you that the the process should not exit in this case. I have modified the code. Could you review again?

@Myracle
Copy link
Contributor Author

Myracle commented Jan 13, 2022

@dmvk Could you review again? Thank you.

Copy link
Member

@dmvk dmvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the PR, this is headed in a good direction, I've added few more comments.

Even though I think this fixes the underlying issue, I would feel more comfortable if we could create a simple integration test for the actual scenario that has been reported. This would ensure that we don't introduce the same problem in the future by accident and that we've really fixed the problem.

Basically we'd test the following:

  • Setup a MiniCluster with the JobDispatcherFactory and the corresponding job
  • Wait for the checkpoint
  • Revoke & Grant leadership
  • Assert that we have recovered from checkpoint
  • The test should fail without a fix (this currently doesn't hold for the newly introduced unit test, so this made me think that we need to put more effort here)

We already have a pretty similar test for the application mode, which you can use to as an inspiration (ApplicationDispatcherBootstrapITCase).

WDYT?

* signals job termination if the JobStatus is not globally terminal state.
*/
@Test
public void testNotTerminationWithoutGloballyTerminalState() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test passes without the fix as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have modified the code to test the shutDownFuture not finished.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

executionMode);
shutDownFuture.complete(ApplicationStatus.fromJobStatus(jobStatus));
} else {
log.warn(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to print a warning here?

Always try to think about it from the user perspective. If users sees this log message, is it relevant to what's going on? Is it something he / she should investigate?


if (jobCancelled || executionMode == ClusterEntrypoint.ExecutionMode.DETACHED) {
JobStatus jobStatus = archivedExecutionGraph.getState();
if ((jobStatus != null && jobStatus.isGloballyTerminalState())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the job status be null here? My intuition would be that for the terminal state this should never be the case and would actually signal an underlying issue. Maybe adding a safeguard (Objects.requireNonNull) with a reasonable message could a better fit?

@tillrohrmann
Copy link
Contributor

What's the state of this PR? Can we resolve the open comments to merge it soon?

@Myracle Myracle force-pushed the FLINK-25486-lost-state-when-zookeeper-leader-changes-bugfix branch from ac842b5 to 6ffad61 Compare January 27, 2022 02:22
@Myracle
Copy link
Contributor Author

Myracle commented Jan 27, 2022

@tillrohrmann Sorry for late reply. I have written the most code and will finish it soon.

@Myracle Myracle force-pushed the FLINK-25486-lost-state-when-zookeeper-leader-changes-bugfix branch from 6ffad61 to 216076e Compare January 27, 2022 08:42
@Myracle
Copy link
Contributor Author

Myracle commented Jan 27, 2022

@flinkbot run azure

@Myracle Myracle force-pushed the FLINK-25486-lost-state-when-zookeeper-leader-changes-bugfix branch from 216076e to f098fea Compare January 27, 2022 12:03
@Myracle
Copy link
Contributor Author

Myracle commented Jan 28, 2022

@flinkbot run azure

Copy link
Member

@dmvk dmvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @Myracle, I really like it! <3 I've added few more suggestions for the test cases for speeding them up + aligning on the community code style (junit 5 for the new tests).

I think we once we address these the PR should be good to go! Once more, thanks for the contribution, this is a really great improvement.

* signals job termination if the JobStatus is not globally terminal state.
*/
@Test
public void testNotTerminationWithoutGloballyTerminalState() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Myracle
Copy link
Contributor Author

Myracle commented Jan 28, 2022

@dmvk Thanks for the valuable suggestions which make the code clean. I have modified the code. Could you please review again?

Copy link
Member

@dmvk dmvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 Thanks for the update

@dmvk
Copy link
Member

dmvk commented Jan 28, 2022

:( compile step is failing, AT_LEAST_ONE_CHECKPOINT_COMPLETED needs to be renamed (-> camel case). Sorry for confusing you.

@Myracle Myracle force-pushed the FLINK-25486-lost-state-when-zookeeper-leader-changes-bugfix branch from 477c124 to 226244d Compare January 29, 2022 01:33
@Myracle
Copy link
Contributor Author

Myracle commented Jan 29, 2022

@dmvk Fixed and the CI passed.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the PR @Myracle. Merging it now.

tillrohrmann pushed a commit to tillrohrmann/flink that referenced this pull request Jan 29, 2022
tillrohrmann pushed a commit to tillrohrmann/flink that referenced this pull request Jan 29, 2022
tillrohrmann pushed a commit to tillrohrmann/flink that referenced this pull request Jan 30, 2022
tillrohrmann pushed a commit to tillrohrmann/flink that referenced this pull request Jan 30, 2022
tillrohrmann pushed a commit to tillrohrmann/flink that referenced this pull request Jan 30, 2022
niklassemmler pushed a commit to niklassemmler/flink that referenced this pull request Feb 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants