Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-17707][k8s] Support configuring replicas of JobManager deployment when HA enabled #15286

Closed

Conversation

wangyang0918
Copy link
Contributor

@wangyang0918 wangyang0918 commented Mar 19, 2021

What is the purpose of the change

At the moment, in the native K8s setups, we hard code the replica of Deployment to 1. However, when users enable the ZooKeeper HighAvailabilityServices or Kubernetes HA service, they would like to configure the replica of JobManager deployment also for faster recovery.

In #15248, we have added the documentation for how to start standby JobManagers for standalone Flink cluster on K8s. This PR tries to make replica of JobManager deployment configurable for native mode.

Brief change log

  • Support configuring replicas of JobManager deployment when HA enabled

Verifying this change

  • Add dedicated tests for the changes

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@flinkbot
Copy link
Collaborator

flinkbot commented Mar 19, 2021

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 9742b4a (Thu Sep 23 17:22:53 UTC 2021)

✅no warnings

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Mar 19, 2021

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

@wangyang0918
Copy link
Contributor Author

cc @tillrohrmann could you please have a look at your convenience?

@wangyang0918
Copy link
Contributor Author

Rebase latest master.

@wangyang0918
Copy link
Contributor Author

cc @xintongsong Could you please have a look?

@xintongsong
Copy link
Contributor

Thanks for preparing the PR, @wangyang0918.

I'm afraid we are not ready for this feature ATM. As FLINK-21667 reveals, currently a standby RM can perform modifying actions on native Kubernetes deployment. I have not started working on FLINK-21667, because I'd rather not to make significant changes to the RM lifecycle management right before the 1.13 feature freeze.

I'd suggest to postpone this feature as well, marking it as blocked by FLINK-21667.

@wangyang0918
Copy link
Contributor Author

wangyang0918 commented Mar 26, 2021

@xintongsong Thanks for your comments. Even though the standby only stops the terminated pod currently, but it is not a by-design behavior. The standby ResourceManagers should not perform the modifying actions. I agree with you that we could defer this PR after FLINK-21667.

@wangyang0918 wangyang0918 marked this pull request as draft March 26, 2021 05:01
@wangyang0918 wangyang0918 marked this pull request as ready for review June 3, 2021 12:02
@wangyang0918
Copy link
Contributor Author

Since the FLINK-21667 has been merged, I think this PR is ready for review. cc @xintongsong

Copy link
Contributor

@xintongsong xintongsong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wangyang0918. Changes LGTM.
I have two additional questions.

  • Has the feature been verified on real clusters?
  • What documentation changes, in addition to the config option description, need to be made for this feature?

Comment on lines 167 to 177
checkArgument(
replicas > 0,
"'%s' should not be configured less than one.",
KubernetesConfigOptions.KUBERNETES_JOBMANAGER_REPLICAS.key());
if (replicas > 1 && !HighAvailabilityMode.isHighAvailabilityModeActivated(flinkConfig)) {
throw new IllegalArgumentException(
"High availability should be enabled when starting standby JobManagers.");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IllegalConfigurationException is preferred rather than IllegalArgumentException.

@wangyang0918
Copy link
Contributor Author

@xintongsong Yes, this feature has been verified in a real K8s cluster. Actually, I think the Flink config option description is enough. But I will add a brief introduction in the native_kubernetes.md.

@wangyang0918
Copy link
Contributor Author

@xintongsong Thanks for your review. PR updated.

Copy link
Contributor

@xintongsong xintongsong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Merging this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants