New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-17707][k8s] Support configuring replicas of JobManager deployment when HA enabled #15286
[FLINK-17707][k8s] Support configuring replicas of JobManager deployment when HA enabled #15286
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 9742b4a (Thu Sep 23 17:22:53 UTC 2021) ✅no warnings Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
cc @tillrohrmann could you please have a look at your convenience? |
519d96d
to
6a1950e
Compare
Rebase latest master. |
6a1950e
to
71743c8
Compare
cc @xintongsong Could you please have a look? |
Thanks for preparing the PR, @wangyang0918. I'm afraid we are not ready for this feature ATM. As FLINK-21667 reveals, currently a standby RM can perform modifying actions on native Kubernetes deployment. I have not started working on FLINK-21667, because I'd rather not to make significant changes to the RM lifecycle management right before the 1.13 feature freeze. I'd suggest to postpone this feature as well, marking it as blocked by FLINK-21667. |
@xintongsong Thanks for your comments. Even though the standby only stops the terminated pod currently, but it is not a by-design behavior. The standby ResourceManagers should not perform the modifying actions. I agree with you that we could defer this PR after FLINK-21667. |
71743c8
to
64a9b36
Compare
64a9b36
to
967f28c
Compare
Since the FLINK-21667 has been merged, I think this PR is ready for review. cc @xintongsong |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wangyang0918. Changes LGTM.
I have two additional questions.
- Has the feature been verified on real clusters?
- What documentation changes, in addition to the config option description, need to be made for this feature?
checkArgument( | ||
replicas > 0, | ||
"'%s' should not be configured less than one.", | ||
KubernetesConfigOptions.KUBERNETES_JOBMANAGER_REPLICAS.key()); | ||
if (replicas > 1 && !HighAvailabilityMode.isHighAvailabilityModeActivated(flinkConfig)) { | ||
throw new IllegalArgumentException( | ||
"High availability should be enabled when starting standby JobManagers."); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IllegalConfigurationException
is preferred rather than IllegalArgumentException
.
…ent when HA enabled
@xintongsong Yes, this feature has been verified in a real K8s cluster. Actually, I think the Flink config option description is enough. But I will add a brief introduction in the |
967f28c
to
9742b4a
Compare
@xintongsong Thanks for your review. PR updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Merging this.
What is the purpose of the change
At the moment, in the native K8s setups, we hard code the replica of Deployment to 1. However, when users enable the ZooKeeper HighAvailabilityServices or Kubernetes HA service, they would like to configure the replica of JobManager deployment also for faster recovery.
In #15248, we have added the documentation for how to start standby JobManagers for standalone Flink cluster on K8s. This PR tries to make replica of JobManager deployment configurable for native mode.
Brief change log
Verifying this change
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (yes / no)Documentation