[FLINK-20357][docs] Split HA documentation up into a general overview and the specific implementations #14254

tillrohrmann · 2020-11-27T17:52:10Z

This commit splits the HA documentation up into a general overview and the specific implementations:

ZooKeeper HA services
Kubernetes HA services

Moreover, this commit moves resource-provider specific documentation to the respective resource-provider
documentation. This is done in order to not lose this information and it should be properly incorporated once the resource-provider documentation is updated.

cc @rmetzger, @XComp, @wangyang0918

flinkbot · 2020-11-27T17:55:04Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 9f8f123 (Fri Nov 27 17:55:04 UTC 2020)

✅no warnings

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2020-11-27T18:12:33Z

CI report:

7b14840 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

wangyang0918

Thanks @tillrohrmann for creating this ticket. The separation of HA documentation makes sense to me. I just left some minor comments and please have a look.

wangyang0918 · 2020-11-30T06:52:31Z

docs/deployment/ha/kubernetes_ha.md

+  <pre>high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory</pre>
+
+- **Storage directory** (required): 
+JobManager metadata is persisted in the file system [`high-availability.storageDir`]({% link deployment/config.md %}#high-availability-storagedir) and only a pointer to this state is stored in ZooKeeper.


Suggested change

JobManager metadata is persisted in the file system [`high-availability.storageDir`]({% link deployment/config.md %}#high-availability-storagedir) and only a pointer to this state is stored in ZooKeeper.

JobManager metadata is persisted in the file system [`high-availability.storageDir`]({% link deployment/config.md %}#high-availability-storagedir) and only a pointer to this state is stored in Kubernetes.

Good catch :-)

wangyang0918 · 2020-11-30T06:56:06Z

docs/deployment/ha/kubernetes_ha.md

+- **Cluster id** (required):
+In order to identify the Flink cluster, you have to specify a [`kubernetes.cluster-id`]({% link deployment/config.md %}#kubernetes-cluster-id).
+
+  <pre>kubernetes.cluster-id: Cluster1337</pre>


The kubernetes.cluster-id should only contain lower case alphanumeric characters, - or .. It is a Kubernetes limitation.

Very good information. I guess we should add this information to the KubernetesConfigOptions.CLUSTER_ID.

wangyang0918 · 2020-11-30T07:09:18Z

docs/deployment/ha/kubernetes_ha.md

@@ -2,7 +2,7 @@
 title: "Kubernetes HA Services"
 nav-title: Kubernetes HA Services
 nav-parent_id: ha
-nav-pos: 2
+nav-pos: 6


nit: I am not sure about why we have the nav-pos 6 for kubernetes_ha.md and 5 for zookeeper_ha.md. Does 2 and 1 also make sense?

It does not make a difference since the relative positioning wrt zookeeper_ha.md is important. I'll update it in order to avoid future confusion, though.

wangyang0918 · 2020-11-30T07:17:08Z

docs/deployment/resource-providers/standalone/kubernetes.md

+
+### How to configure Kubernetes HA Services
+
+Both session and job/application clusters support using the Kubernetes high availability service. Users just need to add the following Flink config options to [flink-configuration-configmap.yaml]({% link deployment/resource-providers/standalone/kubernetes.md %}#common-cluster-resource-definitions). All other yamls do not need to be updated.


Suggested change

Both session and job/application clusters support using the Kubernetes high availability service. Users just need to add the following Flink config options to [flink-configuration-configmap.yaml]({% link deployment/resource-providers/standalone/kubernetes.md %}#common-cluster-resource-definitions). All other yamls do not need to be updated.

Both session and job/application clusters support using the Kubernetes high availability service. Users just need to add the following Flink config options to [flink-configuration-configmap.yaml](#common-cluster-resource-definitions). All other yamls do not need to be updated.

Good catch. I'll update it.

rmetzger

Thanks a lot for splitting the HA configuration pages and reworking them!

I had some minor cleanups and questions.

rmetzger · 2020-11-30T07:08:40Z

docs/deployment/ha/zookeeper_ha.md


-3. **Configure ZooKeeper server** in `conf/zoo.cfg` (currently it's only possible to run a single ZooKeeper server per machine):
+For more information on Flink configuration for Kerberos security, please see [here]({% link deployment/config.md %}).


Suggested change

For more information on Flink configuration for Kerberos security, please see [here]({% link deployment/config.md %}).

For more information on Flink configuration for Kerberos security, please refer to the [security section of the Flink configuration page]({% link deployment/config.md %}#security).

Sounds good. Will update it.

rmetzger · 2020-11-30T07:10:36Z

docs/deployment/ha/zookeeper_ha.md


-3. **Configure ZooKeeper server** in `conf/zoo.cfg` (currently it's only possible to run a single ZooKeeper server per machine):
+For more information on Flink configuration for Kerberos security, please see [here]({% link deployment/config.md %}).
+You can also find [here]({% link deployment/security/security-kerberos.md %}) further details on how Flink internally setups Kerberos-based security.


Suggested change

You can also find [here]({% link deployment/security/security-kerberos.md %}) further details on how Flink internally setups Kerberos-based security.

You can also find further details on [how Flink sets up Kerberos-based security internally]({% link deployment/security/security-kerberos.md %}).

Will update it.

rmetzger · 2020-11-30T07:19:25Z

docs/deployment/ha/kubernetes_ha.md

@@ -23,77 +23,50 @@ specific language governing permissions and limitations
 under the License.
 -->

-## Kubernetes Cluster High Availability
-Kubernetes high availability service could support both [standalone Flink on Kubernetes]({% link deployment/resource-providers/standalone/kubernetes.md %}) and [native Kubernetes integration]({% link deployment/resource-providers/native_kubernetes.md %}).
+Flink's Kubernetes HA services use [Kubernetes](https://kubernetes.io/) for high availability services.


Are there any restrictions on the K8s versions supported? Or on the required K8s features? (I guess the answer is that we only need ConfigMaps?)
I'm asking, so that users can evaluate if it also works with implementations such as https://k3s.io/

I need @wangyang0918 to answer this question here.

The ConfigMap and resource version is supported at the very begging(much lower that 1.9). I believe that very few users are still using the K8s with such version lower than 1.9 since the latest stable version is now 1.19.

BTW, the native K8s integration require Kubernetes 1.9 or above.

I guess to answer whether https://k3s.io/ works, one needs to try it out. I wouldn't block the PR on this, though. Thanks for the answer @wangyang0918.

rmetzger · 2020-11-30T07:21:53Z

docs/deployment/ha/zookeeper_ha.md

- **high-availability mode** (required): The *high-availability mode* has to be set in `conf/flink-conf.yaml` to *zookeeper* in order to enable high availability mode.
-Alternatively this option can be set to FQN of factory class Flink should use to create HighAvailabilityServices instance. 
+- **high-availability mode** (required): 
+The `high-availability` option has to be set to *zookeeper*.


Suggested change

The `high-availability` option has to be set to *zookeeper*.

The `high-availability` option has to be set to `zookeeper`.

Sounds good.

rmetzger · 2020-11-30T07:23:56Z

docs/deployment/ha/zookeeper_ha.md

-  you have to manually configure separate cluster-ids for each cluster.
+  **Important**: 
+  You should not set this value manually when running on YARN, native Kubernetes or on another cluster manager. 
+  In those cases a cluster-id is automatically being generated. 


Suggested change

In those cases a cluster-id is automatically being generated.

In those cases a cluster-id is being automatically generated.

I'm not sure about this one.
"automatically being generated" has 6k google hits, "being automatically generated" 48k ;)

being automatically generated sounds better to me.

rmetzger · 2020-11-30T07:27:26Z

docs/deployment/ha/zookeeper_ha.zh.md


+
  <pre>high-availability.zookeeper.quorum: address1:2181[,...],addressX:2181</pre>

  Each *addressX:port* refers to a ZooKeeper server, which is reachable by Flink at the given address and port.


Suggested change

Each *addressX:port* refers to a ZooKeeper server, which is reachable by Flink at the given address and port.

Each `addressX:port` refers to a ZooKeeper server, which is reachable by Flink at the given address and port.

tillrohrmann · 2020-11-30T09:57:46Z

Thanks for your review @rmetzger and @wangyang0918. I've addressed your comments and resolved the merge conflict.

docs/_includes/generated/kubernetes_config_configuration.html

XComp

Thanks for refactoring the documentation. 👍 Only minor things popped up during my review.

docs/deployment/ha/index.md

docs/deployment/ha/zookeeper_ha.md

docs/deployment/resource-providers/standalone/index.md

XComp · 2020-11-30T12:03:24Z

docs/deployment/resource-providers/standalone/index.md

+
+### Example: Standalone Cluster with 2 JobManagers
+
+1. **Configure high availability mode and ZooKeeper quorum** in `conf/flink-conf.yaml`:


Suggested change

1. **Configure high availability mode and ZooKeeper quorum** in `conf/flink-conf.yaml`:

1. **Configure high availability mode and ZooKeeper quorum** in `${FLINK_HOME}/conf/flink-conf.yaml`:

What about the proposal of adding a ${*_HOME} variable to paths to have a clearer pointer to the actual file/script.

I think this is good point to discuss for the general documentation overhaul. If we decide to do it, then these things need to be updated.

XComp · 2020-11-30T12:05:00Z

docs/deployment/resource-providers/standalone/index.md

+   <pre>
+high-availability: zookeeper
+high-availability.zookeeper.quorum: localhost:2181
+high-availability.zookeeper.path.root: /flink
+high-availability.cluster-id: /cluster_one # important: customize per cluster
+high-availability.storageDir: hdfs:///flink/recovery</pre>
+
+2. **Configure masters** in `conf/masters`:
+
+   <pre>
+localhost:8081
+localhost:8082</pre>
+
+3. **Configure ZooKeeper server** in `conf/zoo.cfg` (currently it's only possible to run a single ZooKeeper server per machine):
+
+   <pre>server.0=localhost:2888:3888</pre>
+
+4. **Start ZooKeeper quorum**:
+
+   <pre>
+$ bin/start-zookeeper-quorum.sh
+Starting zookeeper daemon on host localhost.</pre>
+
+5. **Start an HA-cluster**:
+
+   <pre>
+$ bin/start-cluster.sh
+Starting HA cluster with 2 masters and 1 peers in ZooKeeper quorum.
+Starting standalonesession daemon on host localhost.
+Starting standalonesession daemon on host localhost.
+Starting taskexecutor daemon on host localhost.</pre>
+
+6. **Stop ZooKeeper quorum and cluster**:
+
+   <pre>
+$ bin/stop-cluster.sh
+Stopping taskexecutor daemon (pid: 7647) on localhost.
+Stopping standalonesession daemon (pid: 7495) on host localhost.
+Stopping standalonesession daemon (pid: 7349) on host localhost.
+$ bin/stop-zookeeper-quorum.sh
+Stopping zookeeper daemon (pid: 7101) on host localhost.</pre>


There's also

{% highlight bash %} # ... {% endhighlight %}

which might be the more appropriate syntax to create code blocks.

True. I won't change it though because I only added it to not lose this information. It should be properly reformatted once the standalone resource provider documentation is written.

docs/deployment/resource-providers/standalone/kubernetes.md

… and the specific implementations This commit splits the HA documentation up into a general overview and the specific implementations: * ZooKeeper HA services * Kubernetes HA services Moreover, this commit moves resource-provider specific documentation to the respective resource-provider documentation.

…ster-id Only lowercase alphanumeric characters, "-" or "." are allowed.

tillrohrmann · 2020-11-30T14:25:57Z

Thanks for the review @XComp. I've addressed most of your comments.

tillrohrmann · 2020-12-01T09:41:06Z

Thanks for the review @wangyang0918, @XComp and @rmetzger. Merging this PR now.

… and the specific implementations This commit splits the HA documentation up into a general overview and the specific implementations: * ZooKeeper HA services * Kubernetes HA services Moreover, this commit moves resource-provider specific documentation to the respective resource-provider documentation. This closes #14254.

rmetzger added review=description? component=Documentation component=Runtime/Coordination labels Nov 27, 2020

wangyang0918 reviewed Nov 30, 2020

View reviewed changes

rmetzger approved these changes Nov 30, 2020

View reviewed changes

tillrohrmann force-pushed the FLINK-20357 branch from 9f8f123 to db5f475 Compare November 30, 2020 09:56

wangyang0918 reviewed Nov 30, 2020

View reviewed changes

docs/_includes/generated/kubernetes_config_configuration.html Outdated Show resolved Hide resolved

XComp approved these changes Nov 30, 2020

View reviewed changes

tillrohrmann added 2 commits November 30, 2020 15:24

[hotfix][k8s] Specify which characters are allowed for kubernetes.clu…

7b14840

…ster-id Only lowercase alphanumeric characters, "-" or "." are allowed.

tillrohrmann force-pushed the FLINK-20357 branch from db5f475 to 7b14840 Compare November 30, 2020 14:25

wangyang0918 approved these changes Nov 30, 2020

View reviewed changes

tillrohrmann closed this in 11d4136 Dec 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-20357][docs] Split HA documentation up into a general overview and the specific implementations #14254

[FLINK-20357][docs] Split HA documentation up into a general overview and the specific implementations #14254

tillrohrmann commented Nov 27, 2020

flinkbot commented Nov 27, 2020

flinkbot commented Nov 27, 2020 •

edited

wangyang0918 left a comment

wangyang0918 Nov 30, 2020

tillrohrmann Nov 30, 2020

wangyang0918 Nov 30, 2020

tillrohrmann Nov 30, 2020

wangyang0918 Nov 30, 2020

tillrohrmann Nov 30, 2020

wangyang0918 Nov 30, 2020

tillrohrmann Nov 30, 2020

rmetzger left a comment

rmetzger Nov 30, 2020

tillrohrmann Nov 30, 2020

rmetzger Nov 30, 2020

tillrohrmann Nov 30, 2020

rmetzger Nov 30, 2020

tillrohrmann Nov 30, 2020

wangyang0918 Nov 30, 2020

tillrohrmann Nov 30, 2020

rmetzger Nov 30, 2020

tillrohrmann Nov 30, 2020

rmetzger Nov 30, 2020

tillrohrmann Nov 30, 2020

rmetzger Nov 30, 2020

tillrohrmann Nov 30, 2020

tillrohrmann commented Nov 30, 2020

XComp left a comment

XComp Nov 30, 2020

tillrohrmann Nov 30, 2020

XComp Nov 30, 2020

tillrohrmann Nov 30, 2020

tillrohrmann commented Nov 30, 2020

tillrohrmann commented Dec 1, 2020

	JobManager metadata is persisted in the file system [`high-availability.storageDir`]({% link deployment/config.md %}#high-availability-storagedir) and only a pointer to this state is stored in ZooKeeper.
	JobManager metadata is persisted in the file system [`high-availability.storageDir`]({% link deployment/config.md %}#high-availability-storagedir) and only a pointer to this state is stored in Kubernetes.


		### How to configure Kubernetes HA Services

		Both session and job/application clusters support using the Kubernetes high availability service. Users just need to add the following Flink config options to [flink-configuration-configmap.yaml]({% link deployment/resource-providers/standalone/kubernetes.md %}#common-cluster-resource-definitions). All other yamls do not need to be updated.


		3. Configure ZooKeeper server in `conf/zoo.cfg` (currently it's only possible to run a single ZooKeeper server per machine):
		For more information on Flink configuration for Kerberos security, please see [here]({% link deployment/config.md %}).

	You can also find [here]({% link deployment/security/security-kerberos.md %}) further details on how Flink internally setups Kerberos-based security.
	You can also find further details on [how Flink sets up Kerberos-based security internally]({% link deployment/security/security-kerberos.md %}).

	The `high-availability` option has to be set to zookeeper.
	The `high-availability` option has to be set to `zookeeper`.

	In those cases a cluster-id is automatically being generated.
	In those cases a cluster-id is being automatically generated.



		<pre>high-availability.zookeeper.quorum: address1:2181[,...],addressX:2181</pre>

		Each addressX:port refers to a ZooKeeper server, which is reachable by Flink at the given address and port.

	Each addressX:port refers to a ZooKeeper server, which is reachable by Flink at the given address and port.
	Each `addressX:port` refers to a ZooKeeper server, which is reachable by Flink at the given address and port.


		### Example: Standalone Cluster with 2 JobManagers

		1. Configure high availability mode and ZooKeeper quorum in `conf/flink-conf.yaml`:

[FLINK-20357][docs] Split HA documentation up into a general overview and the specific implementations #14254

[FLINK-20357][docs] Split HA documentation up into a general overview and the specific implementations #14254

Conversation

tillrohrmann commented Nov 27, 2020

flinkbot commented Nov 27, 2020

Automated Checks

Review Progress

flinkbot commented Nov 27, 2020 • edited

CI report:

wangyang0918 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rmetzger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann commented Nov 30, 2020

XComp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann commented Nov 30, 2020

tillrohrmann commented Dec 1, 2020

flinkbot commented Nov 27, 2020 •

edited