Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-20357][docs] Split HA documentation up into a general overview and the specific implementations #14254

Closed
wants to merge 2 commits into from

Conversation

tillrohrmann
Copy link
Contributor

This commit splits the HA documentation up into a general overview and the specific implementations:

  • ZooKeeper HA services
  • Kubernetes HA services

Moreover, this commit moves resource-provider specific documentation to the respective resource-provider
documentation. This is done in order to not lose this information and it should be properly incorporated once the resource-provider documentation is updated.

cc @rmetzger, @XComp, @wangyang0918

@flinkbot
Copy link
Collaborator

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 9f8f123 (Fri Nov 27 17:55:04 UTC 2020)

✅no warnings

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Nov 27, 2020

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

Copy link
Contributor

@wangyang0918 wangyang0918 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tillrohrmann for creating this ticket. The separation of HA documentation makes sense to me. I just left some minor comments and please have a look.

<pre>high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory</pre>

- **Storage directory** (required):
JobManager metadata is persisted in the file system [`high-availability.storageDir`]({% link deployment/config.md %}#high-availability-storagedir) and only a pointer to this state is stored in ZooKeeper.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
JobManager metadata is persisted in the file system [`high-availability.storageDir`]({% link deployment/config.md %}#high-availability-storagedir) and only a pointer to this state is stored in ZooKeeper.
JobManager metadata is persisted in the file system [`high-availability.storageDir`]({% link deployment/config.md %}#high-availability-storagedir) and only a pointer to this state is stored in Kubernetes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch :-)

- **Cluster id** (required):
In order to identify the Flink cluster, you have to specify a [`kubernetes.cluster-id`]({% link deployment/config.md %}#kubernetes-cluster-id).

<pre>kubernetes.cluster-id: Cluster1337</pre>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kubernetes.cluster-id should only contain lower case alphanumeric characters, - or .. It is a Kubernetes limitation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good information. I guess we should add this information to the KubernetesConfigOptions.CLUSTER_ID.

@@ -2,7 +2,7 @@
title: "Kubernetes HA Services"
nav-title: Kubernetes HA Services
nav-parent_id: ha
nav-pos: 2
nav-pos: 6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I am not sure about why we have the nav-pos 6 for kubernetes_ha.md and 5 for zookeeper_ha.md. Does 2 and 1 also make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not make a difference since the relative positioning wrt zookeeper_ha.md is important. I'll update it in order to avoid future confusion, though.


### How to configure Kubernetes HA Services

Both session and job/application clusters support using the Kubernetes high availability service. Users just need to add the following Flink config options to [flink-configuration-configmap.yaml]({% link deployment/resource-providers/standalone/kubernetes.md %}#common-cluster-resource-definitions). All other yamls do not need to be updated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Both session and job/application clusters support using the Kubernetes high availability service. Users just need to add the following Flink config options to [flink-configuration-configmap.yaml]({% link deployment/resource-providers/standalone/kubernetes.md %}#common-cluster-resource-definitions). All other yamls do not need to be updated.
Both session and job/application clusters support using the Kubernetes high availability service. Users just need to add the following Flink config options to [flink-configuration-configmap.yaml](#common-cluster-resource-definitions). All other yamls do not need to be updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I'll update it.

Copy link
Contributor

@rmetzger rmetzger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for splitting the HA configuration pages and reworking them!

I had some minor cleanups and questions.


3. **Configure ZooKeeper server** in `conf/zoo.cfg` (currently it's only possible to run a single ZooKeeper server per machine):
For more information on Flink configuration for Kerberos security, please see [here]({% link deployment/config.md %}).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For more information on Flink configuration for Kerberos security, please see [here]({% link deployment/config.md %}).
For more information on Flink configuration for Kerberos security, please refer to the [security section of the Flink configuration page]({% link deployment/config.md %}#security).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Will update it.


3. **Configure ZooKeeper server** in `conf/zoo.cfg` (currently it's only possible to run a single ZooKeeper server per machine):
For more information on Flink configuration for Kerberos security, please see [here]({% link deployment/config.md %}).
You can also find [here]({% link deployment/security/security-kerberos.md %}) further details on how Flink internally setups Kerberos-based security.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can also find [here]({% link deployment/security/security-kerberos.md %}) further details on how Flink internally setups Kerberos-based security.
You can also find further details on [how Flink sets up Kerberos-based security internally]({% link deployment/security/security-kerberos.md %}).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update it.

@@ -23,77 +23,50 @@ specific language governing permissions and limitations
under the License.
-->

## Kubernetes Cluster High Availability
Kubernetes high availability service could support both [standalone Flink on Kubernetes]({% link deployment/resource-providers/standalone/kubernetes.md %}) and [native Kubernetes integration]({% link deployment/resource-providers/native_kubernetes.md %}).
Flink's Kubernetes HA services use [Kubernetes](https://kubernetes.io/) for high availability services.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any restrictions on the K8s versions supported? Or on the required K8s features? (I guess the answer is that we only need ConfigMaps?)
I'm asking, so that users can evaluate if it also works with implementations such as https://k3s.io/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need @wangyang0918 to answer this question here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ConfigMap and resource version is supported at the very begging(much lower that 1.9). I believe that very few users are still using the K8s with such version lower than 1.9 since the latest stable version is now 1.19.

BTW, the native K8s integration require Kubernetes 1.9 or above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess to answer whether https://k3s.io/ works, one needs to try it out. I wouldn't block the PR on this, though. Thanks for the answer @wangyang0918.

- **high-availability mode** (required): The *high-availability mode* has to be set in `conf/flink-conf.yaml` to *zookeeper* in order to enable high availability mode.
Alternatively this option can be set to FQN of factory class Flink should use to create HighAvailabilityServices instance.
- **high-availability mode** (required):
The `high-availability` option has to be set to *zookeeper*.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `high-availability` option has to be set to *zookeeper*.
The `high-availability` option has to be set to `zookeeper`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

you have to manually configure separate cluster-ids for each cluster.
**Important**:
You should not set this value manually when running on YARN, native Kubernetes or on another cluster manager.
In those cases a cluster-id is automatically being generated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In those cases a cluster-id is automatically being generated.
In those cases a cluster-id is being automatically generated.

I'm not sure about this one.
"automatically being generated" has 6k google hits, "being automatically generated" 48k ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

being automatically generated sounds better to me.



<pre>high-availability.zookeeper.quorum: address1:2181[,...],addressX:2181</pre>

Each *addressX:port* refers to a ZooKeeper server, which is reachable by Flink at the given address and port.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Each *addressX:port* refers to a ZooKeeper server, which is reachable by Flink at the given address and port.
Each `addressX:port` refers to a ZooKeeper server, which is reachable by Flink at the given address and port.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

@tillrohrmann
Copy link
Contributor Author

Thanks for your review @rmetzger and @wangyang0918. I've addressed your comments and resolved the merge conflict.

Copy link
Contributor

@XComp XComp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for refactoring the documentation. 👍 Only minor things popped up during my review.

docs/deployment/ha/index.md Outdated Show resolved Hide resolved
docs/deployment/ha/zookeeper_ha.md Outdated Show resolved Hide resolved
docs/deployment/ha/zookeeper_ha.md Outdated Show resolved Hide resolved
docs/deployment/ha/zookeeper_ha.md Outdated Show resolved Hide resolved
docs/deployment/ha/zookeeper_ha.md Outdated Show resolved Hide resolved
docs/deployment/resource-providers/standalone/index.md Outdated Show resolved Hide resolved

### Example: Standalone Cluster with 2 JobManagers

1. **Configure high availability mode and ZooKeeper quorum** in `conf/flink-conf.yaml`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. **Configure high availability mode and ZooKeeper quorum** in `conf/flink-conf.yaml`:
1. **Configure high availability mode and ZooKeeper quorum** in `${FLINK_HOME}/conf/flink-conf.yaml`:

What about the proposal of adding a ${*_HOME} variable to paths to have a clearer pointer to the actual file/script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good point to discuss for the general documentation overhaul. If we decide to do it, then these things need to be updated.

Comment on lines +177 to +217
<pre>
high-availability: zookeeper
high-availability.zookeeper.quorum: localhost:2181
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: /cluster_one # important: customize per cluster
high-availability.storageDir: hdfs:///flink/recovery</pre>

2. **Configure masters** in `conf/masters`:

<pre>
localhost:8081
localhost:8082</pre>

3. **Configure ZooKeeper server** in `conf/zoo.cfg` (currently it's only possible to run a single ZooKeeper server per machine):

<pre>server.0=localhost:2888:3888</pre>

4. **Start ZooKeeper quorum**:

<pre>
$ bin/start-zookeeper-quorum.sh
Starting zookeeper daemon on host localhost.</pre>

5. **Start an HA-cluster**:

<pre>
$ bin/start-cluster.sh
Starting HA cluster with 2 masters and 1 peers in ZooKeeper quorum.
Starting standalonesession daemon on host localhost.
Starting standalonesession daemon on host localhost.
Starting taskexecutor daemon on host localhost.</pre>

6. **Stop ZooKeeper quorum and cluster**:

<pre>
$ bin/stop-cluster.sh
Stopping taskexecutor daemon (pid: 7647) on localhost.
Stopping standalonesession daemon (pid: 7495) on host localhost.
Stopping standalonesession daemon (pid: 7349) on host localhost.
$ bin/stop-zookeeper-quorum.sh
Stopping zookeeper daemon (pid: 7101) on host localhost.</pre>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also

{% highlight bash %}
# ...
{% endhighlight %}

which might be the more appropriate syntax to create code blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I won't change it though because I only added it to not lose this information. It should be properly reformatted once the standalone resource provider documentation is written.

… and the specific implementations

This commit splits the HA documentation up into a general overview and the specific implementations:
* ZooKeeper HA services
* Kubernetes HA services

Moreover, this commit moves resource-provider specific documentation to the respective resource-provider
documentation.
…ster-id

Only lowercase alphanumeric characters, "-" or "." are allowed.
@tillrohrmann
Copy link
Contributor Author

Thanks for the review @XComp. I've addressed most of your comments.

@tillrohrmann
Copy link
Contributor Author

Thanks for the review @wangyang0918, @XComp and @rmetzger. Merging this PR now.

tillrohrmann added a commit that referenced this pull request Dec 1, 2020
… and the specific implementations

This commit splits the HA documentation up into a general overview and the specific implementations:
* ZooKeeper HA services
* Kubernetes HA services

Moreover, this commit moves resource-provider specific documentation to the respective resource-provider
documentation.

This closes #14254.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants