Add a section about best practice and liveness probes

Signed-off-by: Richard Wall <richard.wall@jetstack.io>
cert-manager · Apr 28, 2023 · 42497d0 · 42497d0
1 parent 0bb1720
commit 42497d0
Show file tree

Hide file tree

Showing 2 changed files with 91 additions and 5 deletions.
diff --git a/.spelling b/.spelling
@@ -124,6 +124,7 @@ DNSPod
 DNSimple
 DaemonSet
 DataDog
+Datree
 Dean-Coakley
 DigitalOcean
 OVHCloud
@@ -185,6 +186,7 @@ k8s
 KubeCon
 Kubernetes
 Kyverno
+Learnk8s
 LuCI
 Maartje
 MacOS
@@ -313,6 +315,7 @@ google-cas-issuer
 goroutine
 hardcodes
 hardcoded
+healthz
 honour
 hostname
 https
@@ -359,6 +362,7 @@ labelled
 lalitadithya
 ldflag
 lifecycle
+liveness
 loadbalancer
 longkai
 loopback

diff --git a/content/docs/installation/best-practice.md b/content/docs/installation/best-practice.md
@@ -1,19 +1,102 @@
 ---
 title: Best Practice
-description: Learn how to deploy cert-manager to comply with popular security standards such as those produced by the CIS, NSA, and BSI.
+description: |
+    Learn about best practices for deploying cert-manager in production,
+    and how to configure cert-manager to comply with popular security standards
+    such as those produced by the CIS, NSA, and BSI.
 ---
 
-Learn how to deploy cert-manager to comply with popular security standards such as
+In this section you will learn how to configure cert-manager to comply with popular security standards such as
 the [CIS Kubernetes Benchmark](https://www.cisecurity.org/benchmark/kubernetes/),
 the [NSA Kubernetes Hardening Guide](https://media.defense.gov/2022/Aug/29/2003066362/-1/-1/0/CTR_KUBERNETES_HARDENING_GUIDANCE_1.2_20220829.PDF), or
 the [BSI Kubernetes Security Recommendations](https://www.bsi.bund.de/SharedDocs/Downloads/EN/BSI/Grundschutz/International/bsi_it_gs_comp_2022.pdf?__blob=publicationFile&v=2#page=475).
 
+And you will learn about best practices for deploying cert-manager in production;
+such as those enforced by tools like [Datree and its built in rules](https://hub.datree.io/built-in-rules),
+and those documented by the likes of [Learnk8s in their "Kubernetes production best practices" checklist](https://learnk8s.io/production-best-practices/).
+
 ## Overview
 
-The default cert-manager resources in the Helm chart or YAML manifests (Deployment, Pod, ServiceAccount etc) are designed for backwards compatibility rather than for best practice or maximum security.
+The default cert-manager resources in the Helm chart or YAML manifests (Deployment, Pod, ServiceAccount etc)
+are designed for backwards compatibility rather than for best practice or maximum security.
 You may find that the default resources do not comply with the security policy on your Kubernetes cluster
 and in that case you can modify the installation configuration using Helm chart values to override the defaults.
 
+## Use Liveness Probes
+
+An example of this recommendation is found in the
+[Datree Documentation: Ensure each container has a configured liveness probe](https://hub.datree.io/built-in-rules/ensure-liveness-probe):
+> Liveness probes allow Kubernetes to determine when a pod should be replaced.
+> They are fundamental in configuring a resilient cluster architecture.
+
+The cert-manager webhook and controller Pods do have liveness probes,
+but only the webhook liveness probe is enabled by default.
+The cainjector Pod does not have a liveness probe, yet.
+More information below.
+
+### webhook
+
+The [cert-manager webhook](../concepts/webhook.md) has a [liveness probe which is enabled by default](https://github.com/cert-manager/cert-manager/blob/eafe0d0aae4b7a9411825424f6b43fb623e1ba65/deploy/charts/cert-manager/templates/webhook-deployment.yaml#L108C1-L121)
+and the [timings and thresholds can be configured using Helm values](https://github.com/cert-manager/cert-manager/blob/eafe0d0aae4b7a9411825424f6b43fb623e1ba65/deploy/charts/cert-manager/README.template.md?plain=1#L181-L185).
+
+### controller
+
+> ℹ️ The cert-manager controller liveness probe was introduced in cert-manager `v1.12.0`.
+
+The cert-manager controller has a liveness probe, but it is **disabled by default**.
+You can enable it using the Helm chart value `livenessProbe.enabled=true`,
+but first read the background information below.
+
+The liveness probe for the cert-manager controller is an HTTP probe which connects
+to the `/livez` endpoint of a healthz server which listens on port 9443 and runs in its own thread.
+The `/livez` endpoint currently reports the combined status of the following sub-systems
+and each sub-system has its own `/livez` endpoint. These are:
+
+* `/livez/leaderElection`: Returns an error if the leader election record has not been renewed
+  or if the leader election thread has exited without also crashing the parent process.
+
+> ℹ️ In future more sub-systems could be checked by the `/livez`  endpoint,
+> similar to how Kubernetes [ensure logging is not blocked](https://github.com/kubernetes/kubernetes/pull/64946)
+> and have [health checks for each controller](https://github.com/kubernetes/kubernetes/pull/104667).
+>
+> 📖 Read about [how to access individual health checks and verbose status information](https://kubernetes.io/docs/reference/using-api/health-checks/) (cert-manager uses the same healthz endpoint multiplexer as Kubernetes).
+
+### cainjector
+
+The cainjector Pod does not have a liveness probe or a `/livez` healthz endpoint,
+but there is justification for it in the GitHub issue:
+[cainjector in a zombie state after attempting to shut down](https://github.com/cert-manager/cert-manager/issues/5889).
+Please add your remarks to that issue if you have also experienced this specific problem,
+and add your remarks to [Helm: Allow configuration of readiness, liveness and startup probes for all created Pods](https://github.com/cert-manager/cert-manager/issues/5626) if you have a general request for a liveness probe in cainjector.
+
+### Background Information
+
+The cert-manager `controller` process and the `cainjector` process,
+both use the Kubernetes [leader election library](https://pkg.go.dev/k8s.io/client-go/tools/leaderelection),
+to ensure that only one replica of each process can be active at any one time.
+The Kubernetes control-plane components also use this library.
+
+The leader election code runs in a loop in a separate thread (go routine).
+If it initially wins the leader election race and if it later fails to renew its leader election lease, it exits.
+If the leader election thread exits, all the other threads are gracefully shutdown and then the process exits.
+Similarly, if any of the other main threads exit unexpectedly,
+that will trigger the orderly shutdown of the remaining threads and the process will exit.
+
+This adheres to the principle that [Containers should crash when there's a fatal error](https://blog.colinbreck.com/kubernetes-liveness-and-readiness-probes-revisited-how-to-avoid-shooting-yourself-in-the-other-foot/#letitcrash).
+Kubernetes will restart the crashed container, and if it crashes repeatedly,
+there will be increasing time delays between successive restarts.
+
+For this reason, the liveness probe should only be needed if there is a bug in this orderly shutdown process,
+or if there is a bug in one of the other threads which causes the process to deadlock and not shutdown.
+
+You may want to enable the liveness probe anyway, for defense against unforeseen bugs and deadlocks,
+but you will need to monitor the processes closely and,
+tweak the [various liveness probe time settings and thresholds](https://github.com/cert-manager/cert-manager/blob/eafe0d0aae4b7a9411825424f6b43fb623e1ba65/deploy/charts/cert-manager/values.yaml#L254-L268), if necessary.
+
+> 📖 Read [Configure Liveness, Readiness and Startup Probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#before-you-begin) in the Kubernetes documentation, paying particular attention to the notes and cautions in that document.
+>
+> 📖 Read [Shooting Yourself in the Foot with Liveness Probes](https://blog.colinbreck.com/kubernetes-liveness-and-readiness-probes-how-to-avoid-shooting-yourself-in-the-foot/#shootingyourselfinthefootwithlivenessprobes) for more cautionary information about liveness probes.
+
 ## Restrict Auto-Mount of Service Account Tokens
 
 This recommendation is described in the [Kyverno Policy Catalogue](https://kyverno.io/policies/other/restrict_automount_sa_token/restrict_automount_sa_token/) as follows:
@@ -25,7 +108,7 @@ This recommendation is described in the [Kyverno Policy Catalogue](https://kyver
 
 The cert-manager components *do* need to speak to the API server but we still recommend setting `automountServiceAccountToken: false` for the following reasons:
 1. Setting `automountServiceAccountToken: false` will allow cert-manager to be installed on clusters where Kyverno (or some other policy system) is configured to deny Pods that have this field set to `true`. The Kubernetes default value is `true`.
-2. With `automountServiceAccountToken: true`, *all* the containers in the Pod will mount the ServiceAccount token, including side-car and init containers that might have been injected into the cert-manager Pod resources by Kubernetes admission controllers. 
+2. With `automountServiceAccountToken: true`, *all* the containers in the Pod will mount the ServiceAccount token, including side-car and init containers that might have been injected into the cert-manager Pod resources by Kubernetes admission controllers.
    The principle of least privilege suggests that it is better to explicitly mount the ServiceAccount token into the cert-manager containers.
 
 So it is recommended to set `automountServiceAccountToken: false` and manually add a projected `Volume` to each of the cert-manager Deployment resources, containing the ServiceAccount token, CA certificate and namespace files that would normally be [added automatically by the Kubernetes ServiceAccount controller](https://github.com/kubernetes/kubernetes/blob/3992eda8e61725c470fb6141a7fe4e7f9ee31ea5/plugin/pkg/admission/serviceaccount/admission.go#L421-L460),
@@ -45,4 +128,3 @@ Download the following Helm chart values file and supply it to `helm install`, `
 
 This list of recommendations is a work-in-progress.
 If you have other best practice recommendations please [contribute to this page](../contributing/contributing-flow.md).
-