Skip to content

Commit

Permalink
Explain the memory use of the default cert-manager installation
Browse files Browse the repository at this point in the history
Signed-off-by: Richard Wall <richard.wall@venafi.com>
  • Loading branch information
wallrj committed Apr 9, 2024
1 parent ffd2b18 commit 131a9ad
Show file tree
Hide file tree
Showing 3 changed files with 81 additions and 0 deletions.
77 changes: 77 additions & 0 deletions content/docs/devops-tips/large-clusters.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
---
title: Deploying cert-manager on Large Clusters
description: |
Learn how to optimize cert-manager for deployment on large clusters,
with thousands of Certificate and Secret resources.
---

Learn how to optimize cert-manager for deployment on large clusters,
with thousands of Certificate and Secret resources.

## Overview

The defaults in the Helm chart or YAML manifests are intended for general use.
You will need to modify the configuration if your Kubernetes cluster has thousands of Certificate resources and TLS Secrets.

## Memory

### Recommendations

Here are some `memory.request` recommendations for each of the cert-manager components in different scenarios:

| Scenario | controller (Mi) | cainjector (Mi) | webhook (Mi) |
|----------------------------|-----------------|-----------------|--------------|
| 2000 RSA 4096 Certificates | 350 | 150 | 50 |

> 📖️ Read [What Everyone Should Know About Kubernetes Memory Limits](https://home.robusta.dev/blog/kubernetes-memory-limit),
> to learn why the best practice is to set memory limit equal to memory request.
### Rationale

**When Certificate resources are the dominant use-case**,
such as when workloads need to mount the TLS Secret or when gateway-shim is used,
the memory consumption of the cert-manager controller will be roughly
proportional to the total size of those Secret resources that contain the TLS
key pairs.
Why? Because the cert-manager controller caches the entire content of these Secret resources in memory.
If large TLS keys are used (e.g. RSA 4096) the memory use will be higher than if smaller TLS keys are used (e.g. ECDSA).

The other Secrets in the cluster, such as those used for Helm chart configurations or for other workloads,
will not significantly increase the memory consumption, because cert-manager will only cache the metadata of these Secrets.

**When CertificateRequest resources are the dominant use-case**,
such as with csi-driver or with istio-csr,
the memory consumption of the cert-manager controller will be much lower,
because there will be fewer TLS Secrets and fewer resources to be cached.

### Evidence

This chart shows the memory consumption of the cert-manager controller (1.14)
during an experiment where 2000 RSA 4096 Certificate are created, signed and
then deleted.

<img src="/docs/devops-tips/large-clusters/default-memory-1.png" alt="Scatter chart showing cert-manager memory usage and cluster resource counts over time" />

The pattern of memory consumption can be explained as follows:

1. `0min`: `~50MiB`: There are 0 Certificates.
There are 13 incidental Secret resources which are:
Helm chart configuration Secrets, and
other Secrets of the metrics-server and Prometheus stack, which are also installed in the test cluster.
All the cert-manager Deployments were restarted before the experiment and the components have only cached cached resources.
1. `33min`: `~260MiB`: All 2000 Certificate resources have been reconciled.
Every Certificate now has a corresponding CertificateRequest and TLS Secret.
There are `~3600` Secret resources -- `~1600` more than can be explained by the 2000 TLS Secrets.
**Why?**
Possibly because cert-manager creates a temporary Secret resource for each Certificate.
The temporary Secret is where cert-manager stores the private key when it is first generated.
After the TLS certificate has been signed, the temporary Secret is deleted.
1. `40min`: `~280MiB`: When remaining temporary Secret resources are being deleted.
This causes a spike in memory. **Why?**.
1. `42min`: `~225MiB`: The Go garbage collector eventually frees the memory which had been allocated for the recently deleted Secrets.
1. `46min`: `~300MiB`: The Certificates and Secrets are now being rapidly deleted.
This causes another spike in memory. **Why?**.
1. `48min`: `~280MiB`: All the Certificate, CertificateRequest and Secret resources have now been deleted,
but the memory consumption remains at roughly the peak size.
**Why?**
The memory which had been allocated to cache the resources is not immediately freed.
4 changes: 4 additions & 0 deletions content/docs/manifest.json
Original file line number Diff line number Diff line change
Expand Up @@ -646,6 +646,10 @@
{
"title": "Best Practice Installation Options",
"path": "/docs/installation/best-practice.md"
},
{
"title": "Large Cluster Configuration",
"path": "/docs/devops-tips/large-clusters.md"
}
]
},
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 131a9ad

Please sign in to comment.