Cert-manager causes API server panic on clusters with more than 20000 secrets. #3748

mvukadinoff · 2021-03-08T13:29:56Z

📢 UPDATE: 2023-11-02
Please read the memory scalability section of the Best Practice documentation which now explains how to configure cainjector to only watch Secret resources in the cert-manager namespace. This should resolve some of the problems described in this issue.

Describe the bug:
On clusters with more than 20000 secrets this becomes a problem . The query that Cert manager does is not optimal.
/api/v1/secrets?limit=500&resourceVersion=0

resourceVersion=0 will cause to query always all secrects and limit=500 will not be taken into account. This way cert manager is not scalable for large deployments. Secrets are used not only for certificates.

As mentioned in : kubernetes/kubernetes#56278 and https://kubernetes.io/docs/reference/using-api/api-concepts/

I suggest to remove the resourceVersion=0 from the query which should make it a lot more faster.

Furhtermore cert manager will retry those queries without waiting for them to complete and they pile up and cause significant load even crashes on the API server. Cert manager basically DDoS'es the Api server.

We're hitting the same issue with:
quay.io/jetstack/cert-manager-cainjector:v0.11.0
quay.io/jetstack/cert-manager-controller:v0.11.0
and
quay.io/jetstack/cert-manager-controller:v1.1.0
quay.io/jetstack/cert-manager-cainjector:v1.1.0

Logs from API server


E0115 18:27:27.893242       1 runtime.go:78] Observed a panic: &errors.errorString{s:"killing connection/stream because serving request timed out and response had been started"} (killing connection/stream because serving request timed out and response had been started)
goroutine 79221267 [running]:
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x3b1fda0, 0xc0001c6650)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc0feb65c90, 0x1, 0x1)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82
panic(0x3b1fda0, 0xc0001c6650)
        /usr/local/go/src/runtime/panic.go:679 +0x1b2
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).timeout(0xc08dc08740, 0xc09ea59b80)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:257 +0x1cf
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP(0xc019eb1960, 0x4edf040, 0xc0a1206af0, 0xc07749d900)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:141 +0x310
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.WithWaitGroup.func1(0x4edf040, 0xc0a1206af0, 0xc07749d800)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/waitgroup.go:47 +0x10f
net/http.HandlerFunc.ServeHTTP(0xc0434bf3e0, 0x4edf040, 0xc0a1206af0, 0xc07749d800)
        /usr/local/go/src/net/http/server.go:2007 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithRequestInfo.func1(0x4edf040, 0xc0a1206af0, 0xc07749d600)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/requestinfo.go:39 +0x274
net/http.HandlerFunc.ServeHTTP(0xc0434bf470, 0x4edf040, 0xc0a1206af0, 0xc07749d600)
        /usr/local/go/src/net/http/server.go:2007 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithCacheControl.func1(0x4edf040, 0xc0a1206af0, 0xc07749d600)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/cachecontrol.go:31 +0xa8
net/http.HandlerFunc.ServeHTTP(0xc019eb1a20, 0x4edf040, 0xc0a1206af0, 0xc07749d600)
        /usr/local/go/src/net/http/server.go:2007 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.WithLogging.func1(0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:89 +0x2ca
net/http.HandlerFunc.ServeHTTP(0xc019eb1a40, 0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /usr/local/go/src/net/http/server.go:2007 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.withPanicRecovery.func1(0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/wrap.go:51 +0x13e
net/http.HandlerFunc.ServeHTTP(0xc019eb1a60, 0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /usr/local/go/src/net/http/server.go:2007 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server.(*APIServerHandler).ServeHTTP(0xc0434bf4a0, 0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/handler.go:189 +0x51
net/http.serverHandler.ServeHTTP(0xc009896a80, 0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /usr/local/go/src/net/http/server.go:2802 +0xa4
net/http.initNPNRequest.ServeHTTP(0x4eeb300, 0xc06df08a50, 0xc07a0df180, 0xc009896a80, 0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /usr/local/go/src/net/http/server.go:3366 +0x8d
k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*serverConn).runHandler(0xc094106480, 0xc0c4b6b240, 0xc07749d500, 0xc08dc08340)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:2149 +0x9f
created by k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*serverConn).processHeaders
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:1883 +0x4eb
E0115 18:27:27.893364       1 wrap.go:39] apiserver panic'd on GET /api/v1/secrets?limit=500&resourceVersion=0
I0115 18:27:27.893567       1 log.go:172] http2: panic serving 10.148.0.16:53202: killing connection/stream because serving request timed out and response had been started
goroutine 79221267 [running]:
k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*serverConn).runHandler.func1(0xc0c4b6b240, 0xc0feb65f67, 0xc094106480)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:2142 +0x16b
panic(0x3b1fda0, 0xc0001c6650)
        /usr/local/go/src/runtime/panic.go:679 +0x1b2
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc0feb65c90, 0x1, 0x1)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x105
panic(0x3b1fda0, 0xc0001c6650)
        /usr/local/go/src/runtime/panic.go:679 +0x1b2
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).timeout(0xc08dc08740, 0xc09ea59b80)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:257 +0x1cf
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*time

Logs from ETCD:

...
Dec 11 09:21:08 ow-prod-k8s-master01 etcd[6830]: 2020-12-11 08:21:08.106948 W | etcdserver: failed to send out heartbeat on time (exceeded the 250ms timeout for 5.348150525s)
Dec 11 09:21:08 ow-prod-k8s-master01 etcd[6830]: 2020-12-11 08:21:08.106954 W | etcdserver: server is likely overloaded
posle bavni no uspqvashti ...
Dec 11 09:23:26 ow-prod-k8s-master01 etcd[6830]: 2020-12-11 08:23:26.433315 W | etcdserver: read-only range request "key:\"/registry/persistentvolumes/pvc-f31decea-7a39-4d11-bbbf-8eb45f433239\" " with result "range_response_count:1 size:1017" took too long (13.750148565s) to execute

Logs from cert-manager:

E0203 15:18:34.063192       1 wrap.go:39] apiserver panic'd on GET /api/v1/secrets?limit=500&resourceVersion=0

E0203 15:18:33.969252       1 reflector.go:123] external/io_k8s_client_go/tools/cache/reflector.go:96: Failed to list *v1.Secret: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 37511; INTERNAL_ERROR

Expected behaviour:
cert-manager to not try making heavy queries that need to query all secrets from all namespaces, but instead work per namespace.

Steps to reproduce the bug:

Generate 15000 secrets - no need for them to be for TLS certificates, any secret will do.
Look at the API server load and Cert-manager logs

Anything else we need to know?:

Environment details::

Kubernetes version: Kubernetes v1.16.13
Cloud-provider/provisioner: Vanilla K8s
cert-manager version: v1.1.0
Install method: helm (with CRDs applied before that)

/kind bug

The text was updated successfully, but these errors were encountered:

irbekrm · 2021-03-10T09:16:13Z

Hi @mvukadinoff,

Sorry to hear you're having trouble and thanks for raising the issue.

I'll see if I can reproduce it, it'd be interesting to know which query that was.
Generally I wasn't aware that we would be watching all the Kubernetes secrets in any of our controllers, I think usually it is only the ones owned by a Certificate, but I guess with 20'000 Certificates that might be too many as well.

Cert manager basically DDoS'es the Api server

The number of queries from cert-manager to the api server is also rate limited (by default 20qps with a burst of 50 and you can modify that with --kube-api-burst --kube-api-qps flags- not sure if that would help here).

It'd definitely be interesting to know whether it'd be interesting to know whether it's possible to have that many Certificates in
one cluster.

I wonder if allowing to modify the sync period would also help for larger deployments like this.

justinkillen · 2021-03-10T13:13:22Z

Is the limit 20,000 globally, or 20,000 in a single namespace?

mvukadinoff · 2021-04-16T17:20:18Z

For our case it's 20000 globally in more than 1500 namespaces. I don't think it's a hard limit it's more how powerful are the masters and how much load they can withstand.

Regarding the query I believe its: /api/v1/secrets?limit=500&resourceVersion=0
But I couldn't trace exactly where cert-manager is making it. Although it seems it has a limit parameter it seems it's not honored because of resourceVersion.

We saw there is a newer version of Cert Manager we'll try that as well.

mgruener · 2021-05-21T09:36:07Z

@irbekrm

Generally I wasn't aware that we would be watching all the Kubernetes secrets in any of our controllers, I think usually it is only the ones owned by a Certificate, but I guess with 20'000 Certificates that might be too many as well.

But it seems to be the case that at least cainjector watches alls secrets. We are currently in the process of introducing cert-manager to our OpenShift clusters and have no (or very few) Certificate objects. But the memory consumption of cainjector seems to scale with the amount of Secrets a cluster has in general. OpenShift creates a lot of serviceaccounts per namespace by default which results in a lot of sa token secrets. On clusters with more namespaces (and therefore more serviceaccounts/secrets) cainjector requires more memory than on clusters with fewer namespaces.

We are running cert-manager 1.3.1

fvlaicu · 2021-05-28T16:06:58Z

On a cluster that we've only installed the CRDs and don't have any certificates actually managed by cert-manager, the controller makes a call for all secrets - on that cluster we have about 130k secrets.
here's the log from the controller:

W0528 09:54:32.599392       1 client_config.go:608] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0528 09:54:32.600405       1 controller.go:171] cert-manager/controller/build-context "msg"="configured acme dns01 nameservers" "nameservers"=["192.168.0.2:53"]
I0528 09:54:32.600928       1 controller.go:72] cert-manager/controller "msg"="enabled controllers: [certificaterequests-approver certificaterequests-issuer-acme certificaterequests-issuer-ca certificaterequests-issuer-selfsigned certificaterequests-issuer-vault certificaterequests-issuer-venafi certificates-issuing certificates-key-manager certificates-metrics certificates-readiness certificates-request-manager certificates-revision-manager certificates-trigger challenges clusterissuers ingress-shim issuers orders]"
I0528 09:54:32.601253       1 controller.go:131] cert-manager/controller "msg"="starting leader election"
I0528 09:54:32.601454       1 metrics.go:166] cert-manager/controller/build-context/metrics "msg"="listening for connections on" "address"={"IP":"::","Port":9402,"Zone":""}
I0528 09:54:32.601794       1 leaderelection.go:243] attempting to acquire leader lease  kube-system/cert-manager-controller...
I0528 09:55:37.453032       1 leaderelection.go:253] successfully acquired lease kube-system/cert-manager-controller
I0528 09:55:37.453538       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificaterequests-issuer-ca"
I0528 09:55:37.453606       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificates-key-manager"
I0528 09:55:37.453656       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificates-revision-manager"
I0528 09:55:37.453668       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="orders"
I0528 09:55:37.453695       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="ingress-shim"
I0528 09:55:37.453763       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificates-request-manager"
I0528 09:55:37.453774       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificaterequests-approver"
I0528 09:55:37.453878       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="clusterissuers"
I0528 09:55:37.453894       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificaterequests-issuer-acme"
I0528 09:55:37.453918       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificaterequests-issuer-vault"
I0528 09:55:37.453935       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificates-metrics"
I0528 09:55:37.453980       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificates-readiness"
I0528 09:55:37.454010       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="issuers"
I0528 09:55:37.454038       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificaterequests-issuer-selfsigned"
I0528 09:55:37.454097       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificaterequests-issuer-venafi"
I0528 09:55:37.454115       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificates-issuing"
I0528 09:55:37.454163       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificates-trigger"
I0528 09:55:37.454975       1 reflector.go:207] Starting reflector *v1.Secret (5m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:46.968845       1 trace.go:205] Trace[1041222873]: "Reflector ListAndWatch" name:external/io_k8s_client_go/tools/cache/reflector.go:156 (28-May-2021 09:55:37.454) (total time: 69513ms):
Trace[1041222873]: ---"Objects listed" 69132ms (09:56:00.587)
Trace[1041222873]: [1m9.513751159s] [1m9.513751159s] END
I0528 09:56:47.655791       1 reflector.go:207] Starting reflector *v1beta1.Ingress (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.655818       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="challenges"
I0528 09:56:47.655860       1 reflector.go:207] Starting reflector *v1.Certificate (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.655790       1 reflector.go:207] Starting reflector *v1.ClusterIssuer (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.655918       1 reflector.go:207] Starting reflector *v1.Service (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.655935       1 reflector.go:207] Starting reflector *v1.Secret (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.655955       1 reflector.go:207] Starting reflector *v1.Challenge (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.655971       1 reflector.go:207] Starting reflector *v1.Pod (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.655860       1 reflector.go:207] Starting reflector *v1.Order (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.656192       1 reflector.go:207] Starting reflector *v1.CertificateRequest (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.656238       1 reflector.go:207] Starting reflector *v1.Issuer (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
W0528 09:56:47.661623       1 warnings.go:67] networking.k8s.io/v1beta1 Ingress is deprecated in v1.19+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
W0528 09:56:47.665090       1 warnings.go:67] networking.k8s.io/v1beta1 Ingress is deprecated in v1.19+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
I0528 09:58:04.063808       1 trace.go:205] Trace[1899623133]: "Reflector ListAndWatch" name:external/io_k8s_client_go/tools/cache/reflector.go:156 (28-May-2021 09:56:47.655) (total time: 76407ms):
Trace[1899623133]: ---"Objects listed" 76019ms (09:58:00.674)
Trace[1899623133]: [1m16.407768851s] [1m16.407768851s] END
W0528 10:02:23.667984       1 warnings.go:67] networking.k8s.io/v1beta1 Ingress is deprecated in v1.19+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress

LE:
I want to be clear, that in my case the API server doesn't crash, however the fact that cert-manager makes this call is problematic.
We're using k8s 1.19 and cert-manager 1.3.1

Jancis · 2021-08-19T08:15:14Z

We've been affected by this now too. I don't know the secrets querying specifics, but can't cert-manager add type filter to it? At least the cert-manager secrets on our cluster have kubernetes.io/tls type which can then be selected with --field-selector type=kubernetes.io/tls.

kubectl get secret -A --field-selector type=kubernetes.io/tls --no-headers | wc -l 
4477
kubectl get secret -A --no-headers | wc -l                                        
12661

Adding time to it shows that the query for tls secrets in particular takes roughly half of the time it used to query all secrets.
Yes, I am aware it's using api to select resources, not kubectl.

maelvls · 2021-09-06T08:44:19Z

With 90,000 secrets, I wasn't able to overload etcd:

$ kubectl get secret -A | wc -l
90796

$ kubectl top pod -A
NAMESPACE     NAME                                        CPU(cores)  MEMORY(bytes)
cert-manager  cert-manager-66b6d6bf59-jnjhg               1m          5207Mi
cert-manager  cert-manager-cainjector-856d4df858-k25kc    2m          2645Mi
kube-system   etcd-bomb-control-plane                     13m         717Mi
kube-system   kube-apiserver-bomb-control-plane           35m         8391Mi
kube-system   kube-controller-manager-bomb-control-plane  6m          2176Mi

💣 Secret bomb (90,000 secrets)

#! /bin/bash
#
# A Secret bomb to reproduce an issue where etcd would be overloaded by the apiserver
# due to cert-manager listing all the secrets.
#
# See: https://github.com/jetstack/cert-manager/issues/3748
#

set -e

kind create cluster --name bomb
helm upgrade --install cert-manager jetstack/cert-manager --version 1.5.3 --namespace cert-manager --set installCRDs=true --create-namespace --wait
kubectl wait --for=condition=available deploy/cert-manager-webhook -n cert-manager --timeout=5m

kubectl apply -f- <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: bomb-secrets
EOF

for j in $(seq 0 10000 90000); do
  kubectl apply -f <(
    for i in $(seq $j $((j + 10000))); do
      cat <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: bomb-${i}
  namespace: bomb-secrets
type: kubernetes.io/tls
stringData:
  ca.crt: |
    -----BEGIN CERTIFICATE-----
    MIIC9zCCAd+gAwIBAgIJAKVJWUdeCPLNMA0GCSqGSIb3DQEBCwUAMBIxEDAOBgNV
    BAMMB2V4YW1wbGUwHhcNMjEwMzE2MTcyNjE2WhcNMjEwNDE1MTcyNjE2WjASMRAw
    DgYDVQQDDAdleGFtcGxlMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA
    34SJ4NhxaIkHXlhoslJIm746fMAOR/25REmxPEZvx/lwgzutAxUi7D39yaeyUJwj
    TNUWZXtlCp/Vr7fmuApaps2vQ/Q/jUGO+UmhSFobKhajboTRpQ+sIPcjYfdhLT/x
    zVQw27acKt8nWVyocN5U8lVC4okv3ItXNHLh5a0jac/amMyFAOB42u0xK4fjnV/a
    qtQ2el1IDfQgnlom9Vt6Dl/bztF83NpSl40euE2v8vZH84KiWocP1OHA1BdxWoaj
    NrI7oHpDdKkia76OwQ/TfQBHJUNcQUoW6YUoRtW07b/CAzqFV8PGUjpdfw508Yco
    C8VJJx9MTdLmOUeE1qQpxQIDAQABo1AwTjAdBgNVHQ4EFgQUjhESkkI8juzNLCdB
    qfu997UZGGMwHwYDVR0jBBgwFoAUjhESkkI8juzNLCdBqfu997UZGGMwDAYDVR0T
    BAUwAwEB/zANBgkqhkiG9w0BAQsFAAOCAQEARJYelaj5otonmb+ZF9Lg+66pNHET
    Gir/+kgPeFog+++63Rkl9E82b9mDql+lccNQo5/yuU0znP+NaB0OnVTpvfMfGYTL
    w9NIPJ+qVQG9P7tOQUas43Zk3oTPR2wz27Pu6/fyAL3LMxTfhWRj4IplSWs0Ipia
    k0qn+29PZA0m1ZSw0BrPjsnjZcL+ZZ5UejGdKyT4UWODYO6W+QUq+MPFNELX55qt
    /cv569ywZojbMUi6E+QO4U4Av722qQaoEqGuN49VDSigm/fb/kRc5Tn/oOnpS1rX
    tfJ8ExaeXxbhTjRHAarNZCUF/u5suH1NY6XfsxbedWaLmlmdGVfZ4Txdqg==
    -----END CERTIFICATE-----
  tls.key: |
    -----BEGIN RSA PRIVATE KEY-----
    MIIEpAIBAAKCAQEA34SJ4NhxaIkHXlhoslJIm746fMAOR/25REmxPEZvx/lwgzut
    AxUi7D39yaeyUJwjTNUWZXtlCp/Vr7fmuApaps2vQ/Q/jUGO+UmhSFobKhajboTR
    pQ+sIPcjYfdhLT/xzVQw27acKt8nWVyocN5U8lVC4okv3ItXNHLh5a0jac/amMyF
    AOB42u0xK4fjnV/aqtQ2el1IDfQgnlom9Vt6Dl/bztF83NpSl40euE2v8vZH84Ki
    WocP1OHA1BdxWoajNrI7oHpDdKkia76OwQ/TfQBHJUNcQUoW6YUoRtW07b/CAzqF
    V8PGUjpdfw508YcoC8VJJx9MTdLmOUeE1qQpxQIDAQABAoIBAQC6M+u4yBcSArWE
    vxnZE/sw43RN4KEFEDV60fk4QWV1rjMw4FHtM3p4W9xEVdOSm8A8jXeu6vDtvOGD
    FSy7PMTwGIFdlugqgObefZxCbe4bTeiwdS1A2KGIhNmRD0iBLbf+WZiqMKJAhM5+
    /1XDUTRq/ORPXAHnNJ1dMCdH8siBp/ulhZkfdsCDzwClpcOsvJsnAzIy/q0QIzbg
    68HjGEot90kbH3HDvdyb8iw5yRBdGnT2oiZ16BQ3v3NPc5cRWqTOs0v8vCfw85qV
    hmXjvYOOm0jNQqyflw6AC/j6DeKFJIcnVEjbp8ZbXRVbCbZ9uRXWU/w8LwIn5dUz
    gFntIGABAoGBAPeMrmJw6Rw4oXuaitsqWN+i0y8TPR8VAqP9zOMEvAVtSAcYeLIO
    qaConIEQlv/j12+0nvCExgvKrP38fP/w41ELRi7JH4bINfEB8KkZFGRDvdv1EnNj
    rjir5MJGALjHQyzREp6zvJWllf4r1/oTCYJvtPy8QRiP150WtRQD1LvFAoGBAOcl
    2FTqyH2mtKWLtOkAq8CNT3Eer5/Cax+w1qAhfZ0meoKkyee7Hf5klTxsqfE+nbgS
    95bwTOzbuvutifHZVySWGv467dsD/HOF0jHoaWKZsHGjQuNECV/UbnvrYWh4dGKT
    A6UR+Pa0KnSOZGf3O12NUBztNOZhAiUscvr0QpYBAoGAWU40Xykyv86iWzAepgB5
    /XwFSfdb1onC4RyfvMqpdh+9m2m1qS7m/SG3DEzK3Nf6kb8Mk+KifACLNjnPcpoZ
    t9QkZp6CNCKoayDzDF4S4DUcGm0oUd6FLMa+iWOtwPuJ/XITkJNxFl+dZAu3J+2U
    Qa1BEuhrZ4wFEhPuEaFsLq0CgYEAwoQ/k95sUAks9i8mU/pDjuucEda/9pKWsXmQ
    c/sbCVdrO2vPmVoG+KDOUaYkMSb/dPtJLdUU9zJGHSvB7St4QQqstosCxQ+Kr/DK
    nUM3BEnPiSHZ1QTZWrKbM182fsL3Nkj/hTclqv6cx69YYYFVjPmxlFYt8T1rn7rT
    G8rYCgECgYAA9wRv5nJOdM0YqISDexRPXjdXqiMgzwtH/11vrObhZPnKneGQ0Euf
    iGKHqxsI2tGBtcOHRxVKy4GzjJPSRMa5j8RXRiuMZNbikUnYCfZoEKcbHPEaS9mZ
    iSJ0/Hks7Xg2Iz1/q2aYb4HTIjMOXCGxXnC5IY+dqdKlpXPcigno0Q==
    -----END RSA PRIVATE KEY-----
  tls.crt: |
    -----BEGIN CERTIFICATE-----
    MIIC9zCCAd+gAwIBAgIJAKVJWUdeCPLNMA0GCSqGSIb3DQEBCwUAMBIxEDAOBgNV
    BAMMB2V4YW1wbGUwHhcNMjEwMzE2MTcyNjE2WhcNMjEwNDE1MTcyNjE2WjASMRAw
    DgYDVQQDDAdleGFtcGxlMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA
    34SJ4NhxaIkHXlhoslJIm746fMAOR/25REmxPEZvx/lwgzutAxUi7D39yaeyUJwj
    TNUWZXtlCp/Vr7fmuApaps2vQ/Q/jUGO+UmhSFobKhajboTRpQ+sIPcjYfdhLT/x
    zVQw27acKt8nWVyocN5U8lVC4okv3ItXNHLh5a0jac/amMyFAOB42u0xK4fjnV/a
    qtQ2el1IDfQgnlom9Vt6Dl/bztF83NpSl40euE2v8vZH84KiWocP1OHA1BdxWoaj
    NrI7oHpDdKkia76OwQ/TfQBHJUNcQUoW6YUoRtW07b/CAzqFV8PGUjpdfw508Yco
    C8VJJx9MTdLmOUeE1qQpxQIDAQABo1AwTjAdBgNVHQ4EFgQUjhESkkI8juzNLCdB
    qfu997UZGGMwHwYDVR0jBBgwFoAUjhESkkI8juzNLCdBqfu997UZGGMwDAYDVR0T
    BAUwAwEB/zANBgkqhkiG9w0BAQsFAAOCAQEARJYelaj5otonmb+ZF9Lg+66pNHET
    Gir/+kgPeFog+++63Rkl9E82b9mDql+lccNQo5/yuU0znP+NaB0OnVTpvfMfGYTL
    w9NIPJ+qVQG9P7tOQUas43Zk3oTPR2wz27Pu6/fyAL3LMxTfhWRj4IplSWs0Ipia
    k0qn+29PZA0m1ZSw0BrPjsnjZcL+ZZ5UejGdKyT4UWODYO6W+QUq+MPFNELX55qt
    /cv569ywZojbMUi6E+QO4U4Av722qQaoEqGuN49VDSigm/fb/kRc5Tn/oOnpS1rX
    tfJ8ExaeXxbhTjRHAarNZCUF/u5suH1NY6XfsxbedWaLmlmdGVfZ4Txdqg==
    -----END CERTIFICATE-----
---
EOF
    done
  )
done

As of today, cert-manager does not "filter" secrets. Listing all secrets on startup and every 10 hours has two consequences:

high memory usage for both the cert-manager controller and cainjector (5GiB and 2.6GiB respectively with 90,000 secrets).
high CPU usage of etcd and kube-apiserver when the listing happens.

Looking at ways to alliviate that:

As @Jancis suggested, one improvement could be to only watch secrets that have type: kubernetes.io/tls. The memory usage would be lowered, but the etcd CPU usage would stay the same since the filtering happens in the apiserver.

Filter on = HTTP call takes 41 seconds,

$ time kubectl get secret -A --field-selector type=kubernetes.io/tls --no-headers >/dev/null
kubectl get secret -A --field-selector type=kubernetes.io/tls --no-headers >   24.91s user 0.56s system 61% cpu 41.524 total

Filter off = HTTP call takes 41 seconds too:

$ time kubectl get secret -A --no-headers >/dev/null
kubectl get secret -A --no-headers > /dev/null  24.69s user 0.67s system 60% cpu 41.618 total

wallrj · 2021-09-07T10:25:46Z

Some Kubernetes issues that might be related:

maelvls · 2021-10-20T16:59:08Z

The same issue appeared in Rancher:

Panic in kube-apiserver when there are 10k secrets created in the cluster rancher/rancher#27519

Other related links:

jetstack-bot · 2022-01-18T17:48:03Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

wallrj · 2022-01-19T10:47:30Z

/remove-lifecycle stale

jetstack-bot · 2022-04-19T11:05:18Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

fvlaicu · 2022-04-19T21:19:49Z

/remove-lifecycle stale

jetstack-bot · 2022-07-18T22:15:52Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

fvlaicu · 2022-07-19T14:58:44Z

/remove-lifecycle stale

jetstack-bot · 2022-10-17T15:39:31Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

evanfoster · 2022-10-18T16:27:01Z

/remove-lifecycle stale

alex-berger · 2022-12-05T19:37:36Z

We are also affected by this, in our case cainjector fails (see attache logs) cert-manager-cainjector-6cdc477bff-6vr5h.log.

Also interesting, upon that failure, the cainjector Pod resp. Container does not terminate. It just runs on dysfunctional forever while still holding the leader lease which prevents other (still) functional cainjector Pods from taking over.

So looks like there are several bugs acting together :-(, but of course if it were not to crash the API server it would not happen at all. Thus, it would be great if this bug could be fixed by removing resourceVersion=0 as suggested above.

jetstack-bot · 2023-03-05T19:48:51Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

evanfoster · 2023-03-06T16:23:19Z

/remove-lifecycle stale

jetstack-bot · 2023-04-05T17:15:31Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle rotten
/remove-lifecycle stale

fvlaicu · 2023-04-05T17:17:47Z

/remove-lifecycle stale

fvlaicu · 2023-04-05T17:27:03Z

/remove-lifecycle rotten

jetstack-bot · 2023-07-04T18:12:34Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

jetstack-bot · 2023-08-03T18:37:57Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle rotten
/remove-lifecycle stale

alex-berger · 2023-08-04T19:42:11Z

/remove-lifecycle rotten

wallrj · 2023-11-02T09:55:43Z

@alex-berger Please read the new memory scalability section of the Best Practice documentation which now explains how to configure cainjector to only watch Secret resources in the cert-manager namespace. Let me know if it helps:

You can also reduce the memory consumption of cainjector by configuring it to only watch resources in the cert-manager namespace, and by configuring it to not watch Certificate resources.

alex-berger · 2023-11-03T12:10:52Z

@wallrj No that is not applicable in our case as we use cainjector for inject things for webooks in multiple namespaces.

⚠️️ This optimization is only appropriate if cainjector is being used exclusively for the the cert-manager webhook. It is not appropriate if cainjector is also being used to manage the TLS certificates for webhooks of other software. For example, some Kubebuilder derived projects may depend on cainjector to inject TLS certificates for their webhooks.

jetstack-bot · 2024-02-01T12:16:59Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

fvlaicu · 2024-02-01T16:13:09Z

/remove-lifecycle stale.

fvlaicu · 2024-02-01T16:13:15Z

/remove-lifecycle stale

jetstack-bot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 8, 2021

irbekrm added the triage/needs-information Indicates an issue needs more information in order to work on it. label Mar 10, 2021

This was referenced May 5, 2021

Documentation on how to handle large-scale certificate management & best practices cert-manager/website#551

Open

Documentation on how to handle large-scale certificate management #3971

Closed

irbekrm mentioned this issue May 13, 2021

3879 test acme issuer setup #3938

Merged

jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 18, 2022

jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 19, 2022

jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 19, 2022

jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 19, 2022

irbekrm mentioned this issue Jun 17, 2022

Investigate improving resource consumption and performance in clusters with large amount of resources #5220

Closed

jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 18, 2022

jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 19, 2022

jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 17, 2022

jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 18, 2022

jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 5, 2023

jetstack-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 5, 2023

jetstack-bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 5, 2023

jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 4, 2023

jetstack-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 3, 2023

jetstack-bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 4, 2023

jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 1, 2024

jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 1, 2024

wallrj added this to the 1.15 milestone Feb 2, 2024

wallrj mentioned this issue Apr 9, 2024

Explain how to optimise cert-manager for scale cert-manager/website#1458

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cert-manager causes API server panic on clusters with more than 20000 secrets. #3748

Cert-manager causes API server panic on clusters with more than 20000 secrets. #3748

mvukadinoff commented Mar 8, 2021 •

edited by wallrj

irbekrm commented Mar 10, 2021

justinkillen commented Mar 10, 2021

mvukadinoff commented Apr 16, 2021

mgruener commented May 21, 2021 •

edited

fvlaicu commented May 28, 2021 •

edited

Jancis commented Aug 19, 2021 •

edited

maelvls commented Sep 6, 2021 •

edited

wallrj commented Sep 7, 2021

maelvls commented Oct 20, 2021

jetstack-bot commented Jan 18, 2022

wallrj commented Jan 19, 2022

jetstack-bot commented Apr 19, 2022

fvlaicu commented Apr 19, 2022

jetstack-bot commented Jul 18, 2022

fvlaicu commented Jul 19, 2022

jetstack-bot commented Oct 17, 2022

evanfoster commented Oct 18, 2022

alex-berger commented Dec 5, 2022

jetstack-bot commented Mar 5, 2023

evanfoster commented Mar 6, 2023

jetstack-bot commented Apr 5, 2023

fvlaicu commented Apr 5, 2023

fvlaicu commented Apr 5, 2023

jetstack-bot commented Jul 4, 2023

jetstack-bot commented Aug 3, 2023

alex-berger commented Aug 4, 2023

wallrj commented Nov 2, 2023

alex-berger commented Nov 3, 2023

jetstack-bot commented Feb 1, 2024

fvlaicu commented Feb 1, 2024

fvlaicu commented Feb 1, 2024

Cert-manager causes API server panic on clusters with more than 20000 secrets. #3748

Cert-manager causes API server panic on clusters with more than 20000 secrets. #3748

Comments

mvukadinoff commented Mar 8, 2021 • edited by wallrj

irbekrm commented Mar 10, 2021

justinkillen commented Mar 10, 2021

mvukadinoff commented Apr 16, 2021

mgruener commented May 21, 2021 • edited

fvlaicu commented May 28, 2021 • edited

Jancis commented Aug 19, 2021 • edited

maelvls commented Sep 6, 2021 • edited

wallrj commented Sep 7, 2021

maelvls commented Oct 20, 2021

jetstack-bot commented Jan 18, 2022

wallrj commented Jan 19, 2022

jetstack-bot commented Apr 19, 2022

fvlaicu commented Apr 19, 2022

jetstack-bot commented Jul 18, 2022

fvlaicu commented Jul 19, 2022

jetstack-bot commented Oct 17, 2022

evanfoster commented Oct 18, 2022

alex-berger commented Dec 5, 2022

jetstack-bot commented Mar 5, 2023

evanfoster commented Mar 6, 2023

jetstack-bot commented Apr 5, 2023

fvlaicu commented Apr 5, 2023

fvlaicu commented Apr 5, 2023

jetstack-bot commented Jul 4, 2023

jetstack-bot commented Aug 3, 2023

alex-berger commented Aug 4, 2023

wallrj commented Nov 2, 2023

alex-berger commented Nov 3, 2023

jetstack-bot commented Feb 1, 2024

fvlaicu commented Feb 1, 2024

fvlaicu commented Feb 1, 2024

mvukadinoff commented Mar 8, 2021 •

edited by wallrj

mgruener commented May 21, 2021 •

edited

fvlaicu commented May 28, 2021 •

edited

Jancis commented Aug 19, 2021 •

edited

maelvls commented Sep 6, 2021 •

edited