Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cert-manager causes API server panic on clusters with more than 20000 secrets. #3748

Open
mvukadinoff opened this issue Mar 8, 2021 · 31 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/needs-information Indicates an issue needs more information in order to work on it.
Milestone

Comments

@mvukadinoff
Copy link

mvukadinoff commented Mar 8, 2021

📢 UPDATE: 2023-11-02
Please read the memory scalability section of the Best Practice documentation which now explains how to configure cainjector to only watch Secret resources in the cert-manager namespace. This should resolve some of the problems described in this issue.

Describe the bug:
On clusters with more than 20000 secrets this becomes a problem . The query that Cert manager does is not optimal.
/api/v1/secrets?limit=500&resourceVersion=0

resourceVersion=0 will cause to query always all secrects and limit=500 will not be taken into account. This way cert manager is not scalable for large deployments. Secrets are used not only for certificates.

As mentioned in : kubernetes/kubernetes#56278 and https://kubernetes.io/docs/reference/using-api/api-concepts/

I suggest to remove the resourceVersion=0 from the query which should make it a lot more faster.

Furhtermore cert manager will retry those queries without waiting for them to complete and they pile up and cause significant load even crashes on the API server. Cert manager basically DDoS'es the Api server.

We're hitting the same issue with:
quay.io/jetstack/cert-manager-cainjector:v0.11.0
quay.io/jetstack/cert-manager-controller:v0.11.0
and
quay.io/jetstack/cert-manager-controller:v1.1.0
quay.io/jetstack/cert-manager-cainjector:v1.1.0

Logs from API server


E0115 18:27:27.893242       1 runtime.go:78] Observed a panic: &errors.errorString{s:"killing connection/stream because serving request timed out and response had been started"} (killing connection/stream because serving request timed out and response had been started)
goroutine 79221267 [running]:
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x3b1fda0, 0xc0001c6650)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc0feb65c90, 0x1, 0x1)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82
panic(0x3b1fda0, 0xc0001c6650)
        /usr/local/go/src/runtime/panic.go:679 +0x1b2
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).timeout(0xc08dc08740, 0xc09ea59b80)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:257 +0x1cf
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP(0xc019eb1960, 0x4edf040, 0xc0a1206af0, 0xc07749d900)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:141 +0x310
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.WithWaitGroup.func1(0x4edf040, 0xc0a1206af0, 0xc07749d800)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/waitgroup.go:47 +0x10f
net/http.HandlerFunc.ServeHTTP(0xc0434bf3e0, 0x4edf040, 0xc0a1206af0, 0xc07749d800)
        /usr/local/go/src/net/http/server.go:2007 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithRequestInfo.func1(0x4edf040, 0xc0a1206af0, 0xc07749d600)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/requestinfo.go:39 +0x274
net/http.HandlerFunc.ServeHTTP(0xc0434bf470, 0x4edf040, 0xc0a1206af0, 0xc07749d600)
        /usr/local/go/src/net/http/server.go:2007 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithCacheControl.func1(0x4edf040, 0xc0a1206af0, 0xc07749d600)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/cachecontrol.go:31 +0xa8
net/http.HandlerFunc.ServeHTTP(0xc019eb1a20, 0x4edf040, 0xc0a1206af0, 0xc07749d600)
        /usr/local/go/src/net/http/server.go:2007 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.WithLogging.func1(0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:89 +0x2ca
net/http.HandlerFunc.ServeHTTP(0xc019eb1a40, 0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /usr/local/go/src/net/http/server.go:2007 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.withPanicRecovery.func1(0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/wrap.go:51 +0x13e
net/http.HandlerFunc.ServeHTTP(0xc019eb1a60, 0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /usr/local/go/src/net/http/server.go:2007 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server.(*APIServerHandler).ServeHTTP(0xc0434bf4a0, 0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/handler.go:189 +0x51
net/http.serverHandler.ServeHTTP(0xc009896a80, 0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /usr/local/go/src/net/http/server.go:2802 +0xa4
net/http.initNPNRequest.ServeHTTP(0x4eeb300, 0xc06df08a50, 0xc07a0df180, 0xc009896a80, 0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /usr/local/go/src/net/http/server.go:3366 +0x8d
k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*serverConn).runHandler(0xc094106480, 0xc0c4b6b240, 0xc07749d500, 0xc08dc08340)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:2149 +0x9f
created by k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*serverConn).processHeaders
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:1883 +0x4eb
E0115 18:27:27.893364       1 wrap.go:39] apiserver panic'd on GET /api/v1/secrets?limit=500&resourceVersion=0
I0115 18:27:27.893567       1 log.go:172] http2: panic serving 10.148.0.16:53202: killing connection/stream because serving request timed out and response had been started
goroutine 79221267 [running]:
k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*serverConn).runHandler.func1(0xc0c4b6b240, 0xc0feb65f67, 0xc094106480)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:2142 +0x16b
panic(0x3b1fda0, 0xc0001c6650)
        /usr/local/go/src/runtime/panic.go:679 +0x1b2
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc0feb65c90, 0x1, 0x1)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x105
panic(0x3b1fda0, 0xc0001c6650)
        /usr/local/go/src/runtime/panic.go:679 +0x1b2
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).timeout(0xc08dc08740, 0xc09ea59b80)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:257 +0x1cf
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*time

Logs from ETCD:

...
Dec 11 09:21:08 ow-prod-k8s-master01 etcd[6830]: 2020-12-11 08:21:08.106948 W | etcdserver: failed to send out heartbeat on time (exceeded the 250ms timeout for 5.348150525s)
Dec 11 09:21:08 ow-prod-k8s-master01 etcd[6830]: 2020-12-11 08:21:08.106954 W | etcdserver: server is likely overloaded
posle bavni no uspqvashti ...
Dec 11 09:23:26 ow-prod-k8s-master01 etcd[6830]: 2020-12-11 08:23:26.433315 W | etcdserver: read-only range request "key:\"/registry/persistentvolumes/pvc-f31decea-7a39-4d11-bbbf-8eb45f433239\" " with result "range_response_count:1 size:1017" took too long (13.750148565s) to execute

Logs from cert-manager:

E0203 15:18:34.063192       1 wrap.go:39] apiserver panic'd on GET /api/v1/secrets?limit=500&resourceVersion=0

E0203 15:18:33.969252       1 reflector.go:123] external/io_k8s_client_go/tools/cache/reflector.go:96: Failed to list *v1.Secret: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 37511; INTERNAL_ERROR

Expected behaviour:
cert-manager to not try making heavy queries that need to query all secrets from all namespaces, but instead work per namespace.

Steps to reproduce the bug:

Generate 15000 secrets - no need for them to be for TLS certificates, any secret will do.
Look at the API server load and Cert-manager logs

Anything else we need to know?:

Environment details::

  • Kubernetes version: Kubernetes v1.16.13
  • Cloud-provider/provisioner: Vanilla K8s
  • cert-manager version: v1.1.0
  • Install method: helm (with CRDs applied before that)

/kind bug

@jetstack-bot jetstack-bot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 8, 2021
@irbekrm
Copy link
Collaborator

irbekrm commented Mar 10, 2021

Hi @mvukadinoff,

Sorry to hear you're having trouble and thanks for raising the issue.

I'll see if I can reproduce it, it'd be interesting to know which query that was.
Generally I wasn't aware that we would be watching all the Kubernetes secrets in any of our controllers, I think usually it is only the ones owned by a Certificate, but I guess with 20'000 Certificates that might be too many as well.

Cert manager basically DDoS'es the Api server

The number of queries from cert-manager to the api server is also rate limited (by default 20qps with a burst of 50 and you can modify that with --kube-api-burst --kube-api-qps flags- not sure if that would help here).

It'd definitely be interesting to know whether it'd be interesting to know whether it's possible to have that many Certificates in
one cluster.

I wonder if allowing to modify the sync period would also help for larger deployments like this.

@irbekrm irbekrm added the triage/needs-information Indicates an issue needs more information in order to work on it. label Mar 10, 2021
@justinkillen
Copy link

Is the limit 20,000 globally, or 20,000 in a single namespace?

@mvukadinoff
Copy link
Author

For our case it's 20000 globally in more than 1500 namespaces. I don't think it's a hard limit it's more how powerful are the masters and how much load they can withstand.

Regarding the query I believe its: /api/v1/secrets?limit=500&resourceVersion=0
But I couldn't trace exactly where cert-manager is making it. Although it seems it has a limit parameter it seems it's not honored because of resourceVersion.

We saw there is a newer version of Cert Manager we'll try that as well.

@mgruener
Copy link

mgruener commented May 21, 2021

@irbekrm

Generally I wasn't aware that we would be watching all the Kubernetes secrets in any of our controllers, I think usually it is only the ones owned by a Certificate, but I guess with 20'000 Certificates that might be too many as well.

But it seems to be the case that at least cainjector watches alls secrets. We are currently in the process of introducing cert-manager to our OpenShift clusters and have no (or very few) Certificate objects. But the memory consumption of cainjector seems to scale with the amount of Secrets a cluster has in general. OpenShift creates a lot of serviceaccounts per namespace by default which results in a lot of sa token secrets. On clusters with more namespaces (and therefore more serviceaccounts/secrets) cainjector requires more memory than on clusters with fewer namespaces.

We are running cert-manager 1.3.1

@fvlaicu
Copy link
Contributor

fvlaicu commented May 28, 2021

On a cluster that we've only installed the CRDs and don't have any certificates actually managed by cert-manager, the controller makes a call for all secrets - on that cluster we have about 130k secrets.
here's the log from the controller:

W0528 09:54:32.599392       1 client_config.go:608] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0528 09:54:32.600405       1 controller.go:171] cert-manager/controller/build-context "msg"="configured acme dns01 nameservers" "nameservers"=["192.168.0.2:53"]
I0528 09:54:32.600928       1 controller.go:72] cert-manager/controller "msg"="enabled controllers: [certificaterequests-approver certificaterequests-issuer-acme certificaterequests-issuer-ca certificaterequests-issuer-selfsigned certificaterequests-issuer-vault certificaterequests-issuer-venafi certificates-issuing certificates-key-manager certificates-metrics certificates-readiness certificates-request-manager certificates-revision-manager certificates-trigger challenges clusterissuers ingress-shim issuers orders]"
I0528 09:54:32.601253       1 controller.go:131] cert-manager/controller "msg"="starting leader election"
I0528 09:54:32.601454       1 metrics.go:166] cert-manager/controller/build-context/metrics "msg"="listening for connections on" "address"={"IP":"::","Port":9402,"Zone":""}
I0528 09:54:32.601794       1 leaderelection.go:243] attempting to acquire leader lease  kube-system/cert-manager-controller...
I0528 09:55:37.453032       1 leaderelection.go:253] successfully acquired lease kube-system/cert-manager-controller
I0528 09:55:37.453538       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificaterequests-issuer-ca"
I0528 09:55:37.453606       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificates-key-manager"
I0528 09:55:37.453656       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificates-revision-manager"
I0528 09:55:37.453668       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="orders"
I0528 09:55:37.453695       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="ingress-shim"
I0528 09:55:37.453763       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificates-request-manager"
I0528 09:55:37.453774       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificaterequests-approver"
I0528 09:55:37.453878       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="clusterissuers"
I0528 09:55:37.453894       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificaterequests-issuer-acme"
I0528 09:55:37.453918       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificaterequests-issuer-vault"
I0528 09:55:37.453935       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificates-metrics"
I0528 09:55:37.453980       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificates-readiness"
I0528 09:55:37.454010       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="issuers"
I0528 09:55:37.454038       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificaterequests-issuer-selfsigned"
I0528 09:55:37.454097       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificaterequests-issuer-venafi"
I0528 09:55:37.454115       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificates-issuing"
I0528 09:55:37.454163       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="certificates-trigger"
I0528 09:55:37.454975       1 reflector.go:207] Starting reflector *v1.Secret (5m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:46.968845       1 trace.go:205] Trace[1041222873]: "Reflector ListAndWatch" name:external/io_k8s_client_go/tools/cache/reflector.go:156 (28-May-2021 09:55:37.454) (total time: 69513ms):
Trace[1041222873]: ---"Objects listed" 69132ms (09:56:00.587)
Trace[1041222873]: [1m9.513751159s] [1m9.513751159s] END
I0528 09:56:47.655791       1 reflector.go:207] Starting reflector *v1beta1.Ingress (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.655818       1 controller.go:105] cert-manager/controller "msg"="starting controller" "controller"="challenges"
I0528 09:56:47.655860       1 reflector.go:207] Starting reflector *v1.Certificate (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.655790       1 reflector.go:207] Starting reflector *v1.ClusterIssuer (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.655918       1 reflector.go:207] Starting reflector *v1.Service (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.655935       1 reflector.go:207] Starting reflector *v1.Secret (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.655955       1 reflector.go:207] Starting reflector *v1.Challenge (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.655971       1 reflector.go:207] Starting reflector *v1.Pod (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.655860       1 reflector.go:207] Starting reflector *v1.Order (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.656192       1 reflector.go:207] Starting reflector *v1.CertificateRequest (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
I0528 09:56:47.656238       1 reflector.go:207] Starting reflector *v1.Issuer (10h0m0s) from external/io_k8s_client_go/tools/cache/reflector.go:156
W0528 09:56:47.661623       1 warnings.go:67] networking.k8s.io/v1beta1 Ingress is deprecated in v1.19+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
W0528 09:56:47.665090       1 warnings.go:67] networking.k8s.io/v1beta1 Ingress is deprecated in v1.19+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
I0528 09:58:04.063808       1 trace.go:205] Trace[1899623133]: "Reflector ListAndWatch" name:external/io_k8s_client_go/tools/cache/reflector.go:156 (28-May-2021 09:56:47.655) (total time: 76407ms):
Trace[1899623133]: ---"Objects listed" 76019ms (09:58:00.674)
Trace[1899623133]: [1m16.407768851s] [1m16.407768851s] END
W0528 10:02:23.667984       1 warnings.go:67] networking.k8s.io/v1beta1 Ingress is deprecated in v1.19+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress

LE:
I want to be clear, that in my case the API server doesn't crash, however the fact that cert-manager makes this call is problematic.
We're using k8s 1.19 and cert-manager 1.3.1

@Jancis
Copy link

Jancis commented Aug 19, 2021

We've been affected by this now too. I don't know the secrets querying specifics, but can't cert-manager add type filter to it? At least the cert-manager secrets on our cluster have kubernetes.io/tls type which can then be selected with --field-selector type=kubernetes.io/tls.

kubectl get secret -A --field-selector type=kubernetes.io/tls --no-headers | wc -l 
4477
kubectl get secret -A --no-headers | wc -l                                        
12661

Adding time to it shows that the query for tls secrets in particular takes roughly half of the time it used to query all secrets.
Yes, I am aware it's using api to select resources, not kubectl.

@maelvls
Copy link
Member

maelvls commented Sep 6, 2021

With 90,000 secrets, I wasn't able to overload etcd:

$ kubectl get secret -A | wc -l
90796

$ kubectl top pod -A
NAMESPACE     NAME                                        CPU(cores)  MEMORY(bytes)
cert-manager  cert-manager-66b6d6bf59-jnjhg               1m          5207Mi
cert-manager  cert-manager-cainjector-856d4df858-k25kc    2m          2645Mi
kube-system   etcd-bomb-control-plane                     13m         717Mi
kube-system   kube-apiserver-bomb-control-plane           35m         8391Mi
kube-system   kube-controller-manager-bomb-control-plane  6m          2176Mi
💣 Secret bomb (90,000 secrets)
#! /bin/bash
#
# A Secret bomb to reproduce an issue where etcd would be overloaded by the apiserver
# due to cert-manager listing all the secrets.
#
# See: https://github.com/jetstack/cert-manager/issues/3748
#

set -e

kind create cluster --name bomb
helm upgrade --install cert-manager jetstack/cert-manager --version 1.5.3 --namespace cert-manager --set installCRDs=true --create-namespace --wait
kubectl wait --for=condition=available deploy/cert-manager-webhook -n cert-manager --timeout=5m

kubectl apply -f- <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: bomb-secrets
EOF

for j in $(seq 0 10000 90000); do
  kubectl apply -f <(
    for i in $(seq $j $((j + 10000))); do
      cat <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: bomb-${i}
  namespace: bomb-secrets
type: kubernetes.io/tls
stringData:
  ca.crt: |
    -----BEGIN CERTIFICATE-----
    MIIC9zCCAd+gAwIBAgIJAKVJWUdeCPLNMA0GCSqGSIb3DQEBCwUAMBIxEDAOBgNV
    BAMMB2V4YW1wbGUwHhcNMjEwMzE2MTcyNjE2WhcNMjEwNDE1MTcyNjE2WjASMRAw
    DgYDVQQDDAdleGFtcGxlMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA
    34SJ4NhxaIkHXlhoslJIm746fMAOR/25REmxPEZvx/lwgzutAxUi7D39yaeyUJwj
    TNUWZXtlCp/Vr7fmuApaps2vQ/Q/jUGO+UmhSFobKhajboTRpQ+sIPcjYfdhLT/x
    zVQw27acKt8nWVyocN5U8lVC4okv3ItXNHLh5a0jac/amMyFAOB42u0xK4fjnV/a
    qtQ2el1IDfQgnlom9Vt6Dl/bztF83NpSl40euE2v8vZH84KiWocP1OHA1BdxWoaj
    NrI7oHpDdKkia76OwQ/TfQBHJUNcQUoW6YUoRtW07b/CAzqFV8PGUjpdfw508Yco
    C8VJJx9MTdLmOUeE1qQpxQIDAQABo1AwTjAdBgNVHQ4EFgQUjhESkkI8juzNLCdB
    qfu997UZGGMwHwYDVR0jBBgwFoAUjhESkkI8juzNLCdBqfu997UZGGMwDAYDVR0T
    BAUwAwEB/zANBgkqhkiG9w0BAQsFAAOCAQEARJYelaj5otonmb+ZF9Lg+66pNHET
    Gir/+kgPeFog+++63Rkl9E82b9mDql+lccNQo5/yuU0znP+NaB0OnVTpvfMfGYTL
    w9NIPJ+qVQG9P7tOQUas43Zk3oTPR2wz27Pu6/fyAL3LMxTfhWRj4IplSWs0Ipia
    k0qn+29PZA0m1ZSw0BrPjsnjZcL+ZZ5UejGdKyT4UWODYO6W+QUq+MPFNELX55qt
    /cv569ywZojbMUi6E+QO4U4Av722qQaoEqGuN49VDSigm/fb/kRc5Tn/oOnpS1rX
    tfJ8ExaeXxbhTjRHAarNZCUF/u5suH1NY6XfsxbedWaLmlmdGVfZ4Txdqg==
    -----END CERTIFICATE-----
  tls.key: |
    -----BEGIN RSA PRIVATE KEY-----
    MIIEpAIBAAKCAQEA34SJ4NhxaIkHXlhoslJIm746fMAOR/25REmxPEZvx/lwgzut
    AxUi7D39yaeyUJwjTNUWZXtlCp/Vr7fmuApaps2vQ/Q/jUGO+UmhSFobKhajboTR
    pQ+sIPcjYfdhLT/xzVQw27acKt8nWVyocN5U8lVC4okv3ItXNHLh5a0jac/amMyF
    AOB42u0xK4fjnV/aqtQ2el1IDfQgnlom9Vt6Dl/bztF83NpSl40euE2v8vZH84Ki
    WocP1OHA1BdxWoajNrI7oHpDdKkia76OwQ/TfQBHJUNcQUoW6YUoRtW07b/CAzqF
    V8PGUjpdfw508YcoC8VJJx9MTdLmOUeE1qQpxQIDAQABAoIBAQC6M+u4yBcSArWE
    vxnZE/sw43RN4KEFEDV60fk4QWV1rjMw4FHtM3p4W9xEVdOSm8A8jXeu6vDtvOGD
    FSy7PMTwGIFdlugqgObefZxCbe4bTeiwdS1A2KGIhNmRD0iBLbf+WZiqMKJAhM5+
    /1XDUTRq/ORPXAHnNJ1dMCdH8siBp/ulhZkfdsCDzwClpcOsvJsnAzIy/q0QIzbg
    68HjGEot90kbH3HDvdyb8iw5yRBdGnT2oiZ16BQ3v3NPc5cRWqTOs0v8vCfw85qV
    hmXjvYOOm0jNQqyflw6AC/j6DeKFJIcnVEjbp8ZbXRVbCbZ9uRXWU/w8LwIn5dUz
    gFntIGABAoGBAPeMrmJw6Rw4oXuaitsqWN+i0y8TPR8VAqP9zOMEvAVtSAcYeLIO
    qaConIEQlv/j12+0nvCExgvKrP38fP/w41ELRi7JH4bINfEB8KkZFGRDvdv1EnNj
    rjir5MJGALjHQyzREp6zvJWllf4r1/oTCYJvtPy8QRiP150WtRQD1LvFAoGBAOcl
    2FTqyH2mtKWLtOkAq8CNT3Eer5/Cax+w1qAhfZ0meoKkyee7Hf5klTxsqfE+nbgS
    95bwTOzbuvutifHZVySWGv467dsD/HOF0jHoaWKZsHGjQuNECV/UbnvrYWh4dGKT
    A6UR+Pa0KnSOZGf3O12NUBztNOZhAiUscvr0QpYBAoGAWU40Xykyv86iWzAepgB5
    /XwFSfdb1onC4RyfvMqpdh+9m2m1qS7m/SG3DEzK3Nf6kb8Mk+KifACLNjnPcpoZ
    t9QkZp6CNCKoayDzDF4S4DUcGm0oUd6FLMa+iWOtwPuJ/XITkJNxFl+dZAu3J+2U
    Qa1BEuhrZ4wFEhPuEaFsLq0CgYEAwoQ/k95sUAks9i8mU/pDjuucEda/9pKWsXmQ
    c/sbCVdrO2vPmVoG+KDOUaYkMSb/dPtJLdUU9zJGHSvB7St4QQqstosCxQ+Kr/DK
    nUM3BEnPiSHZ1QTZWrKbM182fsL3Nkj/hTclqv6cx69YYYFVjPmxlFYt8T1rn7rT
    G8rYCgECgYAA9wRv5nJOdM0YqISDexRPXjdXqiMgzwtH/11vrObhZPnKneGQ0Euf
    iGKHqxsI2tGBtcOHRxVKy4GzjJPSRMa5j8RXRiuMZNbikUnYCfZoEKcbHPEaS9mZ
    iSJ0/Hks7Xg2Iz1/q2aYb4HTIjMOXCGxXnC5IY+dqdKlpXPcigno0Q==
    -----END RSA PRIVATE KEY-----
  tls.crt: |
    -----BEGIN CERTIFICATE-----
    MIIC9zCCAd+gAwIBAgIJAKVJWUdeCPLNMA0GCSqGSIb3DQEBCwUAMBIxEDAOBgNV
    BAMMB2V4YW1wbGUwHhcNMjEwMzE2MTcyNjE2WhcNMjEwNDE1MTcyNjE2WjASMRAw
    DgYDVQQDDAdleGFtcGxlMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA
    34SJ4NhxaIkHXlhoslJIm746fMAOR/25REmxPEZvx/lwgzutAxUi7D39yaeyUJwj
    TNUWZXtlCp/Vr7fmuApaps2vQ/Q/jUGO+UmhSFobKhajboTRpQ+sIPcjYfdhLT/x
    zVQw27acKt8nWVyocN5U8lVC4okv3ItXNHLh5a0jac/amMyFAOB42u0xK4fjnV/a
    qtQ2el1IDfQgnlom9Vt6Dl/bztF83NpSl40euE2v8vZH84KiWocP1OHA1BdxWoaj
    NrI7oHpDdKkia76OwQ/TfQBHJUNcQUoW6YUoRtW07b/CAzqFV8PGUjpdfw508Yco
    C8VJJx9MTdLmOUeE1qQpxQIDAQABo1AwTjAdBgNVHQ4EFgQUjhESkkI8juzNLCdB
    qfu997UZGGMwHwYDVR0jBBgwFoAUjhESkkI8juzNLCdBqfu997UZGGMwDAYDVR0T
    BAUwAwEB/zANBgkqhkiG9w0BAQsFAAOCAQEARJYelaj5otonmb+ZF9Lg+66pNHET
    Gir/+kgPeFog+++63Rkl9E82b9mDql+lccNQo5/yuU0znP+NaB0OnVTpvfMfGYTL
    w9NIPJ+qVQG9P7tOQUas43Zk3oTPR2wz27Pu6/fyAL3LMxTfhWRj4IplSWs0Ipia
    k0qn+29PZA0m1ZSw0BrPjsnjZcL+ZZ5UejGdKyT4UWODYO6W+QUq+MPFNELX55qt
    /cv569ywZojbMUi6E+QO4U4Av722qQaoEqGuN49VDSigm/fb/kRc5Tn/oOnpS1rX
    tfJ8ExaeXxbhTjRHAarNZCUF/u5suH1NY6XfsxbedWaLmlmdGVfZ4Txdqg==
    -----END CERTIFICATE-----
---
EOF
    done
  )
done

As of today, cert-manager does not "filter" secrets. Listing all secrets on startup and every 10 hours has two consequences:

  • high memory usage for both the cert-manager controller and cainjector (5GiB and 2.6GiB respectively with 90,000 secrets).
  • high CPU usage of etcd and kube-apiserver when the listing happens.

Looking at ways to alliviate that:

As @Jancis suggested, one improvement could be to only watch secrets that have type: kubernetes.io/tls. The memory usage would be lowered, but the etcd CPU usage would stay the same since the filtering happens in the apiserver.

Filter on = HTTP call takes 41 seconds,

$ time kubectl get secret -A --field-selector type=kubernetes.io/tls --no-headers >/dev/null
kubectl get secret -A --field-selector type=kubernetes.io/tls --no-headers >   24.91s user 0.56s system 61% cpu 41.524 total

Filter off = HTTP call takes 41 seconds too:

$ time kubectl get secret -A --no-headers >/dev/null
kubectl get secret -A --no-headers > /dev/null  24.69s user 0.67s system 60% cpu 41.618 total

@wallrj
Copy link
Member

wallrj commented Sep 7, 2021

@jetstack-bot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 18, 2022
@wallrj
Copy link
Member

wallrj commented Jan 19, 2022

/remove-lifecycle stale

@jetstack-bot jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 19, 2022
@jetstack-bot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 19, 2022
@fvlaicu
Copy link
Contributor

fvlaicu commented Apr 19, 2022

/remove-lifecycle stale

@jetstack-bot jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 19, 2022
@jetstack-bot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 18, 2022
@fvlaicu
Copy link
Contributor

fvlaicu commented Jul 19, 2022

/remove-lifecycle stale

@jetstack-bot jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 19, 2022
@jetstack-bot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 17, 2022
@evanfoster
Copy link

/remove-lifecycle stale

@jetstack-bot jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 18, 2022
@alex-berger
Copy link

We are also affected by this, in our case cainjector fails (see attache logs) cert-manager-cainjector-6cdc477bff-6vr5h.log.

Also interesting, upon that failure, the cainjector Pod resp. Container does not terminate. It just runs on dysfunctional forever while still holding the leader lease which prevents other (still) functional cainjector Pods from taking over.

So looks like there are several bugs acting together :-(, but of course if it were not to crash the API server it would not happen at all. Thus, it would be great if this bug could be fixed by removing resourceVersion=0 as suggested above.

@jetstack-bot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 5, 2023
@evanfoster
Copy link

/remove-lifecycle stale

@jetstack-bot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle rotten
/remove-lifecycle stale

@jetstack-bot jetstack-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 5, 2023
@fvlaicu
Copy link
Contributor

fvlaicu commented Apr 5, 2023

/remove-lifecycle stale

@fvlaicu
Copy link
Contributor

fvlaicu commented Apr 5, 2023

/remove-lifecycle rotten

@jetstack-bot jetstack-bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 5, 2023
@jetstack-bot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 4, 2023
@jetstack-bot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle rotten
/remove-lifecycle stale

@jetstack-bot jetstack-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 3, 2023
@alex-berger
Copy link

/remove-lifecycle rotten

@jetstack-bot jetstack-bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 4, 2023
@wallrj
Copy link
Member

wallrj commented Nov 2, 2023

@alex-berger Please read the new memory scalability section of the Best Practice documentation which now explains how to configure cainjector to only watch Secret resources in the cert-manager namespace. Let me know if it helps:

You can also reduce the memory consumption of cainjector by configuring it to only watch resources in the cert-manager namespace, and by configuring it to not watch Certificate resources.

@alex-berger
Copy link

@wallrj No that is not applicable in our case as we use cainjector for inject things for webooks in multiple namespaces.

⚠️️ This optimization is only appropriate if cainjector is being used exclusively for the the cert-manager webhook. It is not appropriate if cainjector is also being used to manage the TLS certificates for webhooks of other software. For example, some Kubebuilder derived projects may depend on cainjector to inject TLS certificates for their webhooks.

@jetstack-bot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 1, 2024
@fvlaicu
Copy link
Contributor

fvlaicu commented Feb 1, 2024

/remove-lifecycle stale.

@fvlaicu
Copy link
Contributor

fvlaicu commented Feb 1, 2024

/remove-lifecycle stale

@jetstack-bot jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 1, 2024
@wallrj wallrj added this to the 1.15 milestone Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests