New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cert-manager causes API server panic on clusters with more than 20000 secrets. #3748
Comments
Hi @mvukadinoff, Sorry to hear you're having trouble and thanks for raising the issue. I'll see if I can reproduce it, it'd be interesting to know which query that was.
The number of queries from It'd definitely be interesting to know whether it'd be interesting to know whether it's possible to have that many I wonder if allowing to modify the sync period would also help for larger deployments like this. |
Is the limit 20,000 globally, or 20,000 in a single namespace? |
For our case it's 20000 globally in more than 1500 namespaces. I don't think it's a hard limit it's more how powerful are the masters and how much load they can withstand. Regarding the query I believe its: /api/v1/secrets?limit=500&resourceVersion=0 We saw there is a newer version of Cert Manager we'll try that as well. |
But it seems to be the case that at least cainjector watches alls secrets. We are currently in the process of introducing cert-manager to our OpenShift clusters and have no (or very few) Certificate objects. But the memory consumption of cainjector seems to scale with the amount of Secrets a cluster has in general. OpenShift creates a lot of serviceaccounts per namespace by default which results in a lot of sa token secrets. On clusters with more namespaces (and therefore more serviceaccounts/secrets) cainjector requires more memory than on clusters with fewer namespaces. We are running cert-manager 1.3.1 |
On a cluster that we've only installed the CRDs and don't have any certificates actually managed by cert-manager, the controller makes a call for all secrets - on that cluster we have about 130k secrets.
LE: |
We've been affected by this now too. I don't know the secrets querying specifics, but can't cert-manager add type filter to it? At least the cert-manager secrets on our cluster have
Adding |
With 90,000 secrets, I wasn't able to overload etcd:
💣 Secret bomb (90,000 secrets)
As of today, cert-manager does not "filter" secrets. Listing all secrets on startup and every 10 hours has two consequences:
Looking at ways to alliviate that: As @Jancis suggested, one improvement could be to only watch secrets that have Filter on = HTTP call takes 41 seconds,
Filter off = HTTP call takes 41 seconds too:
|
The same issue appeared in Rancher: Other related links:
|
Issues go stale after 90d of inactivity. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. |
/remove-lifecycle stale |
We are also affected by this, in our case cainjector fails (see attache logs) cert-manager-cainjector-6cdc477bff-6vr5h.log. Also interesting, upon that failure, the cainjector Pod resp. Container does not terminate. It just runs on dysfunctional forever while still holding the leader lease which prevents other (still) functional cainjector Pods from taking over. So looks like there are several bugs acting together :-(, but of course if it were not to crash the API server it would not happen at all. Thus, it would be great if this bug could be fixed by removing |
Issues go stale after 90d of inactivity. |
/remove-lifecycle stale |
Stale issues rot after 30d of inactivity. |
/remove-lifecycle stale |
/remove-lifecycle rotten |
Issues go stale after 90d of inactivity. |
Stale issues rot after 30d of inactivity. |
/remove-lifecycle rotten |
@alex-berger Please read the new memory scalability section of the Best Practice documentation which now explains how to configure cainjector to only watch Secret resources in the cert-manager namespace. Let me know if it helps:
|
@wallrj No that is not applicable in our case as we use cainjector for inject things for webooks in multiple namespaces.
|
Issues go stale after 90d of inactivity. |
/remove-lifecycle stale. |
/remove-lifecycle stale |
Describe the bug:
On clusters with more than 20000 secrets this becomes a problem . The query that Cert manager does is not optimal.
/api/v1/secrets?limit=500&resourceVersion=0
resourceVersion=0 will cause to query always all secrects and limit=500 will not be taken into account. This way cert manager is not scalable for large deployments. Secrets are used not only for certificates.
As mentioned in : kubernetes/kubernetes#56278 and https://kubernetes.io/docs/reference/using-api/api-concepts/
I suggest to remove the resourceVersion=0 from the query which should make it a lot more faster.
Furhtermore cert manager will retry those queries without waiting for them to complete and they pile up and cause significant load even crashes on the API server. Cert manager basically DDoS'es the Api server.
We're hitting the same issue with:
quay.io/jetstack/cert-manager-cainjector:v0.11.0
quay.io/jetstack/cert-manager-controller:v0.11.0
and
quay.io/jetstack/cert-manager-controller:v1.1.0
quay.io/jetstack/cert-manager-cainjector:v1.1.0
Logs from API server
Logs from ETCD:
Logs from cert-manager:
Expected behaviour:
cert-manager to not try making heavy queries that need to query all secrets from all namespaces, but instead work per namespace.
Steps to reproduce the bug:
Generate 15000 secrets - no need for them to be for TLS certificates, any secret will do.
Look at the API server load and Cert-manager logs
Anything else we need to know?:
Environment details::
/kind bug
The text was updated successfully, but these errors were encountered: