-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cert-manager created multiple CertificateRequest objects with the same certificate-revision #4956
Comments
Hi @jayme-github thanks for creating the issue.
There shouldn't be two
It looks like what may have happened is that one After that, once the connectivity stabilized, the two At the moment I am not sure if there is any way around this kind of issue.
Had the certificate actually expired? The ready status reflects the status of the issued certificate, not whether any issuance succeeded or failed. |
Hi @irbekrm thanks for taking the time.
AIUI this would have meant that the result of one
Yes, it did expire and From my naive point of view and looking at the logs it might be an option to add a new metric that counts events that end up being logged as |
Hi, thanks for the extra information, this is all very useful.
So the cert in
Do you know whether the
Thank you, this is all useful insight- we don't manage production cert-manager installations, so for metrics in particular we rely on what users tell us they would find useful. For this particular case, I would like to get rid of the need of human intervention. |
Exactly, yes.
That would be awesome!
The Not sure if it makes as difference in all of this but as it derives from default config let me point out that all our
It would be really nice if cert-manager would be able to resolve this situation on it's own in the future. |
Introducing a new metric controller_requeue_count counting the number of re-queuing events issued per controller and reason. Current reasons can be either "optimistic-locking" (logged as INFO) or "processing-error" (logged as ERROR). This adds more visibility to potential issues randing from things like connection problems to the API or webhooks to possible hard errors. For context, please see cert-manager#4956
Introducing a new metric controller_requeue_count counting the number of re-queuing events issued per controller and reason. Current reasons can be either "optimistic-locking" (logged as INFO) or "processing-error" (logged as ERROR). This adds more visibility to potential issues randing from things like connection problems to the API or webhooks to possible hard errors. For context, please see cert-manager#4956 Signed-off-by: Janis Meybohm <jmeybohm@wikimedia.org>
Introducing a new metric controller_requeue_count counting the number of re-queuing events issued per controller and reason. Current reasons can be either "optimistic-locking" (logged as INFO) or "processing-error" (logged as ERROR). This adds more visibility to potential issues ranging from things like connection problems to the API or webhooks to possible hard errors. For context, please see cert-manager#4956 Signed-off-by: Janis Meybohm <jmeybohm@wikimedia.org>
Introducing a new metric controller_requeue_count counting the number of re-queuing events issued per controller and reason. Current reasons can be either "optimistic-locking" (logged as INFO) or "processing-error" (logged as ERROR). This adds more visibility to potential issues ranging from things like connection problems to the API or webhooks to possible hard errors. For context, please see cert-manager#4956 Signed-off-by: Janis Meybohm <jmeybohm@wikimedia.org>
Introducing a new metric controller_sync_error_count counting the number of errors during sync() of a controller. This adds more visibility to potential issues ranging from things like connection problems to the API or webhooks to possible hard errors. For context, please see cert-manager#4956 Signed-off-by: Janis Meybohm <jmeybohm@wikimedia.org>
Hi Just noting that I've also seen this issue, and also on a cluster with k8s API latency/availability issues. The workaround was to delete the two For one certificate we actually had three |
We also see this issue. We use Regional GKE clusters so don't have much control over k8s API latency. We use some fairly short lived (1hr) client certificates generated by namespaced cert-manager CA Issuer's. When this problem occurs we only have a limited time before our services start failing because of expired certificates. If we don't delete the duplicate We use the new metric to set up an alertmanager alert so we do at least have a chance of preventing an outage but it would be better if cert-manager was able to de-dupe these |
I want to report the same (bad) behaviour. Are there any plans to fix this, please?
|
Hey @irbekrm, we are seeing this fairly frequently in our code base. At this point, we have seen it in versions v1.5.3 and in 1.8.0. For example, we see:
In addition, when we look at the certificates, we see:
In our particular case, we are using self-signed issuers, so we don't care if we have to delete the duplicates. We have not been able to consistently get this error, however. It is very much intermittent and happens to different certificates within the system. Part of our problem here is that we cannot rely on manually deleting these requests. We need to get to the root cause of this, but I could imagine a workaround where we specify in the YAML specification of the certificate to "auto-resolve" the issue by deleting the duplicate certificate. Environment details:
|
We are seeing the same issue as well. We are using version 1.8.0.
There were multiple CertificateRequests created by the cert manager for common-web-ui-ca-cert with the same revision. We manually deleted them. After that, the certificate creation was successful. Is there any plan to fix the root cause? |
@munnerz @irbekrm per our discussion today in the bi-weekly, I was able to dig up some logs that did indeed contain the log statement,
Then in a few lines below that, I see:
So, it does appear that the apiserver could play a role in this issue. |
Issues go stale after 90d of inactivity. |
/remove-lifecycle stale |
@ksauzz , this has been fixed and the fix has been disabled by default. It can be enabled using the feature flag |
@sathyanarays Thank you for the fix! I didn't notice that. Let me test it. |
We had issues with cert-manager crating multiple CertificateRequest objects with the same certificate-revision in the past, see: * https://phabricator.wikimedia.org/T304092 * cert-manager/cert-manager#4956 Upstream introduced a fix that ensures CertificateRequest objects are created with predictable names, so no duplicates are possible: * cert-manager/cert-manager#5487 This fix is hidden behind a feature gate which this change opens for wikikube staging clusters. Bug: T304092 Change-Id: Ibb063cc653fc24dc306282154892c6a6b25f705e
Issues go stale after 90d of inactivity. |
We have used StableCertificateRequestName for 3 months, and it works to us nicely so far. Thank you! |
Stale issues rot after 30d of inactivity. |
Rotten issues close after 30d of inactivity. |
@jetstack-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We've seen this again running cert-manager
Reading through the comments in the code: https://github.com/cert-manager/cert-manager/blob/master/pkg/controller/certificates/requestmanager/requestmanager_controller.go#L424-L426 looks like disabling the flag will actually help us prevent this issue. |
Hi all, Like @ffilippopoulos I'm still seeing this issue in v1.13.3 as well. Does disabling the flag work? |
Describe the bug:
At Wikimedia we run cert-manager with our own issuer and a cfssl PKI: https://gerrit.wikimedia.org/g/operations/software/cfssl-issuer
We've got a staging cluster with short-lived certificates (24h) where I noticed one not being refreshed.
From the data in API objects is seems as if the certificate was triggered for renewal
2022-02-06T17:00:03Z
which led to the creation ofCertificateRequest/toolhub-l8xjm
first (2022-02-06T17:01:59Z
) andCertificateRequest/toolhub-wvz2q
second (2022-02-06T17:04:19Z
), both sharing the samecert-manager.io/certificate-revision: 49
andcert-manager.io/private-key-secret-name: toolhub-rrvtj
. See kubernetes_objects.yamlDuring that time we had some pretty elevated latency on the kubernetes apiserver. Mainly CREATE and UPDATE on cert-manager.io/certificaterequest resources and apparently some connectivity issues with the kubernetes apiserver, as seen in cert-manager.log below.
Expected behaviour:
I'd have expected that even if two CertificateRequests would have been created they would not be allowed to share the same
cert-manager.io/certificate-revision
. Also I would have expected some kind of error metric telling me that something was wrong and/or thecertmanager_certificate_ready_status
beingFalse
orUnknown
.Steps to reproduce the bug:
Unfortunately I don't know how to reproduce.
Anything else we need to know?:
Environment details::
/kind bug
The text was updated successfully, but these errors were encountered: