Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cert-manager created multiple CertificateRequest objects with the same certificate-revision #4956

Closed
jayme-github opened this issue Mar 17, 2022 · 21 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@jayme-github
Copy link
Contributor

Describe the bug:
At Wikimedia we run cert-manager with our own issuer and a cfssl PKI: https://gerrit.wikimedia.org/g/operations/software/cfssl-issuer

We've got a staging cluster with short-lived certificates (24h) where I noticed one not being refreshed.
From the data in API objects is seems as if the certificate was triggered for renewal 2022-02-06T17:00:03Z which led to the creation of CertificateRequest/toolhub-l8xjm first (2022-02-06T17:01:59Z) and CertificateRequest/toolhub-wvz2q second (2022-02-06T17:04:19Z), both sharing the same cert-manager.io/certificate-revision: 49 and cert-manager.io/private-key-secret-name: toolhub-rrvtj . See kubernetes_objects.yaml

During that time we had some pretty elevated latency on the kubernetes apiserver. Mainly CREATE and UPDATE on cert-manager.io/certificaterequest resources and apparently some connectivity issues with the kubernetes apiserver, as seen in cert-manager.log below.

I0206 17:00:03.399283       1 conditions.go:201] Setting lastTransitionTime for Certificate "toolhub" condition "Issuing" to 2022-02-06 17:00:03.399274083 +0000 UTC m=+10735.693974692
I0206 17:00:03.399216       1 trigger_controller.go:181] cert-manager/controller/certificates-trigger "msg"="Certificate must be re-issued" "key"="istio-system/toolhub" "message"="Renewing certificate as renewal was scheduled at 2022-02-06 17:00:00 +0000 UTC" "reason"="Renewing"
E0206 17:00:20.630289       1 controller.go:163] cert-manager/controller/certificates-request-manager "msg"="re-queuing item due to error processing" "error"="Internal error occurred: failed calling webhook \"webhook.cert-manager.io\": Post https://cert-manager-webhook.cert-manager.svc:443/validate?timeout=10s: context deadline exceeded" "key"="istio-system/toolhub" 
E0206 17:00:31.650166       1 controller.go:163] cert-manager/controller/certificates-request-manager "msg"="re-queuing item due to error processing" "error"="Internal error occurred: failed calling webhook \"webhook.cert-manager.io\": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s: context deadline exceeded" "key"="istio-system/toolhub" 
E0206 17:00:43.668839       1 controller.go:163] cert-manager/controller/certificates-request-manager "msg"="re-queuing item due to error processing" "error"="Internal error occurred: failed calling webhook \"webhook.cert-manager.io\": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s: context deadline exceeded" "key"="istio-system/toolhub" 
E0206 17:00:48.729312       1 controller.go:163] cert-manager/controller/certificates-request-manager "msg"="re-queuing item due to error processing" "error"="Internal error occurred: failed calling webhook \"webhook.cert-manager.io\": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s: dial tcp 10.192.76.185:443: connect: connection refused" "key"="istio-system/toolhub" 
E0206 17:00:57.816629       1 controller.go:163] cert-manager/controller/certificates-request-manager "msg"="re-queuing item due to error processing" "error"="Internal error occurred: failed calling webhook \"webhook.cert-manager.io\": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s: dial tcp 10.192.76.185:443: connect: connection refused" "key"="istio-system/toolhub" 
E0206 17:02:20.402566       1 controller.go:163] cert-manager/controller/certificates-request-manager "msg"="re-queuing item due to error processing" "error"="failed whilst waiting for CertificateRequest to exist - this may indicate an apiserver running slowly. Request will be retried" "key"="istio-system/toolhub" 
E0206 17:06:26.899074       1 controller.go:163] cert-manager/controller/certificates-request-manager "msg"="re-queuing item due to error processing" "error"="failed whilst waiting for CertificateRequest to exist - this may indicate an apiserver running slowly. Request will be retried" "key"="istio-system/toolhub" 
I0206 17:07:34.907870       1 trace.go:205] Trace[861803746]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167 (06-Feb-2022 17:07:21.603) (total time: 13304ms):
Trace[861803746]: [13.30407288s] [13.30407288s] END
Trace[861803746]: ---"Objects listed" 13303ms (17:07:34.906)
I0206 17:07:57.199752       1 trace.go:205] Trace[782326899]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167 (06-Feb-2022 17:07:35.007) (total time: 22100ms):
Trace[782326899]: [22.100469212s] [22.100469212s] END
Trace[782326899]: ---"Objects listed" 22095ms (17:07:57.102)
I0206 17:08:05.299156       1 conditions.go:261] Setting lastTransitionTime for CertificateRequest "toolhub-l8xjm" condition "Approved" to 2022-02-06 17:08:04.500992384 +0000 UTC m=+44.594675153
I0206 17:08:05.199563       1 conditions.go:261] Setting lastTransitionTime for CertificateRequest "toolhub-wvz2q" condition "Approved" to 2022-02-06 17:08:03.500823508 +0000 UTC m=+43.594506243
I0206 17:08:05.904575       1 trace.go:205] Trace[850423915]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167 (06-Feb-2022 17:07:35.006) (total time: 30897ms):
Trace[850423915]: ---"Objects listed" 30895ms (17:08:05.902)
Trace[850423915]: [30.897705171s] [30.897705171s] END
I0206 17:08:08.699125       1 requestmanager_controller.go:210] cert-manager/controller/certificates-request-manager "msg"="Multiple matching CertificateRequest resources exist, delete one of them. This is likely an error and should be reported on the issue tracker!" "key"="istio-system/toolhub" 
I0206 17:08:09.201178       1 requestmanager_controller.go:210] cert-manager/controller/certificates-request-manager "msg"="Multiple matching CertificateRequest resources exist, delete one of them. This is likely an error and should be reported on the issue tracker!" "key"="istio-system/toolhub" 
I0206 17:08:09.300472       1 requestmanager_controller.go:210] cert-manager/controller/certificates-request-manager "msg"="Multiple matching CertificateRequest resources exist, delete one of them. This is likely an error and should be reported on the issue tracker!" "key"="istio-system/toolhub" 
E0206 17:08:10.599180       1 controller.go:163] cert-manager/controller/certificates-readiness "msg"="re-queuing item due to error processing" "error"="multiple CertificateRequests were found for the 'next' revision 49, issuance is skipped until there are no more duplicates" "key"="istio-system/toolhub" 
E0206 17:08:13.400912       1 controller.go:163] cert-manager/controller/certificates-readiness "msg"="re-queuing item due to error processing" "error"="multiple CertificateRequests were found for the 'next' revision 49, issuance is skipped until there are no more duplicates" "key"="istio-system/toolhub"
...this message keeps repeating ever since
2022-02-06T17:02:00.026484+00:00 controller.certificaterequest istio-system toolhub-l8xjm CertificateRequest has not been approved yet. Ignoring.
2022-02-06T17:02:00.423743+00:00 controller.certificaterequest istio-system toolhub-l8xjm CertificateRequest has not been approved yet. Ignoring.
2022-02-06T17:04:19.921125+00:00 controller.certificaterequest istio-system toolhub-wvz2q CertificateRequest has not been approved yet. Ignoring.
2022-02-06T17:04:20.280381+00:00 controller.certificaterequest istio-system toolhub-wvz2q CertificateRequest has not been approved yet. Ignoring.
2022-02-06T17:08:06.328312+00:00 controller.certificaterequest istio-system toolhub-wvz2q Initialising Ready condition
2022-02-06T17:08:07.376063+00:00 controller.certificaterequest istio-system toolhub-l8xjm Initialising Ready condition
2022-02-06T17:08:07.441243+00:00 controller.certificaterequest istio-system toolhub-wvz2q Signing cert with k8s_staging discovery true
2022-02-06T17:08:08.047227+00:00 controller.certificaterequest istio-system toolhub-l8xjm Signing cert with k8s_staging discovery true
2022-02-06T17:08:08.536874+00:00 controller.certificaterequest istio-system toolhub-l8xjm CertificateRequest is Ready. Ignoring.
2022-02-06T17:08:08.536832+00:00 controller.certificaterequest istio-system toolhub-wvz2q CertificateRequest is Ready. Ignoring.

Expected behaviour:
I'd have expected that even if two CertificateRequests would have been created they would not be allowed to share the same cert-manager.io/certificate-revision. Also I would have expected some kind of error metric telling me that something was wrong and/or the certmanager_certificate_ready_status being False or Unknown.

Steps to reproduce the bug:
Unfortunately I don't know how to reproduce.

Anything else we need to know?:

Environment details::

  • Kubernetes version: 1.16
  • Cloud-provider/provisioner: n/a
  • cert-manager version: 1.5.4
  • Install method: helm

/kind bug

@jetstack-bot jetstack-bot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 17, 2022
@irbekrm
Copy link
Contributor

irbekrm commented Mar 23, 2022

Hi @jayme-github thanks for creating the issue.

I'd have expected that even if two CertificateRequests would have been created they would not be allowed to share the same cert-manager.io/certificate-revision.

There shouldn't be two CertificateRequests with the same revision that have both succeeded. If a CertificateRequest fails, a new one will be created with the same revision in most cases. But this isn't relevant to what you are reporting as I see that both CRs have eventually succeeded.

E0206 17:02:20.402566 1 controller.go:163] cert-manager/controller/certificates-request-manager "msg"="re-queuing item due to error processing" "error"="failed whilst waiting for CertificateRequest to exist - this may indicate an apiserver running slowly. Request will be retried" "key"="istio-system/toolhub"

It looks like what may have happened is that one CertificateRequest got created, cert-manager timed out waiting for it to be available due to the apiserver being slow, the reconciler was triggered again and, I guess due to the same apiserver slowness, cert-manager did not get the first CertificateRequest from the apiserver here and as it saw no CertificateRequests with that revision it created another one.

After that, once the connectivity stabilized, the two CertificateRequests would be reconciled and fulfilled seperately by another controller which isn't responsible for checking the number of CertificateRequests for revision, but just for issuing certs.

At the moment I am not sure if there is any way around this kind of issue.

Also I would have expected some kind of error metric telling me that something was wrong and/or the certmanager_certificate_ready_status being False or Unknown.

Had the certificate actually expired? The ready status reflects the status of the issued certificate, not whether any issuance succeeded or failed.
The main risk when this kind of issue happens is that too many certificate requests are made to an external issuer. It isn't really a failure state of the Certificate as such. I am not really sure how we could warn about this beyond what is already being logged.

@jayme-github
Copy link
Contributor Author

Hi @irbekrm thanks for taking the time.

After that, once the connectivity stabilized, the two CertificateRequests would be reconciled and fulfilled seperately by another controller which isn't responsible for checking the number of CertificateRequests for revision, but just for issuing certs.

AIUI this would have meant that the result of one CertificateRequest (e.g. the signed certificate) should have ended up in a kubernetes.io/tls secret, right? As far as I can tell that never happened.

Had the certificate actually expired? The ready status reflects the status of the issued certificate, not whether any issuance succeeded or failed. The main risk when this kind of issue happens is that too many certificate requests are made to an external issuer. It isn't really a failure state of the Certificate as such. I am not really sure how we could warn about this beyond what is already being logged.

Yes, it did expire and certmanager_certificate_expiration_timestamp_seconds was reflecting that. But the certificate never entered an not ready state (metrics wise).
As I see it there where no new CertificateRequests created (apart from those two in existence). So further calls to the issuer should not have been made in this case.

From my naive point of view and looking at the logs it might be an option to add a new metric that counts events that end up being logged as re-queuing item due to error processing as elevated rates of those errors (for longer period of time maybe) might point towards an underlying issue.
The re-queuing event issuance is skipped until there are no more duplicates seems kind of special still, as cert-manager is unable to get out of this situation without human intervention. While I do appreciate that being logged so clearly, I would still like to be able to alert on a corresponding metric to call out for said human. :) I have not looked at the code at all, but could it be an option to just count "hard errors" like that making cleat it's not going to go away just by retrying?

@irbekrm
Copy link
Contributor

irbekrm commented Mar 24, 2022

Hi, thanks for the extra information, this is all very useful.

AIUI this would have meant that the result of one CertificateRequest (e.g. the signed certificate) should have ended up in a kubernetes.io/tls secret, right? As far as I can tell that never happened.

So the cert in toolhub-tls-certificate was never updated?
I had another look at the code and I see that this is possible that the controller responsible for writing the cert to the secret hit this case here which is pretty unhelpful as there is actually no error, it would just silently return nil and yes in that case the cert never gets written to the secret.
So that needs improving. Ideally I'd like for there to be no human intervention needed at this point, I'll see if it might make sense to actually read the cert from one of the CertificateRequests at that point anyway or else to delete the other CertificateRequest(s) in requestmanager

Yes, it did expire and certmanager_certificate_expiration_timestamp_seconds was reflecting that. But the certificate never entered an not ready state (metrics wise).

Do you know whether the Ready status of the Certificate was eventually set to false and if not, do you know for how long after the cert expired the Ready status remained true?
There is a controller responsible for updating the Ready condition of the certificate on basis of the contents of the Secret. It is possible that there was no event at the time when the cert expired that would have caused a reconcile of that controller, however we do re-sync all objects every 10 hours that, I think, should have resulted in the controller updating the Ready condition. (This also makes me realize that 10 hours is too long if we are relying on this to catch expired short lived certs in edge cases where the reconciler is not being triggered due to events)

From my naive point of view and looking at the logs it might be an option to add a new metric that counts events that end up being logged as re-queuing item due to error processing as elevated rates of those errors (for longer period of time maybe) might point towards an underlying issue.
The re-queuing event issuance is skipped until there are no more duplicates seems kind of special still, as cert-manager is unable to get out of this situation without human intervention. While I do appreciate that being logged so clearly, I would still like to be able to alert on a corresponding metric to call out for said human. :) I have not looked at the code at all, but could it be an option to just count "hard errors" like that making cleat it's not going to go away just by retrying?

Thank you, this is all useful insight- we don't manage production cert-manager installations, so for metrics in particular we rely on what users tell us they would find useful. For this particular case, I would like to get rid of the need of human intervention.
A metric for hard errors though could be useful, is this something that you have and use for other software?

@jayme-github
Copy link
Contributor Author

So the cert in toolhub-tls-certificate was never updated?

Exactly, yes.

Ideally I'd like for there to be no human intervention needed at this point, I'll see if it might make sense to actually read the cert from one of the CertificateRequests at that point anyway or else to delete the other CertificateRequest(s) in requestmanager

That would be awesome!

Do you know whether the Ready status of the Certificate was eventually set to false and if not, do you know for how long after the cert expired the Ready status remained true?

The Certificate object that I attached was dumped ~30 days after the actual certificate had expired. So as far as I can tell Ready status never changed to false (and the certmanager_certificate_ready_status stayed True as well the whole time for that certificate). So that should have definitely been catched by the resync...

Not sure if it makes as difference in all of this but as it derives from default config let me point out that all our Certificate objects have revisionHistoryLimit: 2 set.

Thank you, this is all useful insight- we don't manage production cert-manager installations, so for metrics in particular we rely on what users tell us they would find useful. For this particular case, I would like to get rid of the need of human intervention. A metric for hard errors though could be useful, is this something that you have and use for other software?

It would be really nice if cert-manager would be able to resolve this situation on it's own in the future.
Having thought a bit more about a possible metric that would have exposed this I think (and that is what other software does as well) having a counter for re-queuing events would have helped in this case as well.
We would have seen a steep increase in the rate of those events, leading to someone investigating (and finding the root cause nicely printed in the logs). As this would also make other issues visible I guess it would be a good signal to have.

jayme-github added a commit to wikimedia/cert-manager that referenced this issue Mar 25, 2022
Introducing a new metric controller_requeue_count counting the
number of re-queuing events issued per controller and reason. Current
reasons can be either "optimistic-locking" (logged as INFO)  or
"processing-error" (logged as ERROR).

This adds more visibility to potential issues randing from things like
connection problems to the API or webhooks to possible hard errors.

For context, please see cert-manager#4956
jayme-github added a commit to wikimedia/cert-manager that referenced this issue Mar 25, 2022
Introducing a new metric controller_requeue_count counting the
number of re-queuing events issued per controller and reason. Current
reasons can be either "optimistic-locking" (logged as INFO)  or
"processing-error" (logged as ERROR).

This adds more visibility to potential issues randing from things like
connection problems to the API or webhooks to possible hard errors.

For context, please see cert-manager#4956

Signed-off-by: Janis Meybohm <jmeybohm@wikimedia.org>
jayme-github added a commit to wikimedia/cert-manager that referenced this issue Mar 25, 2022
Introducing a new metric controller_requeue_count counting the
number of re-queuing events issued per controller and reason. Current
reasons can be either "optimistic-locking" (logged as INFO) or
"processing-error" (logged as ERROR).

This adds more visibility to potential issues ranging from things like
connection problems to the API or webhooks to possible hard errors.

For context, please see cert-manager#4956

Signed-off-by: Janis Meybohm <jmeybohm@wikimedia.org>
jayme-github added a commit to wikimedia/cert-manager that referenced this issue Mar 25, 2022
Introducing a new metric controller_requeue_count counting the
number of re-queuing events issued per controller and reason. Current
reasons can be either "optimistic-locking" (logged as INFO) or
"processing-error" (logged as ERROR).

This adds more visibility to potential issues ranging from things like
connection problems to the API or webhooks to possible hard errors.

For context, please see cert-manager#4956

Signed-off-by: Janis Meybohm <jmeybohm@wikimedia.org>
jayme-github added a commit to wikimedia/cert-manager that referenced this issue Mar 29, 2022
Introducing a new metric controller_sync_error_count counting the
number of errors during sync() of a controller.

This adds more visibility to potential issues ranging from things like
connection problems to the API or webhooks to possible hard errors.

For context, please see cert-manager#4956

Signed-off-by: Janis Meybohm <jmeybohm@wikimedia.org>
@smlx
Copy link

smlx commented Jun 9, 2022

Hi Just noting that I've also seen this issue, and also on a cluster with k8s API latency/availability issues.

The workaround was to delete the two certificaterequest objects with duplicate revisions and allow cert-manager to generate a new request successfully.

For one certificate we actually had three certificaterequests with duplicate revisions.

@nathan-c
Copy link

We also see this issue. We use Regional GKE clusters so don't have much control over k8s API latency. We use some fairly short lived (1hr) client certificates generated by namespaced cert-manager CA Issuer's. When this problem occurs we only have a limited time before our services start failing because of expired certificates. If we don't delete the duplicate CertificateRequest objects in this time then we can experience outages. I can see how this is low-priority for long lived certs but its critical for short lived ones.

We use the new metric to set up an alertmanager alert so we do at least have a chance of preventing an outage but it would be better if cert-manager was able to de-dupe these CertificateRequest objects itself (or never create them in the first place)

@ptsk5
Copy link

ptsk5 commented Aug 1, 2022

I want to report the same (bad) behaviour. Are there any plans to fix this, please?

I0801 10:01:36.376261       1 requestmanager_controller.go:210] cert-manager/certificates-request-manager "msg"="Multiple matching CertificateRequest resources exist, delete one of them. This is likely an error and should be reported on the issue tracker!" "key"="my-ns/my-cert"

@AcidLeroy
Copy link
Contributor

Hey @irbekrm, we are seeing this fairly frequently in our code base. At this point, we have seen it in versions v1.5.3 and in 1.8.0. For example, we see:

I0928 19:24:21.113681       1 requestmanager_controller.go:210] cert-manager/controller/certificates-request-manager "msg"="Multiple matching CertificateRequest resources exist, delete one of them. This is likely an error and should be reported on the issue tracker!" "key"="vmware-system-nsop/vmware-system-nsop-serving-cert"
E0928 19:24:21.125432       1 controller.go:163] cert-manager/controller/certificates-readiness "msg"="re-queuing item due to error processing" "error"="multiple CertificateRequests were found for the 'next' revision 1, issuance is skipped until there are no more duplicates" "key"="vmware-system-nsop/vmware-system-nsop-serving-cert"
E0928 19:24:22.126395       1 controller.go:163] cert-manager/controller/certificates-readiness "msg"="re-queuing item due to error processing" "error"="multiple CertificateRequests were found for the 'next' revision 1, issuance is skipped until there are no more duplicates" "key"="vmware-system-nsop/vmware-system-nsop-serving-cert"

In addition, when we look at the certificates, we see:

root@422ae4b32e9127ab363a62c945891ae0 [ ~ ]# kubectl -n vmware-system-nsop get certificaterequests.cert-manager.io
NAME                                    APPROVED   DENIED   READY   ISSUER                                 REQUESTOR                                                       AGE
vmware-system-nsop-serving-cert-q82rg   True                True    vmware-system-nsop-selfsigned-issuer   system:serviceaccount:vmware-system-cert-manager:cert-manager   4h9m
vmware-system-nsop-serving-cert-tkbkr   True                True    vmware-system-nsop-selfsigned-issuer   system:serviceaccount:vmware-system-cert-manager:cert-manager   4h10m

In our particular case, we are using self-signed issuers, so we don't care if we have to delete the duplicates. We have not been able to consistently get this error, however. It is very much intermittent and happens to different certificates within the system. Part of our problem here is that we cannot rely on manually deleting these requests. We need to get to the root cause of this, but I could imagine a workaround where we specify in the YAML specification of the certificate to "auto-resolve" the issue by deleting the duplicate certificate.

Environment details:

  • cert-manager: v1.5.3
  • Kubernetes version: v1.22.6

@yannizhang2019
Copy link

We are seeing the same issue as well. We are using version 1.8.0.
There were numerous repeating error message -

E1004 15:35:00.237428       1 controller.go:163] cert-manager/certificates-readiness "msg"="re-queuing item due to error
 processing" "error"="multiple CertificateRequests were found for the 'next' revision 1, issuance is skipped until there are 
no more duplicates" "key"="cp4i/common-web-ui-ca-cert" 

There were multiple CertificateRequests created by the cert manager for common-web-ui-ca-cert with the same revision. We manually deleted them. After that, the certificate creation was successful.

Is there any plan to fix the root cause?

@AcidLeroy
Copy link
Contributor

@munnerz @irbekrm per our discussion today in the bi-weekly, I was able to dig up some logs that did indeed contain the log statement,

 1 controller.go:166] cert-manager/certificates-request-manager "msg"="re-queuing item due to error processing" "error"="failed whilst waiting for CertificateRequest to exist - this may indicate an apiserver running slowly. Request will be retried" "key"="vmware-system-tkg/tkr-resolver-cluster-webhook-serving-cert" 

Then in a few lines below that, I see:

1 requestmanager_controller.go:217] cert-manager/certificates-request-manager "msg"="Multiple matching CertificateRequest resources exist, delete one of them. This is likely an error and should be reported on the issue tracker!" "key"="vmware-system-tkg/tkr-resolver-cluster-webhook-serving-cert"

So, it does appear that the apiserver could play a role in this issue.

@jetstack-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 3, 2023
@ksauzz
Copy link

ksauzz commented Jan 4, 2023

/remove-lifecycle stale

@jetstack-bot jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 4, 2023
@sathyanarays
Copy link
Contributor

@ksauzz , this has been fixed and the fix has been disabled by default. It can be enabled using the feature flag StableCertificateRequestName. Please let us know if this works!

@ksauzz
Copy link

ksauzz commented Jan 5, 2023

@sathyanarays Thank you for the fix! I didn't notice that. Let me test it.

wmfgerrit pushed a commit to wikimedia/operations-deployment-charts that referenced this issue Mar 13, 2023
We had issues with cert-manager crating multiple CertificateRequest
objects with the same certificate-revision in the past, see:
* https://phabricator.wikimedia.org/T304092
* cert-manager/cert-manager#4956

Upstream introduced a fix that ensures CertificateRequest objects are
created with predictable names, so no duplicates are possible:
* cert-manager/cert-manager#5487

This fix is hidden behind a feature gate which this change opens for
wikikube staging clusters.

Bug: T304092
Change-Id: Ibb063cc653fc24dc306282154892c6a6b25f705e
@jetstack-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 5, 2023
@ksauzz
Copy link

ksauzz commented May 2, 2023

We have used StableCertificateRequestName for 3 months, and it works to us nicely so far. Thank you!

@jetstack-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle rotten
/remove-lifecycle stale

@jetstack-bot jetstack-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 1, 2023
@jetstack-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to jetstack.
/close

@jetstack-bot
Copy link
Contributor

@jetstack-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to jetstack.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ffilippopoulos
Copy link

We've seen this again running cert-manager v1.13.1 which has the StableCertificateRequestName feature gate enabled by default.

E1024 15:15:56.304905       1 controller.go:167] "cert-manager/certificates-readiness: re-queuing item due to error processing" err="multiple CertificateRequests were found for the 'next' revision 117, issuance is skipped until there are no more duplicates" key="otel/otel-collector-kafka-client"

Reading through the comments in the code: https://github.com/cert-manager/cert-manager/blob/master/pkg/controller/certificates/requestmanager/requestmanager_controller.go#L424-L426 looks like disabling the flag will actually help us prevent this issue.
@sathyanarays Are we missing something here, or shall I try disabling the feature?

@atsang36
Copy link

atsang36 commented Feb 28, 2024

Hi all,

Like @ffilippopoulos I'm still seeing this issue in v1.13.3 as well. Does disabling the flag work?

@ksauzz @sathyanarays @irbekrm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests