Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recognise finalized ACME Orders and gracefully recover by updating the Order's status when they are already in a "valid" state #2765

Closed
devfans opened this issue Mar 30, 2020 · 8 comments
Labels
area/acme Indicates a PR directly modifies the ACME Issuer code kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@devfans
Copy link

devfans commented Mar 30, 2020

Describe the bug:
A clear and concise description of what the bug is.
I got:

E0330 11:50:35.237661       1 sync.go:447] cert-manager/controller/orders "msg"="failed to finalize Order resource due to bad request, marking Order as failed" "error"="403 urn:ietf:params:acme:error:orderNotReady: Order's status (\"valid\") is not acceptable for finalization" "resource_kind"="Order" "resource_name"="domain-com-1755579778-2867024102" "resource_namespace"="xx"
I0330 11:50:35.237704       1 sync.go:56] cert-manager/controller/orders "msg"="updating Order resource status" "resource_kind"="Order" "resource_name"="domain-com-1755579778-2867024102" "resource_namespace"="xx" 
E0330 11:50:35.249918       1 sync.go:59] cert-manager/controller/orders "msg"="failed to update status" "error"=null "resource_kind"="Order" "resource_name"="domain-com-c-1755579778-2867024102" "resource_namespace"="xx" 
E0330 11:50:35.249945       1 controller.go:140] cert-manager/controller/orders "msg"="re-queuing item  due to error processing" "error"="Operation cannot be fulfilled on orders.acme.cert-manager.io \"domain-com-1755579778-2867024102\": the object has been modified; please apply your changes to the latest version and try again" "key"="xx/domain-com-1755579778-2867024102" 

Looks similiar as one old bug: #758
Expected behaviour:
Order should be completed

Steps to reproduce the bug:

after cert manager 0.14.1 is installed with helm. create cluster issuer with dns01 resolver and using route53 to generate a cert for *.domain.com

Anything else we need to know?:
It works well for single domain name like for a.domain.com, and. i specified dns01 with route53 as the only resolver in cluster issuer.

Environment details::

  • Kubernetes version (e.g. v1.10.2): v1.15.x
  • Cloud-provider/provisioner (e.g. GKE, kops AWS, etc): eks on aws
  • cert-manager version (e.g. v0.4.0): 0.14.1
  • Install method (e.g. helm or static manifests): helm

/kind bug

@jetstack-bot jetstack-bot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 30, 2020
@gigaSproule
Copy link

gigaSproule commented Apr 16, 2020

I'm also seeing this same error for the http01 resolver and using a single domain. https://community.letsencrypt.org/t/orders-status-valid-is-not-acceptable-for-finalization/103419 suggests that everything is fine, but cert-manager isn't handling it properly.

@munnerz
Copy link
Member

munnerz commented Apr 23, 2020

This should only happen if cert-manager has a cached status value of Ready when the Order is actually in a Valid (i.e. already finalized) state: https://github.com/jetstack/cert-manager/blob/49e1a7a51c71307718d29c4275bf3916906011c2/pkg/controller/acmeorders/sync.go#L442-L452

We could definitely look at trying to handle this better, e.g. by updating the Order's status to reflect the current state of the Order to allow for a later reconciliation to fetch the already issued certificate.

I'm interested to understand what has caused this case to come up in the first place however - it shouldn't really occur unless a previous attempt to finalize the Order succeeded but updating the Order's status to reflect this change failed. We could also consider using a PATCH instead of an UPDATE operation on the Order resource to try and avoid this case.

Can you see any logs that might indicate this has happened, and if so, any ideas what could cause it? Are you running multiple instances of cert-manager at once, or anything like that?

@munnerz munnerz added area/acme Indicates a PR directly modifies the ACME Issuer code priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Apr 23, 2020
@munnerz munnerz changed the title Route53 resolver shows Order's status (\"valid\") is not acceptable for finalization Recognised finalized ACME Orders and gracefully recover by updating the Order's status when "already finalized" errors occur Apr 23, 2020
@munnerz munnerz changed the title Recognised finalized ACME Orders and gracefully recover by updating the Order's status when "already finalized" errors occur Recognised finalized ACME Orders and gracefully recover by updating the Order's status when they are already in a "valid" state Apr 23, 2020
@munnerz munnerz changed the title Recognised finalized ACME Orders and gracefully recover by updating the Order's status when they are already in a "valid" state Recognise finalized ACME Orders and gracefully recover by updating the Order's status when they are already in a "valid" state Apr 23, 2020
@Nuru
Copy link

Nuru commented May 8, 2020

@munnerz I have seen this, too. I believe it is due to #2741.

It appears that cert-manger is using some kind of hash of the SANs to use to name the CertificateRequest and Order; in any case, if you delete and re-create a certificate, the re-created CertificateRequest, Order, and Challenge all get the same names they had before.

I do not have the sequence exactly nailed down, but it goes something like this:

  • You fiddle around long enough creating and deleting certificates that you hit the Duplicate Certificate limit of 5 per week
  • You create another duplicate. Because of Certificate creation is stuck with Order having no state and no challenges are created for a specific URL #2741, that request just lingers in the background
  • You delete your Certificate because it was not issued and you are trying to goose it. This does not cancel the pending order (POST to new-order API) because it is stuck in a retry loop.
  • While the certificate is deleted, the rate limit window passes, and the order is accepted
  • You create another Certificate, trying to get it to work. Now you have the same order in your system twice. The abandoned order from earlier and the new one. Because the order is very recent, Let's Encrypt will return the same order with the same authorizations as before, and you end up finalizing it twice.

That does not seem exactly right to me, as I do not know all the details about how cert-manager processing works, but I think it is something like that. The key being that you can easily create duplicate orders.

@JoshVanL JoshVanL added triage/support Indicates an issue that is a support question. and removed triage/support Indicates an issue that is a support question. labels Jan 28, 2021
@jetstack-bot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 15, 2021
@Nuru
Copy link

Nuru commented Sep 16, 2021

/remove-lifecycle stale

@jetstack-bot jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 16, 2021
@PSanetra
Copy link

PSanetra commented Nov 29, 2021

We observed exactly the same issue with v1.6.1 when we tried to renew the certificate with kubectl cert-manager -n my-namespace renew my-cert. It worked in most cases but failed with this issue in some.

We have recovered from this by deleting the certificate resource and issuing the renew command again.

@jetstack-bot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 27, 2022
@irbekrm
Copy link
Collaborator

irbekrm commented Mar 11, 2022

This should have been fixed in cert-manager v1.7, see #4697

@irbekrm irbekrm closed this as completed Mar 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/acme Indicates a PR directly modifies the ACME Issuer code kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests

8 participants