-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Describe the bug:
We've been using cert-manager with zerossl as ACME provider using http01 challenges for several months now vey successfully.
However, since a couple of weeks ago, zerossl must have changed their ACME API:
They now introduced a quite strict request rate limit.
Whenever issuing a new certificate containing 3 or more domains and using the http01 challenge, we are running in 429 responses from their API, which completely bricks the cert issue flow.
Note: The problem does not occur when issuing a cert containing <=2 domains.
Expected behaviour:
The controller should respect 429 responses and try again later.
In my case, retrying 2-3 seconds later would already solve the issue.
Steps to reproduce the bug:
This is the certificate resource:
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
annotations:
service: tls-cert
labels:
service: tls-cert
name: tls-cert
spec:
dnsNames:
- xxx
- xxx
- xxx
- xxx
- xxx
- xxx
- xxx
- xxx
issuerRef:
group: cert-manager.io
kind: ClusterIssuer
name: zerossl
secretName: tls-cert
usages:
- digital signature
- key enciphermentAnd this is the ClusterIssuer resource:
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: zerossl
spec:
acme:
externalAccountBinding:
keyID: xxxxx
keySecretRef:
key: eab-hmac-key
name: zerossl
privateKeySecretRef:
name: zerossl-account
server: https://acme.zerossl.com/v2/DV90
solvers:
- http01:
ingress:
class: nginxAfter applying the certificate to the cluster, the corresponding CertificateRequest, Order, and Challenge resources are created as expected.
However, during processing of the challenges, the ACME client hits the request limit of the zerossl API:

# failed challenge status:
status:
presented: false
processing: false
reason: 'Failed to retrieve Order resource: 429 : 429 Too Many Requests'
state: erroredOnce the first challenge fails, the error state is propagated to the Order and Certificate resource:
# Order status:
status:
authorizations:
....
failureTime: "2023-03-16T10:26:15Z"
finalizeURL: https://acme.zerossl.com/v2/DV90/order/xxxxx/finalize
reason: "Failed to retrieve Order resource: 429 : <html>\r\n<head><title>429 Too
Many Requests</title></head>\r\n<body>\r\n<center><h1>429 Too Many Requests</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n"
state: errored
url: https://acme.zerossl.com/v2/DV90/order/xxxxx
# Certificate status:
status:
conditions:
- lastTransitionTime: "2023-03-16T10:26:08Z"
message: Issuing certificate as Secret does not exist
observedGeneration: 1
reason: DoesNotExist
status: "False"
type: Ready
- lastTransitionTime: "2023-03-16T10:26:15Z"
message: "The certificate request has failed to complete and will be retried:
Failed to wait for order resource \"tls-cert-twhmq-1698200363\" to become ready:
order is in \"errored\" state: Failed to retrieve Order resource: 429 : <html>\r\n<head><title>429
Too Many Requests</title></head>\r\n<body>\r\n<center><h1>429 Too Many Requests</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n"
observedGeneration: 1
reason: Failed
status: "False"
type: Issuing
failedIssuanceAttempts: 1
lastFailureTime: "2023-03-16T10:26:15Z"Anything else we need to know?:
It seems that for every challenge, the order is retrieved from the acme API.
The more domains in the certificate, the more challenges are being spawned, and thus the more requests to fetch the order object are being made.
I see two technical issues here:
- upon retrieval of a 429 response code, the controller should retry instead of giving up immediately
- in order to ease the pressure on the ACME API, the order response should be cached
I have informed the technical support of zerossl about this issue.
Their suggestion was to throttle the requests and/or implement a retry.
Environment details::
- Kubernetes version: 1.24
- Cloud-provider/provisioner: GKE
- cert-manager version: 1.11.0
- Install method: helm 1.11.0
/kind bug