Skip to content

Controller can't handle hitting request rate limits of zerossl ACME API #5867

@hnicke

Description

@hnicke

Describe the bug:

We've been using cert-manager with zerossl as ACME provider using http01 challenges for several months now vey successfully.
However, since a couple of weeks ago, zerossl must have changed their ACME API:
They now introduced a quite strict request rate limit.
Whenever issuing a new certificate containing 3 or more domains and using the http01 challenge, we are running in 429 responses from their API, which completely bricks the cert issue flow.
Note: The problem does not occur when issuing a cert containing <=2 domains.

Expected behaviour:
The controller should respect 429 responses and try again later.
In my case, retrying 2-3 seconds later would already solve the issue.

Steps to reproduce the bug:
This is the certificate resource:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  annotations:
    service: tls-cert
  labels:
    service: tls-cert
  name: tls-cert
spec:
  dnsNames:
  - xxx
  - xxx
  - xxx
  - xxx
  - xxx
  - xxx
  - xxx
  - xxx
  issuerRef:
    group: cert-manager.io
    kind: ClusterIssuer
    name: zerossl
  secretName: tls-cert
  usages:
  - digital signature
  - key encipherment

And this is the ClusterIssuer resource:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: zerossl
spec:
  acme:
    externalAccountBinding:
      keyID: xxxxx
      keySecretRef:
        key: eab-hmac-key
        name: zerossl
    privateKeySecretRef:
      name: zerossl-account
    server: https://acme.zerossl.com/v2/DV90
    solvers:
    - http01:
        ingress:
          class: nginx

After applying the certificate to the cluster, the corresponding CertificateRequest, Order, and Challenge resources are created as expected.
However, during processing of the challenges, the ACME client hits the request limit of the zerossl API:
challenge

# failed challenge status:
status:
  presented: false
  processing: false
  reason: 'Failed to retrieve Order resource: 429 : 429 Too Many Requests'
  state: errored

Once the first challenge fails, the error state is propagated to the Order and Certificate resource:

# Order status:
status:
  authorizations:
    ....
  failureTime: "2023-03-16T10:26:15Z"
  finalizeURL: https://acme.zerossl.com/v2/DV90/order/xxxxx/finalize
  reason: "Failed to retrieve Order resource: 429 : <html>\r\n<head><title>429 Too
    Many Requests</title></head>\r\n<body>\r\n<center><h1>429 Too Many Requests</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n"
  state: errored
  url: https://acme.zerossl.com/v2/DV90/order/xxxxx

# Certificate status:
status:
  conditions:
  - lastTransitionTime: "2023-03-16T10:26:08Z"
    message: Issuing certificate as Secret does not exist
    observedGeneration: 1
    reason: DoesNotExist
    status: "False"
    type: Ready
  - lastTransitionTime: "2023-03-16T10:26:15Z"
    message: "The certificate request has failed to complete and will be retried:
      Failed to wait for order resource \"tls-cert-twhmq-1698200363\" to become ready:
      order is in \"errored\" state: Failed to retrieve Order resource: 429 : <html>\r\n<head><title>429
      Too Many Requests</title></head>\r\n<body>\r\n<center><h1>429 Too Many Requests</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n"
    observedGeneration: 1
    reason: Failed
    status: "False"
    type: Issuing
  failedIssuanceAttempts: 1
  lastFailureTime: "2023-03-16T10:26:15Z"

Anything else we need to know?:

It seems that for every challenge, the order is retrieved from the acme API.
The more domains in the certificate, the more challenges are being spawned, and thus the more requests to fetch the order object are being made.

I see two technical issues here:

  • upon retrieval of a 429 response code, the controller should retry instead of giving up immediately
  • in order to ease the pressure on the ACME API, the order response should be cached

I have informed the technical support of zerossl about this issue.
Their suggestion was to throttle the requests and/or implement a retry.

Environment details::

  • Kubernetes version: 1.24
  • Cloud-provider/provisioner: GKE
  • cert-manager version: 1.11.0
  • Install method: helm 1.11.0
    /kind bug

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.lifecycle/frozenIndicates that an issue or PR should not be auto-closed due to staleness.priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions