Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traefik is not retrying the ACME caserver connection upon failure #2670

Open
igoratencompass opened this issue Jan 7, 2018 · 6 comments

Comments

@igoratencompass
Copy link

@igoratencompass igoratencompass commented Jan 7, 2018

Do you want to request a feature or report a bug?

I would say both, a feature and a bug

What did you do?

What did you expect to see?

If the connection failed for some reason I would expect the Traefik pod that has the ACME lock in the backend KV store, Kubernetes etcd in my case, to retry after some configurable time interval.

What did you see instead?

That is not happening.

Output of traefik version: (What version of Traefik are you using?)

v1.5.0-rc4

(paste your output here)

What is your environment & configuration (arguments, toml, provider, platform, ...)?

Traefik with etcd backend in Kubernetes.

# (paste your configuration here)

If applicable, please paste the log output in debug mode (--debug switch)

 (paste your output here)
@alex88

This comment has been minimized.

Copy link

@alex88 alex88 commented Dec 8, 2018

Same here using 1.7.5, acme failed due a 500 in the challenge response and traefik never tried again, after a restart everything worked

@costela

This comment has been minimized.

Copy link

@costela costela commented Dec 12, 2018

FYI: I'm also seeing the same behavior on 1.7.5 and I'm not using KV (using json on an EBS volume; single instance), so the issue label might be misleading for anyone trying to debug this a bit further.

@hjacobs

This comment has been minimized.

Copy link

@hjacobs hjacobs commented Sep 17, 2019

The behavior is pretty bad IMHO for my case: there is a major race condition between Traefik ACME and the actual server startup (Kubernetes ingress). I repeatedly try to get my cert issued and got as far as 3/4 certs (hostnames), now trying to get 4/4 by killing the Traefik pod and checking logs 😞

{"level":"error","msg":"Unable to obtain ACME certificate for domains \"kube-resource-report.demo.j-serv.de\" detected thanks to rule \"Host:kube-resource-report.demo.j-serv.de\" : unable to generate a certificate for the domains [kube-resource-report.demo.j-serv.de]: acme: Error -\u003e One or more domains had a problem:\n[kube-resource-report.demo.j-serv.de] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Connection refused, url: \n","time":"2019-09-17T15:07:30Z"}

My Traefik config: https://codeberg.org/hjacobs/k3s-demo/src/branch/master/manifests/traefik-config.yaml

Some option to delay the cert creation (wait X seconds) would also already help.

@hjacobs

This comment has been minimized.

Copy link

@hjacobs hjacobs commented Sep 17, 2019

How I fixed this in my case: I edited the Ingress YAML, changed the hostname to something invalid (like xxx.de), and then changed it back to the right one.

Result: Traefik detected the change and issued the cert (onHostRule = true) and no race-condition this time (as the server is already running).

@costela

This comment has been minimized.

Copy link

@costela costela commented Sep 17, 2019

@hjacobs just to complement: a similar race condition might also happen between the ingress and its DNS entry, if it's also being managed by k8s (via something like cloud provider integration or external-dns).

The onDemand option (deprecated in 1.7.*; dropped in 2.*) seems to help in this case. Would be nice if it was "revived". Couldn't find a rationale for its removal.

@hjacobs

This comment has been minimized.

Copy link

@hjacobs hjacobs commented Sep 17, 2019

Write-up of my Traefik/ACME problem I mentioned above: https://srcco.de/posts/k3s-outage-traefik-acme-lets-encrypt-local-path.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.