Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

letsEncrypt not working with etcd #926

Closed
vincentvdk opened this issue Dec 2, 2016 · 26 comments
Closed

letsEncrypt not working with etcd #926

vincentvdk opened this issue Dec 2, 2016 · 26 comments

Comments

@vincentvdk
Copy link

@vincentvdk vincentvdk commented Dec 2, 2016

When I try to configure Traefik in combination with 'etcd' as a backend store, i can't get LetsEncrypt working.

traefik.toml (imported with storeconfig)

[web]
address = ":8080"
defaultEntryPoints = ["https"]
[entryPoints]
  [entryPoints.http]
  address = ":80"
    [entryPoints.http.redirect]
    entryPoint = "https"
  [entryPoints.https]
  address = ":443"
    [entryPoints.https.tls]
[acme]
email = "me@domain.tld"
storage= "/traefik/acme/account"
entryPoint = "https"
onDemand = false
OnHostRule = true
caServer = "https://acme-staging.api.letsencrypt.org/directory"
[[acme.domains]]
   main = "domain.tld"
   sans = ["www.domain.tld", "ww2.domain.tld"]
[etcd]
endpoint = "127.0.0.1:4001"
watch = true
prefix = "/traefik"

Output in the logs

Dec 02 09:23:18 core-01 rkt[21448]: [131075.886503] traefik[5]: time="2016-12-02T10:23:18+01:00" level=info msg="Skipping same configuration for provider etcd"
Dec 02 09:23:22 core-01 rkt[21448]: [131079.650257] traefik[5]: time="2016-12-02T10:23:22+01:00" level=debug msg="Challenge GetCertificate www.mydomain.tld"
Dec 02 09:23:22 core-01 rkt[21448]: [131079.650881] traefik[5]: time="2016-12-02T10:23:22+01:00" level=debug msg="Loading ACME certificates [www.mydomain.tld]..."
Dec 02 09:23:22 core-01 rkt[21448]: [131079.661423] traefik[5]: time="2016-12-02T10:23:22+01:00" level=debug msg="Challenge GetCertificate www.mydomain.tld"
Dec 02 09:23:22 core-01 rkt[21448]: [131079.662657] traefik[5]: time="2016-12-02T10:23:22+01:00" level=debug msg="Loading ACME certificates [www.mydomain.tld]..."
Dec 02 09:23:22 core-01 rkt[21448]: [131080.212813] traefik[5]: time="2016-12-02T10:23:22+01:00" level=debug msg="Challenge Present www.mydomain.tld"
Dec 02 09:23:22 core-01 rkt[21448]: [131080.238591] traefik[5]: time="2016-12-02T10:23:22+01:00" level=debug msg="Challenge Present www.mydomain.tld"

From the output in the logs I can't really find out what is exactly going on. I had a look at the source code to see where it might go wrong, but my Go knowledge is no-exiting atm.

Slack reference : https://traefik.slack.com/archives/support/p1480674299005971

@marc-j
Copy link

@marc-j marc-j commented Dec 24, 2016

+1

@one000mph
Copy link

@one000mph one000mph commented Dec 28, 2016

+1. I seem to have the same general issue
Slack reference: https://traefik.slack.com/archives/support/p1482885115000714

I’m trying to configure a etcd cluster of three servers to use acme with traefik. The servers seem to be disagreeing on leadership and also failing to get certificates for the first time.

time="2016-12-27T23:15:11Z" level=error msg="Error calling Leadership listener: Unlock error: 101: Compare failed ([255659 != 255661]) [255662]" 

time="2016-12-28T00:12:08Z" level=error msg="Datastore sync error: Object lock value: expected <value1>, got <value2>, retrying in 293.253134ms" 

The Datastore seems to be refreshing rapidly when all three traefik instances are running so the object lock never matches, therefore the cluster never gets any certificates. Also the onDemand flag is on. there’s a redirect set from http to https, and I’m also using the acme-staging server, just in case those details are relevant. thanks.

@genti-t
Copy link

@genti-t genti-t commented Jan 6, 2017

I can confirm that problem is regarding the usage of etcd and acme.
I tried with a single replica with no etcd usage for KV store, and it works perfectly.

Is someone working on this ?

@emilevauge
Copy link
Member

@emilevauge emilevauge commented Jan 11, 2017

Sadly, it seems that https://github.com/docker/libkv is not managing etcd locks and consul locks in the same way :'( More tests have to be done to confirm this but this is what I suspect...

@genti-t
Copy link

@genti-t genti-t commented Jan 19, 2017

Any news about this ?

@ringods
Copy link

@ringods ringods commented Mar 7, 2017

This issue in libkv tracks this: docker/libkv#106

@andrewl3wis
Copy link

@andrewl3wis andrewl3wis commented Mar 15, 2017

Just want to add that this appears to be happening on zookeeper as well:

ERRO[2017-03-15T00:31:49Z] Datastore sync error: Object lock value: expected c6ff697f-3880-422d-9e06-2e44bb8ce18a, got , retrying in 588.410415ms 
ERRO[2017-03-15T00:31:50Z] Datastore sync error: Object lock value: expected c6ff697f-3880-422d-9e06-2e44bb8ce18a, got , retrying in 738.289049ms 
ERRO[2017-03-15T00:31:51Z] Datastore sync error: Object lock value: expected c6ff697f-3880-422d-9e06-2e44bb8ce18a, got , retrying in 1.151435232s 
ERRO[2017-03-15T00:31:52Z] Datastore sync error: Object lock value: expected c6ff697f-3880-422d-9e06-2e44bb8ce18a, got , retrying in 1.965909627s 
ERRO[2017-03-15T00:31:54Z] Datastore sync error: Object lock value: expected c6ff697f-3880-422d-9e06-2e44bb8ce18a, got , retrying in 2.138171143s 
ERRO[2017-03-15T00:31:56Z] Datastore sync error: Object lock value: expected c6ff697f-3880-422d-9e06-2e44bb8ce18a, got , retrying in 5.3727141s 

EDIT: Appears to be unrelated making new issue

@apognu
Copy link

@apognu apognu commented Apr 16, 2017

I may experience the same issue. I'll check when I get to my computer, but from memory, the account JSON is properly generated, but no certificate is ever requested from LE (from what I can gather from the debug logs).

Will post my etcd config and logfiles.

@lorenz
Copy link

@lorenz lorenz commented Apr 18, 2017

Full trace:

2017-04-18T03:11:45.562113969Z time="2017-04-18T03:11:45Z" level=debug msg="Transaction a751858a-c444-4675-acdf-c057f65efa5d begins" 
2017-04-18T03:11:45.939267691Z time="2017-04-18T03:11:45Z" level=debug msg="Challenge GetCertificate grafana.kube.[redacted].org" 
2017-04-18T03:11:45.939292422Z time="2017-04-18T03:11:45Z" level=debug msg="Loading ACME certificates [grafana.kube.[redacted].org]..." 
2017-04-18T03:11:45.939295958Z legolog: 2017/04/18 03:11:45 [INFO][grafana.kube.[redacted].org] acme: Obtaining bundled SAN certificate
2017-04-18T03:11:46.255160863Z legolog: 2017/04/18 03:11:46 [INFO][grafana.kube.[redacted].org] AuthURL: https://acme-v01.api.letsencrypt.org/acme/authz/[redacted]
2017-04-18T03:11:46.255181716Z legolog: 2017/04/18 03:11:46 [INFO][grafana.kube.[redacted].org] acme: Could not find solver for: dns-01
2017-04-18T03:11:46.255184646Z legolog: 2017/04/18 03:11:46 [INFO][grafana.kube.[redacted].org] acme: Trying to solve TLS-SNI-01
2017-04-18T03:11:46.255187033Z time="2017-04-18T03:11:46Z" level=debug msg="Challenge Present grafana.kube.[redacted].org" 

I looked through the code, it seems like it gets stuck in the acme code after here, which calls this traefik function.

At that point we might get stuck in either the KV transaction, the Golang lock or if none of that happens, execution jumps back to here , which might loop forever if the request is permanently pending. Because after validate() has completed the defer hook for CleanUp() is run (which would log a message), we know that validate() never terminates.

When I get more time I might insert some print statements to see exactly where it gets stuck.

@lorenz
Copy link

@lorenz lorenz commented Apr 18, 2017

Narrowed it down to the etcd transaction, adding further debug output

@apognu
Copy link

@apognu apognu commented Apr 18, 2017

My logs are a lot less verbose than yours (or fail before yours):

level=debug msg="Building ACME client..."
level=info msg="Starting ACME renew job..."
level=debug msg="Challenge GetCertificate redacted.com" 
level=debug msg="ACME got nothing redacted.com" 

Is there anything I could do to get more verbose logs?

@lorenz
Copy link

@lorenz lorenz commented Apr 18, 2017

You can enable debug logs in etcd for both traefik and acme. Set /traefik/loglevel to DEBUG and /traefik/acme/acmelogging to true.
I'm currently running a custom build with even more logging to figure out why the etcd transaction fails.

@apognu
Copy link

@apognu apognu commented Apr 18, 2017

Both are already set, /traefik/debug and /traefik/acme/acmelogging are true, and /traefik/loglevel is set to DEBUG.

@lorenz
Copy link

@lorenz lorenz commented Apr 18, 2017

You might need to also set the debug flag in the configuration file and not just etcd. The above log output was produced by a stock traefik with these flags.

@apognu
Copy link

@apognu apognu commented Apr 19, 2017

Yeah, that doesn't change anything, even with debug and log options set on the command line. The only log line I see regarding ACME is the Challenge GetCertificate one.

For reference, I got traefik running in k8s pods, on GCE, with the following ACME config in etcd:

/ # etcdctl get /traefik/acme/email
redacted@redacted.com
/ # etcdctl get /traefik/acme/ondemand
true
/ # etcdctl get /traefik/acme/onhostrule
true
/ # etcdctl get /traefik/acme/storage
traefik/acme/account
/ # etcdctl get /traefik/acme/entrypoint
https
/ # etcdctl get /traefik/acme/domains/0/main
redacted.com

Everything was working fine when configuring through the config file and storing LE certificates in a plain file. So I guess this might be related to etcd as well.

@apognu
Copy link

@apognu apognu commented Apr 20, 2017

I deployed a freshly built binary of traefik (through make build) and here the logs show up. It seems to be stuck at the exact same point as @lorenz (Challenge Present redacted.com).

@lorenz
Copy link

@lorenz lorenz commented Apr 20, 2017

Well I debugged the whole thing, it essentially boils down to the mentioned docker/libkv#106 issue. I started the etcdv3 upgrade for libkv to actually have proper locking, but the libkv interface maps badly onto etcdv3 concurrency primitives. I'll see what I'm going to do, maybe I'm also going to rewrite cluster.Store, since the original interface abstraction would make a ridiculous number of etcdv3 calls for an essentially simple operation.

@ldez ldez added kind/bug/confirmed and removed bug labels Apr 25, 2017
@ldez ldez added priority/P2 and removed priority/P1 labels May 30, 2017
@kelchm
Copy link
Contributor

@kelchm kelchm commented Aug 5, 2017

Ran into this issue today. Might be worth noting the issues with etcd in the docs somewhere

@trusch
Copy link

@trusch trusch commented Aug 16, 2017

+1 on this. Is there a good workaround?

@ldez ldez added the area/cluster label Aug 16, 2017
@kelchm
Copy link
Contributor

@kelchm kelchm commented Aug 16, 2017

@trusch, the only workaround I've found is to just use Consul instead. It seems to be the only key/value provider that works correctly with ACME.

Here's a rough proof of concept that will get Consul up and running in Rancher.

@abronan
Copy link

@abronan abronan commented Aug 17, 2017

I'm catching up on this issue, trying to slowly get the libkv repository back on track. In the meantime I'll start working on the etcd v3 backend which should hopefully fix the problem (I see it has a proper Mutex API now which was not the case when I originally opened docker/libkv#103 and docker/libkv#106).

As @lorenz mentioned above, the interface probably maps badly with the new lock implementation but I'll be aiming for something that works at first. We can later discuss about a possible interface change to better map with new primitives.

I'll update as soon as I have some results.

@abronan
Copy link

@abronan abronan commented Aug 18, 2017

Update: I got the Lock working with the v3 client (using the same model as for zookeeper: a side-key for the lock). It works as intended and it's pretty clean now compared to v2, just a few lines of code.

What's left:

  • Finish, Polish and Test the v3 backend: I'm missing calls like Watch/WatchTree and List/DeleteTree.
  • I realized I can also probably fix the v2 custom lock feature using the same method with a side-key (Yay!)

Will update again as soon as I drop a PR.

@abronan
Copy link

@abronan abronan commented Aug 20, 2017

Initial PR for etcd V3 is up: abronan/libkv#4

I would appreciate some help on testing in real usage (and with traefik maybe?). Will wait for feedback and also (re)-reviewing myself before merging.


PR for etcd v2 lock up as well: abronan/libkv#5

trusch added a commit to trusch/traefik that referenced this issue Aug 21, 2017
docker/libkv is dead and abronan/libkv is the repository where development is happening.
this solves containous#1975 and therefore hopefully containous#926 once the etcdv2 fix is merged.
this also adds support for redis and etcdv3 storage backends.
@trusch
Copy link

@trusch trusch commented Aug 29, 2017

everything is merged into abronan/libkv! I hope this fixes will be integrated in near future :)

@abronan
Copy link

@abronan abronan commented Aug 29, 2017

Let me know if everything works out fine or if there are still remaining issues and I'll take a look. Feel free to open an issue on the fork as well. I may create a new repository as some have issues with import paths and the fact that it still points to the docker/libkv repository (some package managers support aliasing, some not). But we'll migrate issues if we go forward with that approach.

@nmengin nmengin mentioned this issue Nov 14, 2017
1 of 1 task complete
@traefiker traefiker added this to the 1.5 milestone Nov 17, 2017
@KlavsKlavsen
Copy link

@KlavsKlavsen KlavsKlavsen commented Nov 7, 2018

This seems to still be an issue. Tried with traefik 1.7.4 and an etcd-3.3.9.

@containous containous locked and limited conversation to collaborators Sep 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

You can’t perform that action at this time.