New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

letsEncrypt not working with etcd #926

Closed
vincentvdk opened this Issue Dec 2, 2016 · 25 comments

Comments

Projects
None yet
@vincentvdk

vincentvdk commented Dec 2, 2016

When I try to configure Traefik in combination with 'etcd' as a backend store, i can't get LetsEncrypt working.

traefik.toml (imported with storeconfig)

[web]
address = ":8080"
defaultEntryPoints = ["https"]
[entryPoints]
  [entryPoints.http]
  address = ":80"
    [entryPoints.http.redirect]
    entryPoint = "https"
  [entryPoints.https]
  address = ":443"
    [entryPoints.https.tls]
[acme]
email = "me@domain.tld"
storage= "/traefik/acme/account"
entryPoint = "https"
onDemand = false
OnHostRule = true
caServer = "https://acme-staging.api.letsencrypt.org/directory"
[[acme.domains]]
   main = "domain.tld"
   sans = ["www.domain.tld", "ww2.domain.tld"]
[etcd]
endpoint = "127.0.0.1:4001"
watch = true
prefix = "/traefik"

Output in the logs

Dec 02 09:23:18 core-01 rkt[21448]: [131075.886503] traefik[5]: time="2016-12-02T10:23:18+01:00" level=info msg="Skipping same configuration for provider etcd"
Dec 02 09:23:22 core-01 rkt[21448]: [131079.650257] traefik[5]: time="2016-12-02T10:23:22+01:00" level=debug msg="Challenge GetCertificate www.mydomain.tld"
Dec 02 09:23:22 core-01 rkt[21448]: [131079.650881] traefik[5]: time="2016-12-02T10:23:22+01:00" level=debug msg="Loading ACME certificates [www.mydomain.tld]..."
Dec 02 09:23:22 core-01 rkt[21448]: [131079.661423] traefik[5]: time="2016-12-02T10:23:22+01:00" level=debug msg="Challenge GetCertificate www.mydomain.tld"
Dec 02 09:23:22 core-01 rkt[21448]: [131079.662657] traefik[5]: time="2016-12-02T10:23:22+01:00" level=debug msg="Loading ACME certificates [www.mydomain.tld]..."
Dec 02 09:23:22 core-01 rkt[21448]: [131080.212813] traefik[5]: time="2016-12-02T10:23:22+01:00" level=debug msg="Challenge Present www.mydomain.tld"
Dec 02 09:23:22 core-01 rkt[21448]: [131080.238591] traefik[5]: time="2016-12-02T10:23:22+01:00" level=debug msg="Challenge Present www.mydomain.tld"

From the output in the logs I can't really find out what is exactly going on. I had a look at the source code to see where it might go wrong, but my Go knowledge is no-exiting atm.

Slack reference : https://traefik.slack.com/archives/support/p1480674299005971

@marc-j

This comment has been minimized.

Show comment
Hide comment

marc-j commented Dec 24, 2016

+1

@one000mph

This comment has been minimized.

Show comment
Hide comment
@one000mph

one000mph Dec 28, 2016

+1. I seem to have the same general issue
Slack reference: https://traefik.slack.com/archives/support/p1482885115000714

I’m trying to configure a etcd cluster of three servers to use acme with traefik. The servers seem to be disagreeing on leadership and also failing to get certificates for the first time.

time="2016-12-27T23:15:11Z" level=error msg="Error calling Leadership listener: Unlock error: 101: Compare failed ([255659 != 255661]) [255662]" 

time="2016-12-28T00:12:08Z" level=error msg="Datastore sync error: Object lock value: expected <value1>, got <value2>, retrying in 293.253134ms" 

The Datastore seems to be refreshing rapidly when all three traefik instances are running so the object lock never matches, therefore the cluster never gets any certificates. Also the onDemand flag is on. there’s a redirect set from http to https, and I’m also using the acme-staging server, just in case those details are relevant. thanks.

one000mph commented Dec 28, 2016

+1. I seem to have the same general issue
Slack reference: https://traefik.slack.com/archives/support/p1482885115000714

I’m trying to configure a etcd cluster of three servers to use acme with traefik. The servers seem to be disagreeing on leadership and also failing to get certificates for the first time.

time="2016-12-27T23:15:11Z" level=error msg="Error calling Leadership listener: Unlock error: 101: Compare failed ([255659 != 255661]) [255662]" 

time="2016-12-28T00:12:08Z" level=error msg="Datastore sync error: Object lock value: expected <value1>, got <value2>, retrying in 293.253134ms" 

The Datastore seems to be refreshing rapidly when all three traefik instances are running so the object lock never matches, therefore the cluster never gets any certificates. Also the onDemand flag is on. there’s a redirect set from http to https, and I’m also using the acme-staging server, just in case those details are relevant. thanks.

@genti-t

This comment has been minimized.

Show comment
Hide comment
@genti-t

genti-t Jan 6, 2017

I can confirm that problem is regarding the usage of etcd and acme.
I tried with a single replica with no etcd usage for KV store, and it works perfectly.

Is someone working on this ?

genti-t commented Jan 6, 2017

I can confirm that problem is regarding the usage of etcd and acme.
I tried with a single replica with no etcd usage for KV store, and it works perfectly.

Is someone working on this ?

@emilevauge

This comment has been minimized.

Show comment
Hide comment
@emilevauge

emilevauge Jan 11, 2017

Member

Sadly, it seems that https://github.com/docker/libkv is not managing etcd locks and consul locks in the same way :'( More tests have to be done to confirm this but this is what I suspect...

Member

emilevauge commented Jan 11, 2017

Sadly, it seems that https://github.com/docker/libkv is not managing etcd locks and consul locks in the same way :'( More tests have to be done to confirm this but this is what I suspect...

@genti-t

This comment has been minimized.

Show comment
Hide comment
@genti-t

genti-t Jan 19, 2017

Any news about this ?

genti-t commented Jan 19, 2017

Any news about this ?

@ringods

This comment has been minimized.

Show comment
Hide comment
@ringods

ringods Mar 7, 2017

This issue in libkv tracks this: docker/libkv#106

ringods commented Mar 7, 2017

This issue in libkv tracks this: docker/libkv#106

@andrewl3wis

This comment has been minimized.

Show comment
Hide comment
@andrewl3wis

andrewl3wis Mar 15, 2017

Just want to add that this appears to be happening on zookeeper as well:

ERRO[2017-03-15T00:31:49Z] Datastore sync error: Object lock value: expected c6ff697f-3880-422d-9e06-2e44bb8ce18a, got , retrying in 588.410415ms 
ERRO[2017-03-15T00:31:50Z] Datastore sync error: Object lock value: expected c6ff697f-3880-422d-9e06-2e44bb8ce18a, got , retrying in 738.289049ms 
ERRO[2017-03-15T00:31:51Z] Datastore sync error: Object lock value: expected c6ff697f-3880-422d-9e06-2e44bb8ce18a, got , retrying in 1.151435232s 
ERRO[2017-03-15T00:31:52Z] Datastore sync error: Object lock value: expected c6ff697f-3880-422d-9e06-2e44bb8ce18a, got , retrying in 1.965909627s 
ERRO[2017-03-15T00:31:54Z] Datastore sync error: Object lock value: expected c6ff697f-3880-422d-9e06-2e44bb8ce18a, got , retrying in 2.138171143s 
ERRO[2017-03-15T00:31:56Z] Datastore sync error: Object lock value: expected c6ff697f-3880-422d-9e06-2e44bb8ce18a, got , retrying in 5.3727141s 

EDIT: Appears to be unrelated making new issue

andrewl3wis commented Mar 15, 2017

Just want to add that this appears to be happening on zookeeper as well:

ERRO[2017-03-15T00:31:49Z] Datastore sync error: Object lock value: expected c6ff697f-3880-422d-9e06-2e44bb8ce18a, got , retrying in 588.410415ms 
ERRO[2017-03-15T00:31:50Z] Datastore sync error: Object lock value: expected c6ff697f-3880-422d-9e06-2e44bb8ce18a, got , retrying in 738.289049ms 
ERRO[2017-03-15T00:31:51Z] Datastore sync error: Object lock value: expected c6ff697f-3880-422d-9e06-2e44bb8ce18a, got , retrying in 1.151435232s 
ERRO[2017-03-15T00:31:52Z] Datastore sync error: Object lock value: expected c6ff697f-3880-422d-9e06-2e44bb8ce18a, got , retrying in 1.965909627s 
ERRO[2017-03-15T00:31:54Z] Datastore sync error: Object lock value: expected c6ff697f-3880-422d-9e06-2e44bb8ce18a, got , retrying in 2.138171143s 
ERRO[2017-03-15T00:31:56Z] Datastore sync error: Object lock value: expected c6ff697f-3880-422d-9e06-2e44bb8ce18a, got , retrying in 5.3727141s 

EDIT: Appears to be unrelated making new issue

@apognu

This comment has been minimized.

Show comment
Hide comment
@apognu

apognu Apr 16, 2017

I may experience the same issue. I'll check when I get to my computer, but from memory, the account JSON is properly generated, but no certificate is ever requested from LE (from what I can gather from the debug logs).

Will post my etcd config and logfiles.

apognu commented Apr 16, 2017

I may experience the same issue. I'll check when I get to my computer, but from memory, the account JSON is properly generated, but no certificate is ever requested from LE (from what I can gather from the debug logs).

Will post my etcd config and logfiles.

@lorenz

This comment has been minimized.

Show comment
Hide comment
@lorenz

lorenz Apr 18, 2017

Full trace:

2017-04-18T03:11:45.562113969Z time="2017-04-18T03:11:45Z" level=debug msg="Transaction a751858a-c444-4675-acdf-c057f65efa5d begins" 
2017-04-18T03:11:45.939267691Z time="2017-04-18T03:11:45Z" level=debug msg="Challenge GetCertificate grafana.kube.[redacted].org" 
2017-04-18T03:11:45.939292422Z time="2017-04-18T03:11:45Z" level=debug msg="Loading ACME certificates [grafana.kube.[redacted].org]..." 
2017-04-18T03:11:45.939295958Z legolog: 2017/04/18 03:11:45 [INFO][grafana.kube.[redacted].org] acme: Obtaining bundled SAN certificate
2017-04-18T03:11:46.255160863Z legolog: 2017/04/18 03:11:46 [INFO][grafana.kube.[redacted].org] AuthURL: https://acme-v01.api.letsencrypt.org/acme/authz/[redacted]
2017-04-18T03:11:46.255181716Z legolog: 2017/04/18 03:11:46 [INFO][grafana.kube.[redacted].org] acme: Could not find solver for: dns-01
2017-04-18T03:11:46.255184646Z legolog: 2017/04/18 03:11:46 [INFO][grafana.kube.[redacted].org] acme: Trying to solve TLS-SNI-01
2017-04-18T03:11:46.255187033Z time="2017-04-18T03:11:46Z" level=debug msg="Challenge Present grafana.kube.[redacted].org" 

I looked through the code, it seems like it gets stuck in the acme code after here, which calls this traefik function.

At that point we might get stuck in either the KV transaction, the Golang lock or if none of that happens, execution jumps back to here , which might loop forever if the request is permanently pending. Because after validate() has completed the defer hook for CleanUp() is run (which would log a message), we know that validate() never terminates.

When I get more time I might insert some print statements to see exactly where it gets stuck.

lorenz commented Apr 18, 2017

Full trace:

2017-04-18T03:11:45.562113969Z time="2017-04-18T03:11:45Z" level=debug msg="Transaction a751858a-c444-4675-acdf-c057f65efa5d begins" 
2017-04-18T03:11:45.939267691Z time="2017-04-18T03:11:45Z" level=debug msg="Challenge GetCertificate grafana.kube.[redacted].org" 
2017-04-18T03:11:45.939292422Z time="2017-04-18T03:11:45Z" level=debug msg="Loading ACME certificates [grafana.kube.[redacted].org]..." 
2017-04-18T03:11:45.939295958Z legolog: 2017/04/18 03:11:45 [INFO][grafana.kube.[redacted].org] acme: Obtaining bundled SAN certificate
2017-04-18T03:11:46.255160863Z legolog: 2017/04/18 03:11:46 [INFO][grafana.kube.[redacted].org] AuthURL: https://acme-v01.api.letsencrypt.org/acme/authz/[redacted]
2017-04-18T03:11:46.255181716Z legolog: 2017/04/18 03:11:46 [INFO][grafana.kube.[redacted].org] acme: Could not find solver for: dns-01
2017-04-18T03:11:46.255184646Z legolog: 2017/04/18 03:11:46 [INFO][grafana.kube.[redacted].org] acme: Trying to solve TLS-SNI-01
2017-04-18T03:11:46.255187033Z time="2017-04-18T03:11:46Z" level=debug msg="Challenge Present grafana.kube.[redacted].org" 

I looked through the code, it seems like it gets stuck in the acme code after here, which calls this traefik function.

At that point we might get stuck in either the KV transaction, the Golang lock or if none of that happens, execution jumps back to here , which might loop forever if the request is permanently pending. Because after validate() has completed the defer hook for CleanUp() is run (which would log a message), we know that validate() never terminates.

When I get more time I might insert some print statements to see exactly where it gets stuck.

@lorenz

This comment has been minimized.

Show comment
Hide comment
@lorenz

lorenz Apr 18, 2017

Narrowed it down to the etcd transaction, adding further debug output

lorenz commented Apr 18, 2017

Narrowed it down to the etcd transaction, adding further debug output

@apognu

This comment has been minimized.

Show comment
Hide comment
@apognu

apognu Apr 18, 2017

My logs are a lot less verbose than yours (or fail before yours):

level=debug msg="Building ACME client..."
level=info msg="Starting ACME renew job..."
level=debug msg="Challenge GetCertificate redacted.com" 
level=debug msg="ACME got nothing redacted.com" 

Is there anything I could do to get more verbose logs?

apognu commented Apr 18, 2017

My logs are a lot less verbose than yours (or fail before yours):

level=debug msg="Building ACME client..."
level=info msg="Starting ACME renew job..."
level=debug msg="Challenge GetCertificate redacted.com" 
level=debug msg="ACME got nothing redacted.com" 

Is there anything I could do to get more verbose logs?

@lorenz

This comment has been minimized.

Show comment
Hide comment
@lorenz

lorenz Apr 18, 2017

You can enable debug logs in etcd for both traefik and acme. Set /traefik/loglevel to DEBUG and /traefik/acme/acmelogging to true.
I'm currently running a custom build with even more logging to figure out why the etcd transaction fails.

lorenz commented Apr 18, 2017

You can enable debug logs in etcd for both traefik and acme. Set /traefik/loglevel to DEBUG and /traefik/acme/acmelogging to true.
I'm currently running a custom build with even more logging to figure out why the etcd transaction fails.

@apognu

This comment has been minimized.

Show comment
Hide comment
@apognu

apognu Apr 18, 2017

Both are already set, /traefik/debug and /traefik/acme/acmelogging are true, and /traefik/loglevel is set to DEBUG.

apognu commented Apr 18, 2017

Both are already set, /traefik/debug and /traefik/acme/acmelogging are true, and /traefik/loglevel is set to DEBUG.

@lorenz

This comment has been minimized.

Show comment
Hide comment
@lorenz

lorenz Apr 18, 2017

You might need to also set the debug flag in the configuration file and not just etcd. The above log output was produced by a stock traefik with these flags.

lorenz commented Apr 18, 2017

You might need to also set the debug flag in the configuration file and not just etcd. The above log output was produced by a stock traefik with these flags.

@apognu

This comment has been minimized.

Show comment
Hide comment
@apognu

apognu Apr 19, 2017

Yeah, that doesn't change anything, even with debug and log options set on the command line. The only log line I see regarding ACME is the Challenge GetCertificate one.

For reference, I got traefik running in k8s pods, on GCE, with the following ACME config in etcd:

/ # etcdctl get /traefik/acme/email
redacted@redacted.com
/ # etcdctl get /traefik/acme/ondemand
true
/ # etcdctl get /traefik/acme/onhostrule
true
/ # etcdctl get /traefik/acme/storage
traefik/acme/account
/ # etcdctl get /traefik/acme/entrypoint
https
/ # etcdctl get /traefik/acme/domains/0/main
redacted.com

Everything was working fine when configuring through the config file and storing LE certificates in a plain file. So I guess this might be related to etcd as well.

apognu commented Apr 19, 2017

Yeah, that doesn't change anything, even with debug and log options set on the command line. The only log line I see regarding ACME is the Challenge GetCertificate one.

For reference, I got traefik running in k8s pods, on GCE, with the following ACME config in etcd:

/ # etcdctl get /traefik/acme/email
redacted@redacted.com
/ # etcdctl get /traefik/acme/ondemand
true
/ # etcdctl get /traefik/acme/onhostrule
true
/ # etcdctl get /traefik/acme/storage
traefik/acme/account
/ # etcdctl get /traefik/acme/entrypoint
https
/ # etcdctl get /traefik/acme/domains/0/main
redacted.com

Everything was working fine when configuring through the config file and storing LE certificates in a plain file. So I guess this might be related to etcd as well.

@apognu

This comment has been minimized.

Show comment
Hide comment
@apognu

apognu Apr 20, 2017

I deployed a freshly built binary of traefik (through make build) and here the logs show up. It seems to be stuck at the exact same point as @lorenz (Challenge Present redacted.com).

apognu commented Apr 20, 2017

I deployed a freshly built binary of traefik (through make build) and here the logs show up. It seems to be stuck at the exact same point as @lorenz (Challenge Present redacted.com).

@lorenz

This comment has been minimized.

Show comment
Hide comment
@lorenz

lorenz Apr 20, 2017

Well I debugged the whole thing, it essentially boils down to the mentioned docker/libkv#106 issue. I started the etcdv3 upgrade for libkv to actually have proper locking, but the libkv interface maps badly onto etcdv3 concurrency primitives. I'll see what I'm going to do, maybe I'm also going to rewrite cluster.Store, since the original interface abstraction would make a ridiculous number of etcdv3 calls for an essentially simple operation.

lorenz commented Apr 20, 2017

Well I debugged the whole thing, it essentially boils down to the mentioned docker/libkv#106 issue. I started the etcdv3 upgrade for libkv to actually have proper locking, but the libkv interface maps badly onto etcdv3 concurrency primitives. I'll see what I'm going to do, maybe I'm also going to rewrite cluster.Store, since the original interface abstraction would make a ridiculous number of etcdv3 calls for an essentially simple operation.

@ldez ldez added kind/bug/confirmed and removed bug labels Apr 25, 2017

@ldez ldez added priority/P2 and removed priority/P1 labels May 30, 2017

@kelchm

This comment has been minimized.

Show comment
Hide comment
@kelchm

kelchm Aug 5, 2017

Contributor

Ran into this issue today. Might be worth noting the issues with etcd in the docs somewhere

Contributor

kelchm commented Aug 5, 2017

Ran into this issue today. Might be worth noting the issues with etcd in the docs somewhere

@trusch

This comment has been minimized.

Show comment
Hide comment
@trusch

trusch Aug 16, 2017

+1 on this. Is there a good workaround?

trusch commented Aug 16, 2017

+1 on this. Is there a good workaround?

@ldez ldez added the area/cluster label Aug 16, 2017

@kelchm

This comment has been minimized.

Show comment
Hide comment
@kelchm

kelchm Aug 16, 2017

Contributor

@trusch, the only workaround I've found is to just use Consul instead. It seems to be the only key/value provider that works correctly with ACME.

Here's a rough proof of concept that will get Consul up and running in Rancher.

Contributor

kelchm commented Aug 16, 2017

@trusch, the only workaround I've found is to just use Consul instead. It seems to be the only key/value provider that works correctly with ACME.

Here's a rough proof of concept that will get Consul up and running in Rancher.

@abronan

This comment has been minimized.

Show comment
Hide comment
@abronan

abronan Aug 17, 2017

I'm catching up on this issue, trying to slowly get the libkv repository back on track. In the meantime I'll start working on the etcd v3 backend which should hopefully fix the problem (I see it has a proper Mutex API now which was not the case when I originally opened docker/libkv#103 and docker/libkv#106).

As @lorenz mentioned above, the interface probably maps badly with the new lock implementation but I'll be aiming for something that works at first. We can later discuss about a possible interface change to better map with new primitives.

I'll update as soon as I have some results.

abronan commented Aug 17, 2017

I'm catching up on this issue, trying to slowly get the libkv repository back on track. In the meantime I'll start working on the etcd v3 backend which should hopefully fix the problem (I see it has a proper Mutex API now which was not the case when I originally opened docker/libkv#103 and docker/libkv#106).

As @lorenz mentioned above, the interface probably maps badly with the new lock implementation but I'll be aiming for something that works at first. We can later discuss about a possible interface change to better map with new primitives.

I'll update as soon as I have some results.

@abronan

This comment has been minimized.

Show comment
Hide comment
@abronan

abronan Aug 18, 2017

Update: I got the Lock working with the v3 client (using the same model as for zookeeper: a side-key for the lock). It works as intended and it's pretty clean now compared to v2, just a few lines of code.

What's left:

  • Finish, Polish and Test the v3 backend: I'm missing calls like Watch/WatchTree and List/DeleteTree.
  • I realized I can also probably fix the v2 custom lock feature using the same method with a side-key (Yay!)

Will update again as soon as I drop a PR.

abronan commented Aug 18, 2017

Update: I got the Lock working with the v3 client (using the same model as for zookeeper: a side-key for the lock). It works as intended and it's pretty clean now compared to v2, just a few lines of code.

What's left:

  • Finish, Polish and Test the v3 backend: I'm missing calls like Watch/WatchTree and List/DeleteTree.
  • I realized I can also probably fix the v2 custom lock feature using the same method with a side-key (Yay!)

Will update again as soon as I drop a PR.

@abronan

This comment has been minimized.

Show comment
Hide comment
@abronan

abronan Aug 20, 2017

Initial PR for etcd V3 is up: abronan/libkv#4

I would appreciate some help on testing in real usage (and with traefik maybe?). Will wait for feedback and also (re)-reviewing myself before merging.


PR for etcd v2 lock up as well: abronan/libkv#5

abronan commented Aug 20, 2017

Initial PR for etcd V3 is up: abronan/libkv#4

I would appreciate some help on testing in real usage (and with traefik maybe?). Will wait for feedback and also (re)-reviewing myself before merging.


PR for etcd v2 lock up as well: abronan/libkv#5

trusch added a commit to trusch/traefik that referenced this issue Aug 21, 2017

[vendor] use abronan/libkv instead of docker/libkv;
docker/libkv is dead and abronan/libkv is the repository where development is happening.
this solves #1975 and therefore hopefully #926 once the etcdv2 fix is merged.
this also adds support for redis and etcdv3 storage backends.
@trusch

This comment has been minimized.

Show comment
Hide comment
@trusch

trusch Aug 29, 2017

everything is merged into abronan/libkv! I hope this fixes will be integrated in near future :)

trusch commented Aug 29, 2017

everything is merged into abronan/libkv! I hope this fixes will be integrated in near future :)

@abronan

This comment has been minimized.

Show comment
Hide comment
@abronan

abronan Aug 29, 2017

Let me know if everything works out fine or if there are still remaining issues and I'll take a look. Feel free to open an issue on the fork as well. I may create a new repository as some have issues with import paths and the fact that it still points to the docker/libkv repository (some package managers support aliasing, some not). But we'll migrate issues if we go forward with that approach.

abronan commented Aug 29, 2017

Let me know if everything works out fine or if there are still remaining issues and I'll take a look. Feel free to open an issue on the fork as well. I may create a new repository as some have issues with import paths and the fact that it still points to the docker/libkv repository (some package managers support aliasing, some not). But we'll migrate issues if we go forward with that approach.

@nmengin nmengin referenced this issue Nov 14, 2017

Merged

Support Etcd v3, enhance KV support #2407

1 of 1 task complete

@traefiker traefiker closed this in #2407 Nov 17, 2017

@traefiker traefiker added this to the 1.5 milestone Nov 17, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment