Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.0 is released, is cluster HA ready ? #5426

Closed
Jamlee opened this issue Sep 19, 2019 · 33 comments
Closed

2.0 is released, is cluster HA ready ? #5426

Jamlee opened this issue Sep 19, 2019 · 33 comments

Comments

@Jamlee
Copy link

@Jamlee Jamlee commented Sep 19, 2019

2.0 is released, is cluster HA ready ?

@bradjones1

This comment has been minimized.

Copy link
Contributor

@bradjones1 bradjones1 commented Sep 19, 2019

My understanding is that while it hasn't been officially summarized all in one place, clustering is a major selling point for EE and may not be supported in CE 2.x, though that would need confirmed itself.

I recently received a note from a Containous sales rep putting it this way:

TraefikEE is a clustered and highly available version of Traefik CE.
You are right there were some issues with the HA mode in Traefik 1.X, that's why we deprecated it and decided to work on it on the EE edition.

There are indeed a number of not-obvious bugs in the 1.x HA implementation, especially with respect to ACME/Let's Encrypt: #4851 #3487 #5047 #3833 and likely others. While it's entirely possible to run Traefik 1.x in production with HA, you do need to navigate these not-insignificant issues.

Per this post in the discussion forum, I've found the most definitive statements regarding clustering in 2.x, which is that it will be an EE feature.

To be clear, I am not disputing Containous' right to decide what is and is not in their CE vs EE product, but I think it's worth summarizing the information available and perhaps for a note about the loss of HA support in 2.x to be made more prominent in the documentation.

bradjones1 added a commit to bradjones1/traefik that referenced this issue Sep 19, 2019
@bradjones1

This comment has been minimized.

Copy link
Contributor

@bradjones1 bradjones1 commented Sep 19, 2019

By way of example where the docs could be updated, there is a note:

For concurrency reason, this file cannot be shared across multiple instances of Traefik. Use a key value store entry instead.

on https://docs.traefik.io/https/acme/#storage - if this is not an option, it should probably be removed. I opened a PR for this.

@SantoDE

This comment has been minimized.

Copy link
Contributor

@SantoDE SantoDE commented Sep 20, 2019

Thanks a lot for your question.

As already mentioned on the Community Forums, (https://community.containo.us/t/please-explain-distributed-lets-encrypt-feature-of-traefikee/872/2), our intent with Traefik 2.0 has been to make Traefik Community Edition a good old stateless data plane again.

We want Traefik to be the best at what it's designed for, which is providing a simple solution to get your applications reachable from the outside world.

Even if nothing stops you from deploying multiple Traefik 2.0 nodes at the same time to serve all your incoming requests, some components (such as Let's Encryt) require synchronisation (on top of other things). This synchronisation mechanism requires proven production-ready algorithms and tools to manage your Traefik cluster (to keep things simple). For the Enterprise Edition we worked on a new architecture that enabled us to provide these mechanisms, and we leverage them to handle everything that is distributed. (You can find an overview over the Enterprise Edition features here: https://containo.us/traefikee/.)

Since these tools go beyond the premise stated above (Traefik should be a good old stateless data plane), and since the synchronisation mechanism used in v1 was unsatisfactory and painful to maintain, we're currently not planning to bring a distributed synchronisation to 2.0 Community Edition as the Enterprise Edition is doing that part very well.

Feel free to drop me an email if you have more questions about that matter: manuel@containo.us

Thanks!

@SantoDE SantoDE closed this Sep 20, 2019
@nabsul

This comment has been minimized.

Copy link

@nabsul nabsul commented Sep 23, 2019

I find this very disappointing:

we're currently not planning to bring a distributed synchronisation to 2.0 Community Edition as the Enterprise Edition is doing that part very well.

I don't see how you can call Traefik "Cloud Native" if it can only run a single instance.

@geraldcroes

This comment has been minimized.

Copy link
Contributor

@geraldcroes geraldcroes commented Sep 23, 2019

I don't see how you can call Traefik "Cloud Native" if it can only run a single instance.

Sorry if it was unclear.

You can run as many Traefik instances as you want on your clusters and these instances will all do their job for you.

Here, we're talking about a very specific feature (Let's Encrypt integration with synchronization across multiple instances that used to leverage external KV stores ... ).

Even without this feature in 2.0, we believe and confidently say Traefik is cloud native.

@nabsul

This comment has been minimized.

Copy link

@nabsul nabsul commented Sep 24, 2019

If I'm understanding you correctly, if I want to deploy 3 replicas/instances of Traefik, I would just let them each request certs separately from LetsEncrypt?

@nabsul

This comment has been minimized.

Copy link

@nabsul nabsul commented Sep 24, 2019

Well, no, that can't be right. Regardless of which auth mechanism I use (http/tls/dns), it feels like there would be a high probability of the instances stepping on each other's toes...

@bradjones1

This comment has been minimized.

Copy link
Contributor

@bradjones1 bradjones1 commented Sep 24, 2019

@nabsul I think the point @geraldcroes is making is, Traefik is cloud-native in the sense that it can configure itself using cloud-native workflows; the component under discussion above is HA, which while related, is not a requirement to be "cloud-native." (Not that there's a single unified definition of that, anyway.)

@nabsul

This comment has been minimized.

Copy link

@nabsul nabsul commented Sep 24, 2019

I respectfully disagree. As you said, the definition is a little fuzzy but I think HA is pretty crucial. For example in my case: I run a few very small personal sites on my Kubernetes cluster (personal blog, experiments for learning, etc.). If I use Traefik and its node goes down, then all those sites go down.

Obviously, none of the sites are mission critical and an outage of minutes/seconds while Kubernetes redeploys the container and storage volume to a new node isn't going to be hurt anything. But I'm trying to learn how to deploy reliable services in K8s and I can't do that with Traefik.

@Coksnuss

This comment has been minimized.

Copy link

@Coksnuss Coksnuss commented Sep 25, 2019

@nabsul
I am in a similar situation. My hope and wish is that - eventually - traefik can use a KV store to read/write certificates from/to but without the logic that would be needed for concurrent access (which would be exclusive to TraefikEE). Users of the CE edition that want to use traefik in HA-mode would then have to configure a designated instance of traefik to be the "master" instance which is responsible for updating/writing to the KV store while all other instances are configured in read-only mode.

In my mind this would not require too much effort to be implemented (once there is support for KV stores) but it opens the possibility for CE users to deploy a HA setup. This is certainly more complicated to maintain/configure but at least it won't exclude people like you and me from upgrading to version 2 just because the new version does not support this kind of setup anymore.

@nabsul

This comment has been minimized.

Copy link

@nabsul nabsul commented Sep 25, 2019

In the meantime, I'm going to be exploring a few options:

  • See if other k8s ingress controllers provide this feature out of the box
  • Try out the more manual approach with cert-manager or lego
  • See if I can get my load balancer (digital ocean) to handle the letsencrypt certs without giving it full DNS control of my domain

For those who might be interested, I'll share my conclusions. Hopefully in a week or two.

Longer term (and depending on how painful the alternatives are), I also might try to find time to contribute this feature to Traefik :-)

@Vetal-ca

This comment has been minimized.

Copy link

@Vetal-ca Vetal-ca commented Oct 2, 2019

Essentially, for us it is a show stopper, working with 1.X and switching to something else, such as Istio.

@gentunian

This comment has been minimized.

Copy link

@gentunian gentunian commented Oct 7, 2019

This is a blocker too.

Such a shame that open sourced traefik took benefits from the open source community to rise its fame and glory to then close and drop HA feature like this and enforce you to think in an Enterprise Edition if you, that previously used v1 with HA wants to update your traefik version.

@prune998

This comment has been minimized.

Copy link

@prune998 prune998 commented Oct 8, 2019

When Istio (Envoy)/Linkerd and others do HA for FREE, I really think you're shooting yourself in the foot by keeping HA for enterprise.
You're doing a great job with Traefik, except for this specific one :)

@danassetms

This comment has been minimized.

Copy link

@danassetms danassetms commented Oct 8, 2019

@emilevauge

This comment has been minimized.

Copy link
Member

@emilevauge emilevauge commented Oct 8, 2019

Hi everyone!

Based on the conversation happening here, there seems to be a lot of confusion and I want to emphasize that HA is still very much available in Traefik open source, nothing has changed, and this will never change.

Why? Because Traefik is stateless. So you can deploy as many instances as you like, and each instance will do its job (just like Envoy / Nginx / other solutions).

Also, KV stores providers will be re-introduced in 2.1 like some other missing providers.

Now, as far as the "distributed let's encrypt with KV stores" feature is concerned, it will continue to live in the 1.7 branch, and we will support the 1.7 branch for a year. This experimental feature has never been battle-proof and was a pain to maintain. Working on v2, we realised it was more of a hacky way of providing a distributed let's encrypt than a real solution. This is the reason why we decided to drop it in v2, to keep Traefik stateless and rock solid. Oh, and by the way, this specific Let's Encrypt feature has never been supported by Istio (Envoy) nor Linkerd so switching to these tools won't change anything on this topic.

Yes, the Enterprise Edition provides this feature, but only because its architecture is distributed by design, and "distributed Let's Encrypt" is only a bonus feature in the EE version, not the main selling point. I could list features that EE customers enjoy the most, but here is not the place.

TL;DR: We didn't removed HA from Traefik, we dropped a super specific (and buggy) synchronisation feature around Let's Encrypt. This decision was not business-driven but led by the engineering team to keep Traefik clean.

@prune998

This comment has been minimized.

Copy link

@prune998 prune998 commented Oct 9, 2019

Thanks for this clear answer @emilevauge

Istio support dynamic SSL Certs from Let's Encrypt by using CertManager and the "new" SDS API. This is for both internal or Ingress traffic using the Gateway. This is possible thanks to Envoy rolling update.

So it's a feature Traefik 2.0 CE will not have, maybe until someone contribute to the code. At least it's an opensource project right ? :)

My thinking is that this "feature" should be high in the TODO of the CE edition of Traefik. But it's just me... I'll stay on Istio until then.

@emilevauge

This comment has been minimized.

Copy link
Member

@emilevauge emilevauge commented Oct 9, 2019

CertManager is a separate project, and can be therefore used with Istio but also Traefik! I'm sorry but you are comparing apples to oranges.

At least it's an opensource project right ?

I will stop answering you from now. I really tried to expose transparently all the details.

@nabsul

This comment has been minimized.

Copy link

@nabsul nabsul commented Oct 9, 2019

I'm starting to think that building Let's Encrypt integration into the ingress controller might be a bad idea anyways. Traefik's approach is super convenient, but it does this by storing the certs in a JSON file (or KV storage), while the "official/recommended" Kubernetes way of doing this is via secrets.

Additionally, certificate management is pretty different from load balancing. Cert management requires state and storage. It also doesn't need to be highly available (doesn't matter if your cert is renewed today or tomorrow if it's expiring in a week). High availability is however very important for a load balancer.

Maybe we should all be using CertManager with whatever ingress controller we decide on?

@gentunian

This comment has been minimized.

Copy link

@gentunian gentunian commented Oct 9, 2019

Thanks for clarifying @emilevauge.

I think confusion arises for tying HA with let's encrypt. Also, confusion come to mind because it's a feature that is dropped. I understand why you dropped from your comments.

For people using it, it's like a big issue to rethink and rework the certificate provisioning for new versions of traefik and stay updated. And this is the why for people like me that must think in a greater work than just a migration.

What comes to my mind is to mount a shared volume for let's encrypt across all nodes until a better approach could be found.

@prune998

This comment has been minimized.

Copy link

@prune998 prune998 commented Oct 9, 2019

Well, I'm sorry you did not take my comments @emilevauge.

While I do love both apples and oranges, I may mix them sometimes.
I'm not against Traefik, which I'm using almost since it was created (alongside Istio). I'm sad/frustrated to see some features of the 1.x are not there anymore while they are still part of the product if you buy the enterprise version.

Cert-Manager can be used with any proxy (with K8s) as it's an external application. Right. I'm just comparing features here. It's more work to set this up with Istio, but it does work. If you know how to make it work with Traefik 2.0, maybe you should answer, not me, but all the other persons reading this thread, so they can use Traefik 2.0 with the features they had with 1.7.

Now the situation is cleared, I'll try to recap all this :

  • Traefik CE 2.0 will have HA (kv store for config) in an upcoming release
  • Traefik is not only for Kubernetes, so it may be convenient to embed Let's Encrypt "provider" for some use cases
  • You can use Traefik 1.7 + Let's Encrypt as you did previously (but KV storage is buggy, so they say), just don't use 2.0
  • You can go with (pay) Traefik Enterprise and skip reading this issue
  • You can use Cert-Manager to provide Let's Encrypt certificates and use them with Traefik 2.0 right now, as you would do with Istio

I'll investigate how to use Cert-Manager with Traefik CE 2.0 for Let's Encrypt SSL certs with auto-renew without traefik restart. I'll let you know my findings. Any comment/doc is welcome.

of course :

  • You can add this feature yourself and contribute to the project
@emilevauge

This comment has been minimized.

Copy link
Member

@emilevauge emilevauge commented Oct 9, 2019

@prune998

If you know how to make it work with Traefik 2.0, maybe you should answer

Here is an example, but it's in French and with Traefik 1.x:
https://www.cerenit.fr/blog/kubernetes-ovh-traefik-cert-manager-secrets/

You can make it work with 2.0 with the new CRD provider.

@LincolnBruce

This comment has been minimized.

Copy link

@LincolnBruce LincolnBruce commented Oct 14, 2019

looking forward to the next version ( 2.1 kv provider)

@foxos42

This comment has been minimized.

Copy link

@foxos42 foxos42 commented Oct 14, 2019

I read 2.1 kv provider, but i don't read distributed Let's Encrypt in kv store support in CE. Hopefully I am wrong and traefik at least reads the certs from this source. A little service that checks the lables, get the cert and write it into the kv store is not a big deal. Did it for ha-proxy and switched to traefik because it was integrated... and now... boooring...

@nabsul

This comment has been minimized.

Copy link

@nabsul nabsul commented Oct 17, 2019

Well, I gave nginx+CertManager a whirl and the on-boarding experience was terrible. I tried the official tutorial as well as the DigitalOcean one, and was not able to successfully set it all up.

I'm sure if I banged my head against it for long enough I'd figure it out, but that's a bad new user experience.

So I've decided to stick with Traefik. I'll live with single instance for now (a few minutes outage per month on average), and start looking at the source code to see if I can find a way to make it HA myself.

@lawliet89

This comment has been minimized.

Copy link

@lawliet89 lawliet89 commented Oct 31, 2019

@emilevauge and @prune998 I think a way to mitigate this would be to fix #5495 which allows users to have some external mechanism to provision and renew LE certificates while having a way to tell Traefik to reload the certificates.

@emilevauge

This comment has been minimized.

Copy link
Member

@emilevauge emilevauge commented Nov 6, 2019

We will add some documentation on how to set up Traefik and CertManager to help on this topic: #5792 :)

@aphistic

This comment has been minimized.

Copy link

@aphistic aphistic commented Nov 6, 2019

@emilevauge does this mean that it's currently possible to use CertManager with the existing Traefik v2 releases, it's just a matter of documenting how to do it? I'm one of those that used the 1.7 Let's Encrypt sync and was sad to see it wasn't in v2 but I ended up using v2 anyway with just a single instance and a PersistentVolumeClaim for the key storage. It's annoying and it causes issues with availability so I'd love to use CertManager if there's a way to do it right now.

@SantoDE

This comment has been minimized.

Copy link
Contributor

@SantoDE SantoDE commented Nov 6, 2019

Yes it is possible @aphistic. Me and my colleague @mmatur gave it a test today and it’s working fine. As @emilevauge mentioned, we will add docs to help with that

@bradjones1

This comment has been minimized.

Copy link
Contributor

@bradjones1 bradjones1 commented Nov 12, 2019

@SantoDE Is there a PR for this against the docs? I would be happy to help create this if you have an outline worked up?

traefiker added a commit to bradjones1/traefik that referenced this issue Nov 14, 2019
@elthariel

This comment has been minimized.

Copy link

@elthariel elthariel commented Nov 15, 2019

@aphistic it works pretty well with cert-manager. I recommend to enable the classical kubernetes ingress to do so.

@emilevauge Probably a dumb question (#sorry), but wouldn't writing the certificates received from letsencrypt into a k8s secret be a solution to this issue ?

@gentunian

This comment has been minimized.

Copy link

@gentunian gentunian commented Nov 16, 2019

@nabsul

This comment has been minimized.

Copy link

@nabsul nabsul commented Nov 18, 2019

For those interested: I've decided to experiment with extending Traefik's lets encrypt to store certs and challenges in Azure Table storage instead of file/memory. So far I THINK this can work and allow deploying multiple instances of traefik that share the storage.

A good chunk of code is written (but definitely not ready to run yet): master...nabsul:nabsul/add-cloud-storage

The refresh cert logic needs to be improved and the TLS challenge is not implemented all. I hope to start testing in the next couple of weeks.

I also wrote about my reasoning around this approach here: https://nabeel.blog/2019/11/traefik/

I'll let you all know if this experiment succeeds or fails miserably. Also, I'm happy to take advice or feedback on details I might be forgetting while building this.

@containous containous locked and limited conversation to collaborators Dec 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.