Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.0 is released, is cluster HA ready ? #5426

Closed
Jamlee opened this issue Sep 19, 2019 · 34 comments
Closed

2.0 is released, is cluster HA ready ? #5426

Jamlee opened this issue Sep 19, 2019 · 34 comments

Comments

@Jamlee
Copy link

Jamlee commented Sep 19, 2019

2.0 is released, is cluster HA ready ?

@bradjones1
Copy link
Contributor

My understanding is that while it hasn't been officially summarized all in one place, clustering is a major selling point for EE and may not be supported in CE 2.x, though that would need confirmed itself.

I recently received a note from a Containous sales rep putting it this way:

TraefikEE is a clustered and highly available version of Traefik CE.
You are right there were some issues with the HA mode in Traefik 1.X, that's why we deprecated it and decided to work on it on the EE edition.

There are indeed a number of not-obvious bugs in the 1.x HA implementation, especially with respect to ACME/Let's Encrypt: #4851 #3487 #5047 #3833 and likely others. While it's entirely possible to run Traefik 1.x in production with HA, you do need to navigate these not-insignificant issues.

Per this post in the discussion forum, I've found the most definitive statements regarding clustering in 2.x, which is that it will be an EE feature.

To be clear, I am not disputing Containous' right to decide what is and is not in their CE vs EE product, but I think it's worth summarizing the information available and perhaps for a note about the loss of HA support in 2.x to be made more prominent in the documentation.

@bradjones1
Copy link
Contributor

By way of example where the docs could be updated, there is a note:

For concurrency reason, this file cannot be shared across multiple instances of Traefik. Use a key value store entry instead.

on https://docs.traefik.io/https/acme/#storage - if this is not an option, it should probably be removed. I opened a PR for this.

@SantoDE
Copy link
Collaborator

SantoDE commented Sep 20, 2019

Thanks a lot for your question.

As already mentioned on the Community Forums, (https://community.containo.us/t/please-explain-distributed-lets-encrypt-feature-of-traefikee/872/2), our intent with Traefik 2.0 has been to make Traefik Community Edition a good old stateless data plane again.

We want Traefik to be the best at what it's designed for, which is providing a simple solution to get your applications reachable from the outside world.

Even if nothing stops you from deploying multiple Traefik 2.0 nodes at the same time to serve all your incoming requests, some components (such as Let's Encryt) require synchronisation (on top of other things). This synchronisation mechanism requires proven production-ready algorithms and tools to manage your Traefik cluster (to keep things simple). For the Enterprise Edition we worked on a new architecture that enabled us to provide these mechanisms, and we leverage them to handle everything that is distributed. (You can find an overview over the Enterprise Edition features here: https://containo.us/traefikee/.)

Since these tools go beyond the premise stated above (Traefik should be a good old stateless data plane), and since the synchronisation mechanism used in v1 was unsatisfactory and painful to maintain, we're currently not planning to bring a distributed synchronisation to 2.0 Community Edition as the Enterprise Edition is doing that part very well.

Feel free to drop me an email if you have more questions about that matter: manuel@containo.us

Thanks!

@SantoDE SantoDE closed this as completed Sep 20, 2019
@nabsul
Copy link

nabsul commented Sep 23, 2019

I find this very disappointing:

we're currently not planning to bring a distributed synchronisation to 2.0 Community Edition as the Enterprise Edition is doing that part very well.

I don't see how you can call Traefik "Cloud Native" if it can only run a single instance.

@geraldcroes
Copy link
Contributor

I don't see how you can call Traefik "Cloud Native" if it can only run a single instance.

Sorry if it was unclear.

You can run as many Traefik instances as you want on your clusters and these instances will all do their job for you.

Here, we're talking about a very specific feature (Let's Encrypt integration with synchronization across multiple instances that used to leverage external KV stores ... ).

Even without this feature in 2.0, we believe and confidently say Traefik is cloud native.

@nabsul
Copy link

nabsul commented Sep 24, 2019

If I'm understanding you correctly, if I want to deploy 3 replicas/instances of Traefik, I would just let them each request certs separately from LetsEncrypt?

@nabsul
Copy link

nabsul commented Sep 24, 2019

Well, no, that can't be right. Regardless of which auth mechanism I use (http/tls/dns), it feels like there would be a high probability of the instances stepping on each other's toes...

@bradjones1
Copy link
Contributor

@nabsul I think the point @geraldcroes is making is, Traefik is cloud-native in the sense that it can configure itself using cloud-native workflows; the component under discussion above is HA, which while related, is not a requirement to be "cloud-native." (Not that there's a single unified definition of that, anyway.)

@nabsul
Copy link

nabsul commented Sep 24, 2019

I respectfully disagree. As you said, the definition is a little fuzzy but I think HA is pretty crucial. For example in my case: I run a few very small personal sites on my Kubernetes cluster (personal blog, experiments for learning, etc.). If I use Traefik and its node goes down, then all those sites go down.

Obviously, none of the sites are mission critical and an outage of minutes/seconds while Kubernetes redeploys the container and storage volume to a new node isn't going to be hurt anything. But I'm trying to learn how to deploy reliable services in K8s and I can't do that with Traefik.

@schnz
Copy link

schnz commented Sep 25, 2019

@nabsul
I am in a similar situation. My hope and wish is that - eventually - traefik can use a KV store to read/write certificates from/to but without the logic that would be needed for concurrent access (which would be exclusive to TraefikEE). Users of the CE edition that want to use traefik in HA-mode would then have to configure a designated instance of traefik to be the "master" instance which is responsible for updating/writing to the KV store while all other instances are configured in read-only mode.

In my mind this would not require too much effort to be implemented (once there is support for KV stores) but it opens the possibility for CE users to deploy a HA setup. This is certainly more complicated to maintain/configure but at least it won't exclude people like you and me from upgrading to version 2 just because the new version does not support this kind of setup anymore.

@nabsul
Copy link

nabsul commented Sep 25, 2019

In the meantime, I'm going to be exploring a few options:

  • See if other k8s ingress controllers provide this feature out of the box
  • Try out the more manual approach with cert-manager or lego
  • See if I can get my load balancer (digital ocean) to handle the letsencrypt certs without giving it full DNS control of my domain

For those who might be interested, I'll share my conclusions. Hopefully in a week or two.

Longer term (and depending on how painful the alternatives are), I also might try to find time to contribute this feature to Traefik :-)

@Vetal-ca
Copy link

Vetal-ca commented Oct 2, 2019

Essentially, for us it is a show stopper, working with 1.X and switching to something else, such as Istio.

@gentunian
Copy link

This is a blocker too.

Such a shame that open sourced traefik took benefits from the open source community to rise its fame and glory to then close and drop HA feature like this and enforce you to think in an Enterprise Edition if you, that previously used v1 with HA wants to update your traefik version.

@prune998
Copy link

prune998 commented Oct 8, 2019

When Istio (Envoy)/Linkerd and others do HA for FREE, I really think you're shooting yourself in the foot by keeping HA for enterprise.
You're doing a great job with Traefik, except for this specific one :)

@DanOrsborne
Copy link

DanOrsborne commented Oct 8, 2019 via email

@emilevauge
Copy link
Member

emilevauge commented Oct 8, 2019

Hi everyone!

Based on the conversation happening here, there seems to be a lot of confusion and I want to emphasize that HA is still very much available in Traefik open source, nothing has changed, and this will never change.

Why? Because Traefik is stateless. So you can deploy as many instances as you like, and each instance will do its job (just like Envoy / Nginx / other solutions).

Also, KV stores providers will be re-introduced in 2.1 like some other missing providers.

Now, as far as the "distributed let's encrypt with KV stores" feature is concerned, it will continue to live in the 1.7 branch, and we will support the 1.7 branch for a year. This experimental feature has never been battle-proof and was a pain to maintain. Working on v2, we realised it was more of a hacky way of providing a distributed let's encrypt than a real solution. This is the reason why we decided to drop it in v2, to keep Traefik stateless and rock solid. Oh, and by the way, this specific Let's Encrypt feature has never been supported by Istio (Envoy) nor Linkerd so switching to these tools won't change anything on this topic.

Yes, the Enterprise Edition provides this feature, but only because its architecture is distributed by design, and "distributed Let's Encrypt" is only a bonus feature in the EE version, not the main selling point. I could list features that EE customers enjoy the most, but here is not the place.

TL;DR: We didn't removed HA from Traefik, we dropped a super specific (and buggy) synchronisation feature around Let's Encrypt. This decision was not business-driven but led by the engineering team to keep Traefik clean.

@prune998
Copy link

prune998 commented Oct 9, 2019

Thanks for this clear answer @emilevauge

Istio support dynamic SSL Certs from Let's Encrypt by using CertManager and the "new" SDS API. This is for both internal or Ingress traffic using the Gateway. This is possible thanks to Envoy rolling update.

So it's a feature Traefik 2.0 CE will not have, maybe until someone contribute to the code. At least it's an opensource project right ? :)

My thinking is that this "feature" should be high in the TODO of the CE edition of Traefik. But it's just me... I'll stay on Istio until then.

@emilevauge
Copy link
Member

emilevauge commented Oct 9, 2019

CertManager is a separate project, and can be therefore used with Istio but also Traefik! I'm sorry but you are comparing apples to oranges.

At least it's an opensource project right ?

I will stop answering you from now. I really tried to expose transparently all the details.

@nabsul
Copy link

nabsul commented Oct 9, 2019

I'm starting to think that building Let's Encrypt integration into the ingress controller might be a bad idea anyways. Traefik's approach is super convenient, but it does this by storing the certs in a JSON file (or KV storage), while the "official/recommended" Kubernetes way of doing this is via secrets.

Additionally, certificate management is pretty different from load balancing. Cert management requires state and storage. It also doesn't need to be highly available (doesn't matter if your cert is renewed today or tomorrow if it's expiring in a week). High availability is however very important for a load balancer.

Maybe we should all be using CertManager with whatever ingress controller we decide on?

@gentunian
Copy link

Thanks for clarifying @emilevauge.

I think confusion arises for tying HA with let's encrypt. Also, confusion come to mind because it's a feature that is dropped. I understand why you dropped from your comments.

For people using it, it's like a big issue to rethink and rework the certificate provisioning for new versions of traefik and stay updated. And this is the why for people like me that must think in a greater work than just a migration.

What comes to my mind is to mount a shared volume for let's encrypt across all nodes until a better approach could be found.

@prune998
Copy link

prune998 commented Oct 9, 2019

Well, I'm sorry you did not take my comments @emilevauge.

While I do love both apples and oranges, I may mix them sometimes.
I'm not against Traefik, which I'm using almost since it was created (alongside Istio). I'm sad/frustrated to see some features of the 1.x are not there anymore while they are still part of the product if you buy the enterprise version.

Cert-Manager can be used with any proxy (with K8s) as it's an external application. Right. I'm just comparing features here. It's more work to set this up with Istio, but it does work. If you know how to make it work with Traefik 2.0, maybe you should answer, not me, but all the other persons reading this thread, so they can use Traefik 2.0 with the features they had with 1.7.

Now the situation is cleared, I'll try to recap all this :

  • Traefik CE 2.0 will have HA (kv store for config) in an upcoming release
  • Traefik is not only for Kubernetes, so it may be convenient to embed Let's Encrypt "provider" for some use cases
  • You can use Traefik 1.7 + Let's Encrypt as you did previously (but KV storage is buggy, so they say), just don't use 2.0
  • You can go with (pay) Traefik Enterprise and skip reading this issue
  • You can use Cert-Manager to provide Let's Encrypt certificates and use them with Traefik 2.0 right now, as you would do with Istio

I'll investigate how to use Cert-Manager with Traefik CE 2.0 for Let's Encrypt SSL certs with auto-renew without traefik restart. I'll let you know my findings. Any comment/doc is welcome.

of course :

  • You can add this feature yourself and contribute to the project

@emilevauge
Copy link
Member

@prune998

If you know how to make it work with Traefik 2.0, maybe you should answer

Here is an example, but it's in French and with Traefik 1.x:
https://www.cerenit.fr/blog/kubernetes-ovh-traefik-cert-manager-secrets/

You can make it work with 2.0 with the new CRD provider.

@LincolnBruce
Copy link

LincolnBruce commented Oct 14, 2019

looking forward to the next version ( 2.1 kv provider)

@foxos42
Copy link

foxos42 commented Oct 14, 2019

I read 2.1 kv provider, but i don't read distributed Let's Encrypt in kv store support in CE. Hopefully I am wrong and traefik at least reads the certs from this source. A little service that checks the lables, get the cert and write it into the kv store is not a big deal. Did it for ha-proxy and switched to traefik because it was integrated... and now... boooring...

@nabsul
Copy link

nabsul commented Oct 17, 2019

Well, I gave nginx+CertManager a whirl and the on-boarding experience was terrible. I tried the official tutorial as well as the DigitalOcean one, and was not able to successfully set it all up.

I'm sure if I banged my head against it for long enough I'd figure it out, but that's a bad new user experience.

So I've decided to stick with Traefik. I'll live with single instance for now (a few minutes outage per month on average), and start looking at the source code to see if I can find a way to make it HA myself.

@lawliet89
Copy link

@emilevauge and @prune998 I think a way to mitigate this would be to fix #5495 which allows users to have some external mechanism to provision and renew LE certificates while having a way to tell Traefik to reload the certificates.

@emilevauge
Copy link
Member

We will add some documentation on how to set up Traefik and CertManager to help on this topic: #5792 :)

@aphistic
Copy link

aphistic commented Nov 6, 2019

@emilevauge does this mean that it's currently possible to use CertManager with the existing Traefik v2 releases, it's just a matter of documenting how to do it? I'm one of those that used the 1.7 Let's Encrypt sync and was sad to see it wasn't in v2 but I ended up using v2 anyway with just a single instance and a PersistentVolumeClaim for the key storage. It's annoying and it causes issues with availability so I'd love to use CertManager if there's a way to do it right now.

@SantoDE
Copy link
Collaborator

SantoDE commented Nov 6, 2019

Yes it is possible @aphistic. Me and my colleague @mmatur gave it a test today and it’s working fine. As @emilevauge mentioned, we will add docs to help with that

@bradjones1
Copy link
Contributor

@SantoDE Is there a PR for this against the docs? I would be happy to help create this if you have an outline worked up?

traefiker pushed a commit to bradjones1/traefik that referenced this issue Nov 14, 2019
@elthariel
Copy link

@aphistic it works pretty well with cert-manager. I recommend to enable the classical kubernetes ingress to do so.

@emilevauge Probably a dumb question (#sorry), but wouldn't writing the certificates received from letsencrypt into a k8s secret be a solution to this issue ?

@gentunian
Copy link

gentunian commented Nov 16, 2019 via email

@nabsul
Copy link

nabsul commented Nov 18, 2019

For those interested: I've decided to experiment with extending Traefik's lets encrypt to store certs and challenges in Azure Table storage instead of file/memory. So far I THINK this can work and allow deploying multiple instances of traefik that share the storage.

A good chunk of code is written (but definitely not ready to run yet): master...nabsul:nabsul/add-cloud-storage

The refresh cert logic needs to be improved and the TLS challenge is not implemented all. I hope to start testing in the next couple of weeks.

I also wrote about my reasoning around this approach here: https://nabeel.blog/2019/11/traefik/

I'll let you all know if this experiment succeeds or fails miserably. Also, I'm happy to take advice or feedback on details I might be forgetting while building this.

@emilevauge
Copy link
Member

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests