-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2.0 is released, is cluster HA ready ? #5426
Comments
My understanding is that while it hasn't been officially summarized all in one place, clustering is a major selling point for EE and may not be supported in CE 2.x, though that would need confirmed itself. I recently received a note from a Containous sales rep putting it this way:
There are indeed a number of not-obvious bugs in the 1.x HA implementation, especially with respect to ACME/Let's Encrypt: #4851 #3487 #5047 #3833 and likely others. While it's entirely possible to run Traefik 1.x in production with HA, you do need to navigate these not-insignificant issues. Per this post in the discussion forum, I've found the most definitive statements regarding clustering in 2.x, which is that it will be an EE feature. To be clear, I am not disputing Containous' right to decide what is and is not in their CE vs EE product, but I think it's worth summarizing the information available and perhaps for a note about the loss of HA support in 2.x to be made more prominent in the documentation. |
By way of example where the docs could be updated, there is a note:
on https://docs.traefik.io/https/acme/#storage - if this is not an option, it should probably be removed. I opened a PR for this. |
Thanks a lot for your question. As already mentioned on the Community Forums, (https://community.containo.us/t/please-explain-distributed-lets-encrypt-feature-of-traefikee/872/2), our intent with Traefik 2.0 has been to make Traefik Community Edition a good old stateless data plane again. We want Traefik to be the best at what it's designed for, which is providing a simple solution to get your applications reachable from the outside world. Even if nothing stops you from deploying multiple Traefik 2.0 nodes at the same time to serve all your incoming requests, some components (such as Let's Encryt) require synchronisation (on top of other things). This synchronisation mechanism requires proven production-ready algorithms and tools to manage your Traefik cluster (to keep things simple). For the Enterprise Edition we worked on a new architecture that enabled us to provide these mechanisms, and we leverage them to handle everything that is distributed. (You can find an overview over the Enterprise Edition features here: https://containo.us/traefikee/.) Since these tools go beyond the premise stated above (Traefik should be a good old stateless data plane), and since the synchronisation mechanism used in v1 was unsatisfactory and painful to maintain, we're currently not planning to bring a distributed synchronisation to 2.0 Community Edition as the Enterprise Edition is doing that part very well. Feel free to drop me an email if you have more questions about that matter: manuel@containo.us Thanks! |
I find this very disappointing:
I don't see how you can call Traefik "Cloud Native" if it can only run a single instance. |
Sorry if it was unclear. You can run as many Traefik instances as you want on your clusters and these instances will all do their job for you. Here, we're talking about a very specific feature (Let's Encrypt integration with synchronization across multiple instances that used to leverage external KV stores ... ). Even without this feature in 2.0, we believe and confidently say Traefik is cloud native. |
If I'm understanding you correctly, if I want to deploy 3 replicas/instances of Traefik, I would just let them each request certs separately from LetsEncrypt? |
Well, no, that can't be right. Regardless of which auth mechanism I use (http/tls/dns), it feels like there would be a high probability of the instances stepping on each other's toes... |
@nabsul I think the point @geraldcroes is making is, Traefik is cloud-native in the sense that it can configure itself using cloud-native workflows; the component under discussion above is HA, which while related, is not a requirement to be "cloud-native." (Not that there's a single unified definition of that, anyway.) |
I respectfully disagree. As you said, the definition is a little fuzzy but I think HA is pretty crucial. For example in my case: I run a few very small personal sites on my Kubernetes cluster (personal blog, experiments for learning, etc.). If I use Traefik and its node goes down, then all those sites go down. Obviously, none of the sites are mission critical and an outage of minutes/seconds while Kubernetes redeploys the container and storage volume to a new node isn't going to be hurt anything. But I'm trying to learn how to deploy reliable services in K8s and I can't do that with Traefik. |
@nabsul In my mind this would not require too much effort to be implemented (once there is support for KV stores) but it opens the possibility for CE users to deploy a HA setup. This is certainly more complicated to maintain/configure but at least it won't exclude people like you and me from upgrading to version 2 just because the new version does not support this kind of setup anymore. |
In the meantime, I'm going to be exploring a few options:
For those who might be interested, I'll share my conclusions. Hopefully in a week or two. Longer term (and depending on how painful the alternatives are), I also might try to find time to contribute this feature to Traefik :-) |
Essentially, for us it is a show stopper, working with 1.X and switching to something else, such as Istio. |
This is a blocker too. Such a shame that open sourced traefik took benefits from the open source community to rise its fame and glory to then close and drop HA feature like this and enforce you to think in an Enterprise Edition if you, that previously used v1 with HA wants to update your traefik version. |
When Istio (Envoy)/Linkerd and others do HA for FREE, I really think you're shooting yourself in the foot by keeping HA for enterprise. |
+1 to this. It should be out of the box to be able to do this
…________________________________
From: Prune Sebastien THOMAS <notifications@github.com>
Sent: 08 October 2019 12:34
To: containous/traefik <traefik@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Subject: Re: [containous/traefik] 2.0 is released, is cluster HA ready ? (#5426)
When Istio (Envoy)/Linkerd and others do HA for FREE, I really think you're shooting yourself in the foot by keeping HA for enterprise.
You're doing a great job with Traefik, except for this specific one :)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#5426>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ANBQ5NNGSUB73HRZ5NMADV3QNRV4LANCNFSM4IYITTRA>.
Disclaimer
The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.
This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business. Providing a safer and more useful place for your human generated data. Specializing in; Security, archiving and compliance. To find out more visit the Mimecast website.
|
Hi everyone! Based on the conversation happening here, there seems to be a lot of confusion and I want to emphasize that HA is still very much available in Traefik open source, nothing has changed, and this will never change. Why? Because Traefik is stateless. So you can deploy as many instances as you like, and each instance will do its job (just like Envoy / Nginx / other solutions). Also, KV stores providers will be re-introduced in 2.1 like some other missing providers. Now, as far as the "distributed let's encrypt with KV stores" feature is concerned, it will continue to live in the 1.7 branch, and we will support the 1.7 branch for a year. This experimental feature has never been battle-proof and was a pain to maintain. Working on v2, we realised it was more of a hacky way of providing a distributed let's encrypt than a real solution. This is the reason why we decided to drop it in v2, to keep Traefik stateless and rock solid. Oh, and by the way, this specific Let's Encrypt feature has never been supported by Istio (Envoy) nor Linkerd so switching to these tools won't change anything on this topic. Yes, the Enterprise Edition provides this feature, but only because its architecture is distributed by design, and "distributed Let's Encrypt" is only a bonus feature in the EE version, not the main selling point. I could list features that EE customers enjoy the most, but here is not the place. TL;DR: We didn't removed HA from Traefik, we dropped a super specific (and buggy) synchronisation feature around Let's Encrypt. This decision was not business-driven but led by the engineering team to keep Traefik clean. |
Thanks for this clear answer @emilevauge Istio support dynamic SSL Certs from Let's Encrypt by using CertManager and the "new" SDS API. This is for both internal or Ingress traffic using the Gateway. This is possible thanks to Envoy rolling update. So it's a feature Traefik 2.0 CE will not have, maybe until someone contribute to the code. At least it's an opensource project right ? :) My thinking is that this "feature" should be high in the TODO of the CE edition of Traefik. But it's just me... I'll stay on Istio until then. |
CertManager is a separate project, and can be therefore used with Istio but also Traefik! I'm sorry but you are comparing apples to oranges.
I will stop answering you from now. I really tried to expose transparently all the details. |
I'm starting to think that building Let's Encrypt integration into the ingress controller might be a bad idea anyways. Traefik's approach is super convenient, but it does this by storing the certs in a JSON file (or KV storage), while the "official/recommended" Kubernetes way of doing this is via secrets. Additionally, certificate management is pretty different from load balancing. Cert management requires state and storage. It also doesn't need to be highly available (doesn't matter if your cert is renewed today or tomorrow if it's expiring in a week). High availability is however very important for a load balancer. Maybe we should all be using CertManager with whatever ingress controller we decide on? |
Thanks for clarifying @emilevauge. I think confusion arises for tying HA with let's encrypt. Also, confusion come to mind because it's a feature that is dropped. I understand why you dropped from your comments. For people using it, it's like a big issue to rethink and rework the certificate provisioning for new versions of traefik and stay updated. And this is the why for people like me that must think in a greater work than just a migration. What comes to my mind is to mount a shared volume for let's encrypt across all nodes until a better approach could be found. |
Well, I'm sorry you did not take my comments @emilevauge. While I do love both apples and oranges, I may mix them sometimes. Cert-Manager can be used with any proxy (with K8s) as it's an external application. Right. I'm just comparing features here. It's more work to set this up with Istio, but it does work. If you know how to make it work with Traefik 2.0, maybe you should answer, not me, but all the other persons reading this thread, so they can use Traefik 2.0 with the features they had with 1.7. Now the situation is cleared, I'll try to recap all this :
I'll investigate how to use Cert-Manager with Traefik CE 2.0 for Let's Encrypt SSL certs with auto-renew without traefik restart. I'll let you know my findings. Any comment/doc is welcome. of course :
|
Here is an example, but it's in French and with Traefik 1.x: You can make it work with 2.0 with the new CRD provider. |
looking forward to the next version ( 2.1 kv provider) |
I read 2.1 kv provider, but i don't read distributed Let's Encrypt in kv store support in CE. Hopefully I am wrong and traefik at least reads the certs from this source. A little service that checks the lables, get the cert and write it into the kv store is not a big deal. Did it for ha-proxy and switched to traefik because it was integrated... and now... boooring... |
Well, I gave nginx+CertManager a whirl and the on-boarding experience was terrible. I tried the official tutorial as well as the DigitalOcean one, and was not able to successfully set it all up. I'm sure if I banged my head against it for long enough I'd figure it out, but that's a bad new user experience. So I've decided to stick with Traefik. I'll live with single instance for now (a few minutes outage per month on average), and start looking at the source code to see if I can find a way to make it HA myself. |
@emilevauge and @prune998 I think a way to mitigate this would be to fix #5495 which allows users to have some external mechanism to provision and renew LE certificates while having a way to tell Traefik to reload the certificates. |
We will add some documentation on how to set up Traefik and CertManager to help on this topic: #5792 :) |
@emilevauge does this mean that it's currently possible to use CertManager with the existing Traefik v2 releases, it's just a matter of documenting how to do it? I'm one of those that used the 1.7 Let's Encrypt sync and was sad to see it wasn't in v2 but I ended up using v2 anyway with just a single instance and a PersistentVolumeClaim for the key storage. It's annoying and it causes issues with availability so I'd love to use CertManager if there's a way to do it right now. |
Yes it is possible @aphistic. Me and my colleague @mmatur gave it a test today and it’s working fine. As @emilevauge mentioned, we will add docs to help with that |
@SantoDE Is there a PR for this against the docs? I would be happy to help create this if you have an outline worked up? |
@aphistic it works pretty well with cert-manager. I recommend to enable the classical kubernetes ingress to do so. @emilevauge Probably a dumb question (#sorry), but wouldn't writing the certificates received from letsencrypt into a k8s secret be a solution to this issue ? |
@emilevauge Probably a dumb question
(#sorry), but wouldn't writing the certificates received from letsencrypt
into a k8s secret be a solution to this issue ?
I don't think so. K8s it's just only one orchestration.
|
For those interested: I've decided to experiment with extending Traefik's lets encrypt to store certs and challenges in Azure Table storage instead of file/memory. So far I THINK this can work and allow deploying multiple instances of traefik that share the storage. A good chunk of code is written (but definitely not ready to run yet): master...nabsul:nabsul/add-cloud-storage The refresh cert logic needs to be improved and the TLS challenge is not implemented all. I hope to start testing in the next couple of weeks. I also wrote about my reasoning around this approach here: https://nabeel.blog/2019/11/traefik/ I'll let you all know if this experiment succeeds or fails miserably. Also, I'm happy to take advice or feedback on details I might be forgetting while building this. |
Hi all, we forgot to update this issue, but we have updated our doc on Traefik + Cert-Manager: https://docs.traefik.io/providers/kubernetes-ingress/#letsencrypt-support-with-the-ingress-provider |
2.0 is released, is cluster HA ready ?
The text was updated successfully, but these errors were encountered: