-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Traefik v2 loadBalancer not refreshed when endpoint becomes dead #5675
Comments
Hi! I'm Træfiker 🤖 the bot in charge of communication regulation. Thanks for your interest in Traefik! Issue templates help us help you by providing all necessary information. Please edit your issue and use the available templates: And remember: each time someone ignores the template, a cute little bunny dies. |
Output of
|
Hi @danperry72 thanks a lot, but this is still partial information. Are you able to reproduce this using a tool like Vagrant? Also, can you share the following informations:
|
Hi @dduportal , apologies for the delay,
|
Hi @danperry72 , thanks a lot. I'm not implying that you did something fancy thatt could break the use case, at all! But the amount of different configurations for a given OS that exists in the wild make it complicated for us to reproduce (and fix if we can reproduce) such behavior. For instance, CentOS 7 kernel default are different between the official image in AWS EC2, Azure and GCE... So in order for us to analyze what is happening, we need the exact setting of your situation (or a reproduction case with Vagrant or a Terraform recipe for instance).
Thanks! |
sysctl -a:
No errors in journalctl or since bootup in dmesg. systemd-resolved not installed/running/found in systemctl. sssd_nss is, as the machines are registered to ad for kerb authentication. No errors found in its log files. We can set up a test machine without it and see if the issue persists with the default centos7 resolver setup. |
Any update on this? |
Hello @Planktonette, I will close this issue in favor of #7354, as you will see, the issue lie to the tcp routers actual behavior. Anyway thanks for your contribution and your interest in Traefik ! duplicate of #7354 |
Let's try this again; opening a new issue based on #5638 which mistakenly attributed the problem to DNS caching and was therefore closed with no further investigation. It would appear that the problem is actually that traefik has no handling for the proxy connection that it creates ending up dead when the remote end dies or is replaced/IP changes/etc.
Sometimes we have to redeploy kubernetes clusters and they come up with the same hostname but different IPs, leaving Traefik sending traffic to the wrong IP as it is maintaining the proxy and not detecting that the backend is dead.
Test is as follows:
Have an endpoint with DNS of long-k8ma-l001.domain.co.uk -> 10.9.10.11
Start Traefik, it creates a proxy to 10.9.10.11, and routing works fine.
Redeploy the VM in question with a new IP of 10.9.10.12
Send traffic to Traefik which continues to try and use the existing proxy long-k8ma-l001.domain.co.uk -> 10.9.10.11 resulting in a connection refused.
Restart Traefik, it starts a proxy to the new DNS mapping and routing now works fine
Dynamic config:
When instantiated, it appears some kind of proxy is created that maps to the loadBalancer.servers.address
This happens in https://github.com/containous/traefik/blob/4e9166759dca1a2e7bdba1780c6a08b655d20522/pkg/server/service/tcp/service.go#L62
The IP for this host is looked up in https://github.com/containous/traefik/blob/56e0580aa5ef5fc40a1969ec78014fef693c8a09/pkg/tcp/proxy.go#L19
Expected behavior: When the remote end dies or is rebuilt, the proxy gets timed out and a new one gets brought up.
Actual behavior: all connections to this load balancer fail for eternity (we left it for an hour and it was still broken) until traefik is restarted and a new connection is instantiated with the correct IP.
It does not appear that there is any support for a timeout or healthcheck for loadbalancer.servers.
Output of
traefik version
: (What version of Traefik are you using?)What is your environment & configuration (arguments, toml, provider, platform, ...)?
Our Traefik is run on two Centos7 VMs with round robin loadbalancing between the two, however when isolating one for testing purposes, the issues still occur.
We use PCS for managing the Virtual IPs for the VMs.
If applicable, please paste the log output in DEBUG level (
--log.level=DEBUG
switch)When VM is up and Traefik can resolve OLD_IP, we just get as it's configured to passthrough
Then we redeploy the VM, and it's IP changes to NEW_IP and we just get this forever.
However a quick "host" command reveals that the OS can resolve the new IP just fine.
The text was updated successfully, but these errors were encountered: