-
-
Notifications
You must be signed in to change notification settings - Fork 737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
periodic master/backup flaps or spontaneous failovers #2220
Comments
I left a packet capture running, and here you can see the VRRP packet flow and related keepalived log entries. capture on cloudgw2002-dev (208.80.153.189):
logs on cloudgw2002-dev:
capture on cloudgw2003-dev (208.80.153.188):
logs on cloudgw2003-dev:
I've checked system logs, our internal datacenter logs and even other system logs. There were no relevant operations during this event. I'd be happy to check anything else you may suggest. |
Do you have the packet captures that include the adverts being sent by the system as well as the adverts being received? |
I believe I found the problem in a firewall misconfiguration in our side. The firewall is stateful, but we didn't have an explicit rule accepting the VRRP traffic from the other peer. When a node sends its own VRRP advert packet, it creates a local conntrack entry that the advert by the remote peer could use. That conntrack would eventually expire (no original direction traffic for too long), blocking any more incoming adverts. Then, the advert timeout would kick in, triggering the local keepalived to send its own VRRP advert again, opening the conntrack hole again, and starting the loop again. This scenario explains also why this wasn't always the case despite the setup being the same for years: it depends on which node is master and which node sends the advert first. And bonus point: this may partially explain the issues I've experienced in the past with #2032 I just made the changes to fix the problem. I'll wait a few days then come back to confirm that this had nothing to do with keepalived. |
Months later: the setup is extremely stable. Clearly the firewall rule was missing. |
Describe the bug
We're experimenting weird periodic master/backup flaps or spontaneous fail overs.
We have 2 sets of servers in 2 different datacenters using the exact same configuration (via puppet, only address/nics differs) showing the exact same behavior.
The only particular bit this setup have is that the
unicast_peer
route uses a linux VRF (l3mdev) and the interface that keepalived uses (ininterface
) is also part of the VRF.The network is not down. The servers are mostly idle. There are no packet loss. We've tested sending 1M icmp with 0% packet loss.
I can be convinced this is not a bug in keepalived but that we have something in our network triggering from time to time. If so, I don't know what or
To Reproduce
Start 2 daemons with the attached config.
Expected behavior
No flaps.
Keepalived version
Distro (please complete the following information):
Details of any containerisation or hosted service (e.g. AWS)
None.
Configuration file:
server A configuration:
Server B configuration:
Notify and track scripts
none.
System Log entries
Server A:
Server B:
Did keepalived coredump?
No
Additional context
Server A network config:
Server B network config:
The text was updated successfully, but these errors were encountered: