-
-
Notifications
You must be signed in to change notification settings - Fork 732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For Almost 23 sec, contention in VRRP State for a particular VIP instance and its reflecting VRRP_MASTER #1810
Comments
Logs |
192.168.101.1Thu Dec 10 23:49:24 2020: Starting Keepalived v2.0.20 (01/22,2020) |
192.168.101.2Thu Dec 10 23:49:24 2020: Starting Keepalived v2.0.20 (01/22,2020) |
192.168.101.3Thu Dec 10 23:49:24 2020: Starting Keepalived v2.0.20 (01/22,2020) |
@pqarmitage : Can you please take a look at this issue? Appreciate any help in this regard. |
@rajivginotra The first thing you need to do is to set the priorities of the VRRP instances appropriately. With your current configuration which vrrp instance is master is sometimes determined by which system has the higher IP address on the interface being used. So although the VRRP protocol supports the same priority being used by the different nodes with the same vrrp instance, when the MASTER instance stops being master, the other two nodes will both try to become master at the same time. After that the one with the lower priority will back off and revert to backup; this will all cause some flapping. You can see this happening in the logs at 23:49:29 when vip_10.199.193.234 becomes master simultaneously on all three systems, and then the ones with IP addresses 10.199.193.231 and 232 drop back to backup when the advert is received from 10.199.193.233. If the priorities are different, then the next higher priority vrrp instance will take over as master cleanly. If you don't want one vrrp instance to take over as MASTER when another one is already MASTER, then use the Somehow keepalived is starting up at exactly the same time (give or take a few milliseconds) on all three nodes. An understanding of the environment this is all running in might be helpful. Now I am guessing here, but since the PIDs of the keepalived processes are 36, 31 and 31, which makes me think that keepalived is starting at system boot time (or is it being run in containers since the PIDs are so low?). We quite often see problems when that happens due to the network taking time to settle down and pass traffic reliably, and that is what appears to be happening here. The reason that all three instances of vip_14.1.1.234 are in MASTER state is that none of them is seeing adverts from the other two nodes. At 23:49:53 traffic starts being received on the 14.1.1.0/24 network, and so the two nodes with lower IP addresses revert to backup. Using tcpdump or wireshark might help see what is happening. |
RG> Thanks for @pqarmitage prompt response as always. One interesting finding which I have share that with v2.0.18 image we don't see this issue coming there in recent past we migrated to v2.0.20 version. And we if you see above keepalived.conf file we are already using 'nopreempt' option already but yes we can explore for 'priority' option.
**RG> So we are running K8s cluster with 3 nodes and each node has 3 interfaces named as management, enterprise, and cluster interface. All the 3 interface has the VIP which Keepalived is managing it and we are running Keepalived pods (as daemon sets) so all the 3 nodes are running once instance of keepalived daemon running inside the containers. But I have one suspect case where Keepalived is restarting at same time I will check on this and will update you **
RG> Already mentioned in the last point
RG> Sure will try to fetch the network dumps next time we see this issue again. |
I am closing this issue now since there has been no update for over 1 month. If the problem recurs and you can post your network dumps, then we can reopen this issue if necessary. |
Describe the bug
A clear and concise description of what the bug is.
We have 3 nodes in the cluster and for this VRRP instance vip_14.1.1.234, we see that for more than 23 sec the quorum was not there and all the 3 nodes are reporting VRRP master for the above VRRP instance.
Once the contention is resolved then 192.168.101.3 becomes the VRRP master but there is no change in the notification so the application depends on VRRP MASTER state goes for a toss.
192.168.101.1 Node
root@maglev-master-192-168-101-2:~# cat /tmp/kp.log | grep Entering | grep 14.1.1.234
Thu Dec 10 23:49:25 2020: (vip_14.1.1.234) Entering BACKUP STATE
Thu Dec 10 23:49:30 2020: (vip_14.1.1.234) Entering MASTER STATE. ==========> Next VRRP state change after 23 Sec and all the 3 nodes declared as VRRP master
Thu Dec 10 23:49:53 2020: (vip_14.1.1.234) Entering BACKUP STATE
Thu Dec 10 23:49:56 2020: (vip_14.1.1.234) Entering FAULT STATE
Thu Dec 10 23:49:57 2020: (vip_14.1.1.234) Entering BACKUP STATE
192.168.101.2 Node
root@maglev-master-192-168-101-2:~# cat /tmp/kp.log | grep Entering | grep 14.1.1.234
Thu Dec 10 23:49:25 2020: (vip_14.1.1.234) Entering BACKUP STATE
Thu Dec 10 23:49:30 2020: (vip_14.1.1.234) Entering MASTER STATE ==========> Next VRRP state change after 23 Sec and all the 3 nodes declared as VRRP master
Thu Dec 10 23:49:53 2020: (vip_14.1.1.234) Entering BACKUP STATE
Thu Dec 10 23:49:56 2020: (vip_14.1.1.234) Entering FAULT STATE
Thu Dec 10 23:49:57 2020: (vip_14.1.1.234) Entering BACKUP STATE
192.168.101.3 Node
cat /tmp/kp.log | grep Entering | grep 14.1.1.234
Thu Dec 10 23:49:25 2020: (vip_14.1.1.234) Entering BACKUP STATE
Thu Dec 10 23:49:29 2020: (vip_14.1.1.234) Entering MASTER STATE ==========> Next VRRP state change after 23 Sec and all the 3 nodes declared as VRRP master
To Reproduce
Any steps necessary to reproduce the behaviour:
Expected behavior
A clear and concise description of what you expected to happen.
Keepalived version
Output of
keepalived -v
Keepalived v2.0.20 (01/22,2020)
Distro (please complete the following information):
Name [e.g. Fedora, Ubuntu] Ubuntu
Version [e.g. 29] 16.04.1-Ubuntu
Architecture [e.g. x86_64] x86_64
Linux 4.15.0-74-generic #83~16.04.1-Ubuntu SMP Wed Dec 18 04:56:23 UTC 2019 (built for Linux 4.4.211)
Details of any containerisation or hosted service (e.g. AWS)
If keepalived is being run in a container or on a hosted service, provide full details
Configuration file:
A full copy of the configuration file, obfuscated if necessary to protect passwords and IP addresses
192-168-101-1:/# cat /etc/keepalived/keepalived.conf
global_defs {
vrrp_version 3
vrrp_iptables MAGLEV-KEEPALIVED-VIP
enable_script_security
script_user keepalived_script
vrrp_garp_master_delay 40
vrrp_garp_master_refresh 60
}
vrrp_script node_health_check {
script "/node_health_check.py"
interval 60 # check every 60 seconds
timeout 40 # Script Timeout of 40 seconds
fall 3 # require 3 failures for FAULT Transition
}
vrrp_instance vip_10.199.193.234 {
state BACKUP
interface management
virtual_router_id 119
nopreempt
advert_int 1
track_interface {
management
}
virtual_ipaddress {
10.199.193.234 dev management scope global
}
unicast_src_ip 10.199.193.231
unicast_peer {
10.199.193.233
10.199.193.232
}
track_script {
node_health_check
}
notify /keepalivednotify.py root
}
vrrp_instance vip_14.1.1.234 {
state BACKUP
interface enterprise
virtual_router_id 44
nopreempt
advert_int 1
track_interface {
enterprise
}
virtual_ipaddress {
14.1.1.234 dev enterprise scope global
}
unicast_src_ip 14.1.1.231
unicast_peer {
14.1.1.233
14.1.1.232
}
track_script {
node_health_check
}
notify /keepalivednotify.py root
}
vrrp_instance vip_192.168.101.4 {
state BACKUP
interface cluster
virtual_router_id 41
nopreempt
advert_int 1
track_interface {
cluster
}
virtual_ipaddress {
192.168.101.4 dev cluster scope global
}
unicast_src_ip 192.168.101.1
unicast_peer {
192.168.101.3
192.168.101.2
}
track_script {
node_health_check
}
notify /keepalivednotify.py root
}
192-168-101-2:/# cat /etc/keepalived/keepalived.conf
global_defs {
vrrp_version 3
vrrp_iptables MAGLEV-KEEPALIVED-VIP
enable_script_security
script_user keepalived_script
vrrp_garp_master_delay 40
vrrp_garp_master_refresh 60
}
vrrp_script node_health_check {
script "/node_health_check.py"
interval 60 # check every 60 seconds
timeout 40 # Script Timeout of 40 seconds
fall 3 # require 3 failures for FAULT Transition
}
vrrp_instance vip_10.199.193.234 {
state BACKUP
interface management
virtual_router_id 119
nopreempt
advert_int 1
track_interface {
management
}
virtual_ipaddress {
10.199.193.234 dev management scope global
}
unicast_src_ip 10.199.193.232
unicast_peer {
10.199.193.231
10.199.193.233
}
track_script {
node_health_check
}
notify /keepalivednotify.py root
}
vrrp_instance vip_14.1.1.234 {
state BACKUP
interface enterprise
virtual_router_id 44
nopreempt
advert_int 1
track_interface {
enterprise
}
virtual_ipaddress {
14.1.1.234 dev enterprise scope global
}
unicast_src_ip 14.1.1.232
unicast_peer {
14.1.1.231
14.1.1.233
}
track_script {
node_health_check
}
notify /keepalivednotify.py root
}
vrrp_instance vip_192.168.101.4 {
state BACKUP
interface cluster
virtual_router_id 41
nopreempt
advert_int 1
track_interface {
cluster
}
virtual_ipaddress {
192.168.101.4 dev cluster scope global
}
unicast_src_ip 192.168.101.2
unicast_peer {
192.168.101.1
192.168.101.3
}
track_script {
node_health_check
}
notify /keepalivednotify.py root
}
192-168-101-3:/# cat /etc/keepalived/keepalived.conf
global_defs {
vrrp_version 3
vrrp_iptables MAGLEV-KEEPALIVED-VIP
enable_script_security
script_user keepalived_script
vrrp_garp_master_delay 40
vrrp_garp_master_refresh 60
}
vrrp_script node_health_check {
script "/node_health_check.py"
interval 60 # check every 60 seconds
timeout 40 # Script Timeout of 40 seconds
fall 3 # require 3 failures for FAULT Transition
}
vrrp_instance vip_10.199.193.234 {
state BACKUP
interface management
virtual_router_id 119
nopreempt
advert_int 1
track_interface {
management
}
virtual_ipaddress {
10.199.193.234 dev management scope global
}
unicast_src_ip 10.199.193.233
unicast_peer {
10.199.193.231
10.199.193.232
}
track_script {
node_health_check
}
notify /keepalivednotify.py root
}
vrrp_instance vip_14.1.1.234 {
state BACKUP
interface enterprise
virtual_router_id 44
nopreempt
advert_int 1
track_interface {
enterprise
}
virtual_ipaddress {
14.1.1.234 dev enterprise scope global
}
unicast_src_ip 14.1.1.233
unicast_peer {
14.1.1.231
14.1.1.232
}
track_script {
node_health_check
}
notify /keepalivednotify.py root
}
vrrp_instance vip_192.168.101.4 {
state BACKUP
interface cluster
virtual_router_id 41
nopreempt
advert_int 1
track_interface {
cluster
}
virtual_ipaddress {
192.168.101.4 dev cluster scope global
}
unicast_src_ip 192.168.101.3
unicast_peer {
192.168.101.1
192.168.101.2
}
track_script {
node_health_check
}
notify /keepalivednotify.py root
}
Notify and track scripts
If any notify or track scripts are in use, please provide copies of them
System Log entries
Full keepalived system log entries from when keepalived started
Did keepalived coredump?
If so, can you please provide a stacktrace from the coredump, using gdb.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: