Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sentinels votes for themselves #2243

Open
wangjn01 opened this issue Dec 24, 2014 · 7 comments
Open

sentinels votes for themselves #2243

wangjn01 opened this issue Dec 24, 2014 · 7 comments

Comments

@wangjn01
Copy link

i have 2 redis and 3 sentinels. Sometimes when master down, sentinels votes for themselves so no leader voted, failover abort

sentinel 1 log

[17193] 24 Dec 20:12:07.864 # Sentinel runid is d2e257f5af8becf766f350139af3526efa5d2741
[17193] 24 Dec 20:12:07.864 # +monitor master mymaster 10.120.45.88 6379 quorum 1
[17193] 24 Dec 20:12:08.469 * +sentinel sentinel 10.120.45.89:26379 10.120.45.89 26379 @ mymaster 10.120.45.88 6379
[17193] 24 Dec 20:12:09.084 * +sentinel sentinel 10.120.45.88:26379 10.120.45.88 26379 @ mymaster 10.120.45.88 6379
[17193] 24 Dec 20:15:48.849 * +slave slave 10.120.45.89:6379 10.120.45.89 6379 @ mymaster 10.120.45.88 6379
[17193] 24 Dec 20:16:51.401 # +new-epoch 289
[17193] 24 Dec 20:16:51.443 # +vote-for-leader 32f2d6d6bb62b7404263df90ca7ef23a64827276 289
[17193] 24 Dec 20:16:51.444 # +sdown master mymaster 10.120.45.88 6379
[17193] 24 Dec 20:16:51.444 # +odown master mymaster 10.120.45.88 6379 #quorum 1/1
[17193] 24 Dec 20:16:51.444 # Next failover delay: I will not start a failover before Wed Dec 24 20:22:52 2014
[17193] 24 Dec 20:16:52.293 # +config-update-from sentinel 10.120.45.88:26379 10.120.45.88 26379 @ mymaster 10.120.45.88 6379
[17193] 24 Dec 20:16:52.293 # +switch-master mymaster 10.120.45.88 6379 10.120.45.89 6379
[17193] 24 Dec 20:16:52.293 * +slave slave 10.120.45.88:6379 10.120.45.88 6379 @ mymaster 10.120.45.89 6379
[17193] 24 Dec 20:16:53.344 # +sdown slave 10.120.45.88:6379 10.120.45.88 6379 @ mymaster 10.120.45.89 6379
[17193] 24 Dec 20:17:34.985 # -sdown slave 10.120.45.88:6379 10.120.45.88 6379 @ mymaster 10.120.45.89 6379
[17193] 24 Dec 20:18:07.375 # +sdown master mymaster 10.120.45.89 6379
[17193] 24 Dec 20:18:07.375 # +odown master mymaster 10.120.45.89 6379 #quorum 1/1
[17193] 24 Dec 20:18:07.375 # +new-epoch 290
[17193] 24 Dec 20:18:07.375 # +try-failover master mymaster 10.120.45.89 6379
[17193] 24 Dec 20:18:07.429 # +vote-for-leader d2e257f5af8becf766f350139af3526efa5d2741 290
[17193] 24 Dec 20:18:07.430 # 10.120.45.89:26379 voted for ccdbea66faa41eabc7482e07b47bf3c3c59d9ebd 290
[17193] 24 Dec 20:18:07.430 # 10.120.45.88:26379 voted for ccdbea66faa41eabc7482e07b47bf3c3c59d9ebd 290
[17193] 24 Dec 20:18:08.478 # +config-update-from sentinel 10.120.45.89:26379 10.120.45.89 26379 @ mymaster 10.120.45.89 6379
[17193] 24 Dec 20:18:08.478 # +switch-master mymaster 10.120.45.89 6379 10.120.45.88 6379
[17193] 24 Dec 20:18:08.479 * +slave slave 10.120.45.89:6379 10.120.45.89 6379 @ mymaster 10.120.45.88 6379
[17193] 24 Dec 20:18:09.525 # +sdown slave 10.120.45.89:6379 10.120.45.89 6379 @ mymaster 10.120.45.88 6379
[17193] 24 Dec 20:18:31.254 # -sdown slave 10.120.45.89:6379 10.120.45.89 6379 @ mymaster 10.120.45.88 6379
[17193] 24 Dec 20:20:35.242 # +sdown master mymaster 10.120.45.88 6379
[17193] 24 Dec 20:20:35.243 # +odown master mymaster 10.120.45.88 6379 #quorum 1/1
[17193] 24 Dec 20:20:35.267 # +new-epoch 291
[17193] 24 Dec 20:20:35.267 # +try-failover master mymaster 10.120.45.88 6379
[17193] 24 Dec 20:20:35.315 # +vote-for-leader d2e257f5af8becf766f350139af3526efa5d2741 291
[17193] 24 Dec 20:20:35.316 # 10.120.45.88:26379 voted for 32f2d6d6bb62b7404263df90ca7ef23a64827276 291
[17193] 24 Dec 20:20:35.316 # 10.120.45.89:26379 voted for ccdbea66faa41eabc7482e07b47bf3c3c59d9ebd 291
[17193] 24 Dec 20:20:45.616 # -failover-abort-not-elected master mymaster 10.120.45.88 6379
[17193] 24 Dec 20:20:45.688 # Next failover delay: I will not start a failover before Wed Dec 24 20:26:35 2014

sentinel 2 log

[6564] 24 Dec 20:11:20.096 # Sentinel runid is 32f2d6d6bb62b7404263df90ca7ef23a64827276
[6564] 24 Dec 20:11:20.096 # +monitor master mymaster 10.120.45.88 6379 quorum 1
[6564] 24 Dec 20:11:48.768 - Accepted 10.120.45.89:49820
[6564] 24 Dec 20:11:50.039 * +sentinel sentinel 10.120.45.89:26379 10.120.45.89 26379 @ mymaster 10.120.45.88 6379
[6564] 24 Dec 20:12:09.167 - Accepted 10.120.42.85:47526
[6564] 24 Dec 20:12:09.898 * +sentinel sentinel 10.120.42.85:26379 10.120.42.85 26379 @ mymaster 10.120.45.88 6379
[6564] 24 Dec 20:15:51.127 * +slave slave 10.120.45.89:6379 10.120.45.89 6379 @ mymaster 10.120.45.88 6379
[6564] 24 Dec 20:16:51.199 # +sdown master mymaster 10.120.45.88 6379
[6564] 24 Dec 20:16:51.200 # +odown master mymaster 10.120.45.88 6379 #quorum 1/1
[6564] 24 Dec 20:16:51.200 # +new-epoch 289
[6564] 24 Dec 20:16:51.200 # +try-failover master mymaster 10.120.45.88 6379
[6564] 24 Dec 20:16:51.209 # +vote-for-leader 32f2d6d6bb62b7404263df90ca7ef23a64827276 289
[6564] 24 Dec 20:16:51.213 # 10.120.45.89:26379 voted for 32f2d6d6bb62b7404263df90ca7ef23a64827276 289
[6564] 24 Dec 20:16:51.272 # +elected-leader master mymaster 10.120.45.88 6379
[6564] 24 Dec 20:16:51.272 # +failover-state-select-slave master mymaster 10.120.45.88 6379
[6564] 24 Dec 20:16:51.363 # +selected-slave slave 10.120.45.89:6379 10.120.45.89 6379 @ mymaster 10.120.45.88 6379
[6564] 24 Dec 20:16:51.363 * +failover-state-send-slaveof-noone slave 10.120.45.89:6379 10.120.45.89 6379 @ mymaster 10.120.45.88 6379
[6564] 24 Dec 20:16:51.461 # 10.120.42.85:26379 voted for 32f2d6d6bb62b7404263df90ca7ef23a64827276 289
[6564] 24 Dec 20:16:51.464 * +failover-state-wait-promotion slave 10.120.45.89:6379 10.120.45.89 6379 @ mymaster 10.120.45.88 6379
[6564] 24 Dec 20:16:52.228 - -role-change slave 10.120.45.89:6379 10.120.45.89 6379 @ mymaster 10.120.45.88 6379 new reported role is master
[6564] 24 Dec 20:16:52.233 # +promoted-slave slave 10.120.45.89:6379 10.120.45.89 6379 @ mymaster 10.120.45.88 6379
[6564] 24 Dec 20:16:52.233 # +failover-state-reconf-slaves master mymaster 10.120.45.88 6379
[6564] 24 Dec 20:16:52.294 # +failover-end master mymaster 10.120.45.88 6379
[6564] 24 Dec 20:16:52.294 # +switch-master mymaster 10.120.45.88 6379 10.120.45.89 6379
[6564] 24 Dec 20:16:52.294 * +slave slave 10.120.45.88:6379 10.120.45.88 6379 @ mymaster 10.120.45.89 6379
[6564] 24 Dec 20:16:53.334 # +sdown slave 10.120.45.88:6379 10.120.45.88 6379 @ mymaster 10.120.45.89 6379
[6564] 24 Dec 20:17:34.937 - -role-change slave 10.120.45.88:6379 10.120.45.88 6379 @ mymaster 10.120.45.89 6379 new reported role is master
[6564] 24 Dec 20:17:35.003 # -sdown slave 10.120.45.88:6379 10.120.45.88 6379 @ mymaster 10.120.45.89 6379
[6564] 24 Dec 20:17:44.978 - +role-change slave 10.120.45.88:6379 10.120.45.88 6379 @ mymaster 10.120.45.89 6379 new reported role is slave
[6564] 24 Dec 20:18:07.417 # +new-epoch 290
[6564] 24 Dec 20:18:07.418 # +vote-for-leader ccdbea66faa41eabc7482e07b47bf3c3c59d9ebd 290
[6564] 24 Dec 20:18:07.423 # +sdown master mymaster 10.120.45.89 6379
[6564] 24 Dec 20:18:07.423 # +odown master mymaster 10.120.45.89 6379 #quorum 1/1
[6564] 24 Dec 20:18:07.423 # Next failover delay: I will not start a failover before Wed Dec 24 20:24:08 2014
[6564] 24 Dec 20:18:08.494 # +config-update-from sentinel 10.120.45.89:26379 10.120.45.89 26379 @ mymaster 10.120.45.89 6379
[6564] 24 Dec 20:18:08.494 # +switch-master mymaster 10.120.45.89 6379 10.120.45.88 6379
[6564] 24 Dec 20:18:08.494 * +slave slave 10.120.45.89:6379 10.120.45.89 6379 @ mymaster 10.120.45.88 6379
[6564] 24 Dec 20:18:09.525 # +sdown slave 10.120.45.89:6379 10.120.45.89 6379 @ mymaster 10.120.45.88 6379
[6564] 24 Dec 20:18:31.246 - -role-change slave 10.120.45.89:6379 10.120.45.89 6379 @ mymaster 10.120.45.88 6379 new reported role is master
[6564] 24 Dec 20:18:31.308 # -sdown slave 10.120.45.89:6379 10.120.45.89 6379 @ mymaster 10.120.45.88 6379
[6564] 24 Dec 20:18:41.249 - +role-change slave 10.120.45.89:6379 10.120.45.89 6379 @ mymaster 10.120.45.88 6379 new reported role is slave
[6564] 24 Dec 20:20:35.287 # +sdown master mymaster 10.120.45.88 6379
[6564] 24 Dec 20:20:35.287 # +odown master mymaster 10.120.45.88 6379 #quorum 1/1
[6564] 24 Dec 20:20:35.287 # +new-epoch 291
[6564] 24 Dec 20:20:35.287 # +try-failover master mymaster 10.120.45.88 6379
[6564] 24 Dec 20:20:35.297 # +vote-for-leader 32f2d6d6bb62b7404263df90ca7ef23a64827276 291
[6564] 24 Dec 20:20:35.298 # 10.120.45.89:26379 voted for ccdbea66faa41eabc7482e07b47bf3c3c59d9ebd 291
[6564] 24 Dec 20:20:35.332 # 10.120.42.85:26379 voted for d2e257f5af8becf766f350139af3526efa5d2741 291
[6564] 24 Dec 20:20:46.065 # -failover-abort-not-elected master mymaster 10.120.45.88 6379
[6564] 24 Dec 20:20:46.149 # Next failover delay: I will not start a failover before Wed Dec 24 20:26:36 2014

sentinel 3 log

[20847] 24 Dec 20:11:47.864 # Sentinel runid is ccdbea66faa41eabc7482e07b47bf3c3c59d9ebd
[20847] 24 Dec 20:11:47.864 # +monitor master mymaster 10.120.45.88 6379 quorum 1
[20847] 24 Dec 20:11:48.730 * +sentinel sentinel 10.120.45.88:26379 10.120.45.88 26379 @ mymaster 10.120.45.88 6379
[20847] 24 Dec 20:12:09.902 * +sentinel sentinel 10.120.42.85:26379 10.120.42.85 26379 @ mymaster 10.120.45.88 6379
[20847] 24 Dec 20:15:48.960 * +slave slave 10.120.45.89:6379 10.120.45.89 6379 @ mymaster 10.120.45.88 6379
[20847] 24 Dec 20:16:51.216 # +new-epoch 289
[20847] 24 Dec 20:16:51.217 # +vote-for-leader 32f2d6d6bb62b7404263df90ca7ef23a64827276 289
[20847] 24 Dec 20:16:51.223 # +sdown master mymaster 10.120.45.88 6379
[20847] 24 Dec 20:16:51.223 # +odown master mymaster 10.120.45.88 6379 #quorum 1/1
[20847] 24 Dec 20:16:51.223 # Next failover delay: I will not start a failover before Wed Dec 24 20:22:52 2014
[20847] 24 Dec 20:16:52.314 # +config-update-from sentinel 10.120.45.88:26379 10.120.45.88 26379 @ mymaster 10.120.45.88 6379
[20847] 24 Dec 20:16:52.314 # +switch-master mymaster 10.120.45.88 6379 10.120.45.89 6379
[20847] 24 Dec 20:16:52.314 * +slave slave 10.120.45.88:6379 10.120.45.88 6379 @ mymaster 10.120.45.89 6379
[20847] 24 Dec 20:16:53.334 # +sdown slave 10.120.45.88:6379 10.120.45.88 6379 @ mymaster 10.120.45.89 6379
[20847] 24 Dec 20:17:34.938 # -sdown slave 10.120.45.88:6379 10.120.45.88 6379 @ mymaster 10.120.45.89 6379
[20847] 24 Dec 20:17:44.886 * +convert-to-slave slave 10.120.45.88:6379 10.120.45.88 6379 @ mymaster 10.120.45.89 6379
[20847] 24 Dec 20:18:07.418 # +sdown master mymaster 10.120.45.89 6379
[20847] 24 Dec 20:18:07.418 # +odown master mymaster 10.120.45.89 6379 #quorum 1/1
[20847] 24 Dec 20:18:07.418 # +new-epoch 290
[20847] 24 Dec 20:18:07.418 # +try-failover master mymaster 10.120.45.89 6379
[20847] 24 Dec 20:18:07.419 # +vote-for-leader ccdbea66faa41eabc7482e07b47bf3c3c59d9ebd 290
[20847] 24 Dec 20:18:07.423 # 10.120.45.88:26379 voted for ccdbea66faa41eabc7482e07b47bf3c3c59d9ebd 290
[20847] 24 Dec 20:18:07.451 # 10.120.42.85:26379 voted for d2e257f5af8becf766f350139af3526efa5d2741 290
[20847] 24 Dec 20:18:07.477 # +elected-leader master mymaster 10.120.45.89 6379
[20847] 24 Dec 20:18:07.477 # +failover-state-select-slave master mymaster 10.120.45.89 6379
[20847] 24 Dec 20:18:07.578 # +selected-slave slave 10.120.45.88:6379 10.120.45.88 6379 @ mymaster 10.120.45.89 6379
[20847] 24 Dec 20:18:07.578 * +failover-state-send-slaveof-noone slave 10.120.45.88:6379 10.120.45.88 6379 @ mymaster 10.120.45.89 6379
[20847] 24 Dec 20:18:07.662 * +failover-state-wait-promotion slave 10.120.45.88:6379 10.120.45.88 6379 @ mymaster 10.120.45.89 6379
[20847] 24 Dec 20:18:08.435 # +promoted-slave slave 10.120.45.88:6379 10.120.45.88 6379 @ mymaster 10.120.45.89 6379
[20847] 24 Dec 20:18:08.435 # +failover-state-reconf-slaves master mymaster 10.120.45.89 6379
[20847] 24 Dec 20:18:08.496 # +failover-end master mymaster 10.120.45.89 6379
[20847] 24 Dec 20:18:08.496 # +switch-master mymaster 10.120.45.89 6379 10.120.45.88 6379
[20847] 24 Dec 20:18:08.496 * +slave slave 10.120.45.89:6379 10.120.45.89 6379 @ mymaster 10.120.45.88 6379
[20847] 24 Dec 20:18:09.518 # +sdown slave 10.120.45.89:6379 10.120.45.89 6379 @ mymaster 10.120.45.88 6379
[20847] 24 Dec 20:18:31.275 # -sdown slave 10.120.45.89:6379 10.120.45.89 6379 @ mymaster 10.120.45.88 6379
[20847] 24 Dec 20:18:41.215 * +convert-to-slave slave 10.120.45.89:6379 10.120.45.89 6379 @ mymaster 10.120.45.88 6379
[20847] 24 Dec 20:20:35.270 # +sdown master mymaster 10.120.45.88 6379
[20847] 24 Dec 20:20:35.270 # +odown master mymaster 10.120.45.88 6379 #quorum 1/1
[20847] 24 Dec 20:20:35.270 # +new-epoch 291
[20847] 24 Dec 20:20:35.270 # +try-failover master mymaster 10.120.45.88 6379
[20847] 24 Dec 20:20:35.294 # +vote-for-leader ccdbea66faa41eabc7482e07b47bf3c3c59d9ebd 291
[20847] 24 Dec 20:20:35.302 # 10.120.45.88:26379 voted for 32f2d6d6bb62b7404263df90ca7ef23a64827276 291
[20847] 24 Dec 20:20:35.337 # 10.120.42.85:26379 voted for d2e257f5af8becf766f350139af3526efa5d2741 291
[20847] 24 Dec 20:20:45.577 # -failover-abort-not-elected master mymaster 10.120.45.88 6379
[20847] 24 Dec 20:20:45.641 # Next failover delay: I will not start a failover before Wed Dec 24 20:26:35 2014
@wangjn01 wangjn01 changed the title sentinels votes for each other sentinels votes for themselves Dec 24, 2014
@KeepPeace
Copy link

I met the same problem。
I have 2 virtualMachine A and B
A: 192.168.163.91
redis(master) 6379
sentinel26379---sentinel monitor master1 192.168.163.91 6379 2
sentinel26380---sentinel monitor master1 192.168.163.91 6379 2
B:192.168.163.90
redis(slave) 6379
sentinel26379----sentinel monitor master1 192.168.163.91 6379 2
sentinel26380----sentinel monitor master1 192.168.163.91 6379 2

when i kill all process( 'pkill -9 redis') on mechine A,

but B's log:

log-26380--------------------------

[4761] 26 Dec 17:41:18.927 # +sdown sentinel 192.168.163.91:26380 192.168.163.91 26380 @ master1 192.168.163.91 6379
[4761] 26 Dec 17:41:19.046 # +sdown master master1 192.168.163.91 6379
[4761] 26 Dec 17:41:19.120 # +new-epoch 50
[4761] 26 Dec 17:41:19.122 # +vote-for-leader 6c46bafb874d7b183f02ca5c084e7637a1825c9e 50
[4761] 26 Dec 17:41:20.165 # +odown master master1 192.168.163.91 6379 #quorum 2/2
[4761] 26 Dec 17:41:20.165 # Next failover delay: I will not start a failover before Fri Dec 26 17:47:19 2014

log-26379--------------------------

[4760] 26 Dec 17:41:18.984 # +sdown sentinel 192.168.163.91:26380 192.168.163.91 26380 @ master1 192.168.163.91 6379
[4760] 26 Dec 17:41:18.985 # +sdown sentinel 192.168.163.91:26379 192.168.163.91 26379 @ master1 192.168.163.91 6379
[4760] 26 Dec 17:41:19.048 # +sdown master master1 192.168.163.91 6379
[4760] 26 Dec 17:41:19.111 # +odown master master1 192.168.163.91 6379 #quorum 2/2
[4760] 26 Dec 17:41:19.111 # +new-epoch 50
[4760] 26 Dec 17:41:19.111 # +try-failover master master1 192.168.163.91 6379
[4760] 26 Dec 17:41:19.117 # +vote-for-leader 6c46bafb874d7b183f02ca5c084e7637a1825c9e 50
[4760] 26 Dec 17:41:19.123 # 192.168.163.90:26380 voted for 6c46bafb874d7b183f02ca5c084e7637a1825c9e 50
[4760] 26 Dec 17:41:29.601 # -failover-abort-not-elected master master1 192.168.163.91 6379
[4760] 26 Dec 17:41:29.653 # Next failover delay: I will not start a failover before Fri Dec 26 17:47:19 2014

@mattsta
Copy link
Contributor

mattsta commented Jan 9, 2015

At epoch 289, every Sentinel votes for the same leader:

[17193] 24 Dec 20:16:51.443 # +vote-for-leader 32f2d6d6bb62b7404263df90ca7ef23a64827276 289

At epoch 290, two Sentinels vote ccdbe and the other votes d2e257, so ccdbe is elected.

But at epoch 291, every sentinel votes only for itself (!):

[17193] 24 Dec 20:20:35.315 # +vote-for-leader d2e257f5af8becf766f350139af3526efa5d2741 291
[17193] 24 Dec 20:20:35.316 # 10.120.45.88:26379 voted for 32f2d6d6bb62b7404263df90ca7ef23a64827276 291
[17193] 24 Dec 20:20:35.316 # 10.120.45.89:26379 voted for ccdbea66faa41eabc7482e07b47bf3c3c59d9ebd 291

The sentinel.c source says Sentinels are allowed to vote for themselves:

char *sentinelVoteLeader(sentinelRedisInstance *master, uint64_t req_epoch, char *req_runid, uint64_t *leader_epoch) {
.
.
.
        /* If we did not voted for ourselves, set the master failover start
         * time to now, in order to force a delay before we can start a
         * failover for the same master. */
        if (strcasecmp(master->leader,server.runid))
            master->failover_start_time = mstime()+rand()%SENTINEL_MAX_DESYNC;

I'm not sure why at epoch 291 each Sentinel has given up believing in all the others.

@antirez
Copy link
Contributor

antirez commented Jan 9, 2015

Hello, this usually is the result of slow interaction or poor desynchronization between Sentinels. A Senitel votes for itself unless it already received a request for vote from another Sentinel. Usually one is faster than the other, since they are desynchronized, but if they are slow to communicate, the communication time becomes larger than the time desync, and a split brain condition happens, with a new desynchronization, and a new vote attempt (so the failover will eventually happen, anyway).

It is probably possible to improve on the desynchronization (a long time TODO item of mine...), but here would be more interesting to see why sometimes Sentinels are likely slow to communicate, assuming this is the case. Otherwise there is to understand if the desynchronization is not effective enough...

I'll investigate this issue the next week and report back. Btw in your environment, is this simple to reproduce by running the Sentinel unit tests? Thanks.

@NicGobbi
Copy link

NicGobbi commented Aug 2, 2021

Hello! How is the work on this going? I'm currently having the same issue: sentinels keep on voting for themselves until at some point they both vote for the same.

@bharatkhanna7
Copy link

bharatkhanna7 commented Feb 11, 2023

@NicGobbi @antirez Please let me know if you were able to find the solution for this issue? I am facing the same issue

@spary
Copy link

spary commented Jan 9, 2024

@NicGobbi @antirez Please let me know if you were able to find the solution for this issue? I am facing the same issue

@spary
Copy link

spary commented Jan 9, 2024

@wangjn01 Please let me know if you were able to find the solution for this issue? I am facing the same issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants