New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quorum lost in majority partition #10
Comments
@sorcky Thank you for the report! Yes, it is known bug and your analysis is correct. Main problem is dpd_interval - plan is to remove it completelly - and too big device.timeout which translates into heartbeat interval (0.8 * device.timeout) - this should be set to be closer to token timeout (because it is actually token timeout). And your proposed workaround is correct. Problem is, that the required change is quite a big and it must be done carefully to keep backwards compatibility. I'll keep this open as a reminder (and to increase prio in my todo list). |
Current master (28d4914) includes improved dpd timer detection (a8b7513) so problem should be fixed now. Now, every single client has its own timer with timeout based on (configurable) dpd_interval_coefficient and multiplied with client heartbeat timeout. When message is received from client timer is rescheduled. When timer callback is called (= client doesn't sent message during timeout) then client is disconnected. Qdevice timeout is not based on corosync token timeout yet, but because corosync 3.1.0 got default 3sec token timeout, it is now really close (because of gather/merge/concensus timeouts =~ 7sec) to new (default) qdevice timeout 12 sec, so qdevice timeout based on corosync token timeout may be not so much needed. Also it is important to note, that qdevice sends hb every 8 secs and it is independend to corosync hb so when node dies just before qdevice would normally send hb then qnetd will dpd in ~4 secs so it can be sometimes faster than corosync detection (because of gather/merge/consensus). Closing this issue for now. |
Sometimes corosync lose quorum in majority partition in the following scenario:
Two node cluster:
test1 - node 1 with corosync v3.0.3 + corosync-qdevice v3.0.0
test2 - node 2 with corosync v3.0.3 + corosync-qdevice v3.0.0
test3 - node with corosync-qnetd v 3.0.0
when test2 fails we see the following on the test1:
As you see at this moment test1 qdevice receives callback from corosync and sends changed node list to qnetd.
Sometimes we don't receive vote info from qnetd after 30000ms (default quorum.device.sync_timeout). In such cases we lost quorum and get
And then in a second:
And we get quorum back.
The root cause description:
by default heartbeat from qdevice to qnetd = 8 sec
function qnetd_dpd_timer_cb in qdevices/qnetd-dpd-timer.c is called each 10 secs. (default for dpd_interval)
Let's suppose the following sequence of events:
00:00 - qnetd_dpd_timer_cb is called for test2
let's suppose, that condition (client->dpd_time_since_last_check > (client->heartbeat_interval * 2) is true, so it sets
client->dpd_time_since_last_check = 0;
client->dpd_msg_received_since_last_check = 0;
00:01 - test2 sends heartbeat to qnetd and sets client->dpd_msg_received_since_last_check = 1
00:02 - test2 fails
00:10 - qnetd_dpd_timer_cb is called for test2
condition (client->dpd_time_since_last_check > (client->heartbeat_interval * 2) is false. nothing happens this time
00:20 - qnetd_dpd_timer_cb is called for test2
condition (client->dpd_time_since_last_check > (client->heartbeat_interval * 2) is true now. So it sets
client->dpd_time_since_last_check = 0;
client->dpd_msg_received_since_last_check = 0;
because client->dpd_msg_received_since_last_check == 1
00:30 - qnetd_dpd_timer_cb is called for test2
condition (client->dpd_time_since_last_check > (client->heartbeat_interval * 2) is false. nothing happens again
00:40 - qnetd_dpd_timer_cb is called for test2
condition (client->dpd_time_since_last_check > (client->heartbeat_interval * 2) is true. Now it sets
client->schedule_disconnect = 1
because client->dpd_msg_received_since_last_check == 0 (heartbeat was not received since node have been failed)
Thus it takes up to 40 seconds for qnetd to disconnect failed node test2 and respond to qdevice on live node test1.
I propose to change defaults for timeouts, e.g.:
dpd_interval = 1000
quorum.device.timeout = 4000
The text was updated successfully, but these errors were encountered: