Quorum lost in majority partition #10

sorcky · 2020-04-02T10:02:55Z

Sometimes corosync lose quorum in majority partition in the following scenario:
Two node cluster:
test1 - node 1 with corosync v3.0.3 + corosync-qdevice v3.0.0
test2 - node 2 with corosync v3.0.3 + corosync-qdevice v3.0.0
test3 - node with corosync-qnetd v 3.0.0
when test2 fails we see the following on the test1:

Apr  2 19:17:25 test1 corosync[16840]:  [TOTEM ] Token has not been received in 751 ms
Apr  2 19:17:25 test1 corosync[16840]:  [TOTEM ] A processor failed, forming new configuration.
Apr  2 19:17:27 test1 corosync[16840]:  [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
Apr  2 19:17:27 test1 corosync[16840]:  [TOTEM ] A new membership (1.9d) was formed. Members left: 2
Apr  2 19:17:27 test1 corosync[16840]:  [TOTEM ] Failed to receive the leave message. failed: 2
Apr  2 19:17:27 test1 corosync[16840]:  [CPG   ] downlist left_list: 1 received
Apr  2 19:17:27 test1 corosync-qdevice[12199]: Votequorum nodelist notify callback:
Apr  2 19:17:27 test1 corosync-qdevice[12199]:  Ring_id = (1.9d)
Apr  2 19:17:27 test1 corosync-qdevice[12199]:  Node list (size = 1):
Apr  2 19:17:27 test1 corosync-qdevice[12199]:    0 nodeid = 1
Apr  2 19:17:27 test1 corosync-qdevice[12199]: Algorithm decided to pause cast vote timer and result vote is No change
Apr  2 19:17:27 test1 corosync-qdevice[12199]: Cast vote timer is now paused.
Apr  2 19:17:27 test1 corosync-qdevice[12199]: worker: qdevice_heuristics_worker_cmd_process_exec: Received exec command with seq_no "9" and timeout "15000"
Apr  2 19:17:27 test1 corosync-qdevice[12199]: Received heuristics exec result command with seq_no "9" and result "Disabled"
Apr  2 19:17:27 test1 corosync-qdevice[12199]: Votequorum heuristics exec result callback:
Apr  2 19:17:27 test1 corosync-qdevice[12199]:  seq_number = 9, exec_result = Disabled
Apr  2 19:17:27 test1 corosync-qdevice[12199]: Algorithm decided to send list, result vote is Wait for reply and heuristics is Undefined
Apr  2 19:17:27 test1 corosync-qdevice[12199]: Sending membership node list seq = 18, ringid = (1.9d), heuristics = Undefined.
Apr  2 19:17:27 test1 corosync-qdevice[12199]:  Node list:
Apr  2 19:17:27 test1 corosync-qdevice[12199]:    0 node_id = 1, data_center_id = 0, node_state = not set
Apr  2 19:17:27 test1 corosync-qdevice[12199]: Cast vote timer is no longer paused.
Apr  2 19:17:27 test1 corosync-qdevice[12199]: Cast vote timer is now stopped.
Apr  2 19:17:27 test1 corosync-qdevice[12199]: Not scheduling heuristics timer because mode is not enabled
Apr  2 19:17:27 test1 corosync-qdevice[12199]: Received membership node list reply
Apr  2 19:17:27 test1 corosync-qdevice[12199]:  seq = 18
Apr  2 19:17:27 test1 corosync-qdevice[12199]:  vote = Wait for reply
Apr  2 19:17:27 test1 corosync-qdevice[12199]:  ring id = (1.9d)
Apr  2 19:17:27 test1 corosync-qdevice[12199]: Algorithm result vote is Wait for reply
Apr  2 19:17:27 test1 corosync-qdevice[12199]: Cast vote timer remains stopped.

As you see at this moment test1 qdevice receives callback from corosync and sends changed node list to qnetd.
Sometimes we don't receive vote info from qnetd after 30000ms (default quorum.device.sync_timeout). In such cases we lost quorum and get

Apr  2 19:17:57 test1 corosync[16840]:  [VOTEQ ] lost contact with quorum device Qdevice
Apr  2 19:17:57 test1 corosync[16840]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Apr  2 19:17:57 test1 corosync[16840]:  [QUORUM] Members[1]: 1
Apr  2 19:17:57 test1 corosync[16840]:  [MAIN  ] Completed service synchronization, ready to provide service.
Apr  2 19:17:57 test1 corosync-qdevice[12199]: Votequorum quorum notify callback:
Apr  2 19:17:57 test1 corosync-qdevice[12199]:  Quorate = 0
Apr  2 19:17:57 test1 corosync-qdevice[12199]:  Node list (size = 3):
Apr  2 19:17:57 test1 corosync-qdevice[12199]:    0 nodeid = 2, state = 2
Apr  2 19:17:57 test1 corosync-qdevice[12199]:    1 nodeid = 1, state = 1
Apr  2 19:17:57 test1 corosync-qdevice[12199]:    2 nodeid = 0, state = 0
Apr  2 19:17:57 test1 corosync-qdevice[12199]: Algorithm decided to send list and result vote is No change
Apr  2 19:17:57 test1 corosync-qdevice[12199]: Sending quorum node list seq = 19, quorate = 0
Apr  2 19:17:57 test1 corosync-qdevice[12199]:  Node list:
Apr  2 19:17:57 test1 corosync-qdevice[12199]:    0 node_id = 2, data_center_id = 0, node_state = dead
Apr  2 19:17:57 test1 corosync-qdevice[12199]:    1 node_id = 1, data_center_id = 0, node_state = member
Apr  2 19:17:57 test1 corosync-qdevice[12199]: Received quorum node list reply
Apr  2 19:17:57 test1 corosync-qdevice[12199]:  seq = 19
Apr  2 19:17:57 test1 corosync-qdevice[12199]:  vote = No change
Apr  2 19:17:57 test1 corosync-qdevice[12199]:  ring id = (1.9d)
Apr  2 19:17:57 test1 corosync-qdevice[12199]: Algorithm result vote is No change

And then in a second:

Apr  2 19:18:00 test1 corosync-qdevice[12199]: Received vote info
Apr  2 19:18:00 test1 corosync-qdevice[12199]:  seq = 10
Apr  2 19:18:00 test1 corosync-qdevice[12199]:  vote = ACK
Apr  2 19:18:00 test1 corosync-qdevice[12199]:  ring id = (1.9d)
Apr  2 19:18:00 test1 corosync-qdevice[12199]: Algorithm result vote is ACK
Apr  2 19:18:00 test1 corosync-qdevice[12199]: Cast vote timer is now scheduled every 5000ms voting ACK.
Apr  2 19:18:00 test1 corosync[16840]:  [QUORUM] This node is within the primary component and will provide service.
Apr  2 19:18:00 test1 corosync[16840]:  [QUORUM] Members[1]: 1
Apr  2 19:18:00 test1 corosync-qdevice[12199]: Votequorum quorum notify callback:
Apr  2 19:18:00 test1 corosync-qdevice[12199]:  Quorate = 1
Apr  2 19:18:00 test1 corosync-qdevice[12199]:  Node list (size = 3):
Apr  2 19:18:00 test1 corosync-qdevice[12199]:    0 nodeid = 2, state = 2
Apr  2 19:18:00 test1 corosync-qdevice[12199]:    1 nodeid = 1, state = 1
Apr  2 19:18:00 test1 corosync-qdevice[12199]:    2 nodeid = 0, state = 0
Apr  2 19:18:00 test1 corosync-qdevice[12199]: Algorithm decided to send list and result vote is No change
Apr  2 19:18:00 test1 corosync-qdevice[12199]: Sending quorum node list seq = 20, quorate = 1
Apr  2 19:18:00 test1 corosync-qdevice[12199]:  Node list:
Apr  2 19:18:00 test1 corosync-qdevice[12199]:    0 node_id = 2, data_center_id = 0, node_state = dead
Apr  2 19:18:00 test1 corosync-qdevice[12199]:    1 node_id = 1, data_center_id = 0, node_state = member
Apr  2 19:18:00 test1 corosync-qdevice[12199]: Received quorum node list reply
Apr  2 19:18:00 test1 corosync-qdevice[12199]:  seq = 20
Apr  2 19:18:00 test1 corosync-qdevice[12199]:  vote = No change
Apr  2 19:18:00 test1 corosync-qdevice[12199]:  ring id = (1.9d)
Apr  2 19:18:00 test1 corosync-qdevice[12199]: Algorithm result vote is No change

And we get quorum back.

The root cause description:
by default heartbeat from qdevice to qnetd = 8 sec
function qnetd_dpd_timer_cb in qdevices/qnetd-dpd-timer.c is called each 10 secs. (default for dpd_interval)
Let's suppose the following sequence of events:
00:00 - qnetd_dpd_timer_cb is called for test2
let's suppose, that condition (client->dpd_time_since_last_check > (client->heartbeat_interval * 2) is true, so it sets
client->dpd_time_since_last_check = 0;
client->dpd_msg_received_since_last_check = 0;
00:01 - test2 sends heartbeat to qnetd and sets client->dpd_msg_received_since_last_check = 1
00:02 - test2 fails
00:10 - qnetd_dpd_timer_cb is called for test2
condition (client->dpd_time_since_last_check > (client->heartbeat_interval * 2) is false. nothing happens this time
00:20 - qnetd_dpd_timer_cb is called for test2
condition (client->dpd_time_since_last_check > (client->heartbeat_interval * 2) is true now. So it sets
client->dpd_time_since_last_check = 0;
client->dpd_msg_received_since_last_check = 0;
because client->dpd_msg_received_since_last_check == 1
00:30 - qnetd_dpd_timer_cb is called for test2
condition (client->dpd_time_since_last_check > (client->heartbeat_interval * 2) is false. nothing happens again
00:40 - qnetd_dpd_timer_cb is called for test2
condition (client->dpd_time_since_last_check > (client->heartbeat_interval * 2) is true. Now it sets
client->schedule_disconnect = 1
because client->dpd_msg_received_since_last_check == 0 (heartbeat was not received since node have been failed)

Thus it takes up to 40 seconds for qnetd to disconnect failed node test2 and respond to qdevice on live node test1.

I propose to change defaults for timeouts, e.g.:
dpd_interval = 1000
quorum.device.timeout = 4000

The text was updated successfully, but these errors were encountered:

jfriesse · 2020-04-02T12:51:07Z

@sorcky Thank you for the report! Yes, it is known bug and your analysis is correct. Main problem is dpd_interval - plan is to remove it completelly - and too big device.timeout which translates into heartbeat interval (0.8 * device.timeout) - this should be set to be closer to token timeout (because it is actually token timeout). And your proposed workaround is correct.

Problem is, that the required change is quite a big and it must be done carefully to keep backwards compatibility.

I'll keep this open as a reminder (and to increase prio in my todo list).

jfriesse · 2020-11-18T17:22:21Z

Current master (28d4914) includes improved dpd timer detection (a8b7513) so problem should be fixed now.

Now, every single client has its own timer with timeout based on (configurable) dpd_interval_coefficient and multiplied with client heartbeat timeout. When message is received from client timer is rescheduled. When timer callback is called (= client doesn't sent message during timeout) then client is disconnected.

Qdevice timeout is not based on corosync token timeout yet, but because corosync 3.1.0 got default 3sec token timeout, it is now really close (because of gather/merge/concensus timeouts =~ 7sec) to new (default) qdevice timeout 12 sec, so qdevice timeout based on corosync token timeout may be not so much needed. Also it is important to note, that qdevice sends hb every 8 secs and it is independend to corosync hb so when node dies just before qdevice would normally send hb then qnetd will dpd in ~4 secs so it can be sometimes faster than corosync detection (because of gather/merge/consensus).

Closing this issue for now.

jfriesse closed this as completed Nov 18, 2020

jfriesse mentioned this issue Apr 9, 2021

Question about sync_timeout #21

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quorum lost in majority partition #10

Quorum lost in majority partition #10

sorcky commented Apr 2, 2020

jfriesse commented Apr 2, 2020 •

edited

jfriesse commented Nov 18, 2020 •

edited

Quorum lost in majority partition #10

Quorum lost in majority partition #10

Comments

sorcky commented Apr 2, 2020

jfriesse commented Apr 2, 2020 • edited

jfriesse commented Nov 18, 2020 • edited

jfriesse commented Apr 2, 2020 •

edited

jfriesse commented Nov 18, 2020 •

edited