cpg: Inform clients about left nodes during pause #347

jfriesse · 2018-04-24T16:26:43Z

Patch tries to fix incorrect behaviour during following test-case:

3 nodes
Node 1 is paused
Node 2 and 3 detects node 1 as failed and informs CPG clients
Node 1 is unpaused
Node 1 clients are informed about new membership, but not about Node 1
being paused, so from Node 1 point-of-view, Node 2 and 3 failure

Solution is to:

Remove downlist master choose and always choose local node downlist.
For Node 1 in example above, downlist contains Node 2 and 3.
Keep code which informs clients about left nodes
Use joinlist as a authoritative source of nodes/clients which exists
in membership

This patch doesn't break backwards compatibility.

I've walked thru all the patches which changed behavior of cpg to ensure
patch does not break CPG behavior. Most important were:

058f503 - Base. Code was significantly
changed to handle double free by split group_info into two structures
cpg_pd (local node clients) and process_info (all clients). Joinlist
was
97c28ea - This patch removed
confchg_fn and made CPG sync correct
feff0e8 - I've tested described
behavior without any issues
6bbbfcb - Added idea of using
heuristics to choose same downlist on all nodes. Sadly this idea
was beginning of the problems described in
040fda8,
ac1d79e,
559d408,
02c5dff,
64d0e5a and
b55f32f
02c5dff - Made joinlist as
authoritative source of nodes/clients but left downlist_master_choose
as a source of information about left nodes

Long story made short. This patch basically reverts
idea of using heuristics to choose same downlist on all nodes.

Signed-off-by: Jan Friesse jfriesse@redhat.com

Patch tries to fix incorrect behaviour during following test-case: - 3 nodes - Node 1 is paused - Node 2 and 3 detects node 1 as failed and informs CPG clients - Node 1 is unpaused - Node 1 clients are informed about new membership, but not about Node 1 being paused, so from Node 1 point-of-view, Node 2 and 3 failure Solution is to: - Remove downlist master choose and always choose local node downlist. For Node 1 in example above, downlist contains Node 2 and 3. - Keep code which informs clients about left nodes - Use joinlist as a authoritative source of nodes/clients which exists in membership This patch doesn't break backwards compatibility. I've walked thru all the patches which changed behavior of cpg to ensure patch does not break CPG behavior. Most important were: - 058f503 - Base. Code was significantly changed to handle double free by split group_info into two structures cpg_pd (local node clients) and process_info (all clients). Joinlist was - 97c28ea - This patch removed confchg_fn and made CPG sync correct - feff0e8 - I've tested described behavior without any issues - 6bbbfcb - Added idea of using heuristics to choose same downlist on all nodes. Sadly this idea was beginning of the problems described in 040fda8, ac1d79e, 559d408, 02c5dff, 64d0e5a and b55f32f - 02c5dff - Made joinlist as authoritative source of nodes/clients but left downlist_master_choose as a source of information about left nodes Long story made short. This patch basically reverts idea of using heuristics to choose same downlist on all nodes. Signed-off-by: Jan Friesse <jfriesse@redhat.com>

chrissie-c

I've reviewed this as best I can and it looks OK, and tests out on my VMs. I never did like those downlist heuristics anyway!

jfriesse · 2018-04-30T12:42:14Z

@chrissie-c Thank you a lot for the review (I know how hard this one was)!

I've merged patch into needle as 9c2a97f and ported also to master as 23e1795.

Honza

The cpg config change callback where not made correctly for nodes with a low member id (lowest id == master node) after corosync on said node hung, due to IO, artifical suspend or other scheduling related hangs. This is releated to an heuristic for choosing/syncing the CPG member downlist added in 6bbbfcb6b4af72cf35ab9fdb4412fa6c6bdacc12 (corosync repository). See whole issue thread: https://lists.clusterlabs.org/pipermail/users/2018-March/014594.html Upstream pull-request: corosync/corosync#347

chrissie-c approved these changes Apr 27, 2018

View reviewed changes

jfriesse closed this Apr 30, 2018

jfriesse deleted the remove-cpg-downlist-heuristics branch April 30, 2018 12:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpg: Inform clients about left nodes during pause #347

cpg: Inform clients about left nodes during pause #347

jfriesse commented Apr 24, 2018

chrissie-c left a comment

jfriesse commented Apr 30, 2018

cpg: Inform clients about left nodes during pause #347

cpg: Inform clients about left nodes during pause #347

Conversation

jfriesse commented Apr 24, 2018

chrissie-c left a comment

Choose a reason for hiding this comment

jfriesse commented Apr 30, 2018