Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpg: Inform clients about left nodes during pause #347

Closed

Conversation

jfriesse
Copy link
Member

Patch tries to fix incorrect behaviour during following test-case:

  • 3 nodes
  • Node 1 is paused
  • Node 2 and 3 detects node 1 as failed and informs CPG clients
  • Node 1 is unpaused
  • Node 1 clients are informed about new membership, but not about Node 1
    being paused, so from Node 1 point-of-view, Node 2 and 3 failure

Solution is to:

  • Remove downlist master choose and always choose local node downlist.
    For Node 1 in example above, downlist contains Node 2 and 3.
  • Keep code which informs clients about left nodes
  • Use joinlist as a authoritative source of nodes/clients which exists
    in membership

This patch doesn't break backwards compatibility.

I've walked thru all the patches which changed behavior of cpg to ensure
patch does not break CPG behavior. Most important were:

  • 058f503 - Base. Code was significantly
    changed to handle double free by split group_info into two structures
    cpg_pd (local node clients) and process_info (all clients). Joinlist
    was
  • 97c28ea - This patch removed
    confchg_fn and made CPG sync correct
  • feff0e8 - I've tested described
    behavior without any issues
  • 6bbbfcb - Added idea of using
    heuristics to choose same downlist on all nodes. Sadly this idea
    was beginning of the problems described in
    040fda8,
    ac1d79e,
    559d408,
    02c5dff,
    64d0e5a and
    b55f32f
  • 02c5dff - Made joinlist as
    authoritative source of nodes/clients but left downlist_master_choose
    as a source of information about left nodes

Long story made short. This patch basically reverts
idea of using heuristics to choose same downlist on all nodes.

Signed-off-by: Jan Friesse jfriesse@redhat.com

Patch tries to fix incorrect behaviour during following test-case:
- 3 nodes
- Node 1 is paused
- Node 2 and 3 detects node 1 as failed and informs CPG clients
- Node 1 is unpaused
- Node 1 clients are informed about new membership, but not about Node 1
  being paused, so from Node 1 point-of-view, Node 2 and 3 failure

Solution is to:
- Remove downlist master choose and always choose local node downlist.
  For Node 1 in example above, downlist contains Node 2 and 3.
- Keep code which informs clients about left nodes
- Use joinlist as a authoritative source of nodes/clients which exists
  in membership

This patch doesn't break backwards compatibility.

I've walked thru all the patches which changed behavior of cpg to ensure
patch does not break CPG behavior. Most important were:
- 058f503 - Base. Code was significantly
  changed to handle double free by split group_info into two structures
  cpg_pd (local node clients) and process_info (all clients). Joinlist
  was
- 97c28ea - This patch removed
  confchg_fn and made CPG sync correct
- feff0e8 - I've tested described
  behavior without any issues
- 6bbbfcb - Added idea of using
  heuristics to choose same downlist on all nodes. Sadly this idea
  was beginning of the problems described in
  040fda8,
  ac1d79e,
  559d408,
  02c5dff,
  64d0e5a and
  b55f32f
- 02c5dff - Made joinlist as
  authoritative source of nodes/clients but left downlist_master_choose
  as a source of information about left nodes

Long story made short. This patch basically reverts
idea of using heuristics to choose same downlist on all nodes.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Copy link
Contributor

@chrissie-c chrissie-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed this as best I can and it looks OK, and tests out on my VMs. I never did like those downlist heuristics anyway!

@jfriesse
Copy link
Member Author

@chrissie-c Thank you a lot for the review (I know how hard this one was)!

I've merged patch into needle as 9c2a97f and ported also to master as 23e1795.

Honza

@jfriesse jfriesse closed this Apr 30, 2018
@jfriesse jfriesse deleted the remove-cpg-downlist-heuristics branch April 30, 2018 12:42
ProxBot pushed a commit to proxmox/corosync-pve that referenced this pull request Sep 8, 2020
The cpg config change callback where not made correctly for nodes
with a low member id (lowest id == master node) after corosync on
said node hung, due to IO, artifical suspend or other scheduling
related hangs.

This is releated to an heuristic for choosing/syncing the CPG member
downlist added in 6bbbfcb6b4af72cf35ab9fdb4412fa6c6bdacc12 (corosync
repository).

See whole issue thread:
https://lists.clusterlabs.org/pipermail/users/2018-March/014594.html

Upstream pull-request:
corosync/corosync#347
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants