Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

votequorum: set wfa status only on startup #542

Closed
wants to merge 1 commit into from

Conversation

jfriesse
Copy link
Member

Previously reload of configuration with enabled wait_for_all result in
set of wait_for_all_status which set cluster_is_quorate to 0 but didn't
inform the quorum service so votequorum and quorum information may get
out of sync.

Example is 1 node cluster, which is extended to 3 nodes. Quorum service
reports cluster as a quorate (incorrect) and votequorum as not-quorate
(correct). Similar behavior happens when extending cluster in general,
but some configurations are less incorrect (3->4).

Discussed solution was to inform quorum service but that would mean
every reload would cause loss of quorum until all nodes would be seen
again.

Such behaviour is consistent but seems to be a bit too strict.

Proposed solution sets wait_for_all_status only on startup and
doesn't touch it during reload.

This solution fulfills requirement of "cluster will be quorate for
the first time only after all nodes have been visible at least
once at the same time." because node clears wait_for_all_status only
after it sees all other nodes or joins cluster which is quorate. It also
solves problem with extending cluster, because when cluster becomes
unquorate (1->3) wait_for_all_status is set.

Added assert is only for ensure that I haven't missed any case when
quorate cluster may become unquorate.

Signed-off-by: Jan Friesse jfriesse@redhat.com

Previously reload of configuration with enabled wait_for_all result in
set of wait_for_all_status which set cluster_is_quorate to 0 but didn't
inform the quorum service so votequorum and quorum information may get
out of sync.

Example is 1 node cluster, which is extended to 3 nodes. Quorum service
reports cluster as a quorate (incorrect) and votequorum as not-quorate
(correct). Similar behavior happens when extending cluster in general,
but some configurations are less incorrect (3->4).

Discussed solution was to inform quorum service but that would mean
every reload would cause loss of quorum until all nodes would be seen
again.

Such behaviour is consistent but seems to be a bit too strict.

Proposed solution sets wait_for_all_status only on startup and
doesn't touch it during reload.

This solution fulfills requirement of "cluster will be quorate for
the first time only after all nodes have been visible at least
once at the same time." because node clears wait_for_all_status only
after it sees all other nodes or joins cluster which is quorate. It also
solves problem with extending cluster, because when cluster becomes
unquorate (1->3) wait_for_all_status is set.

Added assert is only for ensure that I haven't missed any case when
quorate cluster may become unquorate.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
@jfriesse
Copy link
Member Author

Just some examples I've tested:

  1. 1->3 - both vq and quorum service reported same status (unquorate)
  2. 3 nodes configured, 2 running, config extended to 6 nodes -> both vq and quorum service reported same status (unquorate)
  3. 3 nodes configured, 3 running, config extended to 6 nodes -> both vq and quorum service reported same status (unquorate)
  4. 3 nodes configured, 3 running, config extended to 4 nodes -> both vq and quorum service reported same status (quorate)
  5. 3 nodes configured, 3 running, 1 rebooted without ability to contact other nodes after reboot -> both vq and quorum service report quorate status for 2 survive nodes, rebooted node reports unquorate for both services

Copy link
Contributor

@chrissie-c chrissie-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having played with this a while and thought about it, it does seem to make sense. so ACK. I hope other people will get chance to test it does what they expect but it does seem sane.

@jfriesse
Copy link
Member Author

@chrissie-c Thank you for the review! I've merged patch as a ca320be (master) and 6894792 (needle).

@jfriesse jfriesse closed this Mar 24, 2020
@jfriesse jfriesse deleted the vq-wfa branch March 24, 2020 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants