RATIS-1446. Avoid leader election for invalid conf #560

Xushaohong · 2021-12-09T09:20:32Z

What changes were proposed in this pull request?

Candidate shall not start leader election in these cases in case of possible NPE caused by conf.getPeer().getPriority().
With this patch, the candidate will become a follower again when receiving the future appendEntries request.

Phenomenon:
The bootstrapped follower becomes the leader after the election timeout somehow, with an empty raft conf. Then in the process of yieldLeaderToHigherPriorityPeer, NPE happens as (null).getPriority().
Setconfiguration action takes place at the end of appendEntriesAsync. If we make it possible to raise electiontimeout before this action, we can replay the NPE case.

1private void yieldLeaderToHigherPriorityPeer() {
2  if (!server.getInfo().isLeader()) {
3    return;
4  }
5  final RaftConfigurationImpl conf = server.getRaftConf();
6  int leaderPriority = conf.getPeer(server.getId()).getPriority();

Possible reason:
Network delay/loss? The leader did not send entries in time or the follower did not receive entries.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/RATIS-1446

How was this patch tested?

Manual simulation in UT.

Xushaohong · 2021-12-09T09:26:23Z

@szetszwo Actually, I am not sure whether we should make a specific check for each getPriority() use. Right now, I just prevent the NPE from the source. Currently, conf.getPeer().getPriority() only exists in leader election and leader state.

szetszwo · 2021-12-09T11:33:36Z

ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java

+    // Candidate shall not start leader election in these cases in case of
+    // possible NPE caused by conf.getPeer().getPriority()
+    if (!getRaftConf().containsInBothConfs(getId())) {
+      LOG.warn("{} find invalid configuration {}, skip start leader election", this, getRaftConf());
+      return;
+    }


The conf may be changed later on. Therefore, let's don't check it here and check null in LeaderElection.getHigherPriorityPeers(..) as below.

private Set<RaftPeerId> getHigherPriorityPeers(RaftConfiguration conf) { final Optional<Integer> priority = Optional.ofNullable(conf.getPeer(server.getId())).map(RaftPeer::getPriority); return conf.getAllPeers().stream() .filter(peer -> priority.filter(p -> peer.getPriority() > p).isPresent()) .map(RaftPeer::getId) .collect(Collectors.toSet()); }

szetszwo · 2021-12-09T11:48:40Z

..., I am not sure whether we should make a specific check for each getPriority() use.

That's a good point. We should check. I just have walked through the code. There are two other cases needed to be fixed:

diff --git a/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java b/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java
index 03850114..d1413040 100644
--- a/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java
+++ b/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java
@@ -916,13 +916,13 @@ class LeaderStateImpl implements LeaderState {
 
     for (LogAppender logAppender : senders.getSenders()) {
       FollowerInfo followerInfo = logAppender.getFollower();
-      RaftPeerId followerID = followerInfo.getPeer().getId();
-      int followerPriority = conf.getPeer(followerID).getPriority();
-
+      final RaftPeer follower = followerInfo.getPeer();
+      final int followerPriority = follower.getPriority();
       if (followerPriority <= leaderPriority) {
         continue;
       }
 
+      final RaftPeerId followerID = follower.getId();
       final TermIndex leaderLastEntry = server.getState().getLastEntry();
       if (leaderLastEntry == null) {
         LOG.info("{} send StartLeaderElectionRequest to follower:{} on term:{} because follower's priority:{} " +
diff --git a/ratis-server/src/main/java/org/apache/ratis/server/impl/VoteContext.java b/ratis-server/src/main/java/org/apache/ratis/server/impl/VoteContext.java
index 1ef721aa..fde198fe 100644
--- a/ratis-server/src/main/java/org/apache/ratis/server/impl/VoteContext.java
+++ b/ratis-server/src/main/java/org/apache/ratis/server/impl/VoteContext.java
@@ -144,7 +144,11 @@ class VoteContext {
     }
 
     // Check priority
-    final int priority = impl.getRaftConf().getPeer(impl.getId()).getPriority();
+    final RaftPeer peer = conf.getPeer(impl.getId());
+    if (peer == null) {
+      return reject("our server " + impl.getId() + " is not in the conf");
+    }
+    final int priority = peer.getPriority();
     if (priority <= candidate.getPriority()) {
       return log(true, "our priority " + priority + " <= candidate's priority " + candidate.getPriority());
     } else {

Xushaohong · 2021-12-10T06:45:17Z

The conf may be changed later on. Therefore, let's don't check it here and check null in LeaderElection.getHigherPriorityPeers(..) as below.

The check here in the getHigherPriorityPeers is not enough. The possible case I tested is that bootstrapped follower changes role from follower to candidate and start leader election, as it has an empty conf, it will become leader directly due to if (others.isEmpty()) { r = new ResultAndTerm(Result.PASSED, electionTerm);. Once it becomes another leader, it won't receive any requests from the original leader as the term is increased. Meanwhile, the original leader stuck in sending appendEntries to 'follower'.

I sightly changed the check place. PTAL
@szetszwo

szetszwo · 2021-12-10T07:48:11Z

..., as it has an empty conf ...

This is an invalid test case since the conf (i.e. the group) is incorrect in the beginning. We must either set the group correct before startup or use AdminApi.setConfiguration to change the conf after startup.

Xushaohong · 2021-12-10T07:55:57Z

..., as it has an empty conf ...

This is an invalid test case since the conf (i.e. the group) is incorrect in the beginning. We must either set the group correct before startup or use AdminApi.setConfiguration to change the conf after startup.

The bootstrapped follower could start with empty conf and set conf through leader's appendEntries request or Installsnapshot request. Somehow caused election timeout would let this follower start leader with empty conf.
@szetszwo

szetszwo

The additional changes are good although the empty-conf test case is invalid. Some comments inlined; see also https://issues.apache.org/jira/secure/attachment/13037236/560_reivew.patch

ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java

ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderElection.java

Xushaohong · 2021-12-10T09:20:52Z

Thanks for all the comments above. I have fixed all. @szetszwo

With NOT_IN_CONF check added in submitRequestAndWaitResult, RaftReconfigurationBaseTest would have some UT failed. I checked it and found it clashes with PRE_VOTE and thus added the phase check.

szetszwo · 2021-12-10T14:13:04Z

@Xushaohong , thanks for the update.

TestLeaderElectionWithNetty.testTransferLeader failed. I also can reproduce it locally. Could you take a look?

Xushaohong · 2021-12-13T06:58:55Z

@Xushaohong , thanks for the update.

TestLeaderElectionWithNetty.testTransferLeader failed. I also can reproduce it locally. Could you take a look?

Currently, the setConf request would not update the FollowerInfo in LogAppenderBase, so we have to get the latest followerInfo from conf like the previous version. I have reverted it back and added an NPE check.
@szetszwo Could you help confirm it?

szetszwo · 2021-12-13T07:08:56Z

With NOT_IN_CONF check added in submitRequestAndWaitResult, RaftReconfigurationBaseTest would have some UT failed. I checked it and found it clashes with PRE_VOTE and thus added the phase check.

We should fix the RaftReconfigurationBaseTest but not having the phase check.

szetszwo · 2021-12-13T07:12:06Z

ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderElection.java

@@ -271,6 +272,9 @@ private boolean shouldRun(long electionTerm) {

  private ResultAndTerm submitRequestAndWaitResult(Phase phase, RaftConfigurationImpl conf, long electionTerm)
      throws InterruptedException {
+    if (!conf.containsInConf(server.getId()) && phase == Phase.ELECTION) {


We should fix the unit tests but not adding the phase check here.

We should fix the unit tests but not adding the phase check here.

OK, let's find out what's wrong with UT. @szetszwo

Could you help approve the workflow? : )

@Xushaohong , the build finished. Please take a look.

Xushaohong · 2021-12-15T09:54:03Z

I reviewed the failure UT of testRemovePeers. The fault is due to after removing the peers out of the conf, the division which should quit the raft group will become a candidate and start the election. During the prevote process, originally it will send out the request and then the other peers should send back should shutdown to turn this server down. But for our case, we just return new ResultAndTerm(Result.NOT_IN_CONF), and won't go to the SHUTDOWN branch.
@szetszwo Shall we move up the line case NOT_IN_CONF: above case SHUTDOWN or still use my original phase check?

szetszwo · 2021-12-15T11:09:37Z

... Shall we move up the line case NOT_IN_CONF: above case SHUTDOWN or still use my original phase check?

This is a good idea. Let's shutdown the server when it is not in conf. Thanks.

szetszwo

+1 the change looks good.

RATIS-1446. Avoid leader election for invalid conf

f89e846

szetszwo reviewed Dec 9, 2021

View reviewed changes

RATIS-1446. Avoid leader election for invalid conf(v2)

07e0a11

Xushaohong force-pushed the RATIS-1446 branch from 86e0c18 to 07e0a11 Compare December 10, 2021 06:48

szetszwo reviewed Dec 10, 2021

View reviewed changes

fix some comments

7432712

Xushaohong force-pushed the RATIS-1446 branch from b85396a to 7432712 Compare December 10, 2021 09:16

fix yieldLeaderToHigherPriorityPeer

846a3c5

szetszwo reviewed Dec 13, 2021

View reviewed changes

remove phase check

409c557

NOT_IN_CONF case should close server

5af366d

szetszwo approved these changes Dec 15, 2021

View reviewed changes

szetszwo merged commit 478749c into apache:master Dec 15, 2021

Brokenice0415 mentioned this pull request Oct 20, 2023

Example to show infinity election when perform membership change. #943

Closed

symious pushed a commit to symious/ratis that referenced this pull request Feb 27, 2024

RATIS-1446. Avoid leader election for invalid conf (apache#560)

f96b4fd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RATIS-1446. Avoid leader election for invalid conf #560

RATIS-1446. Avoid leader election for invalid conf #560

Xushaohong commented Dec 9, 2021 •

edited

Xushaohong commented Dec 9, 2021

szetszwo Dec 9, 2021

szetszwo commented Dec 9, 2021

Xushaohong commented Dec 10, 2021

szetszwo commented Dec 10, 2021

Xushaohong commented Dec 10, 2021

szetszwo left a comment

Xushaohong commented Dec 10, 2021 •

edited

szetszwo commented Dec 10, 2021

Xushaohong commented Dec 13, 2021

szetszwo commented Dec 13, 2021

szetszwo Dec 13, 2021

Xushaohong Dec 13, 2021 •

edited

Xushaohong Dec 13, 2021

szetszwo Dec 14, 2021

Xushaohong commented Dec 15, 2021

szetszwo commented Dec 15, 2021

szetszwo left a comment

RATIS-1446. Avoid leader election for invalid conf #560

RATIS-1446. Avoid leader election for invalid conf #560

Conversation

Xushaohong commented Dec 9, 2021 • edited

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Xushaohong commented Dec 9, 2021

szetszwo Dec 9, 2021

Choose a reason for hiding this comment

szetszwo commented Dec 9, 2021

Xushaohong commented Dec 10, 2021

szetszwo commented Dec 10, 2021

Xushaohong commented Dec 10, 2021

szetszwo left a comment

Choose a reason for hiding this comment

Xushaohong commented Dec 10, 2021 • edited

szetszwo commented Dec 10, 2021

Xushaohong commented Dec 13, 2021

szetszwo commented Dec 13, 2021

szetszwo Dec 13, 2021

Choose a reason for hiding this comment

Xushaohong Dec 13, 2021 • edited

Choose a reason for hiding this comment

Xushaohong Dec 13, 2021

Choose a reason for hiding this comment

szetszwo Dec 14, 2021

Choose a reason for hiding this comment

Xushaohong commented Dec 15, 2021

szetszwo commented Dec 15, 2021

szetszwo left a comment

Choose a reason for hiding this comment

Xushaohong commented Dec 9, 2021 •

edited

Xushaohong commented Dec 10, 2021 •

edited

Xushaohong Dec 13, 2021 •

edited