Skip to content

RATIS-2548. Stabilize timing-sensitive Ratis tests#1475

Open
CRZbulabula wants to merge 3 commits into
apache:masterfrom
CRZbulabula:ratis-2548
Open

RATIS-2548. Stabilize timing-sensitive Ratis tests#1475
CRZbulabula wants to merge 3 commits into
apache:masterfrom
CRZbulabula:ratis-2548

Conversation

@CRZbulabula
Copy link
Copy Markdown
Contributor

@CRZbulabula CRZbulabula commented May 29, 2026

What changes were proposed in this pull request?

This PR stabilizes several timing-sensitive tests by replacing fixed sleeps or immediate assertions with waits for the concrete condition each test needs. It also preserves the existing single-mode election semantics during the leader heartbeat-majority check for the transitional single -> HA configuration.

The changes include:

  • Wait for restart futures before continuing to log assertions in RaftBasicTests.
  • Wait for async append replies before injecting the kill-leader restart in RaftBasicTests.
  • Wait for commit index / state machine count in linearizable read tests instead of relying on fixed sleeps.
  • Wait for the transaction context map to become empty in RaftLogTruncateTests.
  • Pause the current leader before election stepDown and wait until a different leader is elected.
  • Allow more time for the install-snapshot follower next-index assertion on slow CI.
  • Treat transitional single -> HA configurations as single mode in LeaderStateImpl.checkLeadership() so a slow new peer does not make the leader step down before the existing reconfiguration test observes it.

Why are the changes needed?

These tests can fail on slower CI machines when asynchronous restart, log cleanup, state machine application, leadership transition, install snapshot progress, or single -> HA reconfiguration completes later than the fixed delay assumed by the test.

How was this patch tested?

mvn -pl ratis-server,ratis-test -am -Dtest=TestLinearizableReadRepliedIndexWithGrpc,ElectionCommandIntegrationTest,RaftLogTruncateTests test

mvn -pl ratis-test -am -Dtest=TestRaftLogTruncateWithGrpc,TestElectionCommandIntegrationWithGrpc,TestInstallSnapshotNotificationWithGrpc,TestRaftWithGrpc test

mvn -pl ratis-test -am -Dtest=TestRaftAsyncWithGrpc#testBasicAppendEntriesAsyncKillLeader,TestElectionCommandIntegrationWithGrpc,TestInstallSnapshotNotificationWithGrpc,TestRaftLogTruncateWithGrpc,TestRaftWithGrpc test

mvn -pl ratis-test -am -Dtest=TestRaftReconfigurationWithSimulatedRpc#testKillLeaderDuringReconf test

mvn -pl ratis-test -am -Dtest=TestRaftAsyncWithGrpc#testBasicAppendEntriesAsyncKillLeader,TestRaftReconfigurationWithSimulatedRpc#testLeaderElectionWhenChangeFromSingleToHA test

Copy link
Copy Markdown
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CRZbulabula , thanks for fixing the tests! Please see the comments inlined.

Comment on lines +1157 to +1159
if (conf.isSingleMode(server.getId())) {
return true;
}
Copy link
Copy Markdown
Contributor

@szetszwo szetszwo May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do this change separately. Then, this PR changes only the test code.

Comment on lines +157 to 160
if (killLeader) {
log.info("killAndRestart leader " + leader.getId());
killAndRestartLeader = killAndRestartServer(leader.getId(), 0, 4000, cluster, log);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait for async append replies before injecting the kill-leader restart in RaftBasicTests.

Before this change, killLeader is in the beginning. This change moves it to the end. It makes the test easier to pass but not fixing a bug.

It is good to test killLeader before client sending messages. So, let's don't make this change?

Comment on lines +156 to +160
int ret = shell.run("election", "pause", "-peers", sb.toString(), "-address",
leader.getPeer().getAddress());
Assertions.assertEquals(0, ret);

ret = shell.run("election", "stepDown", "-peers", sb.toString());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is good. Could you also remove the redundant toString() calls?

    int ret = shell.run("election", "pause", "-peers", sb, "-address", leader.getPeer().getAddress());
    Assertions.assertEquals(0, ret);

    ret = shell.run("election", "stepDown", "-peers", sb);
    Assertions.assertEquals(0, ret);

Comment on lines +161 to -169
Thread.sleep(cluster.getTimeoutMax().toIntExact(TimeUnit.MILLISECONDS) + 100);
} finally {
CompletableFuture.allOf(killAndRestartFollower, killAndRestartLeader).join();
}
Thread.sleep(cluster.getTimeoutMax().toIntExact(TimeUnit.MILLISECONDS) + 100);
log.info(cluster.printAllLogs());
killAndRestartFollower.join();
killAndRestartLeader.join();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait for restart futures before continuing to log assertions in RaftBasicTests.

You are right that we should join before printing the log.

How about we simply move cluster.printAllLogs() up? The try-finally make the code harder to read.

     Thread.sleep(cluster.getTimeoutMax().toIntExact(TimeUnit.MILLISECONDS) + 100);
-    log.info(cluster.printAllLogs());
     killAndRestartFollower.join();
     killAndRestartLeader.join();
+    log.info(cluster.printAllLogs());

} else if (asyncReplyCount.incrementAndGet() == messages.length) {
f.complete(null);
}
CompletableFuture.allOf(asyncReplies.toArray(new CompletableFuture<?>[0])).join();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since join() is called below. This allOf is not needed. Let's remove it.

BTW, changing

      final AtomicInteger asyncReplyCount = new AtomicInteger();
      final CompletableFuture<Void> f = new CompletableFuture<>();

to

      final List<CompletableFuture<RaftClientReply>> asyncReplies = new ArrayList<>();

does make the code easier to understand (although the original code is also correct.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants