RATIS-2548. Stabilize timing-sensitive Ratis tests#1475
Conversation
szetszwo
left a comment
There was a problem hiding this comment.
@CRZbulabula , thanks for fixing the tests! Please see the comments inlined.
| if (conf.isSingleMode(server.getId())) { | ||
| return true; | ||
| } |
There was a problem hiding this comment.
Let's do this change separately. Then, this PR changes only the test code.
| if (killLeader) { | ||
| log.info("killAndRestart leader " + leader.getId()); | ||
| killAndRestartLeader = killAndRestartServer(leader.getId(), 0, 4000, cluster, log); | ||
| } |
There was a problem hiding this comment.
Wait for async append replies before injecting the kill-leader restart in RaftBasicTests.
Before this change, killLeader is in the beginning. This change moves it to the end. It makes the test easier to pass but not fixing a bug.
It is good to test killLeader before client sending messages. So, let's don't make this change?
| int ret = shell.run("election", "pause", "-peers", sb.toString(), "-address", | ||
| leader.getPeer().getAddress()); | ||
| Assertions.assertEquals(0, ret); | ||
|
|
||
| ret = shell.run("election", "stepDown", "-peers", sb.toString()); |
There was a problem hiding this comment.
This change is good. Could you also remove the redundant toString() calls?
int ret = shell.run("election", "pause", "-peers", sb, "-address", leader.getPeer().getAddress());
Assertions.assertEquals(0, ret);
ret = shell.run("election", "stepDown", "-peers", sb);
Assertions.assertEquals(0, ret);| Thread.sleep(cluster.getTimeoutMax().toIntExact(TimeUnit.MILLISECONDS) + 100); | ||
| } finally { | ||
| CompletableFuture.allOf(killAndRestartFollower, killAndRestartLeader).join(); | ||
| } | ||
| Thread.sleep(cluster.getTimeoutMax().toIntExact(TimeUnit.MILLISECONDS) + 100); | ||
| log.info(cluster.printAllLogs()); | ||
| killAndRestartFollower.join(); | ||
| killAndRestartLeader.join(); |
There was a problem hiding this comment.
Wait for restart futures before continuing to log assertions in RaftBasicTests.
You are right that we should join before printing the log.
How about we simply move cluster.printAllLogs() up? The try-finally make the code harder to read.
Thread.sleep(cluster.getTimeoutMax().toIntExact(TimeUnit.MILLISECONDS) + 100);
- log.info(cluster.printAllLogs());
killAndRestartFollower.join();
killAndRestartLeader.join();
+ log.info(cluster.printAllLogs());| } else if (asyncReplyCount.incrementAndGet() == messages.length) { | ||
| f.complete(null); | ||
| } | ||
| CompletableFuture.allOf(asyncReplies.toArray(new CompletableFuture<?>[0])).join(); |
There was a problem hiding this comment.
Since join() is called below. This allOf is not needed. Let's remove it.
BTW, changing
final AtomicInteger asyncReplyCount = new AtomicInteger();
final CompletableFuture<Void> f = new CompletableFuture<>();to
final List<CompletableFuture<RaftClientReply>> asyncReplies = new ArrayList<>();does make the code easier to understand (although the original code is also correct.)
What changes were proposed in this pull request?
This PR stabilizes several timing-sensitive tests by replacing fixed sleeps or immediate assertions with waits for the concrete condition each test needs. It also preserves the existing single-mode election semantics during the leader heartbeat-majority check for the transitional single -> HA configuration.
The changes include:
RaftBasicTests.RaftBasicTests.RaftLogTruncateTests.LeaderStateImpl.checkLeadership()so a slow new peer does not make the leader step down before the existing reconfiguration test observes it.Why are the changes needed?
These tests can fail on slower CI machines when asynchronous restart, log cleanup, state machine application, leadership transition, install snapshot progress, or single -> HA reconfiguration completes later than the fixed delay assumed by the test.
How was this patch tested?
mvn -pl ratis-server,ratis-test -am -Dtest=TestLinearizableReadRepliedIndexWithGrpc,ElectionCommandIntegrationTest,RaftLogTruncateTests testmvn -pl ratis-test -am -Dtest=TestRaftLogTruncateWithGrpc,TestElectionCommandIntegrationWithGrpc,TestInstallSnapshotNotificationWithGrpc,TestRaftWithGrpc testmvn -pl ratis-test -am -Dtest=TestRaftAsyncWithGrpc#testBasicAppendEntriesAsyncKillLeader,TestElectionCommandIntegrationWithGrpc,TestInstallSnapshotNotificationWithGrpc,TestRaftLogTruncateWithGrpc,TestRaftWithGrpc testmvn -pl ratis-test -am -Dtest=TestRaftReconfigurationWithSimulatedRpc#testKillLeaderDuringReconf testmvn -pl ratis-test -am -Dtest=TestRaftAsyncWithGrpc#testBasicAppendEntriesAsyncKillLeader,TestRaftReconfigurationWithSimulatedRpc#testLeaderElectionWhenChangeFromSingleToHA test