Skip to content

RATIS-1305. Leader stuck in infinite install snapshot cycle when logs have been purged#420

Merged
szetszwo merged 7 commits intoapache:masterfrom
errose28:RATIS-1305
Feb 12, 2021
Merged

RATIS-1305. Leader stuck in infinite install snapshot cycle when logs have been purged#420
szetszwo merged 7 commits intoapache:masterfrom
errose28:RATIS-1305

Conversation

@errose28
Copy link
Copy Markdown
Contributor

@errose28 errose28 commented Feb 5, 2021

What changes were proposed in this pull request?

After RATIS-1241, a leader with no logs will send install snapshot notifications to followers on every heartbeat, regardless of whether or not they are behind. Fix this so that followers only get install snapshot notifications when they are actually behind.

What is the link to the Apache JIRA

RATIS-1305

How was this patch tested?

Integration test added. The test still has an issue that is marked with a TODO. Leaving this PR as a draft while I fix this.

Adding followers in the test still does not pass, but that could be a test specific issue.
Adding peers to cluster still fails.
@errose28
Copy link
Copy Markdown
Contributor Author

errose28 commented Feb 5, 2021

cc @szetszwo @runzhiwang for review and any ideas on why the leader enters a closed state and adding peers to the configuration fails in the integration test.

@errose28
Copy link
Copy Markdown
Contributor Author

errose28 commented Feb 8, 2021

The added integration test InstallSnapshotNotificationTests#testInstallSnapshotNotificationCount shows two log lines from the leader (s2 in this example):

[s2@group-DF2C3007074F-SegmentedRaftLogWorker] INFO  RaftServer$Division (ServerState.java:close(405)) - s2@group-DF2C3007074F: closes. applyIndex: 20
[s2@group-DF2C3007074F-SegmentedRaftLogWorker] INFO  segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:close(236)) - s2@group-DF2C3007074F-SegmentedRaftLogWorker close()

When it tries to add two new nodes to the cluster later in the test, RaftServerTestUtil#waitAndCheckNewConf fails because s2 thinks it is the leader, but it is closed so the rest of the cluster has elected a new leader, and the check deems the configuration unstable. It looks like s2 could have closed because of the java.lang.IllegalStateException: lastWrittenIndex == 19, entry == term: 2 index: 21 which occurred on s2 when trying to push the new configuration. This is probably due to how the logs are being truncated and purged, but I have not found the exact cause.

@szetszwo
Copy link
Copy Markdown
Contributor

@errose28 , I just have tried the test. The two new servers, s3 and s4, have failed to start due to "Unexpected gap in segments" as shown below. It seems the changes might have some bug.

2021-02-10 15:32:49,699 [s4@group-9F91CFE6147C-StateMachineUpdater] ERROR impl.StateMachineUpdater (StateMachineUpdater.java:run(189)) - s4@group-9F91CFE6147C-StateMachineUpdater caught a Throwable.
java.lang.IllegalStateException: Unexpected gap in segments: binarySearch(20) returns -1, segments=[log-21_21]
	at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache$LogSegmentList.purge(SegmentedRaftLogCache.java:340)
	at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.purge(SegmentedRaftLogCache.java:559)
	at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.purgeImpl(SegmentedRaftLog.java:362)
	at org.apache.ratis.server.raftlog.RaftLogBase.purge(RaftLogBase.java:307)
	at org.apache.ratis.server.impl.StateMachineUpdater.takeSnapshot(StateMachineUpdater.java:292)
	at org.apache.ratis.server.impl.StateMachineUpdater.checkAndTakeSnapshot(StateMachineUpdater.java:255)
	at org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:178)
	at java.lang.Thread.run(Thread.java:748)
2021-02-10 15:32:49,699 [s3@group-9F91CFE6147C-StateMachineUpdater] ERROR impl.StateMachineUpdater (StateMachineUpdater.java:run(189)) - s3@group-9F91CFE6147C-StateMachineUpdater caught a Throwable.
java.lang.IllegalStateException: Unexpected gap in segments: binarySearch(20) returns -1, segments=[log-21_21]
	at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache$LogSegmentList.purge(SegmentedRaftLogCache.java:340)
	at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.purge(SegmentedRaftLogCache.java:559)
	at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.purgeImpl(SegmentedRaftLog.java:362)
	at org.apache.ratis.server.raftlog.RaftLogBase.purge(RaftLogBase.java:307)
	at org.apache.ratis.server.impl.StateMachineUpdater.takeSnapshot(StateMachineUpdater.java:292)
	at org.apache.ratis.server.impl.StateMachineUpdater.checkAndTakeSnapshot(StateMachineUpdater.java:255)
	at org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:178)
	at java.lang.Thread.run(Thread.java:748)

@errose28 errose28 marked this pull request as ready for review February 10, 2021 19:06
@errose28
Copy link
Copy Markdown
Contributor Author

New integration test is passing. Ready for review. Current CI failure looks unrelated.

@szetszwo szetszwo closed this Feb 11, 2021
@szetszwo szetszwo reopened this Feb 11, 2021
Copy link
Copy Markdown
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good. Thanks for the update.

@szetszwo
Copy link
Copy Markdown
Contributor

Just have reopened this for triggering a new build. Will wait for it.

@szetszwo szetszwo closed this Feb 11, 2021
@szetszwo szetszwo reopened this Feb 11, 2021
@szetszwo szetszwo closed this Feb 11, 2021
@szetszwo szetszwo reopened this Feb 11, 2021
@szetszwo szetszwo merged commit 371fbfe into apache:master Feb 12, 2021
symious pushed a commit to symious/ratis that referenced this pull request Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants