-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure leader always commits when replicated to quorum #8357
Conversation
Code freeze is next Monday - @oleschoenburg or @Zelldon can one of you review and take this over so it can make it into 1.3.0? Thanks! Let me know if you don't have time, Roman or me could also look into it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @deepthidevaki for your fix and work 🎖️
I have some comments and questions, @oleschoenburg could you also please review this PR and than we can discuss them maybe you have an answer or different opinion. Afterwards we could adjust the PR and merge it :) What do you think ?
atomix/cluster/src/test/java/io/atomix/raft/ControllableRaftContexts.java
Show resolved
Hide resolved
atomix/cluster/src/test/java/io/atomix/raft/protocol/ControllableRaftServerProtocol.java
Outdated
Show resolved
Hide resolved
atomix/cluster/src/test/java/io/atomix/raft/protocol/ControllableRaftServerProtocol.java
Outdated
Show resolved
Hide resolved
atomix/cluster/src/test/java/io/atomix/raft/ControllableRaftContexts.java
Outdated
Show resolved
Hide resolved
atomix/cluster/src/test/java/io/atomix/raft/RandomizedRaftTest.java
Outdated
Show resolved
Hide resolved
atomix/cluster/src/test/java/io/atomix/raft/RandomizedRaftTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix itself looks great but I think some of the test code could be improved. Nothing blocking from my side though, so I'm approving.
atomix/cluster/src/test/java/io/atomix/raft/protocol/ControllableRaftServerProtocol.java
Outdated
Show resolved
Hide resolved
boolean hasCommittedAllEntries() { | ||
return raftServers.values().stream() | ||
.noneMatch( | ||
s -> { | ||
final var lastCommittedEntry = getLastCommittedEntry(s); | ||
final var lastUncommittedEntry = getLastUncommittedEntry(s); | ||
|
||
return lastUncommittedEntry != null | ||
&& !lastUncommittedEntry.equals(lastCommittedEntry); | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔧 I found this method to be a bit difficult to understand, maybe we could simplify?
boolean hasCommittedAllEntries() {
return raftServers.values().stream().allMatch(s -> getLastUncommittedEntry(s) == getLastCommittedEntry(s));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ==
doesn't work here because each reader creates a different object for the same entry. But I will use allMatch
instead of noneMatch
.
Thanks @oleschoenburg @Zelldon for your review. I will apply your comments and try to get it merge this before code-freeze. |
@Zelldon @oleschoenburg I have updated the PR. Please take a look again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @deepthidevaki 🎖️
bors merge |
bors r- |
Canceled. |
Raft expects the messaging layer to timeout the requests. In our mock implementation of the messaging layer, we did not implement the timeout correctly. As a result the requests are pending in raft. This will break some of the liveness guarantees. Ideally raft should timeout requests itself. But since that is a bigger change, we will just fix the test setup to mimic the actual implementation.
This is similar to the correctness test that already exists. First we execute a random sequence of operation, which may contain "failures" like messages are lost or delayed. Then we execute a fixed sequence of operations without any "failures". The idea is that if there are no message lost, then eventually a leader should be elected, and eventually all entries are replicated to all replicas, and all entries are committed.
If a previous ack was lost, and the followers have already received the latest logs, then the leader will send only heartbeats. In this case, previously, the leader did not commit because it did not update the commit index when an ack for heartbeat is received. This can cause problems because as long as new entries are not written the existing last few entries will never get committed.
64f3fab
to
3e4556c
Compare
bors merge |
Build succeeded: |
Successfully created backport PR #8434 for |
Successfully created backport PR #8435 for |
8411: Release 1.1.8 r=npepinpe a=npepinpe ## Description This PR includes improvements/fixes to the release pipeline, as well as the release commits preparing for the next patch release. The `.gocompat.json` file is expected to be modified since the ordering changes, but that's a unfortunate side effect of the tool. It's still the same, and is checked by the pipeline (both release and test) beforehand. 8434: [Backport stable/1.1] Ensure leader always commits when replicated to quorum r=deepthidevaki a=github-actions[bot] # Description Backport of #8357 to `stable/1.1`. closes #8324 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com> Co-authored-by: camunda-jenkins <ci@camunda.com> Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
Description
Related issues
closes #8324
Definition of Done
Not all items need to be done depending on the issue and the pull request.
Code changes:
backport stable/0.25
) to the PR, in case that fails you need to create backports manually.Testing:
Documentation: