-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: implement fast log rejection #11964
Conversation
For context, here is the passage from the thesis:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hicqu! I think this needs some more work to clarify and document though. For one, I think you should state the main idea instead of referring to the Raft thesis, where the idea - in my opinion - is not presented in sufficient detail to fully explain the code. Generally I also like to see examples in the description of such improvements.
Similarly, the code itself needs to be explained better. The idea here isn't so complicated, but it's mostly inscrutable whether an index is the one to append to, or the one before, and what index the term belongs to, why a given loop has a certain exit condition etc - most of this is not your doing, but I'm just shit scared (pardon my french) to accept any PR that I am not 100% comfortable is correct. And it took me around 2 hours to page my current understanding of things back in, which is not acceptable. So, since we're trying to change the code, the burden is on us to make it clear that the change is correct, and I think the missing piece here is better documentation and naming.
@@ -138,6 +138,21 @@ func (l *raftLog) findConflict(ents []pb.Entry) uint64 { | |||
return 0 | |||
} | |||
|
|||
func (l *raftLog) findConflictByTerm(index uint64, term uint64) uint64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment:
findConflictByTerm takes an
(index, term)
pair (indicating a conflicting log entry on a follower during an append) and finds the last index for that term (or the first index of the log, whichever is larger).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not even sure that comment is correct. But you get the idea, explain the inputs and outputs and what it's for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that there are some preconditions that this method needs to hold. For instance, index <= l. lastIndex()
. What about index < l.firstIndex()
?
And the comment on this should mention what happens if the last index for the provided term
is greater than the provided index
. In that case, we return the provided index and not the last index in the log with that term. Right?
Should this method be called findLastConflictingIndexAtOrBelowTerm
? Then we can define a conflicting index as one <= the provided index in the comment.
Also, s/for that term/for that term in the log l/
raft/raft.go
Outdated
if m.Index < reject { | ||
reject = m.Index | ||
} | ||
logTerm, _ := r.raftLog.term(reject) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicitly fall back to zero. I know the go convention is to return the zero value with errors, but we should not rely on it being true.
index uint64 | ||
}{ | ||
// append reject, return index that is not conflict | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add narration to at least a few of the test cases, similar to how I did it in the comment suggestion. As is, these tests are inscrutable for anyone who isn't willing to stare at them for 20 minutes. That is not ideal.
I'm also curious whether you've considered using env-driven testing for this (raft/testdata). That tends to be much more user-friendly IMO.
raft/raft_test.go
Outdated
s1.Append(test.logs) | ||
|
||
n1 := newTestRaft(1, []uint64{1, 2, 3}, 10, 1, s1) | ||
n2 := newTestRaft(2, []uint64{1, 2, 3}, 10, 1, NewMemoryStorage()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you intentionally not appending test.logs
on n2? I thought test.logs was a shared prefix and ents1 and ents2 would be possibly different suffixes. It doesn't seem to matter for this test but seems cleaner (less confusing) anyway.
raft/raft.go
Outdated
r.logger.Debugf("%x received MsgAppResp(MsgApp was rejected, lastindex: %d) from %x for index %d", | ||
r.id, m.RejectHint, m.From, m.Index) | ||
r.logger.Debugf("%x received MsgAppResp(MsgApp was rejected, lastindex: %d) from %x for index %d, log term %d", | ||
r.id, m.RejectHint, m.From, m.Index, m.LogTerm) | ||
if pr.MaybeDecrTo(m.Index, m.RejectHint) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS while we're here - the comment on MaybeDecrTo
is wrong. It says that the first argument (rejected
) is the index that we tried to append. But that's not true, we tried to append rejected+1
(i.e. the m.Index
of the MsgApp was rejected
) and it failed due to a mismatch at rejected
itself.
@tbg thanks for your review! Here is a little change list about the latest commit:
And, I have updated the PR's decription to show which cases can be improved by this PR. |
Friendly ping @tbg . |
Could you take a look again? @tbg |
@hicqu can you resolve the merging conflicts? i will go over it again next week. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @hicqu,
Sorry to let this PR sit. It still seems like a good idea, as it should minimize probing time after leader elections and prevent long uncommitted log tails from causing problems. Are you still interested in pushing this change over the finish line?
@@ -138,6 +138,21 @@ func (l *raftLog) findConflict(ents []pb.Entry) uint64 { | |||
return 0 | |||
} | |||
|
|||
func (l *raftLog) findConflictByTerm(index uint64, term uint64) uint64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that there are some preconditions that this method needs to hold. For instance, index <= l. lastIndex()
. What about index < l.firstIndex()
?
And the comment on this should mention what happens if the last index for the provided term
is greater than the provided index
. In that case, we return the provided index and not the last index in the log with that term. Right?
Should this method be called findLastConflictingIndexAtOrBelowTerm
? Then we can define a conflicting index as one <= the provided index in the comment.
Also, s/for that term/for that term in the log l/
I'll rebase this - given that Nathan and I are both engaging now and are coworkers, I think it's reasonable that the two of us just take over this PR and get it over the finish line. |
3af28ac
to
ac6e2c8
Compare
Rebased. I didn't address any review comments yet but it should pass tests. |
This commit creates a new probe_and_replicate.txt interactive test. The test creates a complete Raft log configuration and demonstrates how a leader probes and replicates to each of its followers. The log configuration constructed is identical to the one present in Figure 7 of the raft paper (https://raft.github.io/raft.pdf), which looks like: ``` 1 2 3 4 5 6 7 8 9 10 11 12 n1: [1][1][1][4][4][5][5][6][6][6] n2: [1][1][1][4][4][5][5][6][6] n3: [1][1][1][4] n4: [1][1][1][4][4][5][5][6][6][6][6] n5: [1][1][1][4][4][5][5][6][7][7][7][7] n6: [1][1][1][4][4][4][4] n7: [1][1][1][2][2][2][3][3][3][3][3] ``` Once in this state, we then elect node 1 as the leader and stabilize the entire raft group. This demonstrates how a newly elected leader probes for matching indexes, overwrites conflicting entries, and catches up all followers. This will be useful to demonstrate the impact of more efficient probing behavior.
Signed-off-by: qupeng <qupeng@pingcap.com>
@tbg I addressed my suggestions from earlier and added some more testing, including a nice interactive test that demonstrates the improvements of this change on the log configuration in Figure 7 of the raft paper: I'm not a maintainer of this repo, so I wasn't able to push changes directly to this PR. So they're over in https://github.com/etcd-io/etcd/compare/master...nvanbenschoten:fast-log-rejection?expand=1. |
ac6e2c8
to
6828517
Compare
@nvanbenschoten I added an extra commit to explain things better (fallout from explaining it to myself), PTAL |
9e4b54a
to
30c0ea5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvements to the comments. They make this much easier to understand.
30c0ea5
to
ec4ef67
Compare
- Add a large detailed comment about the use and necessity of both the follower and leader probing optimization - fix the log message in stepLeader that previously mixed up the log term for the rejection and the index of the append - improve the test via subtests - add some verbiage in findConflictByTerm around first index
ec4ef67
to
c1e8d3a
Compare
I'm seeing the functional tests fail here for the x-th time, last failure here. These logs make it really hard to parse what's going on (what test is this even running?) but it passes locally:
I would blame it on this target being flaky, but I see that while master is pretty red, the functional suite hasn't failed in any of the last five iterations at the time of writing: https://travis-ci.com/github/etcd-io/etcd/branches @ptabor do you have any insights here? |
|
Managed to reproduce locally (4/5 passed, then this):
|
I managed to reproduce a similar flake on
Going to merge this PR then, as it is unlikely that the failure is causally related to this PR. |
This picks up this PR: etcd-io/etcd#11964 There were no other code changes affecting raft. Release note (bug fix): fixed a deficiency in the replication layer that could result in ranges becoming unavailable for prolonged periods of time (hours) when a write burst occurred under system overload. While unavailable, the range status page for the affected range would show a last index much larger than the committed index and no movement in these indexes on a quorum of the replicas. Note that this should be distinguished from the case in which enough replicas are offline to constitute a loss of quorum, where the replication layer can not make progress due to the loss of quorum itself.
60581: go.mod: bump etcd/raft to pick up probing improvements r=nvanbenschoten a=tbg This picks up this PR: etcd-io/etcd#11964 There were no other code changes affecting raft. Release note (bug fix): fixed a deficiency in the replication layer that could result in ranges becoming unavailable for prolonged periods of time (hours) when a write burst occurred under system overload. While unavailable, the range status page for the affected range would show a last index much larger than the committed index and no movement in these indexes on a quorum of the replicas. Note that this should be distinguished from the case in which enough replicas are offline to constitute a loss of quorum, where the replication layer can not make progress due to the loss of quorum itself. Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
The fast log rejection algorithm is descripted in the Raft paper, which can reduce network round trips when peer with less logs becomes leader.
There are 2 cases that the PR can improve:
A. The conflict point can be found by leader quickly:
When leader appends entries from
index=120
, follower will reject it withreject=110,term=4
. So leader can found the conflict position and then appends entries fromindex=101
.B. The conflict point can be found by follower quickly:
When leader appends entries from
index=105
, follower will reject it withreject=100,term=2
if it can found the conflict position. After that leader can continue to append entries fromindex=101
.Signed-off-by: qupeng qupeng@pingcap.com
Please read https://github.com/etcd-io/etcd/blob/master/CONTRIBUTING.md#contribution-flow.