raft: implement fast log rejection #11964

hicqu · 2020-06-02T06:38:47Z

The fast log rejection algorithm is descripted in the Raft paper, which can reduce network round trips when peer with less logs becomes leader.

There are 2 cases that the PR can improve:

A. The conflict point can be found by leader quickly:

leader log   : (idx=100,term=2)(idx=101,term=5)[...](idx=120,term=5)
follower logs: (idx=100,term=2)(idx=101,term=4)[...](idx=110,term=4)

When leader appends entries from index=120, follower will reject it with reject=110,term=4. So leader can found the conflict position and then appends entries from index=101.

B. The conflict point can be found by follower quickly:

leader logs  : (idx=100,term=2)[...](idx=105,term=2)
follower logs: (idx=100,term=2)(idx=101,term=2)(idx=102,term=4)

When leader appends entries from index=105, follower will reject it with reject=100,term=2 if it can found the conflict position. After that leader can continue to append entries from index=101.

Signed-off-by: qupeng qupeng@pingcap.com

Please read https://github.com/etcd-io/etcd/blob/master/CONTRIBUTING.md#contribution-flow.

gyuho · 2020-06-02T15:41:12Z

/cc @bdarnell @tbg

hicqu · 2020-06-05T05:40:47Z

Friendly ping @bdarnell @tbg .

tbg · 2020-06-05T14:20:23Z

For context, here is the passage from the thesis:

If desired, the protocol can be optimized to reduce the number of rejected AppendEntries RPCs.
For example, when rejecting an AppendEntries request, the follower can include the term of the
conflicting entry and the first index it stores for that term. With this information, the leader can
decrement nextIndex to bypass all of the conflicting entries in that term; one AppendEntries RPC
will be required for each term with conflicting entries, rather than one RPC per entry

tbg

Thanks @hicqu! I think this needs some more work to clarify and document though. For one, I think you should state the main idea instead of referring to the Raft thesis, where the idea - in my opinion - is not presented in sufficient detail to fully explain the code. Generally I also like to see examples in the description of such improvements.
Similarly, the code itself needs to be explained better. The idea here isn't so complicated, but it's mostly inscrutable whether an index is the one to append to, or the one before, and what index the term belongs to, why a given loop has a certain exit condition etc - most of this is not your doing, but I'm just shit scared (pardon my french) to accept any PR that I am not 100% comfortable is correct. And it took me around 2 hours to page my current understanding of things back in, which is not acceptable. So, since we're trying to change the code, the burden is on us to make it clear that the change is correct, and I think the missing piece here is better documentation and naming.

raft/raft.go

tbg · 2020-06-05T14:42:08Z

raft/log.go

@@ -138,6 +138,21 @@ func (l *raftLog) findConflict(ents []pb.Entry) uint64 {
 	return 0
 }

+func (l *raftLog) findConflictByTerm(index uint64, term uint64) uint64 {


Add a comment:

findConflictByTerm takes an (index, term) pair (indicating a conflicting log entry on a follower during an append) and finds the last index for that term (or the first index of the log, whichever is larger).

I'm not even sure that comment is correct. But you get the idea, explain the inputs and outputs and what it's for.

It appears that there are some preconditions that this method needs to hold. For instance, index <= l. lastIndex(). What about index < l.firstIndex()?

And the comment on this should mention what happens if the last index for the provided term is greater than the provided index. In that case, we return the provided index and not the last index in the log with that term. Right?

Should this method be called findLastConflictingIndexAtOrBelowTerm? Then we can define a conflicting index as one <= the provided index in the comment.

Also, s/for that term/for that term in the log l/

tbg · 2020-06-05T14:55:13Z

raft/raft.go

+		if m.Index < reject {
+			reject = m.Index
+		}
+		logTerm, _ := r.raftLog.term(reject)


Explicitly fall back to zero. I know the go convention is to return the zero value with errors, but we should not rely on it being true.

raft/raft.go

raft/log.go

tbg · 2020-06-05T15:35:02Z

raft/raft_test.go

+		index uint64
+	}{
+		// append reject, return index that is not conflict
+		{


Please add narration to at least a few of the test cases, similar to how I did it in the comment suggestion. As is, these tests are inscrutable for anyone who isn't willing to stare at them for 20 minutes. That is not ideal.
I'm also curious whether you've considered using env-driven testing for this (raft/testdata). That tends to be much more user-friendly IMO.

tbg · 2020-06-05T15:35:07Z

raft/raft_test.go

+		s1.Append(test.logs)
+
+		n1 := newTestRaft(1, []uint64{1, 2, 3}, 10, 1, s1)
+		n2 := newTestRaft(2, []uint64{1, 2, 3}, 10, 1, NewMemoryStorage())


are you intentionally not appending test.logs on n2? I thought test.logs was a shared prefix and ents1 and ents2 would be possibly different suffixes. It doesn't seem to matter for this test but seems cleaner (less confusing) anyway.

tbg · 2020-06-05T15:47:36Z

raft/raft.go

-			r.logger.Debugf("%x received MsgAppResp(MsgApp was rejected, lastindex: %d) from %x for index %d",
-				r.id, m.RejectHint, m.From, m.Index)
+			r.logger.Debugf("%x received MsgAppResp(MsgApp was rejected, lastindex: %d) from %x for index %d, log term %d",
+				r.id, m.RejectHint, m.From, m.Index, m.LogTerm)
 			if pr.MaybeDecrTo(m.Index, m.RejectHint) {


PS while we're here - the comment on MaybeDecrTo is wrong. It says that the first argument (rejected) is the index that we tried to append. But that's not true, we tried to append rejected+1 (i.e. the m.Index of the MsgApp was rejected) and it failed due to a mismatch at rejected itself.

hicqu · 2020-06-08T09:49:05Z

@tbg thanks for your review! Here is a little change list about the latest commit:

Document findConflictByTerm correctly;
Comment logTerm when it's filled into MsgAppResp and read out from MsgAppResp;
Simplify test cases;
Document logTerm in proto file.

And, I have updated the PR's decription to show which cases can be improved by this PR.

hicqu · 2020-06-11T04:25:42Z

Friendly ping @tbg .

hicqu · 2020-06-19T04:44:05Z

Could you take a look again? @tbg

xiang90 · 2020-07-26T18:15:42Z

@hicqu can you resolve the merging conflicts? i will go over it again next week.

stale · 2020-10-24T19:29:05Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

nvanbenschoten

Hi @hicqu,

Sorry to let this PR sit. It still seems like a good idea, as it should minimize probing time after leader elections and prevent long uncommitted log tails from causing problems. Are you still interested in pushing this change over the finish line?

nvanbenschoten · 2021-02-10T06:17:04Z

raft/log.go

@@ -138,6 +138,21 @@ func (l *raftLog) findConflict(ents []pb.Entry) uint64 {
 	return 0
 }

+func (l *raftLog) findConflictByTerm(index uint64, term uint64) uint64 {


It appears that there are some preconditions that this method needs to hold. For instance, index <= l. lastIndex(). What about index < l.firstIndex()?

And the comment on this should mention what happens if the last index for the provided term is greater than the provided index. In that case, we return the provided index and not the last index in the log with that term. Right?

Should this method be called findLastConflictingIndexAtOrBelowTerm? Then we can define a conflicting index as one <= the provided index in the comment.

Also, s/for that term/for that term in the log l/

raft/raft.go

raft/raft_test.go

raft/log.go

tbg · 2021-02-10T08:05:16Z

I'll rebase this - given that Nathan and I are both engaging now and are coworkers, I think it's reasonable that the two of us just take over this PR and get it over the finish line.

tbg · 2021-02-10T08:13:13Z

Rebased. I didn't address any review comments yet but it should pass tests.

This commit creates a new probe_and_replicate.txt interactive test. The test creates a complete Raft log configuration and demonstrates how a leader probes and replicates to each of its followers. The log configuration constructed is identical to the one present in Figure 7 of the raft paper (https://raft.github.io/raft.pdf), which looks like: ``` 1 2 3 4 5 6 7 8 9 10 11 12 n1: [1][1][1][4][4][5][5][6][6][6] n2: [1][1][1][4][4][5][5][6][6] n3: [1][1][1][4] n4: [1][1][1][4][4][5][5][6][6][6][6] n5: [1][1][1][4][4][5][5][6][7][7][7][7] n6: [1][1][1][4][4][4][4] n7: [1][1][1][2][2][2][3][3][3][3][3] ``` Once in this state, we then elect node 1 as the leader and stabilize the entire raft group. This demonstrates how a newly elected leader probes for matching indexes, overwrites conflicting entries, and catches up all followers. This will be useful to demonstrate the impact of more efficient probing behavior.

Signed-off-by: qupeng <qupeng@pingcap.com>

nvanbenschoten · 2021-02-10T20:54:17Z

@tbg I addressed my suggestions from earlier and added some more testing, including a nice interactive test that demonstrates the improvements of this change on the log configuration in Figure 7 of the raft paper:

I'm not a maintainer of this repo, so I wasn't able to push changes directly to this PR. So they're over in https://github.com/etcd-io/etcd/compare/master...nvanbenschoten:fast-log-rejection?expand=1.

tbg · 2021-02-11T12:32:49Z

@nvanbenschoten I added an extra commit to explain things better (fallout from explaining it to myself), PTAL

nvanbenschoten

Nice improvements to the comments. They make this much easier to understand.

raft/raft.go

- Add a large detailed comment about the use and necessity of both the follower and leader probing optimization - fix the log message in stepLeader that previously mixed up the log term for the rejection and the index of the append - improve the test via subtests - add some verbiage in findConflictByTerm around first index

tbg · 2021-02-15T09:07:43Z

I'm seeing the functional tests fail here for the x-th time, last failure here. These logs make it really hard to parse what's going on (what test is this even running?) but it passes locally:

docker run --rm       --volume=`pwd`:/go/src/go.etcd.io/etcd gcr.io/etcd-development/etcd-test:go${TRAVIS_GO_VERSION}       /bin/bash -c "GO_BUILD_FLAGS='-v -mod=readonly' ./build && GOARCH=amd64 PASSES='functional' ./test"
[...]
'etcd-tester' succeeded
./test.sh: line 174: kill: (5011) - No such process
./test.sh: line 174: kill: (5012) - No such process
./test.sh: line 174: kill: (5013) - No such process
functional test PASS!
'functional' completed at Mon Feb 15 08:52:32 UTC 2021
SUCCESS

I would blame it on this target being flaky, but I see that while master is pretty red, the functional suite hasn't failed in any of the last five iterations at the time of writing: https://travis-ci.com/github/etcd-io/etcd/branches

@ptabor do you have any insights here?

tbg · 2021-02-15T09:22:19Z

stderr: {"level":"error","ts":1613151559.1980429,"caller":"tester/cluster_run.go:50","msg":"round FAIL","round":0,"case":10,"case-total":22,"error":"recovery error: rpc error: code = Unknown desc = write tcp 127.0.0.1:2381->127.0.0.1:45082: write: connection reset by peer","stacktrace":"go.etcd.io/etcd/tests/v3/functional/tester.(*Cluster).Run

	/go/src/go.etcd.io/etcd/tests/functional/tester/cluster_run.go:50

main.main

	/go/src/go.etcd.io/etcd/tests/functional/cmd/etcd-tester/main.go:59

runtime.main

	/usr/local/go/src/runtime/proc.go:204"}

stderr: {"level":"error","ts":1613047421.8999214,"caller":"tester/cluster_run.go:50","msg":"round FAIL","round":0,"case":9,"case-total":22,"error":"injection error: cluster too slow: only 133 commits in 6 retries","stacktrace":"go.etcd.io/etcd/tests/v3/functional/tester.(*Cluster).Run

	/go/src/go.etcd.io/etcd/tests/functional/tester/cluster_run.go:50

main.main

	/go/src/go.etcd.io/etcd/tests/functional/cmd/etcd-tester/main.go:59

runtime.main

	/usr/local/go/src/runtime/proc.go:204"}

tbg · 2021-02-15T10:21:46Z

Managed to reproduce locally (4/5 passed, then this):

{"level":"info","ts":1613382879.2866938,"caller":"agent/handler.go:41","msg":"handler success","operation":"BLACKHOLE_PEER_PORT_TX_RX"}
stderr: {"level":"info","ts":1613382879.286739,"caller":"tester/cluster.go:470","msg":"received response","operation":"BLACKHOLE_PEER_PORT_TX_RX
","from":"127.0.0.1:1379","success":true,"status":"blackholed peer port tx/rx"}
stderr: {"level":"info","ts":1613382879.286763,"caller":"tester/case.go:252","msg":"trigger snapshot START","desc":"BLACKHOLE_PEER_PORT_TX_RX_LE
ADER_UNTIL_TRIGGER_SNAPSHOT","etcd-snapshot-count":2000}
stderr: {"level":"info","ts":1613382879.4084778,"caller":"tester/case.go:307","msg":"trigger snapshot RETRY","retries":0,"committed-entries":0,"
etcd-snapshot-count":2000,"start-revision":101708,"last-revision":101708,"took":0.121712851}
stderr: {"level":"info","ts":1613382880.4792778,"caller":"tester/case.go:307","msg":"trigger snapshot RETRY","retries":1,"committed-entries":0,"
etcd-snapshot-count":2000,"start-revision":101708,"last-revision":101708,"took":1.192513894}
stderr: {"level":"info","ts":1613382881.5366092,"caller":"tester/case.go:307","msg":"trigger snapshot RETRY","retries":2,"committed-entries":337
,"etcd-snapshot-count":2000,"start-revision":101708,"last-revision":102045,"took":2.249845412}
stderr: {"level":"info","ts":1613382882.599872,"caller":"tester/case.go:307","msg":"trigger snapshot RETRY","retries":3,"committed-entries":762,
"etcd-snapshot-count":2000,"start-revision":101708,"last-revision":102470,"took":3.313107066}
stderr: {"level":"info","ts":1613382883.6700158,"caller":"tester/case.go:307","msg":"trigger snapshot RETRY","retries":4,"committed-entries":121
4,"etcd-snapshot-count":2000,"start-revision":101708,"last-revision":102922,"took":4.383252135}
stderr: {"level":"info","ts":1613382884.7321687,"caller":"tester/case.go:307","msg":"trigger snapshot RETRY","retries":5,"committed-entries":163
3,"etcd-snapshot-count":2000,"start-revision":101708,"last-revision":103341,"took":5.445403496}
stderr: {"level":"error","ts":1613382885.73243,"caller":"tester/cluster_run.go:50","msg":"round FAIL","round":0,"case":11,"case-total":22,"error
":"injection error: cluster too slow: only 1633 commits in 6 retries","stacktrace":"go.etcd.io/etcd/tests/v3/functional/tester.(*Cluster).Run
        /go/src/go.etcd.io/etcd/tests/functional/tester/cluster_run.go:50
main.main
        /go/src/go.etcd.io/etcd/tests/functional/cmd/etcd-tester/main.go:59
runtime.main
        /usr/local/go/src/runtime/proc.go:204"}

tbg · 2021-02-15T12:47:00Z

I managed to reproduce a similar flake on master within three runs:

stderr: {"level":"error","ts":1613391270.6197243,"caller":"tester/cluster_run.go:50","msg":"round FAIL","round":0,"case":10,"case-total":22,"err
or":"injection error: cluster too slow: only 1100 commits in 6 retries","stacktrace":"go.etcd.io/etcd/tests/v3/functional/tester.(*Cluster).Run
        /go/src/go.etcd.io/etcd/tests/functional/tester/cluster_run.go:50
main.main
        /go/src/go.etcd.io/etcd/tests/functional/cmd/etcd-tester/main.go:59
runtime.main
        /usr/local/go/src/runtime/proc.go:204"}
stderr: {"level":"info","ts":1613391270.6198068,"caller":"tester/cluster_run.go:342","msg":"closing stressers before archiving failure data","ro
und":0,"case":10,"case-total":22}

Going to merge this PR then, as it is unlikely that the failure is causally related to this PR.

This picks up this PR: etcd-io/etcd#11964 There were no other code changes affecting raft. Release note (bug fix): fixed a deficiency in the replication layer that could result in ranges becoming unavailable for prolonged periods of time (hours) when a write burst occurred under system overload. While unavailable, the range status page for the affected range would show a last index much larger than the committed index and no movement in these indexes on a quorum of the replicas. Note that this should be distinguished from the case in which enough replicas are offline to constitute a loss of quorum, where the replication layer can not make progress due to the loss of quorum itself.

60581: go.mod: bump etcd/raft to pick up probing improvements r=nvanbenschoten a=tbg This picks up this PR: etcd-io/etcd#11964 There were no other code changes affecting raft. Release note (bug fix): fixed a deficiency in the replication layer that could result in ranges becoming unavailable for prolonged periods of time (hours) when a write burst occurred under system overload. While unavailable, the range status page for the affected range would show a last index much larger than the committed index and no movement in these indexes on a quorum of the replicas. Note that this should be distinguished from the case in which enough replicas are offline to constitute a loss of quorum, where the replication layer can not make progress due to the loss of quorum itself. Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>

hicqu mentioned this pull request Jun 2, 2020

fast log append rejection tikv/raft-rs#367

Merged

hicqu force-pushed the fast-log-rejection branch from ddc394c to a920f9c Compare June 2, 2020 07:04

tbg reviewed Jun 5, 2020

View reviewed changes

hicqu force-pushed the fast-log-rejection branch from 7224f77 to 3af28ac Compare June 8, 2020 09:58

stale bot added the stale label Oct 24, 2020

stale bot closed this Nov 14, 2020

nvanbenschoten reviewed Feb 10, 2021

View reviewed changes

ptabor reopened this Feb 10, 2021

stale bot removed the stale label Feb 10, 2021

tbg force-pushed the fast-log-rejection branch from 3af28ac to ac6e2c8 Compare February 10, 2021 08:12

ptabor added the area/raft label Feb 10, 2021

ptabor assigned tbg Feb 10, 2021

nvanbenschoten and others added 2 commits February 10, 2021 15:02

raft: implement fast log rejection

6828517

Signed-off-by: qupeng <qupeng@pingcap.com>

tbg force-pushed the fast-log-rejection branch from ac6e2c8 to 6828517 Compare February 11, 2021 10:46

tbg force-pushed the fast-log-rejection branch from 9e4b54a to 30c0ea5 Compare February 11, 2021 12:56

nvanbenschoten approved these changes Feb 11, 2021

View reviewed changes

raft/raft.go Outdated Show resolved Hide resolved

raft/raft.go Outdated Show resolved Hide resolved

raft/raft.go Outdated Show resolved Hide resolved

tbg force-pushed the fast-log-rejection branch from 30c0ea5 to ec4ef67 Compare February 12, 2021 17:32

tbg force-pushed the fast-log-rejection branch from ec4ef67 to c1e8d3a Compare February 15, 2021 08:47

tbg merged commit 719f6ce into etcd-io:master Feb 15, 2021

tbg mentioned this pull request Feb 15, 2021

go.mod: bump etcd/raft to pick up probing improvements cockroachdb/cockroach#60581

Merged

nvanbenschoten mentioned this pull request Feb 18, 2021

Unresolvable "Raft log too large" state cockroachdb/cockroach#60538

Closed

jayzhan211 mentioned this pull request Mar 9, 2021

Fast log append rejection tikv/raft-rs#246

Closed

hicqu deleted the fast-log-rejection branch March 9, 2021 11:58

tbg mentioned this pull request Apr 5, 2021

roachtest: hotspotsplits/nodes=4 failed [inconsistency] cockroachdb/cockroach#61990

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft: implement fast log rejection #11964

raft: implement fast log rejection #11964

hicqu commented Jun 2, 2020 •

edited

Loading

gyuho commented Jun 2, 2020

hicqu commented Jun 5, 2020

tbg commented Jun 5, 2020

tbg left a comment

tbg Jun 5, 2020

tbg Jun 5, 2020

nvanbenschoten Feb 10, 2021

tbg Jun 5, 2020

tbg Jun 5, 2020

tbg Jun 5, 2020

tbg Jun 5, 2020

hicqu commented Jun 8, 2020

hicqu commented Jun 11, 2020

hicqu commented Jun 19, 2020

xiang90 commented Jul 26, 2020

stale bot commented Oct 24, 2020

nvanbenschoten left a comment

nvanbenschoten Feb 10, 2021

tbg commented Feb 10, 2021

tbg commented Feb 10, 2021

nvanbenschoten commented Feb 10, 2021

tbg commented Feb 11, 2021

nvanbenschoten left a comment

tbg commented Feb 15, 2021

tbg commented Feb 15, 2021

tbg commented Feb 15, 2021

tbg commented Feb 15, 2021

raft: implement fast log rejection #11964

raft: implement fast log rejection #11964

Conversation

hicqu commented Jun 2, 2020 • edited Loading

A. The conflict point can be found by leader quickly:

B. The conflict point can be found by follower quickly:

gyuho commented Jun 2, 2020

hicqu commented Jun 5, 2020

tbg commented Jun 5, 2020

tbg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hicqu commented Jun 8, 2020

hicqu commented Jun 11, 2020

hicqu commented Jun 19, 2020

xiang90 commented Jul 26, 2020

stale bot commented Oct 24, 2020

nvanbenschoten left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tbg commented Feb 10, 2021

tbg commented Feb 10, 2021

nvanbenschoten commented Feb 10, 2021

tbg commented Feb 11, 2021

nvanbenschoten left a comment

Choose a reason for hiding this comment

tbg commented Feb 15, 2021

tbg commented Feb 15, 2021

tbg commented Feb 15, 2021

tbg commented Feb 15, 2021

hicqu commented Jun 2, 2020 •

edited

Loading