Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: implement fast log rejection #11964

Merged
merged 3 commits into from
Feb 15, 2021
Merged

Conversation

hicqu
Copy link
Contributor

@hicqu hicqu commented Jun 2, 2020

The fast log rejection algorithm is descripted in the Raft paper, which can reduce network round trips when peer with less logs becomes leader.

There are 2 cases that the PR can improve:

A. The conflict point can be found by leader quickly:

leader log   : (idx=100,term=2)(idx=101,term=5)[...](idx=120,term=5)
follower logs: (idx=100,term=2)(idx=101,term=4)[...](idx=110,term=4)

When leader appends entries from index=120, follower will reject it with reject=110,term=4. So leader can found the conflict position and then appends entries from index=101.

B. The conflict point can be found by follower quickly:

leader logs  : (idx=100,term=2)[...](idx=105,term=2)
follower logs: (idx=100,term=2)(idx=101,term=2)(idx=102,term=4)

When leader appends entries from index=105, follower will reject it with reject=100,term=2 if it can found the conflict position. After that leader can continue to append entries from index=101.

Signed-off-by: qupeng qupeng@pingcap.com

Please read https://github.com/etcd-io/etcd/blob/master/CONTRIBUTING.md#contribution-flow.

@gyuho
Copy link
Contributor

gyuho commented Jun 2, 2020

/cc @bdarnell @tbg

@hicqu
Copy link
Contributor Author

hicqu commented Jun 5, 2020

Friendly ping @bdarnell @tbg .

@tbg
Copy link
Contributor

tbg commented Jun 5, 2020

For context, here is the passage from the thesis:

If desired, the protocol can be optimized to reduce the number of rejected AppendEntries RPCs.
For example, when rejecting an AppendEntries request, the follower can include the term of the
conflicting entry and the first index it stores for that term. With this information, the leader can
decrement nextIndex to bypass all of the conflicting entries in that term; one AppendEntries RPC
will be required for each term with conflicting entries, rather than one RPC per entry

Copy link
Contributor

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hicqu! I think this needs some more work to clarify and document though. For one, I think you should state the main idea instead of referring to the Raft thesis, where the idea - in my opinion - is not presented in sufficient detail to fully explain the code. Generally I also like to see examples in the description of such improvements.
Similarly, the code itself needs to be explained better. The idea here isn't so complicated, but it's mostly inscrutable whether an index is the one to append to, or the one before, and what index the term belongs to, why a given loop has a certain exit condition etc - most of this is not your doing, but I'm just shit scared (pardon my french) to accept any PR that I am not 100% comfortable is correct. And it took me around 2 hours to page my current understanding of things back in, which is not acceptable. So, since we're trying to change the code, the burden is on us to make it clear that the change is correct, and I think the missing piece here is better documentation and naming.

raft/raft.go Outdated Show resolved Hide resolved
@@ -138,6 +138,21 @@ func (l *raftLog) findConflict(ents []pb.Entry) uint64 {
return 0
}

func (l *raftLog) findConflictByTerm(index uint64, term uint64) uint64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment:

findConflictByTerm takes an (index, term) pair (indicating a conflicting log entry on a follower during an append) and finds the last index for that term (or the first index of the log, whichever is larger).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not even sure that comment is correct. But you get the idea, explain the inputs and outputs and what it's for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that there are some preconditions that this method needs to hold. For instance, index <= l. lastIndex(). What about index < l.firstIndex()?

And the comment on this should mention what happens if the last index for the provided term is greater than the provided index. In that case, we return the provided index and not the last index in the log with that term. Right?

Should this method be called findLastConflictingIndexAtOrBelowTerm? Then we can define a conflicting index as one <= the provided index in the comment.

Also, s/for that term/for that term in the log l/

raft/raft.go Outdated
if m.Index < reject {
reject = m.Index
}
logTerm, _ := r.raftLog.term(reject)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly fall back to zero. I know the go convention is to return the zero value with errors, but we should not rely on it being true.

raft/raft.go Outdated Show resolved Hide resolved
raft/raft.go Outdated Show resolved Hide resolved
raft/log.go Outdated Show resolved Hide resolved
index uint64
}{
// append reject, return index that is not conflict
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add narration to at least a few of the test cases, similar to how I did it in the comment suggestion. As is, these tests are inscrutable for anyone who isn't willing to stare at them for 20 minutes. That is not ideal.
I'm also curious whether you've considered using env-driven testing for this (raft/testdata). That tends to be much more user-friendly IMO.

s1.Append(test.logs)

n1 := newTestRaft(1, []uint64{1, 2, 3}, 10, 1, s1)
n2 := newTestRaft(2, []uint64{1, 2, 3}, 10, 1, NewMemoryStorage())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you intentionally not appending test.logs on n2? I thought test.logs was a shared prefix and ents1 and ents2 would be possibly different suffixes. It doesn't seem to matter for this test but seems cleaner (less confusing) anyway.

raft/raft.go Outdated
r.logger.Debugf("%x received MsgAppResp(MsgApp was rejected, lastindex: %d) from %x for index %d",
r.id, m.RejectHint, m.From, m.Index)
r.logger.Debugf("%x received MsgAppResp(MsgApp was rejected, lastindex: %d) from %x for index %d, log term %d",
r.id, m.RejectHint, m.From, m.Index, m.LogTerm)
if pr.MaybeDecrTo(m.Index, m.RejectHint) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS while we're here - the comment on MaybeDecrTo is wrong. It says that the first argument (rejected) is the index that we tried to append. But that's not true, we tried to append rejected+1 (i.e. the m.Index of the MsgApp was rejected) and it failed due to a mismatch at rejected itself.

@hicqu
Copy link
Contributor Author

hicqu commented Jun 8, 2020

@tbg thanks for your review! Here is a little change list about the latest commit:

  1. Document findConflictByTerm correctly;
  2. Comment logTerm when it's filled into MsgAppResp and read out from MsgAppResp;
  3. Simplify test cases;
  4. Document logTerm in proto file.

And, I have updated the PR's decription to show which cases can be improved by this PR.

@hicqu
Copy link
Contributor Author

hicqu commented Jun 11, 2020

Friendly ping @tbg .

@hicqu
Copy link
Contributor Author

hicqu commented Jun 19, 2020

Could you take a look again? @tbg

@xiang90
Copy link
Contributor

xiang90 commented Jul 26, 2020

@hicqu can you resolve the merging conflicts? i will go over it again next week.

@stale
Copy link

stale bot commented Oct 24, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 24, 2020
@stale stale bot closed this Nov 14, 2020
Copy link
Contributor

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hicqu,

Sorry to let this PR sit. It still seems like a good idea, as it should minimize probing time after leader elections and prevent long uncommitted log tails from causing problems. Are you still interested in pushing this change over the finish line?

@@ -138,6 +138,21 @@ func (l *raftLog) findConflict(ents []pb.Entry) uint64 {
return 0
}

func (l *raftLog) findConflictByTerm(index uint64, term uint64) uint64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that there are some preconditions that this method needs to hold. For instance, index <= l. lastIndex(). What about index < l.firstIndex()?

And the comment on this should mention what happens if the last index for the provided term is greater than the provided index. In that case, we return the provided index and not the last index in the log with that term. Right?

Should this method be called findLastConflictingIndexAtOrBelowTerm? Then we can define a conflicting index as one <= the provided index in the comment.

Also, s/for that term/for that term in the log l/

raft/raft.go Outdated Show resolved Hide resolved
raft/raft.go Outdated Show resolved Hide resolved
raft/raft_test.go Outdated Show resolved Hide resolved
raft/log.go Outdated Show resolved Hide resolved
@ptabor ptabor reopened this Feb 10, 2021
@stale stale bot removed the stale label Feb 10, 2021
@tbg
Copy link
Contributor

tbg commented Feb 10, 2021

I'll rebase this - given that Nathan and I are both engaging now and are coworkers, I think it's reasonable that the two of us just take over this PR and get it over the finish line.

@tbg
Copy link
Contributor

tbg commented Feb 10, 2021

Rebased. I didn't address any review comments yet but it should pass tests.

nvanbenschoten and others added 2 commits February 10, 2021 15:02
This commit creates a new probe_and_replicate.txt interactive test. The
test creates a complete Raft log configuration and demonstrates how a
leader probes and replicates to each of its followers. The log
configuration constructed is identical to the one present in Figure 7 of
the raft paper (https://raft.github.io/raft.pdf), which looks like:

```
     1  2  3  4  5  6  7  8  9  10 11 12
n1: [1][1][1][4][4][5][5][6][6][6]
n2: [1][1][1][4][4][5][5][6][6]
n3: [1][1][1][4]
n4: [1][1][1][4][4][5][5][6][6][6][6]
n5: [1][1][1][4][4][5][5][6][7][7][7][7]
n6: [1][1][1][4][4][4][4]
n7: [1][1][1][2][2][2][3][3][3][3][3]
```

Once in this state, we then elect node 1 as the leader and stabilize the
entire raft group. This demonstrates how a newly elected leader probes
for matching indexes, overwrites conflicting entries, and catches up all
followers.

This will be useful to demonstrate the impact of more efficient probing
behavior.
Signed-off-by: qupeng <qupeng@pingcap.com>
@nvanbenschoten
Copy link
Contributor

@tbg I addressed my suggestions from earlier and added some more testing, including a nice interactive test that demonstrates the improvements of this change on the log configuration in Figure 7 of the raft paper:

Screen Shot 2021-02-10 at 1 11 04 AM

I'm not a maintainer of this repo, so I wasn't able to push changes directly to this PR. So they're over in https://github.com/etcd-io/etcd/compare/master...nvanbenschoten:fast-log-rejection?expand=1.

@tbg
Copy link
Contributor

tbg commented Feb 11, 2021

@nvanbenschoten I added an extra commit to explain things better (fallout from explaining it to myself), PTAL

Copy link
Contributor

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvements to the comments. They make this much easier to understand.

raft/raft.go Outdated Show resolved Hide resolved
raft/raft.go Outdated Show resolved Hide resolved
raft/raft.go Outdated Show resolved Hide resolved
- Add a large detailed comment about the use and necessity of
  both the follower and leader probing optimization
- fix the log message in stepLeader that previously mixed up the
  log term for the rejection and the index of the append
- improve the test via subtests
- add some verbiage in findConflictByTerm around first index
@tbg
Copy link
Contributor

tbg commented Feb 15, 2021

I'm seeing the functional tests fail here for the x-th time, last failure here. These logs make it really hard to parse what's going on (what test is this even running?) but it passes locally:

docker run --rm       --volume=`pwd`:/go/src/go.etcd.io/etcd gcr.io/etcd-development/etcd-test:go${TRAVIS_GO_VERSION}       /bin/bash -c "GO_BUILD_FLAGS='-v -mod=readonly' ./build && GOARCH=amd64 PASSES='functional' ./test"
[...]
'etcd-tester' succeeded
./test.sh: line 174: kill: (5011) - No such process
./test.sh: line 174: kill: (5012) - No such process
./test.sh: line 174: kill: (5013) - No such process
functional test PASS!
'functional' completed at Mon Feb 15 08:52:32 UTC 2021
SUCCESS

I would blame it on this target being flaky, but I see that while master is pretty red, the functional suite hasn't failed in any of the last five iterations at the time of writing: https://travis-ci.com/github/etcd-io/etcd/branches

@ptabor do you have any insights here?

@tbg
Copy link
Contributor

tbg commented Feb 15, 2021

stderr: {"level":"error","ts":1613151559.1980429,"caller":"tester/cluster_run.go:50","msg":"round FAIL","round":0,"case":10,"case-total":22,"error":"recovery error: rpc error: code = Unknown desc = write tcp 127.0.0.1:2381->127.0.0.1:45082: write: connection reset by peer","stacktrace":"go.etcd.io/etcd/tests/v3/functional/tester.(*Cluster).Run

	/go/src/go.etcd.io/etcd/tests/functional/tester/cluster_run.go:50

main.main

	/go/src/go.etcd.io/etcd/tests/functional/cmd/etcd-tester/main.go:59

runtime.main

	/usr/local/go/src/runtime/proc.go:204"}
stderr: {"level":"error","ts":1613047421.8999214,"caller":"tester/cluster_run.go:50","msg":"round FAIL","round":0,"case":9,"case-total":22,"error":"injection error: cluster too slow: only 133 commits in 6 retries","stacktrace":"go.etcd.io/etcd/tests/v3/functional/tester.(*Cluster).Run

	/go/src/go.etcd.io/etcd/tests/functional/tester/cluster_run.go:50

main.main

	/go/src/go.etcd.io/etcd/tests/functional/cmd/etcd-tester/main.go:59

runtime.main

	/usr/local/go/src/runtime/proc.go:204"}

@tbg
Copy link
Contributor

tbg commented Feb 15, 2021

Managed to reproduce locally (4/5 passed, then this):

{"level":"info","ts":1613382879.2866938,"caller":"agent/handler.go:41","msg":"handler success","operation":"BLACKHOLE_PEER_PORT_TX_RX"}
stderr: {"level":"info","ts":1613382879.286739,"caller":"tester/cluster.go:470","msg":"received response","operation":"BLACKHOLE_PEER_PORT_TX_RX
","from":"127.0.0.1:1379","success":true,"status":"blackholed peer port tx/rx"}
stderr: {"level":"info","ts":1613382879.286763,"caller":"tester/case.go:252","msg":"trigger snapshot START","desc":"BLACKHOLE_PEER_PORT_TX_RX_LE
ADER_UNTIL_TRIGGER_SNAPSHOT","etcd-snapshot-count":2000}
stderr: {"level":"info","ts":1613382879.4084778,"caller":"tester/case.go:307","msg":"trigger snapshot RETRY","retries":0,"committed-entries":0,"
etcd-snapshot-count":2000,"start-revision":101708,"last-revision":101708,"took":0.121712851}
stderr: {"level":"info","ts":1613382880.4792778,"caller":"tester/case.go:307","msg":"trigger snapshot RETRY","retries":1,"committed-entries":0,"
etcd-snapshot-count":2000,"start-revision":101708,"last-revision":101708,"took":1.192513894}
stderr: {"level":"info","ts":1613382881.5366092,"caller":"tester/case.go:307","msg":"trigger snapshot RETRY","retries":2,"committed-entries":337
,"etcd-snapshot-count":2000,"start-revision":101708,"last-revision":102045,"took":2.249845412}
stderr: {"level":"info","ts":1613382882.599872,"caller":"tester/case.go:307","msg":"trigger snapshot RETRY","retries":3,"committed-entries":762,
"etcd-snapshot-count":2000,"start-revision":101708,"last-revision":102470,"took":3.313107066}
stderr: {"level":"info","ts":1613382883.6700158,"caller":"tester/case.go:307","msg":"trigger snapshot RETRY","retries":4,"committed-entries":121
4,"etcd-snapshot-count":2000,"start-revision":101708,"last-revision":102922,"took":4.383252135}
stderr: {"level":"info","ts":1613382884.7321687,"caller":"tester/case.go:307","msg":"trigger snapshot RETRY","retries":5,"committed-entries":163
3,"etcd-snapshot-count":2000,"start-revision":101708,"last-revision":103341,"took":5.445403496}
stderr: {"level":"error","ts":1613382885.73243,"caller":"tester/cluster_run.go:50","msg":"round FAIL","round":0,"case":11,"case-total":22,"error
":"injection error: cluster too slow: only 1633 commits in 6 retries","stacktrace":"go.etcd.io/etcd/tests/v3/functional/tester.(*Cluster).Run
        /go/src/go.etcd.io/etcd/tests/functional/tester/cluster_run.go:50
main.main
        /go/src/go.etcd.io/etcd/tests/functional/cmd/etcd-tester/main.go:59
runtime.main
        /usr/local/go/src/runtime/proc.go:204"}

@tbg
Copy link
Contributor

tbg commented Feb 15, 2021

I managed to reproduce a similar flake on master within three runs:

stderr: {"level":"error","ts":1613391270.6197243,"caller":"tester/cluster_run.go:50","msg":"round FAIL","round":0,"case":10,"case-total":22,"err
or":"injection error: cluster too slow: only 1100 commits in 6 retries","stacktrace":"go.etcd.io/etcd/tests/v3/functional/tester.(*Cluster).Run
        /go/src/go.etcd.io/etcd/tests/functional/tester/cluster_run.go:50
main.main
        /go/src/go.etcd.io/etcd/tests/functional/cmd/etcd-tester/main.go:59
runtime.main
        /usr/local/go/src/runtime/proc.go:204"}
stderr: {"level":"info","ts":1613391270.6198068,"caller":"tester/cluster_run.go:342","msg":"closing stressers before archiving failure data","ro
und":0,"case":10,"case-total":22}

Going to merge this PR then, as it is unlikely that the failure is causally related to this PR.

@tbg tbg merged commit 719f6ce into etcd-io:master Feb 15, 2021
tbg added a commit to tbg/cockroach that referenced this pull request Feb 16, 2021
This picks up this PR:

etcd-io/etcd#11964

There were no other code changes affecting raft.

Release note (bug fix): fixed a deficiency in the replication layer
that could result in ranges becoming unavailable for prolonged periods
of time (hours) when a write burst occurred under system overload.
While unavailable, the range status page for the affected range would
show a last index much larger than the committed index and no movement
in these indexes on a quorum of the replicas. Note that this should
be distinguished from the case in which enough replicas are offline
to constitute a loss of quorum, where the replication layer can not
make progress due to the loss of quorum itself.
craig bot pushed a commit to cockroachdb/cockroach that referenced this pull request Feb 16, 2021
60581: go.mod: bump etcd/raft to pick up probing improvements r=nvanbenschoten a=tbg

This picks up this PR:

etcd-io/etcd#11964

There were no other code changes affecting raft.

Release note (bug fix): fixed a deficiency in the replication layer
that could result in ranges becoming unavailable for prolonged periods
of time (hours) when a write burst occurred under system overload.
While unavailable, the range status page for the affected range would
show a last index much larger than the committed index and no movement
in these indexes on a quorum of the replicas. Note that this should
be distinguished from the case in which enough replicas are offline
to constitute a loss of quorum, where the replication layer can not
make progress due to the loss of quorum itself.


Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
@hicqu hicqu deleted the fast-log-rejection branch March 9, 2021 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

6 participants