Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: support safe readonly request #6275

Merged
merged 1 commit into from Sep 13, 2016

Conversation

@xiang90
Copy link
Contributor

xiang90 commented Aug 26, 2016

Implement raft readonly request described in raft thesis 6.4
along with the existing clock/lease based approach.

/cc @siddontang @swingbach

Based on my experiments, the performance is comparable with the lease based approach when all the nodes are in the same DC.

@xiang90 xiang90 force-pushed the xiang90:raft_l branch from b57a6e3 to 3dfa95c Aug 26, 2016
@xiang90

This comment has been minimized.

Copy link
Contributor Author

xiang90 commented Aug 26, 2016

@xiang90 xiang90 force-pushed the xiang90:raft_l branch 2 times, most recently from 34bd82f to 5bce93b Aug 26, 2016

func (ro *readOnly) recvAck(m pb.Message) int {
ctx := string(m.Context)
rs, ok := ro.pendingReadIndex[ctx]

This comment has been minimized.

Copy link
@gyuho

gyuho Aug 26, 2016

Member

rs, ok := ro.pendingReadIndex[string(m.Context)]? golang/go@f5f5a8b

@xiang90 xiang90 force-pushed the xiang90:raft_l branch from 5bce93b to ee5eead Aug 26, 2016
@siddontang

This comment has been minimized.

Copy link
Contributor

siddontang commented Aug 27, 2016

@xiang90
raw_node should support this? I think we can add it after this PR merged and we port in rust first.

@xiang90 xiang90 force-pushed the xiang90:raft_l branch from ee5eead to df21a14 Aug 27, 2016
@xiang90

This comment has been minimized.

Copy link
Contributor Author

xiang90 commented Aug 27, 2016

@siddontang I can add the support for raw_node after this pr.

@@ -0,0 +1,86 @@
// Copyright 2015 The etcd Authors

This comment has been minimized.

Copy link
@gyuho

gyuho Aug 29, 2016

Member

2016

@@ -750,6 +775,25 @@ func stepLeader(r *raft, m pb.Message) {
if pr.Match < r.raftLog.lastIndex() {
r.sendAppend(m.From)
}

if r.readOnly.option != ReadOnlySafe || len(m.Context) == 0 {

This comment has been minimized.

Copy link
@gyuho

gyuho Aug 29, 2016

Member

Why do we skip here, when r.readOnly.option == ReadOnlyLeaseBased?

This comment has been minimized.

Copy link
@gyuho

gyuho Aug 29, 2016

Member

ah, nvm. addRequest is only called for ReadOnlySafe

@xiang90 xiang90 force-pushed the xiang90:raft_l branch from df21a14 to 988d20a Aug 29, 2016
@xiang90

This comment has been minimized.

Copy link
Contributor Author

xiang90 commented Aug 29, 2016

@bdarnell Can anyone from cockroach help to review this feature?

@xiang90 xiang90 force-pushed the xiang90:raft_l branch from 988d20a to dc8db61 Aug 29, 2016
import pb "github.com/coreos/etcd/raft/raftpb"

// ReadState provides state for read only query.
// It's caller's responsibility to send MsgReadIndex first before getting

This comment has been minimized.

Copy link
@bdarnell

bdarnell Aug 30, 2016

Member

s/send MsgReadIndex/call ReadIndex/

The comments on the ReadIndex method in the node interface need to be updated for this new mode.

rss = append(rss, rs)
delete(ro.pendingReadIndex, okctx)
if okctx == ctx {
break

This comment has been minimized.

Copy link
@bdarnell

bdarnell Aug 30, 2016

Member

If we get a message with an unknown context (for example, if we've just restarted and the message was delayed), this will clear all pending reads, when it should do nothing. This loop needs to go backwards and only do something if it finds a match.

This comment has been minimized.

Copy link
@xiang90

xiang90 Aug 30, 2016

Author Contributor

only leader keeps readonly. if a leader get restarted, it will come with an empty read only list. But I agree with you that I probably should also make this looks more robust.

return rs.ackCount
}

func (ro *readOnly) advance(m pb.Message) []*readIndexStatus {

This comment has been minimized.

Copy link
@bdarnell

bdarnell Aug 30, 2016

Member

All the methods in this file could use some comments.

This comment has been minimized.

Copy link
@xiang90

xiang90 Aug 30, 2016

Author Contributor

will do.

@@ -390,11 +400,15 @@ func (r *raft) bcastAppend() {

// bcastHeartbeat sends RPC, without entries to all the peers.
func (r *raft) bcastHeartbeat() {
r.bcastHeartbeatWithCtx(nil)

This comment has been minimized.

Copy link
@bdarnell

bdarnell Aug 30, 2016

Member

If there are any pending reads in readOnly, we should send the context of the last one with each heartbeat until we get a response. Otherwise if the single MsgHeartbeat is dropped, the read won't be able to finish until the next ReadIndex succeeds.

I was originally going to ask why this reused MsgHeartbeat instead of a new message type, but retrying the last context on each heartbeat seems like a good reason.

This comment has been minimized.

Copy link
@xiang90

xiang90 Aug 30, 2016

Author Contributor

agree. will do.

This comment has been minimized.

Copy link
@xiang90

xiang90 Aug 30, 2016

Author Contributor

originally i want to make the first impl simple. So we only need to have an ack count. now we need to know where the ack coming from since dup ack is possible now.

To: to,
Type: pb.MsgHeartbeat,
Commit: commit,
Context: ctx,

This comment has been minimized.

Copy link
@bdarnell

bdarnell Aug 30, 2016

Member

Do we need to send the user-supplied context on the wire with the heartbeat? I know we need to use the context to connect ReadIndex and the result in Ready, but I think for the heartbeat we can express this in terms of the term and index instead of a user-supplied value. This would allow multiple reads to piggyback on the same message.

This comment has been minimized.

Copy link
@xiang90

xiang90 Aug 30, 2016

Author Contributor

we do not have to. I will think about this more.

@@ -434,6 +448,7 @@ func (r *raft) reset(term uint64) {
}
}
r.pendingConf = false
r.readOnly = newReadOnly(r.readOnly.option)

This comment has been minimized.

Copy link
@bdarnell

bdarnell Aug 30, 2016

Member

Don't we need to do something here to tell the application that any previous reads have failed and will need to be retried?

This comment has been minimized.

Copy link
@xiang90

xiang90 Aug 30, 2016

Author Contributor

only leader can do efficient l-read. if we want to retry, we have to rely on the previous leader to send the previous read info to the new leader. i feel this is worse than a simple application layer retry.

This comment has been minimized.

Copy link
@bdarnell

bdarnell Sep 13, 2016

Member

How would the application retry if you never tell it the node lost its leadership and the read failed? Is the application expected to watch SoftState.Status to see when leadership changes, or would it retry based on a timeout?

This comment has been minimized.

Copy link
@xiang90

xiang90 Sep 13, 2016

Author Contributor

Oh. I misunderstood your original comment. Right now, we treat readIndexRequest as proposals. So client wont really know if it is dropped or not. In etcd, we use simple timeout. We can monitor term to fail faster. If term is changed, but read index response is not returned, we know it is dropped.

@@ -26,7 +26,8 @@ This raft implementation is a full feature implementation of Raft protocol. Feat
- Log compaction
- Membership changes
- Leadership transfer extension
- Lease-based linearizable read-only queries served by both the leader and followers
- Effecient linearizable read-only queries served by both the leader and followers
- More effecient lease-based linearizable read-only queries served by both the leader and followers

This comment has been minimized.

Copy link
@bdarnell

bdarnell Aug 30, 2016

Member

Be clear about what "efficient" and "more efficient" mean here (or somewhere). Mentioning followers sounds like this is a purely local operation, but followers still need to talk to the leader (so it spreads the load a little bit but for lowest latency it's still important to talk to the leader).

This comment has been minimized.

Copy link
@xiang90

xiang90 Aug 30, 2016

Author Contributor

will do. i probably also need to link to the original thesis.

@bdarnell

This comment has been minimized.

Copy link
Member

bdarnell commented Aug 30, 2016

This generally looks good, although I don't think we'll be able to use it in cockroachdb any time soon. (we have our own leasing mechanism with subtle interactions between reads and writes that would be difficult to translate to raft-level leases).

@xiang90

This comment has been minimized.

Copy link
Contributor Author

xiang90 commented Aug 30, 2016

@bdarnell Thanks for the kind review. I will clean this up soon.

return
}

ackCount := r.readOnly.recvAck(m)

This comment has been minimized.

Copy link
@hhkbp2

hhkbp2 Sep 12, 2016

Contributor

Is it necessary to check duplicated heartbeats from the same follower?

This comment has been minimized.

Copy link
@xiang90

xiang90 Sep 12, 2016

Author Contributor

originally, no. we never send heartbeat with same ctx twice. i made a change right now, which might introduce dup. i have added the debup.

@xiang90 xiang90 force-pushed the xiang90:raft_l branch 2 times, most recently from 70217d9 to a2a4e58 Sep 12, 2016
Implement raft readonly request described in raft thesis 6.4
along with the existing clock/lease based approach.
@xiang90 xiang90 force-pushed the xiang90:raft_l branch from a2a4e58 to 710b14c Sep 12, 2016
@xiang90

This comment has been minimized.

Copy link
Contributor Author

xiang90 commented Sep 12, 2016

@bdarnell All fixed except the context thing. I left it as "thinking". Now the heartbeat sends with the latest context. We also have a queue in the readonly struct. It will unblock all previous readonly requests when a read only request is unblocked in the middle of the queue. Applications can coalesce the heartbeat response at will. So use term, index to order the read only request seems to be unnecessary.

@bdarnell

This comment has been minimized.

Copy link
Member

bdarnell commented Sep 13, 2016

LGTM

@xiang90 xiang90 merged commit cfe717e into etcd-io:master Sep 13, 2016
2 checks passed
2 checks passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
semaphoreci The build passed on Semaphore.
Details
@xiang90 xiang90 deleted the xiang90:raft_l branch Sep 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
5 participants
You can’t perform that action at this time.