Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: support raft learner in etcd - part 2 #10727

Merged
merged 15 commits into from
May 15, 2019
Merged

Conversation

jingyih
Copy link
Contributor

@jingyih jingyih commented May 15, 2019

Part 2 of #10645. Continuation of #10725.

This PR incldues the following 16 commits from #10645. The last 6 commits are just formatting and bug fixes for the earlier 10 commits.

(ordered by: latest commits on top)

f058b7fbfe9bb3a1ea82f262514110aa91b0c8d5 clientv3/integration: deflake TestKVForLearner
1a080384e20ca55f48c4effe448c87392600ba3e clientv3/integration: fix cluster tests
11e41685030846ba495a4263d8db352d0e9fc7d0 clientv3/integration: update MemberAdd call
5d1f6c73c1dcdd889a28138b0ba6e79f814a704a integration: remove unnecessary type conversion
4f9eb2f7c8b7175ce7eb690c72461e654bd0a0ff etcdserver: remove unnecessary bool comparison
ab030455051cc034704cc4fc9206786b12ee88f6 words: whitelist words to fix goword test.

b0be8067e40c068eb2180b192d792aec21a1e3a2 clientv3, etcdctl: MemberPromote for learner
f1d6e70e271342691c3f1b8cf7e0074f5804cc15 integration: add TestTransferLeadershipWithLearner
2db0ca918be883749c72ede61a38148550be3811 integration: add TestMoveLeaderToLearnerError
3d2d3718be4073e8c97a1e9eecee7b1d4b281619 etcdserver: exclude learner from leader transfer
cc8508d7a2469b944a1fec431f0c8346e4b7ae4b clientv3: add member promote
a1c0ea8636b6607baee91bc38b381aed22c17ae3 etcdserver: support MemberPromote for learner
01a7b3e8dcc29725e590dda2c03dbc83aa00bd38 integration: add TestKVForLearner
c8fbc4d50ef97bc2ba78f0e71f8b9b78de0494ad functional: fix MemberAdd call in tests.
2c19071219c8eeb6ef7bcc99c97a74a4d0d26757 etcdserver: filter rpc request to learner
c5001530d74b827c4949d490c7e523f3f91d112a *: add learner field in endpoint status

This PR is rebased to part 1 (#10725). Since we modified some function signatures in the end of part 1, I have to edit some of the commits in this PR so that they still make sense. - A little bit history rewrite of the learner feature branch. During this process, the following 2 commits (out of the total 16) are no longer needed, so they are dropped.

11e41685030846ba495a4263d8db352d0e9fc7d0 clientv3/integration: update MemberAdd call
c8fbc4d50ef97bc2ba78f0e71f8b9b78de0494ad functional: fix MemberAdd call in tests.

cc @xiang90

jingyih and others added 14 commits May 15, 2019 13:13
Added learner field to endpoint status API.
Hardcoded allowed rpc for learner node. Added filtering in grpc
interceptor to check if rpc is allowed for learner node.
Adding TestKVForLearner. Also adding test utility functions for clientv3
integration tests.
1. Maintenance API MoveLeader() returns ErrBadLeaderTransferee if
transferee does not exist or is raft learner.

2. etcdserver TransferLeadership() only choose voting member as
transferee.
Adding integration test TestMoveLeaderToLearnerError, which ensures that
leader transfer to learner member will fail.
Adding integration test TestTransferLeadershipWithLearner, which ensures
that TransferLeadership does not timeout due to learner is automatically
picked by leader as transferee.
Fixes TestMemberAddForLearner and TestMemberPromoteForLearner.
Adding delay in the test for the newly started learner member to catch
up applying config change entries in raft log.
Use: "promote <memberID>",
Short: "Promotes a non-voting member in the cluster",
Long: `Promotes a non-voting learner member to a voting one in the cluster.
`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why a new line here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not write this part of the code. But it seems consistent with the existing NewMemberListCommand.

ErrIDNotFound = errors.New("membership: ID not found")
ErrPeerURLexists = errors.New("membership: peerURL exists")
ErrMemberNotLearner = errors.New("membership: can only promote a learner member")
ErrLearnerNotReady = errors.New("membership: can only promote a learner member which catches up with leader")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which is in sync with the leader

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good.

@@ -40,6 +40,8 @@ var (
ErrGRPCMemberNotEnoughStarted = status.New(codes.FailedPrecondition, "etcdserver: re-configuration failed due to not enough started members").Err()
ErrGRPCMemberBadURLs = status.New(codes.InvalidArgument, "etcdserver: given member URLs are invalid").Err()
ErrGRPCMemberNotFound = status.New(codes.NotFound, "etcdserver: member not found").Err()
ErrGRPCMemberNotLearner = status.New(codes.FailedPrecondition, "etcdserver: can only promote a learner member").Err()
ErrGRPCLearnerNotReady = status.New(codes.FailedPrecondition, "etcdserver: can only promote a learner member which catches up with leader").Err()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is in sync with leader

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse the previously defined error message here? maybe define the error message as a const

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse the previously defined error message here? maybe define the error message as a const

This is just following the exiting pattern. If we want to change it, it should be a separate PR for all existing errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is in sync with leader

sounds good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this one, a const is better since we repeat the same message twice. but it might also apply to other error messages. i am not sure.

@@ -116,3 +120,15 @@ func isClientCtxErr(ctxErr error, err error) bool {
}
return false
}

// in v3.4, learner is allowed to serve serializable read and endpoint status
func isRPCEnabledForLearner(req interface{}) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isRPCSupportedForLeaner ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good.

// ConfigChangeContext represents a context for confChange.
type ConfigChangeContext struct {
Member
IsPromote bool `json:"isPromote"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment on this field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

if urls[u] {
return ErrPeerURLexists
// A ConfChangeAddNode to a existing learner node promotes it to a voting member.
if confChangeContext.IsPromote {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move the validation to the top of this func?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should do the validation first, then do the context encoding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to encode confChangeContext first, which will tells us if the config change is for promoting a existing member, or for adding a new member. The validation logic is different for these two scenarios.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. sgtm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but then, can you move the decoding thing even before the urls construction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, urls construction should be moved to later part of the logic. It is only needed for adding a new member case, not needed for promoting.

@@ -693,3 +735,44 @@ func mustDetectDowngrade(lg *zap.Logger, cv *semver.Version) {
}
}
}

// IsLearner returns if the local member is raft learner
func (c *RaftCluster) IsLearner() bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsLocalMemberLearner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good.

@@ -1435,20 +1440,20 @@ func (s *EtcdServer) TransferLeadership() error {
return nil
}

if !s.isMultiNode() {
if s.cluster == nil || len(s.cluster.VotingMemberIDs()) <= 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably still encapsulate this logic into a func.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

// TODO add more checks whether the member can be promoted.
// like learner progress check or if cluster is ready to promote a learner
// this is an example to get progress
fmt.Printf("raftStatus, %#v\n", raftStatus())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not simply use fmt.printf as logging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is unfinished. This line should be removed for now. The actual implementation is in later commit (not included in part 2 PR): dc79587

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but checking this before the apply phase is ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but checking this before the apply phase is ok.

Not sure I understand. The member promote request should be rejected (if the reason is learner not ready) before it is appended to raft log. I think it is too late if we check learner progress in apply stage. In other words, applying a particular conf change raft entry should not depend on the progress of learner node in cluster. Some node may choose to apply this entry, whereas some node may not - depending on when they started applying this entry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that is what i mean. checking this at the apply phase will not work since the cluster might not have a consistent view on progress unless the leader keep on propagating this information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is unfinished. This line should be removed for now. The actual implementation is in later commit (not included in part 2 PR):

can you just make this a noop or a TODO panic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it is noop - this checking function always return nil.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will print something out, can we remove that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this print should be removed.

@xiang90
Copy link
Contributor

xiang90 commented May 15, 2019

i did not give a look at all the tests changes. but the core path is good in general.

defer cli.Close()

// waiting for learner member to catch up applying the config change entries in raft log.
time.Sleep(3 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not good to make this kind of timing assumption in general. we tried to get rid of all time.sleep in tests before.

Copy link
Contributor Author

@jingyih jingyih May 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is replaced in later commits with server's ready notify channel. 90956f7 (not included in this part 2 PR)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a todo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@xiang90
Copy link
Contributor

xiang90 commented May 15, 2019

tests look good too.

@jingyih
Copy link
Contributor Author

jingyih commented May 15, 2019

I'll create an additional commit to address all the comments.

@jingyih
Copy link
Contributor Author

jingyih commented May 15, 2019

@xiang90 Pushed an additional commit to address the comments. PTAL.

@xiang90
Copy link
Contributor

xiang90 commented May 15, 2019

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants