Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task list: support raft learner in etcd. #10537

Open
jingyih opened this issue Mar 12, 2019 · 28 comments

Comments

@jingyih
Copy link
Member

commented Mar 12, 2019

Background

Open questions

Task List
The list is subject to change.

Phase I

  • Create new feature branch - The feature branch will be merged back to master when the development is done. We will not squash commits to keep track of contribution.
  • API changes and regenerate go bindings.
    • add isLearner flag to MemberAdd API request.
    • add MemberPromote API.
    • add isLearner field in Status maintenance API response.

Phase II
Multiple developers can work in parallel on these tasks.

  • etcdserver and clientv3 support for MemberAdd with isLearner flag. Includes both routing the request to raft, and applying the result.
  • Add etcdctl member add --learner.
  • etcdserver and clientv3 implementation for MemberPromote.
  • Add etcdctl member promote.
  • Add isLearner field to etcdctl member list.
  • etcdserver and clientv3 support for Status maintenance API with isLearner field.
  • Add isLearner field to etcdctl endpoint status.
  • Learner reject all client requests except s-reads and status requests.
  • Exclude learner in leadership transfer.
  • Implement canPromote() check (before sending promote request to Raft consensus).
  • Add learner limit to etcdserver.
  • Exclude learner nodes from clientv3 balancer endpoints. -> clientv3 balancer retry when request send to learner fails.

Phase III
Multiple developers can work in parallel on these tasks.

  • Add metrics.
  • Integration test: add learner
  • Integration test: promote learner
  • Integration test: learner only accepts s-reads and status requests.
  • Integration test: cannot transfer leadership to learner
  • Integration test: add learner with large delta (CanPromote=false) requiring snapshot catchup
  • Integration test: clientv3 balancer with learner member
  • Integration test: partitioned learner
  • e2e test: etcdctl member add --learner, etcdctl endpoint status, etcdctl member list
  • Add auto-promote, but put behind a feature flag? (moved to future work)
  • Throttle process of a learner catching up from leader. (moved to future work)
  • Update etcdctl documentation
  • Update operations documentation
  • Update API documentation (auto generated when generating go bindings)
  • Update clientv3 documentation

Future Work
These items will likely not be included in v3.4 release.

  • Add auto-promote, but put behind a feature flag?
  • Throttle process of a learner catching up from leader.

Pull Requests
Incremental PRs opened against feature branch.
Aggregated PR of feature branch against master branch.

@jingyih

This comment has been minimized.

Copy link
Member Author

commented Mar 12, 2019

@jingyih

This comment has been minimized.

Copy link
Member Author

commented Mar 12, 2019

Thanks @jpbetz for helping on the task list.

@WIZARD-CXY and @jingyih will be working on this. Please let us know if you want to help on some of the tasks.

@jingyih jingyih changed the title Task list: support raft leaner in etcd. Task list: support raft learner in etcd. Mar 13, 2019

@WIZARD-CXY

This comment has been minimized.

Copy link
Contributor

commented Mar 13, 2019

LGTM, the task is clear. I can help in phase II and III.

@jingyih

This comment has been minimized.

Copy link
Member Author

commented Mar 14, 2019

@WIZARD-CXY

I created the feature branch here:
https://github.com/jingyih/etcd/tree/learner

For our development work, we can create pull request against jingyih/etcd/learner. In the end, we will merge all the changes from jingyih/etcd/learner to etcd-io/etcd/master. Since we will not squash commits, all the commits in feature branch will appear in master branch. Here is an example: #9860

Please let me know if you have any question.

@jpbetz

This comment has been minimized.

Copy link
Contributor

commented Mar 14, 2019

@WIZARD-CXY

This comment has been minimized.

Copy link
Contributor

commented Mar 15, 2019

Roger that

@WIZARD-CXY

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2019

@jingyih Maybe add modify existing code to work with the learner implementation to phase II.
Because I find that TransferLeadership function may transfer leader to a learner node which is unexpected.

@jingyih

This comment has been minimized.

Copy link
Member Author

commented Mar 21, 2019

@WIZARD-CXY Thanks for pointing out:) Added. I changed the wording to be more specific: "Learner reject leadership transfer."

@WIZARD-CXY

This comment has been minimized.

Copy link
Contributor

commented Mar 22, 2019

@jingyih maybe exclude the learner when transfer leadership. I think it is more accurate in this case.

@jingyih

This comment has been minimized.

Copy link
Member Author

commented Mar 22, 2019

@jingyih maybe exclude the learner when transfer leadership. I think it is more accurate in this case.

Sounds good.

@WIZARD-CXY

This comment has been minimized.

@xiang90

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2019

For anyone not working "face to face" internally on etcd, it's not obvious what this is. It was mentioned in the etcd community meeting, but I think everyone who attended was a paid RH employee. Thanks!

This feature is currently driven by @jingyih from Google and @WIZARD-CXY from Alibaba. So it is a joint effort from the community :P. If you are interested, we definitely want your help.

@purpleidea

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2019

@WIZARD-CXY Thanks for the links!

@xiang90 I didn't realize it wasn't just RH people. Will remove my comment, thanks for the info!

@purpleidea

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2019

I just read

https://etcd.readthedocs.io/en/latest/server-learner.html

I have some thoughts about this, based on my experience building etcd clusters automatically in https://github.com/purpleidea/mgmt/ If these insights are useful, I am happy to share. I don't have a lot of time to contribute new code for this particular implementation at the moment.

Some background:

I am about one week away from releasing a re-write of this code to remove all the cruft caused by my lack of knowledge in golang.

Some thoughts:

  • Instead of doing member add as a follower, wouldn't it be better for the new prospective member to connect as a client, and do some sort of "sync" to pull most of the data down, and then re-join as a member? This means existing cluster automation tools don't need to change much, they just need to add a single etcd client sync which is optional if they want the initial log data. It also means we DON'T need to break API for this.

  • Currently there's a difficult race:
    Suppose you have an N member cluster (perhaps N is 1 and hostname is A). You decide to add member B. You run member add API from A, and also member B starts up server. If there is any problem with startup, member A is in a borked state. What can you do to unbork? Raft log is blocked. A way to force remove the pending member would be useful.
    If you did this operation with a "raft learner" the server is already started but not voting, so doing the switchover from non-voting server to voting server might be a more difficult problem to deal with, since it's already part of the cluster specification!

  • I think the best way to model all this is to add MemberAPI operations into etcd.Op so that they can happen in a Txn. If this is something that intrigues anyone, then I can write up a proposal. A related proposal is: #10571 and #5277

Thanks for reading. If this is not helpful, please be blunt and tell me, and I won't make more noise.

@jingyih

This comment has been minimized.

Copy link
Member Author

commented Apr 5, 2019

This task list was meant to be used for tracking the implementation progress. But I just added some context in the beginning. Sorry for the confusion.

@WIZARD-CXY

This comment has been minimized.

Copy link
Contributor

commented May 15, 2019

@jingyih seems we are in the end of phase III, what else need I do? From your task list, we left 3 things

Add metrics.
integration test: partitioned learner
e2e test: etcdctl member add --learner, etcdctl endpoint status, etcdctl member list
@WIZARD-CXY

This comment has been minimized.

Copy link
Contributor

commented May 15, 2019

how about I work on the add metrics

@jingyih

This comment has been minimized.

Copy link
Member Author

commented May 15, 2019

@WIZARD-CXY I am thinking about dropping integration test: partitioned learner and the e2e tests. We already added e2e test for etcdctl member add --learner, I do not feel like we need to add e2e test for etcdctl endpoint status and etcdctl member list. So probably the last one is Add metrics. Maybe first come up with a list of metrics that we want to add?

@jingyih

This comment has been minimized.

Copy link
Member Author

commented May 15, 2019

We might want to add:

  • etcd_server_is_learner
    Ref:

    # name: "etcd_server_is_leader"

  • We might also want some debugging metrics to count the number of total/failed/successful learner promotion.

@WIZARD-CXY

This comment has been minimized.

Copy link
Contributor

commented May 15, 2019

Will do. I will submit a pr collecting the metrics

@jingyih

This comment has been minimized.

Copy link
Member Author

commented May 16, 2019

Will do. I will submit a pr collecting the metrics

Thanks. When you are ready, please create PR against etcd-io/master. Let's stop using the learner feature branch.

@WIZARD-CXY

This comment has been minimized.

Copy link
Contributor

commented May 16, 2019

ok

@fabriziopandini

This comment has been minimized.

Copy link

commented Jun 7, 2019

@jingyih is it possible that an etcd member answer to request from clients while still "learning"
We are experiencing something that might be related in Kubernetes see kubernetes-sigs/kind#588 (comment) ...

@jingyih

This comment has been minimized.

Copy link
Member Author

commented Jun 7, 2019

@fabriziopandini Are you testing with etcd binary built from the master branch? The learner feature is not released yet.

Learner does accept certain types of requests in the current implementation. I can provide more info if needed. Just want to first make sure if this is related to the issue you are facing.

@fabriziopandini

This comment has been minimized.

Copy link

commented Jun 10, 2019

@jingyih thanks for the quick answer.My use case is the following:
I'm a kubeadm contributor and I'm investigating an issue on kind, where they are agreesively (in parallel) joining control-plane plane nodes to a K8s.

Joining control-plane nodes is a new kubeadm feature that creates a new etcd member on the joining node and then calls AddMember of the etcd V3 API (currently on etcd 3.1.10).

Now, the problem we are experiencing is that etcd / the API server starts to answer before the new etcd member is aligned with the rest of the nodes, giving false answers (after few seconds everything starts to behave as expected). From my understanding, what we are experiencing is etcd answering while still learning.

What I'm looking for is:

  • possibly a confirmation that the above analysis makes sense (or at least that this is a possible condition)
  • any suggestions about how to prevent this; the "learner" API is really promising, but I understand it still WIP and I'm open to any suggestion

Thank you in advance for your help

@jingyih

This comment has been minimized.

Copy link
Member Author

commented Jun 10, 2019

@jingyih thanks for the quick answer.My use case is the following:
I'm a kubeadm contributor and I'm investigating an issue on kind, where they are agreesively (in parallel) joining control-plane plane nodes to a K8s.

Joining control-plane nodes is a new kubeadm feature that creates a new etcd member on the joining node and then calls AddMember of the etcd V3 API (currently on etcd 3.1.10).

Now, the problem we are experiencing is that etcd / the API server starts to answer before the new etcd member is aligned with the rest of the nodes, giving false answers (after few seconds everything starts to behave as expected). From my understanding, what we are experiencing is etcd answering while still learning.

What I'm looking for is:

  • possibly a confirmation that the above analysis makes sense (or at least that this is a possible condition)
  • any suggestions about how to prevent this; the "learner" API is really promising, but I understand it still WIP and I'm open to any suggestion

Thank you in advance for your help

Please follow the instruction [1] on how to add a new etcd member to cluster. The AddMember API should be called before starting the new etcd member, otherwise the existing cluster will not be able to recognize and verify the new member.

[1] https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/runtime-configuration.md#add-a-new-member

maxenglander added a commit to maxenglander/etcd that referenced this issue Jul 10, 2019
etcdserver: add ability to auto-promote learners to voters
It looks as though the etcd team was planning to implement
"auto promote" (see etcd-io#10537). This commit attempts to lay the
groundwork for auto-promoting learners to voters by adding an
--auto-promote flag to the add member command. One reason
for having the ability to automatically promote learners to
voters is that it would enable implementing the Raft paper's
recommendation handle reconfiguration changes by first adding
new members as non-voters until they have caught up with the
leader.
@arjunsingri

This comment has been minimized.

Copy link

commented Aug 31, 2019

How many learner nodes can an etcd cluster support? Does it make sense to replace watches with learner nodes? Can a learner remain a learner indefinitely?

@jingyih

This comment has been minimized.

Copy link
Member Author

commented Sep 1, 2019

How many learner nodes can an etcd cluster support?

Currently in 3.4, the maximum number of learner is 1. This may increase in future.

Does it make sense to replace watches with learner nodes?

Currently learner does not support watch API. To scale watch you might want to use: https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/grpc_proxy.md#scalable-watch-api

Can a learner remain a learner indefinitely?

Yes. Currently in 3.4 a leaner will not be auto promoted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
7 participants
You can’t perform that action at this time.