Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Placement Services: Run multiple instances and use Leader election using Raft #663

Closed
amanbha opened this issue Oct 22, 2019 · 4 comments · Fixed by #2364, #2416 or #2417
Closed

Placement Services: Run multiple instances and use Leader election using Raft #663

amanbha opened this issue Oct 22, 2019 · 4 comments · Fixed by #2364, #2416 or #2417

Comments

@amanbha
Copy link
Contributor

amanbha commented Oct 22, 2019

Describe the proposal

Currently Placement service runs as single instance and can lead to unavailability of it when node goes down. Proposal is to run it with multiple instances and use Leader Election.

cc: @yaron2 @youngbupark @mkosieradzki

RELEASE NOTE: Enables placement service to run with multiple instances.

@youngbupark
Copy link
Contributor

youngbupark commented Oct 22, 2019

From Gitter conversation:

@mkosieradzki - https://gitter.im/Dapr/community?at=5dadeb55ef84ab37867027f5

  1. I am trying to go through the DHT algorithm and trying to reason about it, but it is fairly difficult to understand as it utilizes eventual consistency approach. I am having some problems with understanding what will happen in case we have multiple instances of a placement service (i see there are safety checks for placement table version, also I see that there is this lock command distributed accross all dapr sidecars).
    Helm charts set the replica set size to 1 - is this on purpose? Why not 2 by default? Can we have 2 instances all the time? Or should we use only one instance and second one will appear only temporarily in case when first one goes out?
  2. Shouldn't the placement service use platform native leader election, e.g. https://godoc.org/k8s.io/client-go/tools/leaderelection on k8s and be a stateful service on Service Fabric? Or is it unnecessary?

@yaron2 - https://gitter.im/Dapr/community?at=5dae0943a03ae1584fe4374d

  1. You only need to have one instance of the placement service. the underlying hosting platform (Kubernetes, for example) guarantees the desired number of replicas (1) are running.
  2. Because of the above, there's no leader election needed. The placement service does not hold external state, and can recover from crashes by rebuilding its hashing table. This frees us from state management and coordination issues

@mkosieradzki - https://gitter.im/Dapr/community?at=5dae22599825bd6baca3b46b

ad 2, 3. You really trust guarantees for ReplicaSet. Imagine a node goes down (the one that was hosting placement-service). The node will stop sending heartbeats. After 3 minutes (AFAIR) k8s will stop scheduling new pods k8s to this node ;-)))). But ReplicaSet will not schedule a new pod until the node comes back ;-). How about liveness probes? AFAIK liveness probes are issued by the kubelet, which lives on the dead node ;). So you just lost the only copy of placement-service. This is why you should consider running multiple instances and use leader election to elect the one instance doing the real thing. K8S is not Service Fabric which detects node outage under 4 seconds :)


@youngp - https://gitter.im/Dapr/community?at=5dae9d8efb4dab784ae26357

@yaron2 ad 2, 3. You really trust guarantees for ReplicaSet. Imagine a node goes down (the one that was hosting placement-service). The node will stop sending heartbeats. After 3 minutes (AFAIR) k8s will stop scheduling new pods k8s to this node ;-)))). But ReplicaSet will not schedule a new pod until the node comes back ;-). How about liveness probes? AFAIK liveness probes are issued by the kubelet, which lives on the dead node ;). So you just lost the only copy of placement-service. This is why you should consider running multiple instances and use leader election to elect the one instance doing the real thing. K8S is not Service Fabric which detects node outage under 4 seconds :)

@mkosieradzki Could you please create an issue to track this in dapr/dapr ? It would be good to discuss this via issue.
@yaron2 to me, his scenario makes sense. don’t you think?


@mkosieradzki - https://gitter.im/Dapr/community?at=5daea3a92a6494729c1952be

In case you agree with my scenario, we can move to the next problem: API server based leader election ties you to the API server (and underyling etcd/Cosmos DB) availability, which in AKS has SLO of 99,5% ;-), and AKS folks are refusing to fix this, because "your workloads don't need API server to work" ;-), so you end-up deploying your own zk/etcd ;-) or implementing your own clustering and leader election on top of k8s.


@yaron - https://gitter.im/Dapr/community?at=5daf2dd89c39821509655c3c

actually the default pod eviction time for nodes is 5 minutes.. so its even "worse" in that sense. liveness probes are indeed dispatched from the kubelet on the dead node so they would be of no help. in any case, placement service being down does not mean actors stop working or are unreachable. It means there won't be any rebalancing done. What we could do is run the placement service as a DaemonSet and use leader election for active-NPassive. Leader election won't be API server leader election, there are many ways to achieve that.
DaemonSet means that the placement service lives on every node, and gets automatically deployed to every new node

@yaron2 yaron2 added the size/XL 4 weeks of work (should this be an epic?) label Oct 30, 2019
@amanbha amanbha added the P2 label Mar 25, 2020
@youngbupark youngbupark added P1 and removed P2 labels Jun 19, 2020
@youngbupark youngbupark added size/L 3 weeks of work and removed size/XL 4 weeks of work (should this be an epic?) labels Jul 20, 2020
@youngbupark youngbupark changed the title Placement Services: Run multiple instances and use Leader election. Placement Services: Run multiple instances and use Leader election using Raft Aug 14, 2020
@youngbupark
Copy link
Contributor

@amanbha @yaron2 - JFYI - I will reuse this issue to adopt Raft algorithm. I will work on this item first.

@youngbupark
Copy link
Contributor

@yaron2 @amanbha - I will use etcd's raft implementation to use Raft consensus algorithm in placement because etcd one looks most popular one. let me know if you have any concern.

https://raft.github.io/

image

@youngbupark youngbupark added this to To do in 0.11.0 Milestone via automation Aug 14, 2020
@youngbupark youngbupark moved this from To do to In progress in 0.11.0 Milestone Aug 20, 2020
@mukundansundar mukundansundar removed this from In progress in 0.11.0 Milestone Sep 17, 2020
@mukundansundar mukundansundar added this to To do in 1.0.0 Milestone 1 via automation Sep 17, 2020
@youngbupark youngbupark moved this from To do to In progress in 1.0.0 Milestone 1 Oct 6, 2020
@artursouza artursouza moved this from To do to In progress in 1.0.0-RC1 Milestone features Oct 8, 2020
@youngbupark
Copy link
Contributor

youngbupark commented Oct 14, 2020

Overview of placement service

Placement service acts as membership service of actor service instances and partitioning service for actors. When Dapr runtime starts, it connects to placement service gRPC server and keep this connection in its entire life-cycle in order to synchronize the local actor partition table in Dapr runtime sidecar. Dapr runtime piggybacks to placement service by reporting the current host information, such as serving actor types, in the heartbeat request and receives the updated partition table while sending heartbeat every 1 seconds. This partition table is used for placing actors on actor service member nodes and used as a look-up table for actors when user application invokes actors.

Role of Raft in Placement

Raft consensus algorithm plays two roles in Placement services:

  1. Leader election: All Dapr sidecars will establish bi-directional stream channels only to the leader of Placement instances. The stream channel from placement to Dapr sidecar is used for updating local hashing tables to look up the avaiable actor service instances. Non-leader nodes must return error when dapr sidecar tries to connect to these nodes. It will allow dapr sidecar to connect to leader node in a round robin manner.

  2. Consensus agreeing on dapr sidecar member change to keep the states of dapr sidecar list: the leader of placement raft nodes sends the command to the other raft nodes and the state machine in the raft nodes processes the series command of dapr sidecar member change to have the concensus among the raft nodes. When state machine in each raft node processes Dapr side car member changes, it will update the state of dapr sidecar status and construct consistent hashing table. The member change commands will be stored in logs and snapshot the moment of state. When one of nodes are failed, Raft concensus algorithm will elect new leader and the failed node will use snapshot and replay the logs to produce the same state when it is back to normal.

Raft input command in logs and state machine

  1. Raft input command (data) in log : the leader placement node will propose the command including the below properties to upsert or remove dapr sidecar member.

    • Operation: upsert, remove
    • DaprMemberInfo:
      • Name: The unique name of dapr sidecar - e.g. IP:port
      • Entities: actor types (optional for remove operation)
      • CreatedAt: the time when new Dapr sidecar member is added (optional, for remove operation)
      • UpdatedAt: the timestamp when memberinfo needs to be updated (optional, for remove operation)

Note: additional properties can be added.

  1. State machine: When the raft leader node proposes new command, the state machine will update the current dapr sidecar member list as a state. Whenever member list changes, it will update the consistent hashing table. Only membership list will be treated as a state in state machine and snapshotted because consistent hashing table can be constructed as long as membership list is same.

Initial Raft nodes

The default number of Placement nodes will be three which can tolerate a single node failure; quorum of the nodes (3/2+1 nodes) needs to agree on committing log entry. Placement will not provide cluster join/leave options. Instead, Placement needs to start with the initial three node addresses with highly available mode.

Dapr runtime client behavior change

Dapr sidecar must know the addresses of three placement nodes to find and connect the leader nodes. It reconnects to these placement nodes in a round robin manner when the leader node is failed.

Raft transport layer

For the first iteration, we will use insecure transport channel and then will encrypt the channel with mTLS workload certificiates.

Etcd raft vs Hashicorp raft

There are two Raft protocol implementations in Go. Both implementations have been adopted to many projects and proven in the production environments.

Production adoption

  • etcd/raft: etcd, Kubernetes, Docker Swarm, Cloud Foundry Diego, CockroachDB, TiDB, Project Calico, Flannel, Hyperledger, etc
  • hashicorp/raft: hashicorp consul, hashicorp vault, hashicorp products, influxdb, nats-streaming-server, rqlite, etc

Raft feature implementation

Etcd/raft provides more features, such as request forwarding to leader, and have tested in more diverse environments than hashicorp/raft does.

Both raft implementations provide Raft protocol features that dapr placement requires.

  1. raft server and state machine - etcd/raft needs more boilerplate code in the application code as compared to hashicorp/raft because etcd/raft is designed to give application more flexibility. Thus, the application code using hashicorp/raft looks cleaner because it needs less boilerplate code than etcd/raft.

  2. transport layer - etcd/raft provides the default http transport layer while hashicorp/raft uses tcp layer with Message pack to implement RPC among raft nodes. If we want to encrypt this transport layer with mtls cert, etcd/raft allows us to set only cert config in transport configuration whereas hashicorp/raft needs to implement transport layer for tls--it doesn't seem difficult to implement the transport layer.

Action items

  • Evaluate Go raft implementation - prototyping
  • Write design spec
  • Adopt Raft in placement
  • Improve placement client to support multiple placement servers
  • Improve dapr sidecar actor service membership to add and remove dapr sidecar reliably
  • Encrypt Raft transport layer using mTLS cert
  • Update helm chart for no-downtime upgrade
  • Update docs

Open questions

  • Etcd/raft vs hashicorp/raft

Reference

1.0.0 Milestone 1 automation moved this from In progress to Review Nov 3, 2020
1.0.0 Milestone automation moved this from In progress to Done Nov 3, 2020
@youngbupark youngbupark reopened this Nov 3, 2020
1.0.0 Milestone 1 automation moved this from Review to In progress Nov 3, 2020
1.0.0 Milestone automation moved this from Done to In progress Nov 3, 2020
1.0.0 Milestone 1 automation moved this from In progress to Review Nov 10, 2020
1.0.0 Milestone automation moved this from In progress to Done Nov 10, 2020
@artursouza artursouza moved this from Review to Done in 1.0.0 Milestone 1 Nov 10, 2020
@artursouza artursouza moved this from Done to Release in 1.0.0 Milestone 1 Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
4 participants