New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Placement Services: Run multiple instances and use Leader election using Raft #663
Comments
From Gitter conversation:@mkosieradzki - https://gitter.im/Dapr/community?at=5dadeb55ef84ab37867027f5
@yaron2 - https://gitter.im/Dapr/community?at=5dae0943a03ae1584fe4374d
@mkosieradzki - https://gitter.im/Dapr/community?at=5dae22599825bd6baca3b46b ad 2, 3. You really trust guarantees for ReplicaSet. Imagine a node goes down (the one that was hosting placement-service). The node will stop sending heartbeats. After 3 minutes (AFAIR) k8s will stop scheduling new pods k8s to this node ;-)))). But ReplicaSet will not schedule a new pod until the node comes back ;-). How about liveness probes? AFAIK liveness probes are issued by the kubelet, which lives on the dead node ;). So you just lost the only copy of placement-service. This is why you should consider running multiple instances and use leader election to elect the one instance doing the real thing. K8S is not Service Fabric which detects node outage under 4 seconds :) @youngp - https://gitter.im/Dapr/community?at=5dae9d8efb4dab784ae26357
@mkosieradzki Could you please create an issue to track this in dapr/dapr ? It would be good to discuss this via issue. @mkosieradzki - https://gitter.im/Dapr/community?at=5daea3a92a6494729c1952be In case you agree with my scenario, we can move to the next problem: API server based leader election ties you to the API server (and underyling etcd/Cosmos DB) availability, which in AKS has SLO of 99,5% ;-), and AKS folks are refusing to fix this, because "your workloads don't need API server to work" ;-), so you end-up deploying your own zk/etcd ;-) or implementing your own clustering and leader election on top of k8s. @yaron - https://gitter.im/Dapr/community?at=5daf2dd89c39821509655c3c actually the default pod eviction time for nodes is 5 minutes.. so its even "worse" in that sense. liveness probes are indeed dispatched from the kubelet on the dead node so they would be of no help. in any case, placement service being down does not mean actors stop working or are unreachable. It means there won't be any rebalancing done. What we could do is run the placement service as a DaemonSet and use leader election for active-NPassive. Leader election won't be API server leader election, there are many ways to achieve that. |
@yaron2 @amanbha - I will use etcd's raft implementation to use Raft consensus algorithm in placement because etcd one looks most popular one. let me know if you have any concern. |
Overview of placement servicePlacement service acts as membership service of actor service instances and partitioning service for actors. When Dapr runtime starts, it connects to placement service gRPC server and keep this connection in its entire life-cycle in order to synchronize the local actor partition table in Dapr runtime sidecar. Dapr runtime piggybacks to placement service by reporting the current host information, such as serving actor types, in the heartbeat request and receives the updated partition table while sending heartbeat every 1 seconds. This partition table is used for placing actors on actor service member nodes and used as a look-up table for actors when user application invokes actors. Role of Raft in PlacementRaft consensus algorithm plays two roles in Placement services:
Raft input command in logs and state machine
Initial Raft nodesThe default number of Placement nodes will be three which can tolerate a single node failure; quorum of the nodes (3/2+1 nodes) needs to agree on committing log entry. Placement will not provide cluster join/leave options. Instead, Placement needs to start with the initial three node addresses with highly available mode. Dapr runtime client behavior changeDapr sidecar must know the addresses of three placement nodes to find and connect the leader nodes. It reconnects to these placement nodes in a round robin manner when the leader node is failed. Raft transport layerFor the first iteration, we will use insecure transport channel and then will encrypt the channel with mTLS workload certificiates. Etcd raft vs Hashicorp raftThere are two Raft protocol implementations in Go. Both implementations have been adopted to many projects and proven in the production environments. Production adoption
Raft feature implementationEtcd/raft provides more features, such as request forwarding to leader, and have tested in more diverse environments than hashicorp/raft does. Both raft implementations provide Raft protocol features that dapr placement requires.
Action items
Open questions
Reference |
Describe the proposal
Currently Placement service runs as single instance and can lead to unavailability of it when node goes down. Proposal is to run it with multiple instances and use Leader Election.
cc: @yaron2 @youngbupark @mkosieradzki
RELEASE NOTE: Enables placement service to run with multiple instances.
The text was updated successfully, but these errors were encountered: