Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DSIP-9][Feature][Server] Add new registry plugin based on raft #10874

Open
3 tasks done
Tracked by #14102
zhuxt2015 opened this issue Jul 10, 2022 · 19 comments · May be fixed by #16352
Open
3 tasks done
Tracked by #14102

[DSIP-9][Feature][Server] Add new registry plugin based on raft #10874

zhuxt2015 opened this issue Jul 10, 2022 · 19 comments · May be fixed by #16352
Labels

Comments

@zhuxt2015
Copy link
Contributor

zhuxt2015 commented Jul 10, 2022

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

The role of zookeeper

  1. Store service-related information of master and worker, host IP, port, cpu load, etc.
  2. Master and worker health check
  3. master failover
  4. Distributed lock

Problems caused by zookeeper

  1. Increase the complexity of system deployment and operation and maintenance. While deploying the DS cluster, a zookeeper cluster should also be deployed. It is also necessary to monitor the operation and maintenance of the DS cluster and the zookeeper cluster at the same time
  2. Increase the chance of error. The network between the DS cluster and the zookeeper cluster is fluttering, and errors may occur.

Advantages of remove zookeeper

  1. The deployment architecture is simpler
  2. Lower maintenance cost
  3. No more DS errors due to zookeeper cluster errors

Blue print of remove zookeeper

image

  1. Refer to Kafka's KRaft solution to achieve leader election and data synchronization between masters
  2. The worker server registers with the leader master and maintains a heartbeat with the master
  3. Api server obtains cluster information from leader master

Use case

No response

Related issues

#6680

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@zhuxt2015 zhuxt2015 added feature new feature Waiting for reply Waiting for reply labels Jul 10, 2022
@github-actions
Copy link

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

The current role of zookeeper

  1. Store service-related information of master and worker, host IP, port, cpu load, etc.
  2. Master and worker health check
  3. master failover
  4. Distributed lock

Problems caused by zookeeper

  1. Increase the complexity of system deployment and operation and maintenance. While deploying the DS cluster, a zookeeper cluster should also be deployed. It is also necessary to monitor the operation and maintenance of the DS cluster and the zookeeper cluster at the same time
  2. Increase the chance of error. The network between the DS cluster and the zookeeper cluster is fluttering, and errors may occur.

remove zookeeper's point

  1. The deployment architecture is simpler
  2. Lower maintenance cost
  3. No more DS errors due to zookeeper cluster errors

remove zookeeper scheme

image

  1. Refer to Kafka's KRaft solution to achieve leader election and data synchronization between masters
  2. The worker server registers with the leader master and maintains a heartbeat with the master
  3. Api server obtains cluster information from leader master

Use case

No response

Related issues

#6680

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@github-actions
Copy link

Thank you for your feedback, we have received your issue, Please wait patiently for a reply.

  • In order for us to understand your request as soon as possible, please provide detailed information、version or pictures.
  • If you haven't received a reply for a long time, you can join our slack and send your question to channel #troubleshooting

@ruanwenjun ruanwenjun removed the Waiting for reply Waiting for reply label Jul 10, 2022
@ruanwenjun
Copy link
Member

Great feature, this feature has been discussed before, it's needed to give a detailed design, such like how we store the log, how to solve the split-brain..., this is a good begin.

@ruanwenjun ruanwenjun added the discussion discussion label Jul 10, 2022
@zhuxt2015
Copy link
Contributor Author

@ruanwenjun ok, I'll give detail design later

@lgcareer
Copy link
Contributor

@zhuxt2015 Greate feature,could you please introduce more detail about the leader master and other masters.

@zhuxt2015
Copy link
Contributor Author

zhuxt2015 commented Jul 10, 2022

leader选举

  1. master之间使用raft算法进行选举, 由raft算法保证不会出现脑裂的情况, raft算法演示
  2. raft算法需要半数以上的节点存活才能保证集群继续运行。3个master最多宕机1个, 5个master节点最多宕机2个,依次类推。
  3. 如果master存活少于半数, 所有剩余master自杀
  4. 如果leader master 死掉, 会在剩余的follower中重新选择leader, 并且重新选举的时间控制在几百毫秒以内。

监控及信息存储

  1. worker和follower将原来写入zookeeper的心跳信息(HeartBeat.class), 通过Netty发送给leader节点, 由leader节点保存在内存中
  2. follower周期从leader获取master和worker信息,保存在内存中, 并作为leader的热备
  3. Api Server从任意master节点获取master和worker的监控信息
  4. 如果leader超过一定时间没有收到worker和follower的心跳信息, 会将他们移至dead列表,方便进行容错处理

Leader election

  1. The raft algorithm is used for election between masters, and the raft algorithm ensures that there will be no brain split. raft election demonstrate
  2. The raft algorithm requires more than half of the nodes to survive to keep the cluster running. 3 masters have a maximum of 1 down, 5 master nodes have a maximum of 2 downs, and so on.
  3. If less than half of the masters survive, all remaining masters commit suicide
  4. If the leader master dies, the leader will be re-selected among the remaining followers, and the time for re-election will be controlled within a few hundred milliseconds.

Monitoring and information storage

  1. Workers and follower send the heartbeat information originally written to zookeeper (HeartBeat.class) to the leader node via Netty, which is kept in memory by the leader node
  2. The follower cycle takes master and worker information from the leader, saves it in memory, and serves as a hot spare for the leader
  3. Api Server obtains monitoring information for masters and workers from any master node
  4. If the leader does not receive heartbeat information from the worker and follower for more than a certain period of time, they will be moved to the dead list for fault tolerance

@ruanwenjun ruanwenjun changed the title [Feature][Server] Remove the Apache Zookeeper dependency [Feature][Server] Add new registry plugin based on raft Jul 15, 2022
@leo-lao
Copy link

leo-lao commented Jul 19, 2022

Hi @zhuxt2015 I am also interested in implementing this issue. Maybe I can join the discussion of design and implementation

@zhuxt2015
Copy link
Contributor Author

Hi @zhuxt2015 I am also interested in implementing this issue. Maybe I can join the discussion of design and implementation

@leo-lao Great! Thank you for joining, I'm completed most of mostly functions, I will submit a PR this week, then let's discuss the subsequent division of development work.

@ruanwenjun
Copy link
Member

Hi @zhuxt2015 I am also interested in implementing this issue. Maybe I can join the discussion of design and implementation

@leo-lao Great! Thank you for joining, I'm completed most of mostly functions, I will submit a PR this week, then let's discuss the subsequent division of development work.

Before submit PR, it's better to provide a detail design. The current design is not enough, we need to consider how to persistent data in disk, and how to implement the lock, how to maintain the data. Do we need to use some lib or we will implement the raft by ourselves.

@zhuxt2015
Copy link
Contributor Author

I will use sofa-jraft lib, here is Github Repository and User Guide

sofa-jraft introduction

SOFAJRAFT is a production-grade java implementation of RAFT consensus algorithm. SOFAJRaft is licensed under the Apache License 2.0. SOFAJRaft relies on some third-party components, and their open source protocol is also Apache License 2.0.

The core component is StateMachine and RheaKV .

StateMachine is an implementation of users’ core logic. It calls the onApply(Iterator) method to apply log entries that are submitted with Node#apply(task) to the business state machine.

RheaKV is a lightweight, distributed, and embedded KV storage library, which is included in the JRaft project as a submodule.

Ephemeral Node

All node information is stored in StateMachine's memory, StateMachine manages the registration and downtime of nodes, When a new node joins the cluster, a heartbeat packet is sent to the leader master, The last update time of the node is recorded and synchronized to all masters。When there is a node down, Ephemeral Node Refresh Thread scan records in StateMachine , When the last update time differs from the current time by more than a certain amount of time, nodes are removed and the removed results are synchronized to other masters.
image

Subscribe/Notify

The design of Subscribe/Notify is the same with ephemeral node, when leader master' StateMachine senses a data change in the server , then it will trigger the subscribed listener.
image

Global Lock

The design of global lock is the same with ephemeral node, there will be a KVStore in the StateMachine to store the lock info. RheaKVStore will store the lock of master server and clear the expiry lock.
image

@zhongjiajie zhongjiajie changed the title [Feature][Server] Add new registry plugin based on raft [DSIP][Feature][Server] Add new registry plugin based on raft Jul 20, 2022
@zhongjiajie zhongjiajie changed the title [DSIP][Feature][Server] Add new registry plugin based on raft [DSIP-9][Feature][Server] Add new registry plugin based on raft Jul 20, 2022
@zhongjiajie
Copy link
Member

@zhuxt2015 Please follow the dsip https://dolphinscheduler.apache.org/en-us/community/DSIP.html process to create DSIP, thanks

@leo-lao
Copy link

leo-lao commented Jul 20, 2022

@zhuxt2015 I searched available raft implementations, and found apache ratis may be one better choice?
SOFAJRAFT requires more dependencies, like

            <dependency>
                <groupId>com.alipay.sofa</groupId>
                <artifactId>bolt</artifactId>
                <version>${bolt.version}</version>
            </dependency>
            <dependency>
                <groupId>com.alipay.sofa</groupId>
                <artifactId>hessian</artifactId>
                <version>${hessian.version}</version>
            </dependency>

this is not so acceptable for one Apache Project.

On the other hand, I found apache ratis used by apache ozone and alluxio, which add more credit

@zhuxt2015
Copy link
Contributor Author

@leo-lao com.alipay.sofa.bolt and com.alipay.sofa.hessian use Apache License 2.0, I think they can be used. I also noticed ratis, it does not implement distributed lock, so not use it.

@leo-lao
Copy link

leo-lao commented Jul 21, 2022

Is it is a must to implement distributed lock in dolphinscheduler?
As far as I know, distributed lock in current version, is used for making sure only one master handling requests or failover event.
Is it ok If we just put above logics in Leader Master?

@ruanwenjun
Copy link
Member

Is it is a must to implement distributed lock in dolphinscheduler?
As far as I know, distributed lock in current version, is used for making sure only one master handling requests or failover event.
Is it ok If we just put above logics in Leader Master?

This is not a good idea, you need to import the leader role in other registry plugin.

@leo-lao
Copy link

leo-lao commented Jul 22, 2022

Is it is a must to implement distributed lock in dolphinscheduler?
As far as I know, distributed lock in current version, is used for making sure only one master handling requests or failover event.
Is it ok If we just put above logics in Leader Master?

This is not a good idea, you need to import the leader role in other registry plugin.

In fact not, with raft introduced, we will introduce leader and followers in this system, no need to rely on other registry plugins.

method details pros cons
raft with distributed lock multiple master handling tasks no pressure on single master extra dependency/implementation of distributed lock
raft without distributed lock only the raft leader handling tasks simple, popular way(like Apache Ozone, Alibaba RemoteShuffuleService) pressure on single master

I feel both are OK

@ruanwenjun
Copy link
Member

Is it is a must to implement distributed lock in dolphinscheduler?
As far as I know, distributed lock in current version, is used for making sure only one master handling requests or failover event.
Is it ok If we just put above logics in Leader Master?

This is not a good idea, you need to import the leader role in other registry plugin.

In fact not, with raft introduced, we will introduce leader and followers in this system, no need to rely on other registry plugins.

method details pros cons
raft with distributed lock multiple master handling tasks no pressure on single master extra dependency/implementation of distributed lock
raft without distributed lock only the raft leader handling tasks simple, popular way(like Apache Ozone, Alibaba RemoteShuffuleService) pressure on single master
I feel both are OK

We need to reach a consensus that we import raft just as a new registry plugin, this will not affect our existing plugin.

@zhuxt2015
Copy link
Contributor Author

@leo-lao For the time being, backward compatibility should be guaranteed, The user can choose to use raft registry or zookeeper registry.

@leo-lao
Copy link

leo-lao commented Jul 23, 2022

Gotcha, then your idea works

@MatthewAden MatthewAden linked a pull request Jul 20, 2024 that will close this issue
3 tasks
MatthewAden added a commit to MatthewAden/dolphinscheduler that referenced this issue Aug 19, 2024
MatthewAden added a commit to MatthewAden/dolphinscheduler that referenced this issue Sep 8, 2024
MatthewAden added a commit to MatthewAden/dolphinscheduler that referenced this issue Sep 8, 2024
MatthewAden added a commit to MatthewAden/dolphinscheduler that referenced this issue Sep 11, 2024
MatthewAden added a commit to MatthewAden/dolphinscheduler that referenced this issue Sep 11, 2024
MatthewAden added a commit to MatthewAden/dolphinscheduler that referenced this issue Sep 11, 2024
MatthewAden added a commit to MatthewAden/dolphinscheduler that referenced this issue Sep 13, 2024
MatthewAden added a commit to MatthewAden/dolphinscheduler that referenced this issue Sep 25, 2024
MatthewAden added a commit to MatthewAden/dolphinscheduler that referenced this issue Sep 25, 2024
MatthewAden added a commit to MatthewAden/dolphinscheduler that referenced this issue Sep 25, 2024
MatthewAden added a commit to MatthewAden/dolphinscheduler that referenced this issue Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment