RIP‐67 jRaft‐Controller Implemention

Status

Current State: Accept
Authors: yulangz
Shepherds: zhouxinyu,rongtong,fuyou
Mailing List discussion: dev@rocketmq.apache.org
Pull Request: https://github.com/apache/rocketmq/pull/7301
Released: <relased_version>
Google doc: https://docs.google.com/document/d/1mpzTv1vnWxQwPGsHj6Ng2fK9aL9f6MZFw7ZgvW5284o/edit

Background & Motivation

What do we need to do

Will we add a new module?

No new modules will be added, but a new implementation will be added for the Controller interface.

Will we add new APIs?

No additions or modifications to any client-level and admin-tool APIs. There will be some new interfaces and APIs.

Will we add a new feature?

No, JRaft Controller is a new implementation, but not a new future.

Why should we do that

Are there any problems of our current project?

Yes, there are some issues with the current DLedger Controller:

DLedger, as a Raft repository specifically designed for RocketMQ CommitLog, uses some specialized triggers for CommitLog. For example, DLedger does not implement a snapshot based log truncation function, but instead uses an expiration mechanism to directly discard logs that have exceeded their storage time. This scheme works well as a CommitLog repository, as logs that have exceeded their retention time can be simply discarded. However, when it comes to providing distributed consensus and consistency assurance for upper level state machines, this approach is not very suitable. After an unexpected machine failure, in order to restore the state machine in memory, it is necessary to apply Raft logs one by one to the upper layer. Therefore, all logs must be saved and the timeout deletion mechanism cannot be enabled. Without implementing the snapshot interface, DLedger's logs will grow infinitely, ultimately exceeding the machine's disk capacity. At the same time, the time for fault recovery will also be infinitely extended, which is unacceptable.

As shown in the following figure, the design of the DLedger Controller does not meet linear consistency:

20231124094631

The core function of the Controller is to manage the liveness status of the nodes and the SyncStateSet to achieve automatic election of the Master.

Let's describe the workflow of the DLedger Controller using the example of an AlterSyncStateSet request:

1.The Master Broker generates an AlterSyncStateSet request, which includes the desired SyncStateSet to switch to.

2.DLedgerController queries the current SyncStateSet from ReplicasInfoManager and generates a response action based on it (e.g., adding/removing nodes from the SyncStateSet).

3.DLedgerController submits this response action (event) to DLedgerRaft. DLedger is responsible for replicating this event to other Controller nodes. Once consensus is reached, the event is submitted to the DLedgerControllerStateMachine.

4.DLedgerControllerStateMachine modifies the data in ReplicasInfoManager based on the event.

5.The Broker's heartbeat reaches the BrokerLifecycleListener through a separate link, which is not routed through Raft, and is detected by ReplicasInfoManager.

The above workflow has a significant issue of not satisfying linear consistency. Since the processing of the request happens before Raft, a response action generated using potentially outdated data may occur. Let's illustrate this with an example:

Suppose there is a Broker Master A that triggers two consecutive AlterSyncStateSet requests.

The initial SyncStateSet in ReplicasInfoManager is {A, B}.

For the two AlterSyncStateSet requests, the first one is {A, B, C}, and the second one is {A, B} (removing node C).

Assume that the first request completes step 2, generating an event to insert node C into the SyncStateSet. It is currently in the process of Raft replication (step 3) and has not reached step 4 yet.

At this point, the second request arrives at the Controller. Since the SyncStateSet is still {A, B}, the Controller assumes that the SyncStateSet has not changed and directly returns a Fail response to the requesting client. It does not proceed to submit to Raft (based on the code logic).

Finally, the first request completes step 3, and the data is broadcasted to all Controller nodes, eventually completing step 4 by inserting node C into the SyncStateSet.

As a result, the final state of the SyncStateSet is {A, B, C}, while the expected state is {A, B}.

The issue of inconsistent metadata within the Controller[Summer of code] Let controller become role state after append initial logs by hzh0425 · Pull Request #4442 · apache/rocketmq (github.com) stems from this problem.

Similarly, the heartbeat management, which operates independently of the Raft link, can also encounter problems.

What can we benefit proposed changes?

Through this proposal, users can use the JRaft Controller to replace the DLedger Controller, which implements the snapshot function and can regularly create snapshots of the state machine and truncate logs to avoid infinite growth of Raft logs. At the same time, the JRaft Controller underwent refactoring during design to avoid the issue of linear inconsistency.

Goals

What problem is this proposal designed to solve?

The problem of infinite growth of Raft logs caused by incomplete DLedger design.
Nonlinear consistency issues caused by incomplete design of DLedger Controller.

Non-Goals.

What problem is this proposal NOT designed to solve?

This proposal does not propose a new multi copy storage mechanism, but rather an improvement on the existing architecture.

Changes

New Configuration

// 可选 jRaft、DLedger，默认 DLedger

controllerType=jRaft

// jRaft 相关

// 选举超时时间，默认 1 秒

jRaftElectionTimeoutMs=1000

// 进行 Snapshot 的间隔，默认 1 小时，此处建议更长的时间，如 1 天、3 天

jRaftSnapshotIntervalSecs=3600

// group id

jRaftGroupId=jRaft-Controller

// 本机 jraft 的地址

jRaftServerId=localhost:9880

// jraft 组的地址

jRaftInitConf=localhost:9880,localhost:9881,localhost:9882

// jRaft Controller 中，jRaft 与提供给 Broker 的 RPCService 不共用套接字资源，下面设置的是 Controller 上监听 Broker RPC 的端口。注意 IP 与端口要与

// jRaftInitConf 一一对应。

jRaftControllerRPCAddr=localhost:9770,localhost:9771,localhost:9772

Architecture

20231124095151

In JRaft Controller, all request responses are pushed down to the state machine layer. Additionally, the heartbeat of the Broker, as a request, goes through Raft replication and broadcasting before being submitted to the state machine.

This design ensures two points:

All request responses are processed at the state machine layer, eliminating the possibility of generating response actions using outdated data and ensuring linear consistency.
The liveness status of the nodes is reported to the state machine through Raft, allowing it to be recovered from Raft logs. By incorporating these mechanisms, JRaftController ensures that both request processing and node liveness status are handled consistently and reliably within the system.

In the implementation of heartbeats, JRaftController chooses to fix the timestamp of the heartbeat at the RaftBrokerLifecycleListener instead of checking the heartbeat time in the StateMachine. This ensures that the heartbeat time observed by each Controller node for the Broker remains consistent.

Rejected Alternatives

Perhaps other Raft libraries can be used to implement Controller.

How does alternatives solve the issue you proposed?

Same as JRaft Controller.

Pros and Cons of alternatives

This depends on the Raft library actually used, and different Raft libraries have different characteristics.

Why should we reject above alternatives

JRaft is mature enough, has undergone large-scale production testing, and is capable of achieving the functions we need.

Attachments

Fault injection test report

Home
RocketMQ Improvement Proposal
- RIP
User Guide
- FAQ
Community
- Release Policy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly