Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Summer of Code] Dledger controller #4195

Merged
merged 23 commits into from
Apr 30, 2022
Merged

[Summer of Code] Dledger controller #4195

merged 23 commits into from
Apr 30, 2022

Conversation

hzh0425
Copy link
Member

@hzh0425 hzh0425 commented Apr 22, 2022

What is the purpose of the change

tracking issue: #4330

Add a controller for ha-service, the controller is based on dledger.
The architecture is:
image

DledgerController 里面有几个组件, 他们的关系是这样的:

  • EventScheduler: 事件调度器, 内含 BlockingQueue, controller 的每个 api 都可以往里面投放 EventHandler.
  • EventHandler: 事件处理器, 是 EventScheduler 的调度对象, 定义了如何 run event, 如何将 event append 到 dledger, 如何返回结果等等.
  • RoleChangeHandler: Dledger role 监听器, 当 controller 变为 leader 时, 便会启动 EventScheduler.
  • ControllerStateMachine: statemachine 的实现类, 其只负责从 dledger 获取日志, 解码成 event, 然后将 event apply 到 replicasInfoManager 中
  • ReplicasInfoManager: Controller 真正的内存状态机, 其有两钟类型的函数:
    1.不会修改状态机的, 例如 alterSyncStateSet 和 electMaster 等, 这些函数会被 controller 所调用, 根据内存中的元数据, 生成一些 event.
    然后 controller 会将这些 event 包装成 eventHandler, 投放到 EventScheduler的队列, 由 eventHandler append 到dledger.
    2.会修改状态机的, 例如 handleXXXX (handleElectMaster), 这是由 ControllerStateMachine 调用的. (也即当 event 成功 append 到 dledger 后, 就会调用相应的 handleXXX 函数, 修改内存中的元数据).

Brief changelog

  • Add a controller, based on dledger
  • Add some options in name-srv config
  • Add controller into name-srv, add controllerRequestProcessor.

Verifying this change

XXXX

Follow this checklist to help us incorporate your contribution quickly and easily. Notice, it would be helpful if you could finish the following 5 checklist(the last one is not necessary)before request the community to review your PR.

  • Make sure there is a Github issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a Github issue. Your pull request should address just this issue, without pulling in other changes - one PR resolves one issue.
  • Format the pull request title like [ISSUE #123] Fix UnknownException when host config not exist. Each commit in the pull request should have a meaningful subject line and body.
  • Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
  • Write necessary unit-test(over 80% coverage) to verify your logic correction, more mock a little better when cross module dependency exist. If the new feature or significant change is committed, please remember to add integration-test in test module.
  • Run mvn -B clean apache-rat:check findbugs:findbugs checkstyle:checkstyle to make sure basic checks pass. Run mvn clean install -DskipITs to make sure unit-test pass. Run mvn clean test-compile failsafe:integration-test to make sure integration-test pass.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

@hzh0425
Copy link
Member Author

hzh0425 commented Apr 22, 2022

@RongtongJin Hi,take a look please

@hzh0425 hzh0425 changed the title Feature/ha controller Dledger controller Apr 22, 2022
@dugenkui03
Copy link
Contributor

Some suggestion, details in review:

  1. remove unnessary setter;
  2. return copy or unmodified view for getter;
  3. use defensive copy in constructor;

Describe in Chinese

评论中提交了三个建议、适用所有实体定义。请评估。

  1. 移除不必要的 setter、用 final+constroctor 或者 final + builder 替代;
  2. getter中返回保护性拷贝或者 不可修改视图;
  3. 构造函数对于列表使用保护性拷贝。

这些建议的目的均为使得对象安全的创建或者发布,避免可以获取到对象引用或者对象集合字段引用的不可信代码、不恰当的使用导致对象状态被破坏。

@RongtongJin RongtongJin self-requested a review April 22, 2022 06:25
Comment on lines +171 to +176
if (!brokerIdTable.containsKey(brokerAddress)) {
// If this broker replicas is first time come online, we need to apply a new id for this replicas.
brokerId = brokerInfo.newBrokerId();
final ApplyBrokerIdEvent applyIdEvent = new ApplyBrokerIdEvent(request.getBrokerName(),
brokerAddress, brokerId);
result.addEvent(applyIdEvent);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will generating events based on the values in memory cause duplicate brokerId or epoch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually not, because eventScheduler is scheduled in fifo order.
The next eventHandler will be dispatched only after the events generated by the previous eventHandler are appended to the dledger and applied to the state machine.
So it doesn't happen as you said.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这需要依靠 dledger snapshot 的支持, 从最近一次 snapshot 恢复状态, 然后恢复之后的日志.
但是目前 dledger 不支持该能力, 所以只能从头开始回放日志.

Copy link
Member Author

@hzh0425 hzh0425 Apr 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

讨论方案: 在 role change 到 leader 时, 主动发起一次空的提案. 等该提案提交之后, 才代表controller 的元数据都已经恢复了.然后才能对外提供服务.

@RongtongJin RongtongJin added the soc Summer of Code, hosted by Google, Alibaba, Chinese Academy of Sciences and so on label Apr 22, 2022
@RongtongJin RongtongJin changed the title Dledger controller [Summer of Code] Dledger controller Apr 22, 2022
@hzh0425
Copy link
Member Author

hzh0425 commented Apr 23, 2022

@RongtongJin Thanks a lot

1.remove originMasterId in replicasInfo
2.add DledgerControllerConfig
1.add option isProcessReadEvent.
2.add ControllerConfig
add namesrv into dledgerController to predict whether the broker is alive.
Comment on lines 360 to 364
tryTimes++;
if (tryTimes > 3) {
log.error("Controller leader append initial log failed too many times");
break;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

所以连续超过3次会跳出循环吗?是不是应该循环直到append成功

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可能刚转换成 leader 的时候, 需要一段时间才能 append 成功. 但是并不知道成功的边界. 会不会有可能一直不成功?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的,但我认为成功前不应该提供服务,或者每失败超过x次打一次日志来提醒用户。

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea

@codecov-commenter
Copy link

codecov-commenter commented Apr 30, 2022

Codecov Report

Merging #4195 (bbf5c1a) into 5.0.0-beta-dledger-controller (bcce3d3) will increase coverage by 0.15%.
The diff coverage is 53.40%.

@@                         Coverage Diff                         @@
##             5.0.0-beta-dledger-controller    #4195      +/-   ##
===================================================================
+ Coverage                            43.13%   43.28%   +0.15%     
- Complexity                            6025     6138     +113     
===================================================================
  Files                                  795      818      +23     
  Lines                                56859    57559     +700     
  Branches                              7787     7852      +65     
===================================================================
+ Hits                                 24524    24917     +393     
- Misses                               29141    29404     +263     
- Partials                              3194     3238      +44     
Impacted Files Coverage Δ
...rg/apache/rocketmq/common/constant/LoggerName.java 0.00% <ø> (ø)
...ache/rocketmq/common/namesrv/ControllerConfig.java 0.00% <0.00%> (ø)
...g/apache/rocketmq/common/protocol/RequestCode.java 0.00% <ø> (ø)
.../apache/rocketmq/common/protocol/ResponseCode.java 0.00% <ø> (ø)
...srv/controller/AlterSyncStateSetRequestHeader.java 0.00% <0.00%> (ø)
...rv/controller/AlterSyncStateSetResponseHeader.java 0.00% <0.00%> (ø)
...r/namesrv/controller/ElectMasterRequestHeader.java 0.00% <0.00%> (ø)
.../namesrv/controller/ElectMasterResponseHeader.java 0.00% <0.00%> (ø)
.../namesrv/controller/GetMetaDataResponseHeader.java 0.00% <0.00%> (ø)
...amesrv/controller/GetReplicaInfoRequestHeader.java 0.00% <0.00%> (ø)
... and 35 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bcce3d3...bbf5c1a. Read the comment docs.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 47.365% when pulling bbf5c1a on hzh0425:feature/ha-controller into bcce3d3 on apache:5.0.0-beta-dledger-controller.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 47.365% when pulling bbf5c1a on hzh0425:feature/ha-controller into bcce3d3 on apache:5.0.0-beta-dledger-controller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
soc Summer of Code, hosted by Google, Alibaba, Chinese Academy of Sciences and so on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants