1.4.x版本中，JRaft协议初始化失败导致集群不可用问题纵览 #5344

realJackSun · 2021-04-12T13:15:18Z

(该问题的父ISSUE为#5343)
Nacos1.4.x版本中，最为常见的ISSUE之一是：JRaft协议初始化经常失败，导致集群中一台或几台机器无法正常工作。
报错信息包括：
"server is DOWN now, please try again later!"
"old raft protocol already stop "
"did not find the Leader node"等

Nacos社区的修复计划列表包括：

将“Server is Down"的提示，修改成机器失效的具体原因，传给用户；该功能将通过增加接口的方式进行。 In version 1.4.1, modify the "Server is Down" prompt to the specific reason for the machine failure and send it to the user #5350
修复--选主成功，初始化完成了之后，但"Down"的状态没有改为"UP"--的问题。Nacos1.4.1 selected the leader successfully, but the peer status of "Down" is not changed to "UP" #5351
不同节点上，jRaft group member不一致。
其他初始化失败，无法选主的问题。（待补充，持续更新）

该问题关联的ISSUE包括：

CherishCai · 2021-06-13T03:11:01Z

不同节点上，jRaft group member不一致。

@KomachiSion
管控台上看到的节点信息，本节点的信息明显落后。原因之一认为是 allMembers() 的 Set.add(self) 时候因为 key 一致所以内容没覆盖进去；但是实际上该是要 serverList 里的自身节点信息要被更实时地更新与 self 一致

CherishCai · 2021-09-29T09:12:48Z

不同节点上，jRaft group member不一致。

当前我如此处理，很久了都正常返回数据

zrlw · 2022-01-09T11:08:35Z

不同节点上，jRaft group member不一致。

当前我如此处理，很久了都正常返回数据 !

serverList初始化时有保存self对象，除了修改本地ip更新serverList时会删除旧地址对应的self之外，还有什么情况能让serverList保存的self对象和当前的self对象不一致呢？

CherishCai · 2022-01-17T08:28:13Z

serverList初始化时有保存self对象，除了修改本地ip更新serverList时会删除旧地址对应的self之外，还有什么情况能让serverList保存的self对象和当前的self对象不一致呢？

有一部分是 Raft 的一些选举内容变化，你看 extendInfo 的会在哪些更新地方就知道了。

inkinworld · 2022-02-09T11:41:12Z

请问 1.4.3 没有完全修复这些问题是嘛？

CherishCai · 2022-04-20T05:51:20Z

面对这个 k8s 云环境这种上下线/重启等成员 ip 变动大的情况，常常遇到 issue 里描述的问题；
而 Nacos 还是 AP 用法，所以 CP 不应该作为影响集群整体可用性的拦路虎；
而且 nacos-client 也会不断 beat 补全注册的数据，从而我这边增加了一个重置整个集群选举的入口 resetPeers （用于真的无法选主时的极限运维，总好过集群整体挂掉）

    /**
     * resetPeers. ## nacos-enhance ##.
     * <p>只有在非常紧急并且可用性更为重要的情况下使用：https://www.bookstack.cn/read/sofa-jraft/3.md#6.3%20多数节点故障</p>
     */
    RESET_PEERS(JRaftConstants.RESET_PEERS) {
        @Override
        public RestResult<String> execute(CliService cliService, String groupId, Node node, Map<String, String> args) {
            final Configuration newConf = new Configuration();
            String peers = args.get(JRaftConstants.COMMAND_VALUE);
            for (String peer : peers.split(",")) {
                newConf.addPeer(PeerId.parsePeer(peer.trim()));
            }
            
            final PeerId nodePeerId = node.getNodeId().getPeerId();
            Status status = cliService.resetPeer(groupId, nodePeerId, newConf);
            if (status.isOk()) {
                return RestResultUtils.success();
            }
            return RestResultUtils.failed(status.getErrorMsg());
        }
    };

…pi resetPeers.

…tPeers. (#8221)

KomachiSion · 2022-08-10T01:48:03Z

From the community feedback, the problem has been greatly reduced

realJackSun mentioned this issue Apr 12, 2021

JRaft协议初始化失败导致集群不可用问题纵览 #5343

Closed

realJackSun changed the title ~~1.4.x版本中，JRaft协议初始化失败导致集群不可用问题~~ 1.4.x版本中，JRaft协议初始化失败导致集群不可用问题纵览 Apr 12, 2021

KomachiSion added this to the 1.4.2 milestone Apr 13, 2021

KomachiSion added kind/bug Category issues or prs related to bug. kind/enhancement Category issues or prs related to enhancement. area/Nacos Core labels Apr 13, 2021

KomachiSion modified the milestones: 1.4.2, 1.4.3 Apr 29, 2021

realJackSun modified the milestones: 1.4.3, 1.4.4 Jan 27, 2022

KomachiSion mentioned this issue Apr 12, 2022

Nacos 缺少优雅上线机制 #8125

Closed

CherishCai added a commit to CherishCai/nacos that referenced this issue Apr 20, 2022

feat(alibaba#5344): reset raft cluster ops for no leader, use JRaft A…

9768788

…pi resetPeers.

CherishCai mentioned this issue Apr 20, 2022

feat(#5344): reset raft cluster ops for no leader, use JRaft Api rese… #8221

Merged

KomachiSion pushed a commit that referenced this issue Apr 22, 2022

feat(#5344): reset raft cluster ops for no leader, use JRaft Api rese…

9d5a8da

…tPeers. (#8221)

KomachiSion closed this as completed Aug 10, 2022

KomachiSion mentioned this issue Apr 17, 2024

项目启动时报错 ErrMsg:server is DOWNnow, detailed error message: Optional[Distro protocol is not initialized] #6794

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.4.x版本中，JRaft协议初始化失败导致集群不可用问题纵览 #5344

1.4.x版本中，JRaft协议初始化失败导致集群不可用问题纵览 #5344

realJackSun commented Apr 12, 2021 •

edited

Loading

CherishCai commented Jun 13, 2021 •

edited

Loading

CherishCai commented Sep 29, 2021 •

edited

Loading

zrlw commented Jan 9, 2022

CherishCai commented Jan 17, 2022

inkinworld commented Feb 9, 2022

CherishCai commented Apr 20, 2022

KomachiSion commented Aug 10, 2022

1.4.x版本中，JRaft协议初始化失败导致集群不可用问题纵览 #5344

1.4.x版本中，JRaft协议初始化失败导致集群不可用问题纵览 #5344

Comments

realJackSun commented Apr 12, 2021 • edited Loading

CherishCai commented Jun 13, 2021 • edited Loading

CherishCai commented Sep 29, 2021 • edited Loading

zrlw commented Jan 9, 2022

CherishCai commented Jan 17, 2022

inkinworld commented Feb 9, 2022

CherishCai commented Apr 20, 2022

KomachiSion commented Aug 10, 2022

realJackSun commented Apr 12, 2021 •

edited

Loading

CherishCai commented Jun 13, 2021 •

edited

Loading

CherishCai commented Sep 29, 2021 •

edited

Loading