Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.4.x版本中,JRaft协议初始化失败导致集群不可用问题纵览 #5344

Closed
2 of 4 tasks
realJackSun opened this issue Apr 12, 2021 · 7 comments
Closed
2 of 4 tasks
Labels
area/Nacos Core kind/bug Category issues or prs related to bug. kind/enhancement Category issues or prs related to enhancement.
Milestone

Comments

@realJackSun
Copy link
Collaborator

realJackSun commented Apr 12, 2021

(该问题的父ISSUE为#5343)
Nacos1.4.x版本中,最为常见的ISSUE之一是:JRaft协议初始化经常失败,导致集群中一台或几台机器无法正常工作。
报错信息包括:
"server is DOWN now, please try again later!"
"old raft protocol already stop "
"did not find the Leader node"等

Nacos社区的修复计划列表包括:

该问题关联的ISSUE包括:

@realJackSun realJackSun changed the title 1.4.x版本中,JRaft协议初始化失败导致集群不可用问题 1.4.x版本中,JRaft协议初始化失败导致集群不可用问题纵览 Apr 12, 2021
@KomachiSion KomachiSion added this to the 1.4.2 milestone Apr 13, 2021
@KomachiSion KomachiSion added kind/bug Category issues or prs related to bug. kind/enhancement Category issues or prs related to enhancement. area/Nacos Core labels Apr 13, 2021
@KomachiSion KomachiSion modified the milestones: 1.4.2, 1.4.3 Apr 29, 2021
@CherishCai
Copy link
Contributor

CherishCai commented Jun 13, 2021

不同节点上,jRaft group member不一致。

@KomachiSion
管控台上看到的节点信息,本节点的信息明显落后。原因之一认为是 allMembers() 的 Set.add(self) 时候因为 key 一致所以内容没覆盖进去;但是实际上 该是要 serverList 里的自身节点信息要被更实时地更新与 self 一致

image

image

image

@CherishCai
Copy link
Contributor

CherishCai commented Sep 29, 2021

不同节点上,jRaft group member不一致。

当前我如此处理,很久了都正常返回数据
image

@zrlw
Copy link
Contributor

zrlw commented Jan 9, 2022

不同节点上,jRaft group member不一致。

当前我如此处理,很久了都正常返回数据 !

serverList初始化时有保存self对象,除了修改本地ip更新serverList时会删除旧地址对应的self之外,还有什么情况能让serverList保存的self对象和当前的self对象不一致呢?

@CherishCai
Copy link
Contributor

serverList初始化时有保存self对象,除了修改本地ip更新serverList时会删除旧地址对应的self之外,还有什么情况能让serverList保存的self对象和当前的self对象不一致呢?

有一部分是 Raft 的一些选举内容变化,你看 extendInfo 的会在哪些更新地方就知道了。

@realJackSun realJackSun modified the milestones: 1.4.3, 1.4.4 Jan 27, 2022
@inkinworld
Copy link

请问 1.4.3 没有完全修复这些问题是嘛?

@CherishCai
Copy link
Contributor

面对这个 k8s 云环境这种上下线/重启等成员 ip 变动大的情况,常常遇到 issue 里描述的问题;
而 Nacos 还是 AP 用法,所以 CP 不应该作为影响集群整体可用性的拦路虎;
而且 nacos-client 也会不断 beat 补全注册的数据,从而我这边增加了一个重置整个集群选举的入口 resetPeers (用于真的无法选主时的极限运维,总好过集群整体挂掉)

image

    /**
     * resetPeers. ## nacos-enhance ##.
     * <p>只有在非常紧急并且可用性更为重要的情况下使用:https://www.bookstack.cn/read/sofa-jraft/3.md#6.3%20多数节点故障</p>
     */
    RESET_PEERS(JRaftConstants.RESET_PEERS) {
        @Override
        public RestResult<String> execute(CliService cliService, String groupId, Node node, Map<String, String> args) {
            final Configuration newConf = new Configuration();
            String peers = args.get(JRaftConstants.COMMAND_VALUE);
            for (String peer : peers.split(",")) {
                newConf.addPeer(PeerId.parsePeer(peer.trim()));
            }
            
            final PeerId nodePeerId = node.getNodeId().getPeerId();
            Status status = cliService.resetPeer(groupId, nodePeerId, newConf);
            if (status.isOk()) {
                return RestResultUtils.success();
            }
            return RestResultUtils.failed(status.getErrorMsg());
        }
    };

@KomachiSion
Copy link
Collaborator

From the community feedback, the problem has been greatly reduced

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/Nacos Core kind/bug Category issues or prs related to bug. kind/enhancement Category issues or prs related to enhancement.
Projects
None yet
Development

No branches or pull requests

5 participants