-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.4.x版本中,JRaft协议初始化失败导致集群不可用问题纵览 #5344
Comments
@KomachiSion |
serverList初始化时有保存self对象,除了修改本地ip更新serverList时会删除旧地址对应的self之外,还有什么情况能让serverList保存的self对象和当前的self对象不一致呢? |
有一部分是 Raft 的一些选举内容变化,你看 extendInfo 的会在哪些更新地方就知道了。 |
请问 1.4.3 没有完全修复这些问题是嘛? |
面对这个 k8s 云环境这种上下线/重启等成员 ip 变动大的情况,常常遇到 issue 里描述的问题; /**
* resetPeers. ## nacos-enhance ##.
* <p>只有在非常紧急并且可用性更为重要的情况下使用:https://www.bookstack.cn/read/sofa-jraft/3.md#6.3%20多数节点故障</p>
*/
RESET_PEERS(JRaftConstants.RESET_PEERS) {
@Override
public RestResult<String> execute(CliService cliService, String groupId, Node node, Map<String, String> args) {
final Configuration newConf = new Configuration();
String peers = args.get(JRaftConstants.COMMAND_VALUE);
for (String peer : peers.split(",")) {
newConf.addPeer(PeerId.parsePeer(peer.trim()));
}
final PeerId nodePeerId = node.getNodeId().getPeerId();
Status status = cliService.resetPeer(groupId, nodePeerId, newConf);
if (status.isOk()) {
return RestResultUtils.success();
}
return RestResultUtils.failed(status.getErrorMsg());
}
}; |
From the community feedback, the problem has been greatly reduced |
(该问题的父ISSUE为#5343)
Nacos1.4.x版本中,最为常见的ISSUE之一是:JRaft协议初始化经常失败,导致集群中一台或几台机器无法正常工作。
报错信息包括:
"server is DOWN now, please try again later!"
"old raft protocol already stop "
"did not find the Leader node"等
Nacos社区的修复计划列表包括:
该问题关联的ISSUE包括:
nacos1.4集群服务注册server is DOWN now, please try again later! #4888
1.4.0Version Server List 503 #4573
nacos集群(容器)从1.4.0升级到1.4.1后节点正常启动但是选举失败,回滚到1.4.0就没问题 #4730
1.4.1 集群启动失败 #4995
swarm 部署 nacos 集群,出现多 leader 情况,并且部分节点无法响应,down 掉的情况 #4974
nacos1.4.1选举异常 #5300
https://app.yinxiang.com/fx/60dcf61e-cf73-4343-875e-f4ff0885e7e5
Nacos1.4.1 集群选举失败 #5339
k8s nacos1.4.1集群突然故障 #5311
当nacos集群的某一个pod所在的k8s node节点异常宕机,然后节点恢复后,nacos集群中会多出一个异常节点 #5302
nacos1.4.1 k8s集群化部署问题 #5290
什么时候解决nacos在k8s上hostname的bug,这个问题n个版本一直存在。 #4962
The text was updated successfully, but these errors were encountered: