Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nacos1.4.1选举异常 #5300

Closed
1019822077 opened this issue Apr 8, 2021 · 11 comments
Closed

nacos1.4.1选举异常 #5300

1019822077 opened this issue Apr 8, 2021 · 11 comments
Labels
status/invalid This doesn't seem right

Comments

@1019822077
Copy link

1019822077 commented Apr 8, 2021

首先安装1.4.1
export NACOS_VERSION=1.4.1
git clone --depth 1 https://github.com/nacos-group/nacos-docker.git
cd nacos-docker
docker-compose -f example/cluster-hostname.yaml up
之后集群中leader是nacos3,执行docker stop nacos3,剩下的2个进行选举,nacos2成功为leader.
此时再把nacos2 stop掉,再看nacos1的时候,leader还是保持nacos2。不会自动成leader.

脑裂整个过程笔记:
https://app.yinxiang.com/fx/60dcf61e-cf73-4343-875e-f4ff0885e7e5
https://www.yinxiang.com/everhub/note/60dcf61e-cf73-4343-875e-f4ff0885e7e5

以上过程我在nacos2.0.0上测试,没有问题,当有2个节点的时候,会重新选举,不会脑裂。3个都启动的时候,也不会脑裂。看来1.4.1的选举bug在nacos2.0.0上已解决了

@realJackSun
Copy link
Collaborator

执行docker stop nacos3

这里3台挂掉了一台,nacos2被选为主

此时再把nacos2 stop掉

此时raft协议已经无法工作了,因为还存活的机器数量少于一半

再看nacos1的时候,leader还是保持nacos2

这是因为虽然集群失效了,SofajRaft还是会把上次的集群信息缓存下来

@1019822077
Copy link
Author

1019822077 commented Apr 8, 2021

https://app.yinxiang.com/fx/60dcf61e-cf73-4343-875e-f4ff0885e7e5

执行docker stop nacos3

这里3台挂掉了一台,nacos2被选为主

此时再把nacos2 stop掉

此时raft协议已经无法工作了,因为还存活的机器数量少于一半

再看nacos1的时候,leader还是保持nacos2

这是因为虽然集群失效了,SofajRaft还是会把上次的集群信息缓存下来

但是后面我把stop掉的节点都启动后,集群还是脑裂状态,整个过程详细笔记可见:https://app.yinxiang.com/fx/60dcf61e-cf73-4343-875e-f4ff0885e7e5
比如布署到k8s里面修改配置什么的pod的升级都是滚动升的,可能重启的时间超过了选举的时间,就会造成脑裂的状态。

同样的操作,在nacos2.0.0就没脑裂的现象。

@KomachiSion
Copy link
Collaborator

KomachiSion commented Apr 8, 2021

有日志吗? 看一下alipay-jraft.log 和protocol-raft.log

@1019822077
Copy link
Author

1019822077 commented Apr 9, 2021

有日志吗? 看一下alipay-jraft.log 和protocol-raft.log
重新复现了一下脑裂问题,这个是必现的,所以我又重新写了一个笔记,你需要的日志都带上了,详见:https://app.yinxiang.com/fx/60dcf61e-cf73-4343-875e-f4ff0885e7e5

@1019822077
Copy link
Author

1019822077 commented Apr 9, 2021

[root@nacos1 nacos]# cat logs/protocol-raft.log
2021-04-08 09:19:08,872 INFO Initializes the Raft protocol, raft-config info : {}
2021-04-08 09:19:11,636 INFO ========= The raft protocol is starting... =========
2021-04-08 09:20:40,388 INFO Initializes the Raft protocol, raft-config info : {}
2021-04-08 09:20:43,305 INFO ========= The raft protocol is starting... =========
2021-04-08 09:20:55,342 INFO ========= The raft protocol start finished... =========
2021-04-08 09:20:55,393 INFO create raft group : naming_persistent_service
2021-04-08 09:21:18,820 ERROR Fail to refresh raft metadata info for group : naming_persistent_service, error is : {}
java.util.concurrent.TimeoutException: null
at com.alipay.sofa.jraft.rpc.impl.FutureImpl.get(FutureImpl.java:214)
at com.alipay.sofa.jraft.RouteTable.refreshLeader(RouteTable.java:255)
at com.alibaba.nacos.core.distributed.raft.JRaftServer.refreshRouteTable(JRaftServer.java:501)
at com.alibaba.nacos.core.distributed.raft.JRaftServer.lambda$createMultiRaftGroup$1(JRaftServer.java:277)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2021-04-08 09:21:19,837 ERROR Fail to refresh leader for group : naming_persistent_service, status is : Status[UNKNOWN<-1>: Fail to init channel to nacos3:7848, Unknown leader, Unknown leader]
2021-04-08 09:21:20,049 ERROR Fail to refresh route configuration for group : naming_persistent_service, status is : Status[UNKNOWN<-1>: Fail to get leader of group naming_persistent_service]
2021-04-08 09:21:21,146 INFO This Raft event changes : RaftEvent{groupId='naming_persistent_service', leader='nacos1:7848', term=1, raftClusterInfo=[nacos3:7848, nacos1:7848, nacos2:7848]}
2021-04-08 09:21:42,511 ERROR Fail to refresh route configuration for group : naming_persistent_service, status is : Status[UNKNOWN<-1>: handleRequest internal error]
2021-04-08 09:21:46,354 ERROR Fail to refresh route configuration for group : naming_persistent_service, status is : Status[UNKNOWN<-1>: handleRequest internal error]
2021-04-08 09:21:48,842 INFO This Raft event changes : RaftEvent{groupId='naming_persistent_service', leader='nacos1:7848', term=2, raftClusterInfo=[nacos3:7848, nacos1:7848, nacos2:7848]}
2021-04-08 09:53:28,898 INFO Initializes the Raft protocol, raft-config info : {}
2021-04-08 09:53:29,103 INFO ========= The raft protocol is starting... =========
2021-04-08 09:53:29,454 INFO ========= The raft protocol start finished... =========
2021-04-08 09:53:29,458 INFO create raft group : naming_persistent_service
2021-04-08 09:53:30,001 INFO This Raft event changes : RaftEvent{groupId='naming_persistent_service', leader='null', term=null, raftClusterInfo=[nacos3:7848, nacos1:7848, nacos2:7848]}
2021-04-08 09:53:32,205 INFO This Raft event changes : RaftEvent{groupId='naming_persistent_service', leader='nacos2:7848', term=4, raftClusterInfo=[nacos1:7848]}

@KomachiSion
Copy link
Collaborator

raftClusterInfo= 被修改成异常状态了。 这个问题在k8s环境中会出现,目前还在定位问题。

@1019822077
Copy link
Author

raftClusterInfo= 被修改成异常状态了。 这个问题在k8s环境中会出现,目前还在定位问题。

这个问题除了在k8s里会出现,在docker-composer中也会出现,我上面的笔记都是docker-composer操作的脑裂必现步骤:https://app.yinxiang.com/fx/60dcf61e-cf73-4343-875e-f4ff0885e7e5

@realJackSun
Copy link
Collaborator

@1019822077 您好,您的印象笔记复现记录似乎打不开了,可否维护一下?

@1019822077
Copy link
Author

1019822077 commented Apr 13, 2021

@1019822077 您好,您的印象笔记复现记录似乎打不开了,可否维护一下?

试试这个:https://app.yinxiang.com/fx/60dcf61e-cf73-4343-875e-f4ff0885e7e5
如果上面访问不了,可看这个:https://www.yinxiang.com/everhub/note/60dcf61e-cf73-4343-875e-f4ff0885e7e5

@KomachiSion KomachiSion added the follow up this problem requires continuous follow-up label Apr 14, 2021
@KomachiSion
Copy link
Collaborator

Can you use 1.4.2 retry?

@KomachiSion
Copy link
Collaborator

No response from author for a long time, community think the 1.4.2 has solve this problem. If new version has same problem. Please submit new issue let us know.

@KomachiSion KomachiSion added status/invalid This doesn't seem right and removed follow up this problem requires continuous follow-up status/need feedback labels May 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

3 participants