Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nacos 2.0.3 ERROR_TYPE_STATE_MACHINE #6877

Closed
warmonipa opened this issue Sep 14, 2021 · 14 comments
Closed

Nacos 2.0.3 ERROR_TYPE_STATE_MACHINE #6877

warmonipa opened this issue Sep 14, 2021 · 14 comments
Labels
kind/question Category issues related to questions or problems status/need feedback

Comments

@warmonipa
Copy link

warmonipa commented Sep 14, 2021

Describe the bug
Nacos 2.0.3 Cluster based k8s encounter an unrecoverable error, error log :

"naming_persistent_service": {
    "errMsg": "Error [type=ERROR_TYPE_STATE_MACHINE, status=Status[ESTATEMACHINE<10002>: StateMachine meet critical error when applying one or more tasks since index=1365161, 
    Status[ESTATEMACHINE<10002>: StateMachine meet critical error: java.lang.IllegalArgumentException: No enum constant com.alibaba.nacos.naming.consistency.persistent.impl.
    BasePersistentServiceProcessor.Op.\n\tat java.lang.Enum.valueOf(Enum.java:238)\n\tat com.alibaba.nacos.naming.consistency.persistent.impl.BasePersistentServiceProcessor$Op
    .valueOf(BasePersistentServiceProcessor.java:63)\n\tat com.alibaba.nacos.naming.consistency.persistent.impl.BasePersistentServiceProcessor.onApply(BasePersistentServiceProcessor.java:170)
    \n\tat com.alibaba.nacos.core.distributed.raft.NacosStateMachine.onApply(NacosStateMachine.java:115)\n\tat com.alipay.sofa.jraft.core.FSMCallerImpl.doApplyTasks(FSMCallerImpl.java:539)
    \n\tat com.alipay.sofa.jraft.core.FSMCallerImpl.doCommitted(FSMCallerImpl.java:508)\n\tat com.alipay.sofa.jraft.core.FSMCallerImpl.runApplyTask(FSMCallerImpl.java:440)
    \n\tat com.alipay.sofa.jraft.core.FSMCallerImpl.access$100(FSMCallerImpl.java:73)\n\tat com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:148)
    \n\tat com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:142)\n\tat com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137)
    \n\tat java.lang.Thread.run(Thread.java:748)\n.]]]",
    "leader": "nacos-0.nacos-headless.nacos.svc.cluster.local:7848",

Expected behavior
No error.

Acutally behavior
jRaft state machine encountered unrecoverable error.

Desktop (please complete the following information):

  • OS: [K8S 1.19.6]
  • Version: nacos-server 2.0.3
@warmonipa warmonipa changed the title Nacos 2.0.3 Nacos 2.0.3 ERROR_TYPE_STATE_MACHINE Sep 14, 2021
@realJackSun
Copy link
Collaborator

删除~/nacos/data/protocol,然后再重启一次

@realJackSun realJackSun added status/need feedback kind/question Category issues related to questions or problems labels Sep 14, 2021
@wangdongyun
Copy link

wangdongyun commented Sep 15, 2021

删除~/nacos/data/protocol,然后再重启一次

现在状态机 某条log 执行出现报错 ,该节点就挂掉了,因为那条log 即使重试,也可能永远无法重放成功,会导致该节点不可用。

扩大开来:
如果集群中 出现了某条有问题的commit log, 该log在所有状态机都不能执行成功,整个集群挂掉!

目前看找不到可以恢复的手段!

@warmonipa
Copy link
Author

warmonipa commented Sep 15, 2021

删除~/nacos/data/protocol,然后再重启一次

昨天nfs重建了,今天又挂了,错误是一致的,已经降级1.4.2,但142还有stop的bug,希望大佬尽快发布143,待203修复上面的问题再升级

@warmonipa
Copy link
Author

别沉了,bump

@horizonzy
Copy link
Collaborator

我看一下这个问题。

@warmonipa
Copy link
Author

bump

@suanyi001
Copy link

有更新吗 我们也遇到了同样的问题.

@horizonzy
Copy link
Collaborator

有更新吗 我们也遇到了同样的问题.

能够稳定复现吗?

@suanyi001
Copy link

suanyi001 commented Nov 15, 2021

有更新吗 我们也遇到了同样的问题.

能够稳定复现吗?

我们已经遇到了几次,我把详细的过程整理到这个issue上,麻烦大佬帮忙看看!

@bizhenchao1201
Copy link

删除~/nacos/data/protocol,然后再重启一次

昨天nfs重建了,今天又挂了,错误是一致的,已经降级1.4.2,但142还有stop的bug,希望大佬尽快发布143,待203修复上面的问题再升级

存储在nfs,是否会因为网络抖动导致存储失败?对于nacos集群一致性的数据是不是应该存在本地磁盘上?

@zrlw
Copy link
Contributor

zrlw commented Jan 9, 2022

stop的bug

142还有stop的bug指的是什么?有相关的issue链接么?
如果指的是I/O reactor status: STOPPED,我们采用的方法是把2.0分支的 #6441 补丁搞到1.4.2分支上,这个补丁不仅捕获了IOException(ConnectionClosedException 属于IOException ),而且捕获了全部RuntimeException,可能比#7299 讲的升级httpcore版本的方法更稳妥一些。

更新:只有#6441还不行,还需要像2.0那样修改pom文件,把httpcore和httpclient的版本定义删掉,直接用springboot自己定义的版本,否则还会因为版本不兼容stopped

@warmonipa
Copy link
Author

stop的bug

142还有stop的bug指的是什么?有相关的issue链接么? 如果指的是I/O reactor status: STOPPED,我们采用的方法是把2.0分支的 #6441 补丁搞到1.4.2分支上,这个补丁不仅捕获了IOException(ConnectionClosedException 属于IOException ),而且捕获了全部RuntimeException,可能比#7299 讲的升级httpcore版本的方法更稳妥一些。

对的,就是I/O reactor status: STOPPED的问题

@zrlw
Copy link
Contributor

zrlw commented Feb 13, 2022

更新一下,心跳数量超大情况下,版本1也会出现同样的问题,只有Ieader的控制台页面显示有注册的dubbo临时服务实例(且数量不稳定),其他节点都是ERROR_TYPE_STATE_MACHINE状态,控制台页面显示没有一个服务注册实例。

@zrlw
Copy link
Contributor

zrlw commented Feb 13, 2022

这个问题我们通过修改代码增加异常捕获解决了。#7757

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Category issues related to questions or problems status/need feedback
Projects
None yet
Development

No branches or pull requests

8 participants