Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-panic when newly added ordinary node sync the blocks ## cita进程自动终结 #534

Closed
mfuuzy opened this issue Apr 30, 2019 · 18 comments
Assignees
Labels
bug Something isn't working

Comments

@mfuuzy
Copy link

mfuuzy commented Apr 30, 2019

Description

新增一个普通节点,测试块同步的性能,新节点启动后块是一直同步的,日志显示进程在 隔天01:37分崩溃
下面是错误日志:
【2019-04-30T01:37:18.546364357+08:00 - INFO - public block txs height 1618892 with 0 transactions
2019-04-30T01:37:23.111085504+08:00 - INFO - get block tx hashes for height 1618893
2019-04-30T01:37:23.111146491+08:00 - INFO - public block txs height 1618893 with 0 transactions
2019-04-30T01:37:24.189629995+08:00 - ERROR - Error in reading loop: Protocol("Connection reset by peer (os error 104)")
2019-04-30T01:37:24.191299405+08:00 - ERROR - Error dispatching closing packet to a channel
2019-04-30T01:37:24.191327491+08:00 - ERROR - Error consuming Protocol("Connection reset by peer (os error 104)")
2019-04-30T01:37:24.191611845+08:00 - ERROR - Error in reading loop: Protocol("Connection reset by peer (os error 104)")
2019-04-30T01:37:24.191633697+08:00 - ERROR - Error dispatching closing packet to a channel
2019-04-30T01:37:24.750045126+08:00 - INFO - CITA:auth
2019-04-30T01:37:24.750086919+08:00 - INFO - Version: v0.22.0-151-g53de566-dev
2019-04-30T01:37:25.086629431+08:00 - ERROR -

【position:
Thread main panicked at failed to open url amqp://guest:guest@localhost/test-chain/0 : IoError(AddrNotAvailable), /opt/.cargo/git/checkouts/cita-common-1aad419f3e80ba17/3ec4e19/pubsub_rabbitmq/src/lib.rs:60

Versions

os:Ubuntu 16.04.3 LTS
cita:v0.23.0,日志显示是v0.22.0 实际上是v0.23.0
docker : cita/cita-run:ubuntu-18.04-20190419

Additional Information

[开始运行cita v0.22.0 ,磁盘空间不够,进行了节点目录迁移,再升级到0.23.0版本。添加了一个节点,准备测试块同步性能,最后报错].

@mfuuzy mfuuzy added the bug Something isn't working label Apr 30, 2019
@kaikai1024
Copy link
Contributor

Please check the cita-op-helper and attach the file.
Thanks!

@mfuuzy
Copy link
Author

mfuuzy commented Apr 30, 2019

cita_info.tar.gz

@kaikai1024
Copy link
Contributor

迁移过程报错吗?

有没有 setup

@mfuuzy
Copy link
Author

mfuuzy commented Apr 30, 2019

迁移不是我做的,但是我看结果都是成功运行的。迁移是去年就做了,一直在运行。
新加的节点,我setup过了。

@leeyr338
Copy link
Contributor

leeyr338 commented Apr 30, 2019 via email

@jiangxianliang007
Copy link
Contributor

rabbitmq 是在运行的过程中奔溃的, 一直在运行 凌晨1点37 挂的。

@jiangxianliang007
Copy link
Contributor

迁移和升级 可以忽略,这都是之前做的, 这次只是新增了1个节点,再同步块, 然后发现凌晨所有节点都崩溃了,因rabbitmq 挂了,原因不详。

@kaikai1024
Copy link
Contributor

Current node ip is 127.0.0.1

id_card = 4
port = 4004
[[peers]]
id_card = 0
ip = "127.0.0.1"
port = 4000

[[peers]]
id_card = 1
ip = "127.0.0.1"
port = 4001

[[peers]]
id_card = 2
ip = "127.0.0.1"
port = 4002

[[peers]]
id_card = 3
ip = "127.0.0.1"
port = 4003

[[peers]]
id_card = 4
ip = "127.0.0.1"
port = 4004

[[peers]]
id_card = 5
ip = "127.0.0.1"
port = 4005

Why are there six peers info?

and network log of node 4

[P2pProtocol] Dialed Error in V4(127.0.0.1:4005) : IoError(Os { code: 111, kind: ConnectionRefused, message: "Connection refused" }).

@leeyr338 Do you have any ideas about these?

@kaikai1024 kaikai1024 changed the title cita进程自动终结 Auto-panic when newly added ordinary node sync the blocks ## cita进程自动终结 Apr 30, 2019
@mfuuzy
Copy link
Author

mfuuzy commented Apr 30, 2019

我开始按照0.22.0文档创建了两个节点,setup失败了,然后rm test-chain/4 test-chain/5两个节点;
删除了0 1 2 3 的 network.toml里面的 4和5 的ip 端口信息;
然后按照 0.23.0的文档重新启动了一个节点。重新启动的4节点的network.toml 里面没有删除第一次创建的4 和 5 节点的信息,所以日志里还显示了第一次创建的4和5 节点的信息

@leeyr338
Copy link
Contributor

leeyr338 commented May 5, 2019

@leeyr338 Do you have any ideas about these?

这是去拨一个没有启动或不存在的节点,是一个正常的行为。理论上讲这个错误不会引起network自动终结。
我会跟踪和处理这个问题。

@leeyr338 leeyr338 assigned leeyr338 and unassigned kaikai1024 May 5, 2019
@driftluo
Copy link
Contributor

driftluo commented May 5, 2019

@leeyr338
Copy link
Contributor

leeyr338 commented May 5, 2019

谢谢 @driftluo
hi @mfuuzy , 你可以把节点 4 的节点文件 .env 中的内容贴上来给我看一下吗?
有可能是其中的 rabbitmq 的配置不正确导致的,如果你已经正确执行 setup 的话。

@mfuuzy
Copy link
Author

mfuuzy commented May 5, 2019

见图片
node4_env

@leeyr338
Copy link
Contributor

leeyr338 commented May 5, 2019

经分析 cita-forever.log 日志,发现在 network 异常退出以前,bft 先退出了;cita-forever.log日志片段:

2019-04-29T16:31:05.218011802+08:00 - INFO - process id: 12838
2019-04-29T16:31:05.218055676+08:00 - INFO - cita-chain started
2019-04-29T16:31:11.422652242+08:00 - WARN - cita-bft exit status is ExitStatus(ExitStatus(256))
2019-04-29T16:31:11.423988713+08:00 - INFO - process id: 12944
2019-04-29T16:31:11.424048196+08:00 - INFO - cita-bft started
2019-04-29T16:31:14.549753937+08:00 - WARN - cita-bft exit status is ExitStatus(ExitStatus(256))
2019-04-29T16:31:14.549992295+08:00 - INFO - process id: 12957
2019-04-29T16:31:14.550027953+08:00 - INFO - cita-bft started
2019-04-29T16:31:17.681909591+08:00 - WARN - cita-bft exit status is ExitStatus(ExitStatus(256))
2019-04-29T16:31:17.682800908+08:00 - INFO - process id: 12968
2019-04-29T16:31:17.682855037+08:00 - INFO - cita-bft started
2019-04-29T16:31:20.819867152+08:00 - WARN - cita-bft exit status is ExitStatus(ExitStatus(256))
2019-04-29T16:31:20.819973949+08:00 - WARN - cita-bft reach max respawn limit
2019-04-29T16:31:20.820005683+08:00 - WARN - ==>: Child process cita-bft exited unexpectedly!
2019-04-30T01:37:24.259220872+08:00 - WARN - cita-network exit status is ExitStatus(ExitStatus(0))

bft 日志片段:

 2019-04-29T16:31:20.806725735+08:00 - INFO - address[0xef3aed75587ea544b9d8df507b42e29baeded7d1] is not consensus power !
2019-04-29T16:31:20.816863399+08:00 - ERROR -
============================
stack backtrace:
   0:     0x56145b125a4d - backtrace::backtrace::trace::hb70975eebea9c40e
   1:     0x56145b125632 - backtrace::capture::Backtrace::new::h9f9ed123ff1e30ee
   2:     0x56145aecb284 - panic_hook::panic_hook::h2e58e0f6c6bffba5
   3:     0x56145aecafb8 - core::ops::function::Fn::call::hb9f6eae8621c8e50
   4:     0x56145b1c46c8 - rust_panic_with_hook
                        at src/libstd/panicking.rs:482
   5:     0x56145b1c4161 - continue_panic_fmt
                        at src/libstd/panicking.rs:385
   6:     0x56145b1c4045 - rust_begin_unwind
   7:     0x56145b1dd8dc - panic_fmt
                        at src/libcore/panicking.rs:85
   8:     0x56145b1dd81b - panic
                        at src/libcore/panicking.rs:49
   9:     0x56145ae2a2c2 - cita_bft::core::votetime::WaitTimer::start::hd8b90ccaa5f178ec
  10:     0x56145ae318cc - std::sys_common::backtrace::__rust_begin_short_backtrace::h9ec9f7185c88019d
  11:     0x56145ae32fbb - std::panicking::try::do_call::hc8730e50693ed203
  12:     0x56145b1ca049 - __rust_maybe_catch_panic
                        at src/libpanic_unwind/lib.rs:87
  13:     0x56145ae35e6f - <F as alloc::boxed::FnBox<A>>::call_box::hd81aa90afe319fe7
  14:     0x56145b1c93bd - call_once<(),()>
                        at /rustc/91856ed52c58aa5ba66a015354d1cc69e9779bdf/src/liballoc/boxed.rs:759
                         - start_thread
                        at src/libstd/sys_common/thread.rs:14
                         - thread_start
                        at src/libstd/sys/unix/thread.rs:81
  15:     0x7f6c472e36da - start_thread
  16:     0x7f6c46df488e - __clone
  17:                0x0 - <unknown>

position:
Thread <unnamed> panicked at called `Option::unwrap()` on a `None` value, src/libcore/option.rs:345

This is a bug. Please report it at:

    https://github.com/cryptape/cita/issues/new?labels=bug&template=bug_report.md
============================

@leeyr338
Copy link
Contributor

leeyr338 commented May 5, 2019

这与之前的一个问题:
https://talk.citahub.com/t/topic/228/7
有一些相似之处,麻烦 @KaoImin 看一下。

@KaoImin
Copy link
Contributor

KaoImin commented May 5, 2019

这与之前的一个问题:
https://talk.citahub.com/t/topic/228/7
有一些相似之处,麻烦 @KaoImin 看一下。

这个不是 system time 的问题,不太一样

@kaikai1024
Copy link
Contributor

Please add some descriptions about the reason and the solution @KaoImin

@leeyr338
Copy link
Contributor

leeyr338 commented May 7, 2019

We need to merge this changes into CITA and verify the problem has fixed before close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants