New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when running several witness nodes on the same machine #377

Closed
devos50 opened this Issue Aug 28, 2017 · 5 comments

Comments

Projects
3 participants
@devos50

devos50 commented Aug 28, 2017

Hi all,

I'm trying to setup a local network to play a bit around with the Bitshares software. I've compiled commit 5e67be7 and I copied the witness_node executable to somewhere else.

Next, I did the following: I created a genesis file with the following command: ./witness_node --create-genesis-json my-genesis.json and I initialised the data directory: ./witness_node --data-dir=data --genesis-json my-genesis.json and I close the program. Now, I have a data directory which I rename to data-clean. Next, I have the following bash script to run a witness node with a specific ID:

echo "I am peer $1"
rm -rf instances_test/data$1
cp -r data-clean instances_test/data$1

rm instances_test/data$1/config.ini
cp instances_test/data$1/config-template.ini instances_test/data$1/config.ini

let pn="$1 + 11000"
sed -i -e "s/<BITSHARES_P2P_ENDPOINT>/0.0.0.0:$pn/g" instances_test/data$1/config.ini
sed -i -e "s/<BITSHARES_SEED_NODE>/127.0.0.1:11001/g" instances_test/data$1/config.ini
let pn="$1 + 12000"
sed -i -e "s/<BITSHARES_RPC_ENDPOINT>/0.0.0.0:$pn/g" instances_test/data$1/config.ini
sed -i -e "s/<BITSHARES_WITNESS_ID>/1.6.$1/g" instances_test/data$1/config.ini

gdb -ex r --args /home/pouwelse/bitshares-test/witness_node --data-dir "instances_test/data$1" --partial-operations true --replay-blockchain

Now I open five ssh sessions to my server and I run the script five times, with five different IDs. Note that node 1 is always the seed node in my experiment. However, after a while, node 1 segfaults. I've attached gdb to this process and I get the following stack trace:

Program received signal SIGSEGV, Segmentation fault.
0x0000000000a5a211 in graphene::app::detail::application_impl::get_blockchain_synopsis(fc::ripemd160 const&, unsigned int) ()
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 glibc-2.17-106.el7_2.8.x86_64 ncurses-libs-5.9-13.20130511.el7.x86_64 readline-6.2-9.el7.x86_64 zlib-1.2.7-15.el7.x86_64
(gdb) bt
#0  0x0000000000a5a211 in graphene::app::detail::application_impl::get_blockchain_synopsis(fc::ripemd160 const&, unsigned int) ()
#1  0x0000000001050909 in fc::detail::functor_run<graphene::net::detail::statistics_gathering_node_delegate_wrapper::get_blockchain_synopsis(fc::ripemd160 const&, unsigned int)::{lambda()#1}>::run(void*, fc::detail::functor_run<graphene::net::detail::statistics_gathering_node_delegate_wrapper::get_blockchain_synopsis(fc::ripemd160 const&, unsigned int)::{lambda()#1}>) ()
#2  0x0000000000ef2ad5 in fc::task_base::run_impl() ()
#3  0x0000000000ef065f in fc::thread_d::process_tasks() ()
#4  0x0000000000ef08a1 in fc::thread_d::start_process_tasks(long) ()
#5  0x0000000001171db1 in make_fcontext ()
#6  0x0000000000000000 in ?? ()
(gdb) 

This issue is reproducible on my server. When I run a modified version of the script, spawning up to 40 witness_node instances (my server has enough cpu/memory/file descriptor capacity), there is usually one node that segfaults, however, with 40 instances, this is slightly harder to capture using gdb.

The output of the crashing witness node looks like this:

[pouwelse@fs0 bitshares-test]$ bash spawn_instance.sh 1
I am peer 1
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/pouwelse/bitshares-test/witness_node...done.
Starting program: /home/pouwelse/bitshares-test/witness_node --data-dir instances_test/data1 --partial-operations true --replay-blockchain
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
warning: File "/cm/local/apps/gcc/5.2.0/lib64/libstdc++.so.6.0.21-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py".
To enable execution of this file add
	add-auto-load-safe-path /cm/local/apps/gcc/5.2.0/lib64/libstdc++.so.6.0.21-gdb.py
line to your configuration file "/home/pouwelse/.gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/home/pouwelse/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"
1814845ms th_a       witness.cpp:88                plugin_initialize    ] witness plugin:  plugin_initialize() begin
1814845ms th_a       witness.cpp:99                plugin_initialize    ] Public Key: BTS6MRyAjQq8ud7hVNYcfnVPJqcVpscN5So8BhtHuGYqET5GDW5CV
1814845ms th_a       witness.cpp:117               plugin_initialize    ] witness plugin:  plugin_initialize() end
1814846ms th_a       application.cpp:441           startup              ] Replaying blockchain due to: replay-blockchain argument specified
1814846ms th_a       application.cpp:330           operator()           ] Initializing database...
1814849ms th_a       db_management.cpp:51          reindex              ] reindexing blockchain
1814849ms th_a       db_management.cpp:104         wipe                 ] Wiping database
1814849ms th_a       db_management.cpp:170         close                ] Database close unexpected exception: {"code":10,"name":"assert_exception","message":"Assert Exception","stack":[{"context":{"level":"error","file":"index.hpp","line":111,"method":"get","hostname":"","thread_name":"th_a","timestamp":"2017-08-28T19:30:14"},"format":"maybe_found != nullptr: Unable to find Object","data":{"id":"2.1.0"}}]}
1814853ms th_a       object_database.cpp:87        wipe                 ] Wiping object database...
1814864ms th_a       object_database.cpp:89        wipe                 ] Done wiping object databse.
1814864ms th_a       object_database.cpp:94        open                 ] Opening object database from /home/pouwelse/bitshares-test/instances_test/data1/blockchain ...
1814864ms th_a       object_database.cpp:100       open                 ] Done opening object database.
1814868ms th_a       db_debug.cpp:85               debug_dump           ] total_balances[asset_id_type()].value: 0 core_asset_data.current_supply.value: 1000000000000000 
1814868ms th_a       db_management.cpp:58          reindex              ] !no last block
1814868ms th_a       db_management.cpp:59          reindex              ] last_block:  
[New Thread 0x2aaab33b1700 (LWP 19400)]
[New Thread 0x2aaab35b2700 (LWP 19401)]
[New Thread 0x2aaab37b3700 (LWP 19402)]
[New Thread 0x2aaab39b4700 (LWP 19403)]
[New Thread 0x2aaab3bb5700 (LWP 19404)]
[New Thread 0x2aaab3db6700 (LWP 19405)]
[New Thread 0x2aaab3fb7700 (LWP 19406)]
[New Thread 0x2aaacc200700 (LWP 19407)]
[New Thread 0x2aaacc401700 (LWP 19408)]
[New Thread 0x2aaacc602700 (LWP 19409)]
1814880ms th_a       application.cpp:131           reset_p2p_node       ] Adding seed node 127.0.0.1:11001
1814880ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 104.236.144.84:1777
1814880ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 128.199.143.47:2015
1814881ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 23.92.53.182:1776
1814881ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 192.121.166.162:1776
1814881ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 51.15.61.160:1776
1814937ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 114.92.234.195:62015
1814944ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 144.76.29.248:4243
1814945ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 51.15.42.228:50696
1814945ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 212.47.249.84:50696
1814993ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 45.76.37.29:1776
1814994ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 71.197.2.119:1776
1815253ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 104.145.234.6:1777
1815297ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 149.56.17.159:1776
1815297ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 108.61.176.106:1776
1815298ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 188.241.58.128:1776
1815298ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 82.211.1.99:1776
1815299ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 45.32.117.94:1776
1815299ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 45.32.159.93:1776
1815299ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 116.62.226.52:1776
1815299ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 45.32.117.94:1776
1815300ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 149.56.17.159:1776
1815300ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 192.121.166.162:1776
1815300ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 188.241.58.128:1776
1815300ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 139.162.183.240:1776
1815300ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 213.167.243.194:1776
1815300ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 108.61.176.106:1776
1815300ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 45.76.37.29:1776
1815300ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 82.211.1.99:1776
1815300ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 45.79.174.179:1776
1815300ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 94.130.15.169:1776
1815300ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 51.15.61.160:1776
1815300ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 81.89.101.133:1776
1815300ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 209.105.239.13:1776
1815300ms th_a       application.cpp:189           reset_p2p_node       ] Adding seed node 71.197.2.119:1776
1815300ms th_a       application.cpp:204           reset_p2p_node       ] Configured p2p node to listen on 0.0.0.0:11001
1815302ms th_a       application.cpp:279           reset_websocket_serv ] Configured websocket rpc to listen on 0.0.0.0:12001
1815302ms th_a       witness.cpp:122               plugin_startup       ] witness plugin:  plugin_startup() begin
1815302ms th_a       witness.cpp:127               plugin_startup       ] Launching block production for 1 witnesses.

********************************
*                              *
*   ------- NEW CHAIN ------   *
*   - Welcome to Graphene! -   *
*   ------------------------   *
*                              *
********************************

Your genesis seems to have an old timestamp
Please consider using the --genesis-timestamp option to give your genesis a recent timestamp

1815302ms th_a       witness.cpp:138               plugin_startup       ] witness plugin:  plugin_startup() end
1815302ms th_a       main.cpp:179                  main                 ] Started witness node on a chain with 0 blocks.
1815302ms th_a       main.cpp:180                  main                 ] Chain ID is 6f5da6616ad5f1723091d5a683116e20428e771ccf46e9439bd8b7a8c7b9d921
1860164ms th_a       witness.cpp:184               block_production_loo ] Generated block #1 with timestamp 2017-08-28T19:31:00 at time 2017-08-28T19:31:00
1880002ms th_a       application.cpp:538           handle_block         ] Got block: #2 time: 2017-08-28T19:31:20 latency: 2 ms from: init1  irreversible: 0 (-2)

Program received signal SIGSEGV, Segmentation fault.

Am I doing something wrong here or did I encounter a bug in graphene? I can provide other log files if necessary.

@abitmore

This comment has been minimized.

Member

abitmore commented Aug 29, 2017

AFAIK there is a bug which will cause a crash when the chain has too few blocks and another p2p node is trying to connect. It's related to get_blockchain_synopsis(). A workaround is to wait for several minutes after the first/initial producing node started, then launch other nodes. It's not high priority so we didn't look deeper into it.

@abitmore abitmore added the bug label Nov 13, 2017

@abitmore abitmore added this to the Future Non-Consensus-Changing Release milestone Nov 27, 2017

dyakovitsky added a commit to openledger/bitshares-core that referenced this issue Aug 23, 2018

dyakovitsky added a commit to openledger/bitshares-core that referenced this issue Aug 23, 2018

dyakovitsky added a commit to openledger/bitshares-core that referenced this issue Aug 24, 2018

@abitmore abitmore added this to To do in Feature release (201810) via automation Aug 29, 2018

@abitmore

This comment has been minimized.

Member

abitmore commented Aug 29, 2018

Fixed by #1286.

@abitmore abitmore closed this Aug 29, 2018

Feature release (201810) automation moved this from To do to Done Aug 29, 2018

@ryanRfox

This comment has been minimized.

Member

ryanRfox commented Sep 5, 2018

Sorry this is after the fact, but I failed to transcribe the email discussions into the Issue Description per standard:

  • Assigned: OpenLedger
  • Estimated:
    • Developer: 35 hours
    • Project Manager: 5 hours
@abitmore

This comment has been minimized.

Member

abitmore commented Sep 6, 2018

My estimation on this issue is 5-10 hours.

@ryanRfox

This comment has been minimized.

Member

ryanRfox commented Sep 6, 2018

Per converations with Core Team Devs, the invoice for weeks 34-35 reflects 10 hours remitted for this issue. In the future, I will work to ensure estimates are reviewed prior to merge to eliminate discrepancy after the fact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment