Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pegasus Coredump at boost.asio epoll_reactor 1.11.5 #387

Open
neverchanje opened this issue Aug 29, 2019 · 5 comments
Open

Pegasus Coredump at boost.asio epoll_reactor 1.11.5 #387

neverchanje opened this issue Aug 29, 2019 · 5 comments
Labels
type/bug This issue reports a bug.

Comments

@neverchanje
Copy link
Contributor

neverchanje commented Aug 29, 2019

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do?
    If possible, provide a recipe for reproducing the error.

Normal bootstrap. c3srv-xiaoai, on replica-server (2019/08/29).

  1. What did you expect to see?

No coredump.

  1. What did you see instead?

Coredump Stack:

(gdb) bt
#0  0x00007fcc302b7ad6 in boost::asio::detail::epoll_reactor::start_op (this=0x309a6e0, op_type=op_type@entry=1, descriptor=1352, descriptor_data=@0x583560e88: 0x0, 
    op=op@entry=0x6cdf6e900, is_continuation=is_continuation@entry=true, allow_speculative=allow_speculative@entry=true)
    at /home/wutao1/boost_1_58_0/output/include/boost/asio/detail/impl/epoll_reactor.ipp:219
#1  0x00007fcc302bca55 in start_op (noop=false, is_non_blocking=true, is_continuation=true, op=0x6cdf6e900, op_type=1, impl=..., this=0x309c418)
    at /home/wutao1/boost_1_58_0/output/include/boost/asio/detail/impl/reactive_socket_service_base.ipp:214
#2  async_send<boost::asio::detail::consuming_buffers<boost::asio::const_buffer, std::vector<boost::asio::const_buffer> >, boost::asio::detail::write_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp>, std::vector<boost::asio::const_buffer>, boost::asio::detail::transfer_all_t, dsn::tools::asio_rpc_session::send(uint64_t)::__lambda3> > (flags=0, 
    handler=..., buffers=..., impl=..., this=0x309c418) at /home/wutao1/boost_1_58_0/output/include/boost/asio/detail/reactive_socket_service_base.hpp:216
#3  async_send<boost::asio::detail::consuming_buffers<boost::asio::const_buffer, std::vector<boost::asio::const_buffer> >, boost::asio::detail::write_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp>, std::vector<boost::asio::const_buffer>, boost::asio::detail::transfer_all_t, dsn::tools::asio_rpc_session::send(uint64_t)::__lambda3> > (flags=0, 
    handler=<unknown type in /home/work/app/pegasus/c3srv-xiaoai/replica/package/bin/libdsn_replica_server.so, CU 0x34e2bd9, DIE 0x355c99c>, buffers=..., impl=..., this=0x309c3f0)
    at /home/wutao1/boost_1_58_0/output/include/boost/asio/stream_socket_service.hpp:330
#4  async_write_some<boost::asio::detail::consuming_buffers<boost::asio::const_buffer, std::vector<boost::asio::const_buffer> >, boost::asio::detail::write_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp>, std::vector<boost::asio::const_buffer>, boost::asio::detail::transfer_all_t, dsn::tools::asio_rpc_session::send(uint64_t)::__lambda3> > (
    handler=<unknown type in /home/work/app/pegasus/c3srv-xiaoai/replica/package/bin/libdsn_replica_server.so, CU 0x34e2bd9, DIE 0x355bab4>, buffers=..., this=0x583560e80)
    at /home/wutao1/boost_1_58_0/output/include/boost/asio/basic_stream_socket.hpp:732
#5  boost::asio::detail::write_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::stream_socket_service<boost::asio::ip::tcp> >, std::vector<boost::asio::const_buffer, std::allocator<boost::asio::const_buffer> >, boost::asio::detail::transfer_all_t, dsn::tools::asio_rpc_session::send(uint64_t)::__lambda3>::operator()(const boost::system::error_code &, std::size_t, int) (this=0x7fcc185cf030, ec=..., bytes_transferred=<optimized out>, start=<optimized out>)
    at /home/wutao1/boost_1_58_0/output/include/boost/asio/impl/write.hpp:181
#6  0x00007fcc302bd39d in operator() (this=0x7fcc185cf030) at /home/wutao1/boost_1_58_0/output/include/boost/asio/detail/bind_handler.hpp:127
#7  asio_handler_invoke<boost::asio::detail::binder2<boost::asio::detail::write_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp>, std::vector<boost::asio::const_buffer>, boost::asio::detail::transfer_all_t, dsn::tools::asio_rpc_session::send(uint64_t)::__lambda3>, boost::system::error_code, long unsigned int> > (function=...)
    at /home/wutao1/boost_1_58_0/output/include/boost/asio/handler_invoke_hook.hpp:69
#8  invoke<boost::asio::detail::binder2<boost::asio::detail::write_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp>, std::vector<boost::asio::const_buffer>, boost::asio::detail::transfer_all_t, dsn::tools::asio_rpc_session::send(uint64_t)::__lambda3>, boost::system::error_code, long unsigned int>, dsn::tools::asio_rpc_session::send(uint64_t)::__lambda3> (context=..., function=...) at /home/wutao1/boost_1_58_0/output/include/boost/asio/detail/handler_invoke_helpers.hpp:37
#9  asio_handler_invoke<boost::asio::detail::binder2<boost::asio::detail::write_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp>, std::vector<boost::asio::const_buffer>, boost::asio::detail::transfer_all_t, dsn::tools::asio_rpc_session::send(uint64_t)::__lambda3>, boost::system::error_code, long unsigned int>, boost::asio::basic_stream_socket<boost::asio::ip::tcp>, std::vector<boost::asio::const_buffer>, boost::asio::detail::transfer_all_t, dsn::tools::asio_rpc_session::send(uint64_t)::__lambda3> (
    this_handler=<optimized out>, function=...) at /home/wutao1/boost_1_58_0/output/include/boost/asio/impl/write.hpp:565
#10 invoke<boost::asio::detail::binder2<boost::asio::detail::write_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp>, std::vector<boost::asio::const_buffer>, boost::asio::detail::transfer_all_t, dsn::tools::asio_rpc_session::send(uint64_t)::__lambda3>, boost::system::error_code, long unsigned int>, boost::asio::detail::write_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp>, std::vector<boost::asio::const_buffer>, boost::asio::detail::transfer_all_t, dsn::tools::asio_rpc_session::send(uint64_t)::__lambda3> > (
    context=..., function=...) at /home/wutao1/boost_1_58_0/output/include/boost/asio/detail/handler_invoke_helpers.hpp:37
#11 boost::asio::detail::reactive_socket_send_op<boost::asio::detail::consuming_buffers<boost::asio::const_buffer, std::vector<boost::asio::const_buffer, std::allocator<boost::asio::const_buffer> > >, boost::asio::detail::write_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::stream_socket_service<boost::asio::ip::tcp> >, std::vector<boost::asio::const_buffer, std::allocator<boost::asio::const_buffer> >, boost::asio::detail::transfer_all_t, dsn::tools::asio_rpc_session::send(uint64_t)::__lambda3> >::do_complete(boost::asio::detail::io_service_impl *, boost::asio::detail::operation *, const boost::system::error_code &, std::size_t) (owner=0x30b6620, base=<optimized out>)
    at /home/wutao1/boost_1_58_0/output/include/boost/asio/detail/reactive_socket_send_op.hpp:107
#12 0x000000000074fec9 in complete (bytes_transferred=<optimized out>, ec=..., owner=..., this=<optimized out>)
    at /home/wutao1/boost_1_58_0/output/include/boost/asio/detail/task_io_service_operation.hpp:38
---Type <return> to continue, or q <return> to quit---
#13 do_run_one (ec=..., this_thread=..., lock=..., this=0x30b6620) at /home/wutao1/boost_1_58_0/output/include/boost/asio/detail/impl/task_io_service.ipp:372
#14 boost::asio::detail::task_io_service::run (this=0x30b6620, ec=...) at /home/wutao1/boost_1_58_0/output/include/boost/asio/detail/impl/task_io_service.ipp:149
#15 0x00007fcc302b2cc6 in run (this=<optimized out>, ec=...) at /home/wutao1/boost_1_58_0/output/include/boost/asio/impl/io_service.ipp:66
#16 operator() (__closure=0x30958b0) at /home/wutao1/pegasus-release/rdsn/src/core/tools/common/asio_net_provider.cpp:73
#17 _M_invoke<> (this=0x30958b0) at /home/wutao1/app/include/c++/4.8.2/functional:1732
#18 operator() (this=0x30958b0) at /home/wutao1/app/include/c++/4.8.2/functional:1720
#19 std::thread::_Impl<std::_Bind_simple<dsn::tools::asio_network_provider::start(dsn::rpc_channel, int, bool)::__lambda2()> >::_M_run(void) (this=0x3095898)
    at /home/wutao1/app/include/c++/4.8.2/thread:115
#20 0x00007fcc2d035600 in std::(anonymous namespace)::execute_native_thread_routine (__p=<optimized out>)
    at /home/qinzuoyan/git.xiaomi/pegasus/toolchain/objdir/../gcc-4.8.2/libstdc++-v3/src/c++11/thread.cc:84
#21 0x00007fcc2dca8dc5 in start_thread () from /lib64/libpthread.so.0
#22 0x00007fcc2c79f73d in clone () from /lib64/libc.so.6
  1. What version of Pegasus are you using?

Pegasus-server 1.11.5

@neverchanje neverchanje added the type/bug This issue reports a bug. label Aug 29, 2019
@neverchanje
Copy link
Contributor Author

The coredump point (boost/asio/detail/impl/epoll_reactor.ipp:219) is:


void epoll_reactor::start_op(int op_type, socket_type descriptor,
    epoll_reactor::per_descriptor_data& descriptor_data, reactor_op* op,
    bool is_continuation, bool allow_speculative)
{
  if (!descriptor_data)
  {
    op->ec_ = boost::asio::error::bad_descriptor;
    post_immediate_completion(op, is_continuation);
    return;
  }

  mutex::scoped_lock descriptor_lock(descriptor_data->mutex_);

  if (descriptor_data->shutdown_)
  {
    post_immediate_completion(op, is_continuation);
    return;
  }

@qinzuoyan
Copy link
Contributor

我怀疑锁使用得不对。现在是用的读写锁,在调用async_read_some()async_write()时都是用的read_lock。如果用普通的排它锁是不是就不会有问题了?当然,用排它锁可能影响性能,但是这个需要测试才知道。

@foreverneverer
Copy link
Contributor

foreverneverer commented Dec 10, 2019

@Smityz
Copy link
Contributor

Smityz commented Jan 7, 2022

这里应该不能简单地用锁来防止竞态条件
asio_rpc_session.cpp:86

  utils::auto_read_lock socket_guard(_socket_lock);

    _socket->async_read_some(
        boost::asio::buffer(ptr, remaining),
        [this](boost::system::error_code ec, std::size_t length) {
            if (!!ec) {
                if (ec == boost::asio::error::make_error_code(boost::asio::error::eof)) {
                    ddebug("asio read from %s failed: %s",
                           _remote_addr.to_string(),
                           ec.message().c_str());
                } else {
                    derror("asio read from %s failed: %s",
                           _remote_addr.to_string(),
                           ec.message().c_str());
                }
                on_failure();
            } else {

async_read_some 执行的时候是在 io_service 中排队执行的,但是有同步的方法会立即关闭 socket。

void asio_rpc_session::close()
{
    utils::auto_write_lock socket_guard(_socket_lock);

    boost::system::error_code ec;
    _socket->shutdown(boost::asio::socket_base::shutdown_type::shutdown_both, ec);
    if (ec)
        dwarn("asio socket shutdown failed, error = %s", ec.message().c_str());
    _socket->close(ec);
    if (ec)
        dwarn("asio socket close failed, error = %s", ec.message().c_str());
}

由于我们的 io_service 是多线程环境下运行的,所以会产生 A 线程关闭了 socket 之后,B线程又尝试去读取,引发 issue 中所提及的 coredump 问题。

一个最简单的方法是,把 worker_count 设置为 1,单线程运行 io_service。但需要测试性能

@Smityz
Copy link
Contributor

Smityz commented Jan 7, 2022

这里应该不能简单地用锁来防止竞态条件 asio_rpc_session.cpp:86

  utils::auto_read_lock socket_guard(_socket_lock);

    _socket->async_read_some(
        boost::asio::buffer(ptr, remaining),
        [this](boost::system::error_code ec, std::size_t length) {
            if (!!ec) {
                if (ec == boost::asio::error::make_error_code(boost::asio::error::eof)) {
                    ddebug("asio read from %s failed: %s",
                           _remote_addr.to_string(),
                           ec.message().c_str());
                } else {
                    derror("asio read from %s failed: %s",
                           _remote_addr.to_string(),
                           ec.message().c_str());
                }
                on_failure();
            } else {

async_read_some 执行的时候是在 io_service 中排队执行的,但是有同步的方法会立即关闭 socket。

void asio_rpc_session::close()
{
    utils::auto_write_lock socket_guard(_socket_lock);

    boost::system::error_code ec;
    _socket->shutdown(boost::asio::socket_base::shutdown_type::shutdown_both, ec);
    if (ec)
        dwarn("asio socket shutdown failed, error = %s", ec.message().c_str());
    _socket->close(ec);
    if (ec)
        dwarn("asio socket close failed, error = %s", ec.message().c_str());
}

由于我们的 io_service 是多线程环境下运行的,所以会产生 A 线程关闭了 socket 之后,B线程又尝试去读取,引发 issue 中所提及的 coredump 问题。

一个最简单的方法是,把 worker_count 设置为 1,单线程运行 io_service。但需要测试性能

性能测试结果不好,asio 线程把单核负载打满了都没法撑起原来 4 线程 asio 的流量

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug This issue reports a bug.
Projects
None yet
Development

No branches or pull requests

4 participants