Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpp client core when meta closes connection of query config if connection threshold is hit #286

Open
mentoswang opened this issue Feb 22, 2019 · 7 comments
Assignees

Comments

@mentoswang
Copy link
Contributor

The problem can be reproduced on the temp branch https://github.com/XiaoMi/pegasus/tree/timeout_core

Steps:

  1. build
  2. start onebox (3 meta, 5 replica, connection_threshold_endpoint is set to 7 for testing)
    [network]
    ; how many network threads for network library(used by asio)
    io_service_worker_count = 4
    connection_threshold_endpoint = 7
  3. start a shell, so primary meta connection count will be 6
  4. build cpp client sample, run it (query config from primary meta will hit the connection threshold)
@mentoswang mentoswang self-assigned this Feb 22, 2019
@mentoswang
Copy link
Contributor Author

change client log level to LOG_LEVEL_INFORMATION, to see more log info
change set timeout to 10000 for cpp client sample, more easy to recreate
int ret = client->set(hash_key, sort_key, value, 10000);

@mentoswang
Copy link
Contributor Author

mentoswang commented Feb 22, 2019

Two different core stacks, one of them is as below:

Core was generated by `./pegasus_cpp_sample onebox temp'.
Program terminated with signal SIGSEGV, Segmentation fault.
-0 0x00007f5fe1ba2b88 in main_arena () from /lib/x86_64-linux-gnu/libc.so.6
[Current thread is 1 (Thread 0x7f5fb37fe700 (LWP 15557))]
(gdb) bt
-0 0x00007f5fe1ba2b88 in main_arena () from /lib/x86_64-linux-gnu/libc.so.6
-1 0x00000000004c5aa1 in dsn::task_queue::increase_count (count=1, this=0x2223bf0) at /home/mi/pegasus/rdsn/include/dsn/tool-api/task_queue.h:85
-2 dsn::task_queue::enqueue_internal (this=0x2223bf0, task=0x7f5f98000998) at /home/mi/pegasus/rdsn/src/core/core/task_queue.cpp:97
-3 0x000000000056ae12 in dsn::tools::simple_timer_service::<lambda(const boost::system::error_code&)>::operator() (__closure=0x7f5fb37fbd80, ec=...)
at /home/mi/pegasus/rdsn/src/core/tools/common/simple_task_queue.cpp:72
-4 boost::asio::detail::binder1<dsn::tools::simple_timer_service::add_timer(dsn::task*)::<lambda(const boost::system::error_code&)>, boost::system::error_code>::operator() (
this=0x7f5fb37fbd80) at /usr/local/include/boost/asio/detail/bind_handler.hpp:47
-5 boost::asio::asio_handler_invoke<boost::asio::detail::binder1<dsn::tools::simple_timer_service::add_timer(dsn::task*)::<lambda(const boost::system::error_code&)>, boost::system::error_code> > (function=...) at /usr/local/include/boost/asio/handler_invoke_hook.hpp:69
-6 boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::binder1<dsn::tools::simple_timer_service::add_timer(dsn::task*)::<lambda(const boost::system::error_code&)>, boost::system::error_code>, dsn::tools::simple_timer_service::add_timer(dsn::task*)::<lambda(const boost::system::error_code&)> > (context=..., function=...)
at /usr/local/include/boost/asio/detail/handler_invoke_helpers.hpp:37
-7 boost::asio::detail::wait_handler<dsn::tools::simple_timer_service::add_timer(dsn::task*)::<lambda(const boost::system::error_code&)> >::do_complete(boost::asio::detail::io_service_impl *, boost::asio::detail::operation *, const boost::system::error_code &, std::size_t) (owner=0x2225a40, base=0x7f5f94010ab0)
at /usr/local/include/boost/asio/detail/wait_handler.hpp:70
-8 0x000000000056a6ef in boost::asio::detail::task_io_service_operation::complete (bytes_transferred=0, ec=..., owner=..., this=0x7f5f94010ab0)
at /usr/local/include/boost/asio/detail/task_io_service_operation.hpp:38
-9 boost::asio::detail::task_io_service::do_run_one (ec=..., this_thread=..., lock=..., this=0x2225a40) at /usr/local/include/boost/asio/detail/impl/task_io_service.ipp:372
-10 boost::asio::detail::task_io_service::run (ec=..., this=0x2225a40) at /usr/local/include/boost/asio/detail/impl/task_io_service.ipp:149
-11 boost::asio::io_service::run (this=, ec=...) at /usr/local/include/boost/asio/impl/io_service.ipp:66
-12 dsn::tools::simple_timer_service::<lambda()>::operator() (__closure=) at /home/mi/pegasus/rdsn/src/core/tools/common/simple_task_queue.cpp:50
-13 std::_Bind_simple<dsn::tools::simple_timer_service::start()::<lambda()>()>::_M_invoke<> (this=) at /usr/include/c++/5/functional:1531
-14 std::_Bind_simple<dsn::tools::simple_timer_service::start()::<lambda()>()>::operator() (this=) at /usr/include/c++/5/functional:1520
-15 std::thread::_Impl<std::_Bind_simple<dsn::tools::simple_timer_service::start()::<lambda()>()> >::_M_run(void) (this=) at /usr/include/c++/5/thread:115
-16 0x00007f5fe1e76c80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
-17 0x00007f5fe296d6ba in start_thread (arg=0x7f5fb37fe700) at pthread_create.c:333
-18 0x00007f5fe18e541d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

client log:
I2019-02-22 13:04:57.662 (1550811897662068659 3cab) mimic.io-thrd.15531: temp.client: start query config, gpid = -1.-1
I2019-02-22 13:04:57.662 (1550811897662116335 3cab) mimic.io-thrd.15531: rpc_name = RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX, remote_addr = 127.0.0.1:34602, header_format = NET_HDR_DSN, channel = RPC_CHANNEL_TCP, seq_id = 2, trace_id = 407952f1ef59b91d
D2019-02-22 13:04:57.662 (1550811897662178095 3cab) mimic.io-thrd.15531: client session created, remote_server = 127.0.0.1:34602, current_count = 1
I2019-02-22 13:04:57.662 (1550811897662416421 3cad) mimic.io-thrd.15533: client session 127.0.0.1:34602 connected
I2019-02-22 13:04:57.662 (1550811897662461366 3cad) mimic.io-thrd.15533: boost asio send buffer size is 1313280, set as 16MB, now is 212992
I2019-02-22 13:04:57.662 (1550811897662480418 3cad) mimic.io-thrd.15533: boost asio recv buffer size is 530904, set as 16MB, now is 212992
I2019-02-22 13:04:57.662 (1550811897662512809 3cad) mimic.io-thrd.15533: boost asio set no_delay = true
D2019-02-22 13:04:57.662 (1550811897662527924 3cad) mimic.io-thrd.15533: client session connected, remote_server = 127.0.0.1:34602, current_count = 1
I2019-02-22 13:04:57.663 (1550811897663228862 3caf) mimic.io-thrd.15535: rpc_name = RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX, remote_addr = 10.239.35.124:34601, header_format = NET_HDR_DSN, channel = RPC_CHANNEL_TCP, seq_id = 2, trace_id = 407952f1ef59b91d
D2019-02-22 13:04:57.663 (1550811897663301738 3caf) mimic.io-thrd.15535: client session created, remote_server = 10.239.35.124:34601, current_count = 2
I2019-02-22 13:04:57.663 (1550811897663665688 3cb0) mimic.io-thrd.15536: client session 10.239.35.124:34601 connected
I2019-02-22 13:04:57.663 (1550811897663718216 3cb0) mimic.io-thrd.15536: boost asio send buffer size is 1313280, set as 16MB, now is 212992
I2019-02-22 13:04:57.663 (1550811897663739619 3cb0) mimic.io-thrd.15536: boost asio recv buffer size is 530904, set as 16MB, now is 212992
I2019-02-22 13:04:57.663 (1550811897663753633 3cb0) mimic.io-thrd.15536: boost asio set no_delay = true
D2019-02-22 13:04:57.663 (1550811897663767529 3cb0) mimic.io-thrd.15536: client session connected, remote_server = 10.239.35.124:34601, current_count = 2
D2019-02-22 13:04:57.663 (1550811897663955032 3cb0) mimic.io-thrd.15536: asio read from 10.239.35.124:34601 failed: End of file
D2019-02-22 13:04:57.663 (1550811897663981374 3cb0) mimic.io-thrd.15536: client session disconnected, remote_server = 10.239.35.124:34601, current_count = 1
W2019-02-22 13:04:57.664 (1550811897664023999 3cb0) mimic.io-thrd.15536: asio socket shutdown failed, error = Transport endpoint is not connected

The other one is as below:

Core was generated by `./pegasus_cpp_sample onebox temp'.
Program terminated with signal SIGABRT, Aborted.
-0 0x00007f7067008428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54 ../sysdeps/unix/sysv/linux/raise.c: 没有那个文件或目录.
[Current thread is 1 (Thread 0x7f7068a43080 (LWP 13369))]
(gdb) bt
-0 0x00007f7067008428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
-1 0x00007f706700a02a in __GI_abort () at abort.c:89
-2 0x00000000004b203e in dsn_coredump () at /home/mi/temp/pegasus/rdsn/src/core/core/service_api_c.cpp:73
-3 0x000000000056816a in dsn::rpc_client_matcher::~rpc_client_matcher (this=0x17ddf10, __in_chrg=) at /home/mi/temp/pegasus/rdsn/src/core/core/rpc_engine.cpp:71
-4 0x00000000004be548 in dsn::rpc_engine::~rpc_engine (this=0x17ddeb0, __in_chrg=) at /home/mi/temp/pegasus/rdsn/src/core/core/rpc_engine.h:130
-5 std::default_deletedsn::rpc_engine::operator() (this=, __ptr=0x17ddeb0) at /usr/include/c++/5/bits/unique_ptr.h:76
-6 std::unique_ptr<dsn::rpc_engine, std::default_deletedsn::rpc_engine >::~unique_ptr (this=0x17d9b40, __in_chrg=) at /usr/include/c++/5/bits/unique_ptr.h:236
-7 dsn::service_node::io_engine::io_engine (this=0x17d9b40, __in_chrg=) at /home/mi/temp/pegasus/rdsn/src/core/core/service_engine.h:67
-8 dsn::service_node::service_node (this=0x17d9940, __in_chrg=) at /home/mi/temp/pegasus/rdsn/src/core/core/service_engine.h:64
-9 0x00000000004b740a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x17d9930) at /usr/include/c++/5/bits/shared_ptr_base.h:150
-10 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::
__shared_count (this=, __in_chrg=) at /usr/include/c++/5/bits/shared_ptr_base.h:659
-11 std::__shared_ptr<dsn::service_node, (__gnu_cxx::_Lock_policy)2>::
__shared_ptr (this=, __in_chrg=) at /usr/include/c++/5/bits/shared_ptr_base.h:925
-12 std::shared_ptrdsn::service_node::~shared_ptr (this=, __in_chrg=) at /usr/include/c++/5/bits/shared_ptr.h:93
-13 std::pair<int const, std::shared_ptrdsn::service_node >::pair (this=, __in_chrg=) at /usr/include/c++/5/bits/stl_pair.h:96
-14 __gnu_cxx::new_allocator<std::_Rb_tree_node<std::pair<int const, std::shared_ptrdsn::service_node > > >::destroy<std::pair<int const, std::shared_ptrdsn::service_node > > (this=, __p=)
at /usr/include/c++/5/ext/new_allocator.h:124
-15 std::allocator_traits<std::allocator<std::_Rb_tree_node<std::pair<int const, std::shared_ptrdsn::service_node > > > >::destroy<std::pair<int const, std::shared_ptrdsn::service_node > > (__a=..., __p=)
at /usr/include/c++/5/bits/alloc_traits.h:542
-16 std::_Rb_tree<int, std::pair<int const, std::shared_ptrdsn::service_node >, std::_Select1st<std::pair<int const, std::shared_ptrdsn::service_node > >, std::less, std::allocator<std::pair<int const, std::shared_ptrdsn::service_node > > >::_M_destroy_node (this=0x816408 <dsn::utils::singletondsn::service_engine::instance()::_instance+808>, __p=0x17f93c0) at /usr/include/c++/5/bits/stl_tree.h:553
-17 std::_Rb_tree<int, std::pair<int const, std::shared_ptrdsn::service_node >, std::_Select1st<std::pair<int const, std::shared_ptrdsn::service_node > >, std::less, std::allocator<std::pair<int const, std::shared_ptrdsn::service_node > > >::_M_drop_node (this=0x816408 <dsn::utils::singletondsn::service_engine::instance()::_instance+808>, __p=0x17f93c0) at /usr/include/c++/5/bits/stl_tree.h:561
-18 std::_Rb_tree<int, std::pair<int const, std::shared_ptrdsn::service_node >, std::_Select1st<std::pair<int const, std::shared_ptrdsn::service_node > >, std::less, std::allocator<std::pair<int const, std::shared_ptrdsn::service_node > > >::_M_erase (
this=this@entry=0x816408 <dsn::utils::singletondsn::service_engine::instance()::_instance+808>, __x=0x17f93c0) at /usr/include/c++/5/bits/stl_tree.h:1614
-19 0x00000000004ba957 in std::_Rb_tree<int, std::pair<int const, std::shared_ptrdsn::service_node >, std::_Select1st<std::pair<int const, std::shared_ptrdsn::service_node > >, std::less, std::allocator<std::pair<int const, std::shared_ptrdsn::service_node > > >::
_Rb_tree (this=0x816408 <dsn::utils::singletondsn::service_engine::instance()::_instance+808>, __in_chrg=) at /usr/include/c++/5/bits/stl_tree.h:858
-20 std::map<int, std::shared_ptrdsn::service_node, std::less, std::allocator<std::pair<int const, std::shared_ptrdsn::service_node > > >::~map (this=0x816408 <dsn::utils::singletondsn::service_engine::instance()::_instance+808>,
__in_chrg=) at /usr/include/c++/5/bits/stl_map.h:96
-21 dsn::service_engine::~service_engine (this=0x8160e0 <dsn::utils::singletondsn::service_engine::instance()::_instance>, __in_chrg=) at /home/mi/temp/pegasus/rdsn/src/core/core/service_engine.h:121
-22 0x00007f706700cff8 in __run_exit_handlers (status=-1, listp=0x7f70673975f8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true) at exit.c:82
-23 0x00007f706700d045 in __GI_exit (status=) at exit.c:104
-24 0x00007f7066ff3837 in __libc_start_main (main=0x45fe72 <main(int, char const**)>, argc=3, argv=0x7fff6c8b5a98, init=, fini=, rtld_fini=, stack_end=0x7fff6c8b5a88) at ../csu/libc-start.c:325
-25 0x000000000045c789 in _start ()

client log:
I2019-02-22 12:57:01.483 (1550811421483019254 3439) mimic.io-thrd.13369: temp.client: start query config, gpid = -1.-1, timeout_ms = 9999
I2019-02-22 12:57:01.483 (1550811421483033448 3439) mimic.io-thrd.13369: rpc_name = RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX, remote_addr = 127.0.0.1:34602, header_format = NET_HDR_DSN, channel = RPC_CHANNEL_TCP, seq_id = 2, trace_id = 8f6aae70d46ec050
D2019-02-22 12:57:01.483 (1550811421483060980 3439) mimic.io-thrd.13369: client session created, remote_server = 127.0.0.1:34602, current_count = 1
I2019-02-22 12:57:01.483 (1550811421483164247 343b) mimic.io-thrd.13371: client session 127.0.0.1:34602 connected
I2019-02-22 12:57:01.483 (1550811421483186340 343b) mimic.io-thrd.13371: boost asio send buffer size is 1313280, set as 16MB, now is 212992
I2019-02-22 12:57:01.483 (1550811421483193136 343b) mimic.io-thrd.13371: boost asio recv buffer size is 530904, set as 16MB, now is 212992
I2019-02-22 12:57:01.483 (1550811421483197325 343b) mimic.io-thrd.13371: boost asio set no_delay = true
D2019-02-22 12:57:01.483 (1550811421483202544 343b) mimic.io-thrd.13371: client session connected, remote_server = 127.0.0.1:34602, current_count = 1
I2019-02-22 12:57:01.483 (1550811421483363470 343d) mimic.io-thrd.13373: rpc_name = RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX, remote_addr = 10.239.35.124:34601, header_format = NET_HDR_DSN, channel = RPC_CHANNEL_TCP, seq_id = 2, trace_id = 8f6aae70d46ec050
D2019-02-22 12:57:01.483 (1550811421483379830 343d) mimic.io-thrd.13373: client session created, remote_server = 10.239.35.124:34601, current_count = 2
I2019-02-22 12:57:01.483 (1550811421483444047 343c) mimic.io-thrd.13372: client session 10.239.35.124:34601 connected
I2019-02-22 12:57:01.483 (1550811421483456477 343c) mimic.io-thrd.13372: boost asio send buffer size is 1313280, set as 16MB, now is 212992
I2019-02-22 12:57:01.483 (1550811421483461921 343c) mimic.io-thrd.13372: boost asio recv buffer size is 530904, set as 16MB, now is 212992
I2019-02-22 12:57:01.483 (1550811421483465960 343c) mimic.io-thrd.13372: boost asio set no_delay = true
D2019-02-22 12:57:01.483 (1550811421483469867 343c) mimic.io-thrd.13372: client session connected, remote_server = 10.239.35.124:34601, current_count = 2
D2019-02-22 12:57:01.483 (1550811421483518515 343e) mimic.io-thrd.13374: asio read from 10.239.35.124:34601 failed: End of file
D2019-02-22 12:57:01.483 (1550811421483531251 343e) mimic.io-thrd.13374: client session disconnected, remote_server = 10.239.35.124:34601, current_count = 1
W2019-02-22 12:57:01.483 (1550811421483550167 343e) mimic.io-thrd.13374: asio socket shutdown failed, error = Transport endpoint is not connected
E2019-02-22 12:57:06.483 (1550811426483692005 3456) mimic.default2.0100343900030003: temp.client: query config reply, gpid = -1.-1, err = ERR_TIMEOUT
I2019-02-22 12:57:07.484 (1550811427484237893 3454) mimic.default0.0101000200000001: temp.client: start query config, gpid = -1.-1, timeout_ms = 3998
I2019-02-22 12:57:07.484 (1550811427484275368 3454) mimic.default0.0101000200000001: rpc_name = RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX, remote_addr = 10.239.35.124:34601, header_format = NET_HDR_DSN, channel = RPC_CHANNEL_TCP, seq_id = 4, trace_id = c18b7b02ed7164ef
D2019-02-22 12:57:07.484 (1550811427484292568 3454) mimic.default0.0101000200000001: client session created, remote_server = 10.239.35.124:34601, current_count = 2
I2019-02-22 12:57:07.484 (1550811427484410818 343c) mimic.io-thrd.13372: client session 10.239.35.124:34601 connected
I2019-02-22 12:57:07.484 (1550811427484426118 343c) mimic.io-thrd.13372: boost asio send buffer size is 1313280, set as 16MB, now is 212992
I2019-02-22 12:57:07.484 (1550811427484440641 343c) mimic.io-thrd.13372: boost asio recv buffer size is 530904, set as 16MB, now is 212992
I2019-02-22 12:57:07.484 (1550811427484445384 343c) mimic.io-thrd.13372: boost asio set no_delay = true
D2019-02-22 12:57:07.484 (1550811427484450224 343c) mimic.io-thrd.13372: client session connected, remote_server = 10.239.35.124:34601, current_count = 2
D2019-02-22 12:57:07.484 (1550811427484486857 343c) mimic.io-thrd.13372: asio read from 10.239.35.124:34601 failed: End of file
D2019-02-22 12:57:07.484 (1550811427484496456 343c) mimic.io-thrd.13372: client session disconnected, remote_server = 10.239.35.124:34601, current_count = 1
W2019-02-22 12:57:07.484 (1550811427484510121 343c) mimic.io-thrd.13372: asio socket shutdown failed, error = Transport endpoint is not connected
W2019-02-22 12:57:11.484 (1550811431484775537 343a) mimic.io-thrd.13370: io_getevents returns -22, you probably want to try on another machine:-(
F2019-02-22 12:57:11.547 (1550811431547641883 3439) mimic.io-thrd.13369: assertion expression: _requests[i].size() == 0
F2019-02-22 12:57:11.547 (1550811431547736586 3439) mimic.io-thrd.13369: all rpc entries must be removed before the matcher ends

@mentoswang
Copy link
Contributor Author

mentoswang commented Feb 22, 2019

The problem happened probably because we setup a timer for query_config in partition_resolver_simple::call according to the client request timeout, however the internal rpc request timeout is not set in partition_resolver_simple::query_config, so it will be 5 sec as default. In some of the cases, the timer will be triggered before the rpc timeout and back. In this case the client will core, I didn't find the exact root cause yet.

    // calculate timeout
    int timeout_ms;
    if (nts + 1000 >= request->timeout_ts_us)
        timeout_ms = 1;
    else
        timeout_ms = static_cast<int>(request->timeout_ts_us - nts) / 1000;

    // init timeout timer only when necessary
    {
        zauto_lock l(request->lock);
        if (request->timeout_timer == nullptr) {
            derror("wss: i am here in call 3, timeout_ms = %d", timeout_ms);
            request->timeout_timer =
                tasking::enqueue(LPC_REPLICATION_CLIENT_REQUEST_TIMEOUT,
                                 &_tracker,
                                 [ =, req2 = request ]() mutable { on_timeout(std::move(req2)); },
                                 0,
                                 std::chrono::milliseconds(timeout_ms));
        }
    }

auto msg = dsn::message_ex::create_request(RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX);

@mentoswang
Copy link
Contributor Author

Found the root cause of the 2nd core:

bad case

data/log/log.3.txt:E2019-02-22 19:55:48.331 (1550836548331735179 738f) mimic.io-thrd.29583: wss: i am here in partition_resolver_simple::call
data/log/log.3.txt:E2019-02-22 19:55:48.331 (1550836548331759499 738f) mimic.io-thrd.29583: wss: i am here in partition_resolver_simple::call setup timer, timeout_ms = 9999 >>> setup timer as per client request timeout
data/log/log.3.txt:E2019-02-22 19:55:48.331 (1550836548331847643 738f) mimic.io-thrd.29583: wss: i am here in rpc_client_matcher::on_call, insert rpc code = RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX, hdr.id = 2
data/log/log.3.txt:E2019-02-22 19:55:48.332 (1550836548332311378 7394) mimic.io-thrd.29588: wss: i am here in rpc_client_matcher::on_recv_reply, remove hdr.id = 2
data/log/log.3.txt:E2019-02-22 19:55:48.332 (1550836548332348397 7394) mimic.io-thrd.29588: wss: i am here in rpc_client_matcher::on_call, insert rpc code = RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX, hdr.id = 3
data/log/log.3.txt:E2019-02-22 19:55:53.332 (1550836553332442548 73ac) mimic.default2.0100739400010001: wss: i am here in rpc_client_matcher::on_rpc_timeout >>> rpc timeout after 5 sec
data/log/log.3.txt:E2019-02-22 19:55:53.332 (1550836553332671392 73ac) mimic.default2.0100739400010001: wss: i am here in rpc_client_matcher::on_rpc_timeout, remove hdr.id = 3
data/log/log.3.txt:E2019-02-22 19:55:53.332 (1550836553332725140 73ac) mimic.default2.0100738f00030003: wss: i am here in partition_resolver_simple::query_config_reply >>> callback of rpc call
data/log/log.3.txt:E2019-02-22 19:55:53.332 (1550836553332753246 73ac) mimic.default2.0100738f00030003: wss: i am here in partition_resolver_simple::query_config_reply, ERR_TIMEOUT
data/log/log.3.txt:E2019-02-22 19:55:53.332 (1550836553332866460 73ac) mimic.default2.0100738f00030003: wss: i am here in partition_resolver_simple::handle_pending_requests ERR_OK
data/log/log.3.txt:E2019-02-22 19:55:53.332 (1550836553332899569 73ac) mimic.default2.0100738f00030003: wss: i am here in partition_resolver_simple::handle_pending_requests partition_index = -1
data/log/log.3.txt:E2019-02-22 19:55:53.332 (1550836553332919252 73ac) mimic.default2.0100738f00030003: wss: i am here in partition_resolver_simple::handle_pending_requests, ERR_OK, ERR_OBJECT_NOT_FOUND
data/log/log.3.txt:E2019-02-22 19:55:53.332 (1550836553332943926 73ac) mimic.default2.0100738f00030003: wss: i am here in partition_resolver_simple::call
data/log/log.3.txt:E2019-02-22 19:55:53.332 (1550836553332986783 73ac) mimic.default2.0100738f00030003: wss: i am here in partition_resolver_simple::call setup retry
data/log/log.3.txt:E2019-02-22 19:55:54.333 (1550836554333122157 73aa) mimic.default0.0101000200000001: wss: i am here in partition_resolver_simple::call >>> retry
data/log/log.3.txt:E2019-02-22 19:55:54.333 (1550836554333331803 73aa) mimic.default0.0101000200000001: wss: i am here in rpc_client_matcher::on_call, insert rpc code = RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX, hdr.id = 4
data/log/log.3.txt:E2019-02-22 19:55:58.330 (1550836558330870275 73ab) mimic.default1.0100738f00030002: wss: i am here in partition_resolver_simple::on_timeout >>> timer timeout after 9999ms, back before retry rpc timeout
data/log/log.3.txt:E2019-02-22 19:55:58.330 (1550836558330935300 73ab) mimic.default1.0100738f00030002: wss: i am here in partition_resolver_simple::end_request
data/log/log.3.txt:E2019-02-22 19:55:58.330 (1550836558330950440 73ab) mimic.default1.0100738f00030001: wss: i am here in partition_resolver::call_task callback, gpid = 0, err = ERR_TIMEOUT
data/log/log.3.txt:E2019-02-22 19:55:58.330 (1550836558330958699 73ab) mimic.default1.0100738f00030001: wss: i am here in partition_resolver::call_task old callback
data/log/log.3.txt:E2019-02-22 19:55:58.379 (1550836558379533479 738f) mimic.io-thrd.29583: wss: i am here in rpc_client_matcher::~rpc_client_matcher, hdr.id = 4, timeout_task = LPC_RPC_TIMEOUT, response_task = RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX_ACK >>> the problem is the retry rpc is still on the way
F2019-02-22 19:55:58.379 (1550836558379565885 738f) mimic.io-thrd.29583: assertion expression: _requests[i].size() == 0
F2019-02-22 19:55:58.379 (1550836558379572153 738f) mimic.io-thrd.29583: all rpc entries must be removed before the matcher ends

good case

data/log/log.5.txt:E2019-02-22 19:58:09.528 (1550836689528468694 74f4) mimic.io-thrd.29940: wss: i am here in partition_resolver_simple::call
data/log/log.5.txt:E2019-02-22 19:58:09.528 (1550836689528482169 74f4) mimic.io-thrd.29940: wss: i am here in partition_resolver_simple::call setup timer, timeout_ms = 5006
data/log/log.5.txt:E2019-02-22 19:58:09.528 (1550836689528528215 74f4) mimic.io-thrd.29940: wss: i am here in rpc_client_matcher::on_call, insert rpc code = RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX, hdr.id = 2
data/log/log.5.txt:E2019-02-22 19:58:09.528 (1550836689528818469 74f8) mimic.io-thrd.29944: wss: i am here in rpc_client_matcher::on_recv_reply, remove hdr.id = 2
data/log/log.5.txt:E2019-02-22 19:58:09.528 (1550836689528854286 74f8) mimic.io-thrd.29944: wss: i am here in rpc_client_matcher::on_call, insert rpc code = RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX, hdr.id = 3
data/log/log.5.txt:E2019-02-22 19:58:14.528 (1550836694528882565 7511) mimic.default2.010074f800010001: wss: i am here in rpc_client_matcher::on_rpc_timeout >>> rpc timeout after 5 sec
data/log/log.5.txt:E2019-02-22 19:58:14.528 (1550836694528918756 7511) mimic.default2.010074f800010001: wss: i am here in rpc_client_matcher::on_rpc_timeout, remove hdr.id = 3
data/log/log.5.txt:E2019-02-22 19:58:14.528 (1550836694528934697 7511) mimic.default2.010074f400030003: wss: i am here in partition_resolver_simple::query_config_reply >>> callback of rpc call
data/log/log.5.txt:E2019-02-22 19:58:14.528 (1550836694528944567 7511) mimic.default2.010074f400030003: wss: i am here in partition_resolver_simple::query_config_reply, ERR_TIMEOUT
data/log/log.5.txt:E2019-02-22 19:58:14.528 (1550836694528990155 7511) mimic.default2.010074f400030003: wss: i am here in partition_resolver_simple::handle_pending_requests ERR_OK
data/log/log.5.txt:E2019-02-22 19:58:14.529 (1550836694529003450 7511) mimic.default2.010074f400030003: wss: i am here in partition_resolver_simple::handle_pending_requests partition_index = -1
data/log/log.5.txt:E2019-02-22 19:58:14.529 (1550836694529013904 7511) mimic.default2.010074f400030003: wss: i am here in partition_resolver_simple::handle_pending_requests, ERR_OK, ERR_OBJECT_NOT_FOUND
data/log/log.5.txt:E2019-02-22 19:58:14.529 (1550836694529024170 7511) mimic.default2.010074f400030003: wss: i am here in partition_resolver_simple::call
data/log/log.5.txt:E2019-02-22 19:58:14.529 (1550836694529039427 7511) mimic.default2.010074f400030003: wss: i am here in partition_resolver_simple::call setup retry
data/log/log.5.txt:E2019-02-22 19:58:14.534 (1550836694534534820 750f) mimic.default0.010074f400030002: wss: i am here in partition_resolver_simple::on_timeout >>> timer timeout after 5006ms, back after rpc timeout
data/log/log.5.txt:E2019-02-22 19:58:14.534 (1550836694534573070 750f) mimic.default0.010074f400030002: wss: i am here in partition_resolver_simple::end_request
data/log/log.5.txt:E2019-02-22 19:58:14.534 (1550836694534591296 750f) mimic.default0.010074f400030001: wss: i am here in partition_resolver::call_task callback, gpid = 0, err = ERR_TIMEOUT
data/log/log.5.txt:E2019-02-22 19:58:14.534 (1550836694534599697 750f) mimic.default0.010074f400030001: wss: i am here in partition_resolver::call_task old callback

@mentoswang
Copy link
Contributor Author

mentoswang commented Feb 26, 2019

Temporary fix the problem by passing the client request timeout into partition_resolver_simple::query_config to set the coresponding timeout for the actual rpc call.

        if (-1 != pindex) {
            // put into pending queue of querying target partition
            auto it = _pending_requests.find(pindex);
            if (it == _pending_requests.end()) {
                auto pc = new partition_context();
                it = _pending_requests.emplace(pindex, pc).first;
            }
            it->second->requests.push_back(std::move(request));

            // init configuration query task if necessary
            if (nullptr == it->second->query_config_task) {
                it->second->query_config_task = query_config(pindex, timeout_ms);
            }
        } else {
            _pending_requests_before_partition_count_unknown.push_back(std::move(request));
            if (_pending_requests_before_partition_count_unknown.size() == 1) {
                _query_config_task = query_config(pindex, timeout_ms);
            }
        }
    task_spec *sp = task_spec::get(RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX);
    if (timeout_ms >= sp->rpc_timeout_milliseconds)
        timeout_ms = 0;
    auto msg = dsn::message_ex::create_request(RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX, timeout_ms);

E2019-02-26 18:59:30.886 (1551178770886220913 4ca2) mimic.io-thrd.19618: wss: i am here in partition_resolver_simple::call
E2019-02-26 18:59:30.886 (1551178770886270790 4ca2) mimic.io-thrd.19618: wss: i am here in partition_resolver_simple::call setup timer, timeout_ms = 9999
I2019-02-26 18:59:30.886 (1551178770886358214 4ca2) mimic.io-thrd.19618: temp.client: start query config, gpid = -1.-1, timeout_ms = 9994
I2019-02-26 18:59:30.886 (1551178770886393692 4ca2) mimic.io-thrd.19618: rpc_name = RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX, remote_addr = 127.0.0.1:34603, header_format = NET_HDR_DSN, channel = RPC_CHANNEL_TCP, seq_id = 2, trace_id = 83786c678ddd03ba
E2019-02-26 18:59:30.886 (1551178770886412859 4ca2) mimic.io-thrd.19618: wss: i am here in rpc_client_matcher::on_call, insert hdr.id = 2, request_code = RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX, timeout_task = LPC_RPC_TIMEOUT, response_task = RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX_ACK
E2019-02-26 18:59:30.887 (1551178770887449353 4ca6) mimic.io-thrd.19622: wss: i am here in rpc_client_matcher::on_recv_reply, remove hdr.id = 2, timeout_task = LPC_RPC_TIMEOUT, response_task = RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX_ACK
I2019-02-26 18:59:30.887 (1551178770887540004 4ca6) mimic.io-thrd.19622: rpc_name = RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX, remote_addr = 10.239.35.124:34601, header_format = NET_HDR_DSN, channel = RPC_CHANNEL_TCP, seq_id = 2, trace_id = 83786c678ddd03ba
E2019-02-26 18:59:30.887 (1551178770887563375 4ca6) mimic.io-thrd.19622: wss: i am here in rpc_client_matcher::on_call, insert hdr.id = 3, request_code = RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX, timeout_task = LPC_RPC_TIMEOUT, response_task = RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX_ACK
E2019-02-26 18:59:35.887 (1551178775887753170 4cc0) mimic.default2.01004ca600010001: wss: i am here in rpc_client_matcher::on_rpc_timeout1, remove hdr.id = 3, timeout_task = LPC_RPC_TIMEOUT, response_task = RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX_ACK
E2019-02-26 18:59:35.887 (1551178775887890489 4cc0) mimic.default2.01004ca200030003: wss: i am here in partition_resolver_simple::query_config_reply
E2019-02-26 18:59:35.887 (1551178775887920602 4cc0) mimic.default2.01004ca200030003: temp.client: query config reply, gpid = -1.-1, err = ERR_TIMEOUT
E2019-02-26 18:59:35.887 (1551178775887988750 4cc0) mimic.default2.01004ca200030003: wss: i am here in partition_resolver_simple::handle_pending_requests ERR_OK
E2019-02-26 18:59:35.888 (1551178775888019130 4cc0) mimic.default2.01004ca200030003: wss: i am here in partition_resolver_simple::handle_pending_requests, ERR_OK, ERR_OBJECT_NOT_FOUND
E2019-02-26 18:59:35.888 (1551178775888042311 4cc0) mimic.default2.01004ca200030003: wss: i am here in partition_resolver_simple::call
E2019-02-26 18:59:35.888 (1551178775888063263 4cc0) mimic.default2.01004ca200030003: wss: i am here in partition_resolver_simple::call setup retry
E2019-02-26 18:59:36.888 (1551178776888341178 4cbe) mimic.default0.0101000200000001: wss: i am here in partition_resolver_simple::call
I2019-02-26 18:59:36.888 (1551178776888433571 4cbe) mimic.default0.0101000200000001: temp.client: start query config, gpid = -1.-1, timeout_ms = 3992
I2019-02-26 18:59:36.888 (1551178776888525895 4cbe) mimic.default0.0101000200000001: rpc_name = RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX, remote_addr = 10.239.35.124:34601, header_format = NET_HDR_DSN, channel = RPC_CHANNEL_TCP, seq_id = 4, trace_id = 12b04fc20186042d
E2019-02-26 18:59:36.888 (1551178776888548869 4cbe) mimic.default0.0101000200000001: wss: i am here in rpc_client_matcher::on_call, insert hdr.id = 4, request_code = RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX, timeout_task = LPC_RPC_TIMEOUT, response_task = RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX_ACK
E2019-02-26 18:59:40.880 (1551178780880812576 4cbf) mimic.default1.0101000000000002: wss: i am here in rpc_client_matcher::on_rpc_timeout1, remove hdr.id = 4, timeout_task = LPC_RPC_TIMEOUT, response_task = RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX_ACK
E2019-02-26 18:59:40.880 (1551178780880941179 4cbf) mimic.default1.0101000000000001: wss: i am here in partition_resolver_simple::query_config_reply
E2019-02-26 18:59:40.880 (1551178780880968663 4cbf) mimic.default1.0101000000000001: temp.client: query config reply, gpid = -1.-1, err = ERR_TIMEOUT
E2019-02-26 18:59:40.881 (1551178780881054311 4cbf) mimic.default1.0101000000000001: wss: i am here in partition_resolver_simple::handle_pending_requests ERR_OK
E2019-02-26 18:59:40.881 (1551178780881086681 4cbf) mimic.default1.0101000000000001: wss: i am here in partition_resolver_simple::handle_pending_requests, ERR_OK, ERR_OBJECT_NOT_FOUND
E2019-02-26 18:59:40.881 (1551178780881112056 4cbf) mimic.default1.0101000000000001: wss: i am here in partition_resolver_simple::call
E2019-02-26 18:59:40.881 (1551178780881135929 4cbf) mimic.default1.0101000000000001: wss: i am here in partition_resolver_simple::call setup retry
E2019-02-26 18:59:40.885 (1551178780885621160 4cc0) mimic.default2.01004ca200030002: wss: i am here in partition_resolver_simple::on_timeout
E2019-02-26 18:59:40.885 (1551178780885710194 4cc0) mimic.default2.01004ca200030002: wss: i am here in partition_resolver_simple::end_request
E2019-02-26 18:59:40.885 (1551178780885751662 4cc0) mimic.default2.01004ca200030001: wss: i am here in partition_resolver::call_task callback, gpid = 0, err = ERR_TIMEOUT
W2019-02-26 18:59:40.888 (1551178780888077128 4ca3) mimic.io-thrd.19619: io_getevents returns -22, you probably want to try on another machine:-(

@mentoswang
Copy link
Contributor Author

Found the root cause of core1, it's a different problem, need to reconsider the solution to fix the two problems.

@mentoswang
Copy link
Contributor Author

The second problem is because end_request is called as timeout will happen within 100 us, and then the timer task is trying to enqueue shorlty. Considering to not end up the request in this case, but wait for the timer to clean up instead, but still need to make sure rpc in query_config is back before the timer.

// timeout will happen very soon, no way to get the rpc call done
if (nts + 100 >= request->timeout_ts_us) // within 100 us
{
    end_request(std::move(request), ERR_TIMEOUT, rpc_address());
    return;
}

// delay 1 second for further config query
if (from_meta_ack) {
    tasking::enqueue(LPC_REPLICATION_DELAY_QUERY_CONFIG,
                     &_tracker,
                     [ =, req2 = request ]() mutable { call(std::move(req2), false); },
                     0,
                     std::chrono::seconds(1));
    return;
}

// calculate timeout
int timeout_ms;
if (nts + 1000 >= request->timeout_ts_us)
    timeout_ms = 1;
else
    timeout_ms = static_cast<int>(request->timeout_ts_us - nts) / 1000;

// init timeout timer only when necessary
{
    zauto_lock l(request->lock);
    if (request->timeout_timer == nullptr) {
        request->timeout_timer =
            tasking::enqueue(LPC_REPLICATION_CLIENT_REQUEST_TIMEOUT,
                             &_tracker,
                             [ =, req2 = request ]() mutable { on_timeout(std::move(req2)); },
                             0,
                             std::chrono::milliseconds(timeout_ms));
    }
}

E2019-02-28 12:07:52.541 (1551326872541616898 455c) mimic.default1.0101000000000001: temp.client: query config reply, gpid = -1.-1, err = ERR_TIMEOUT
E2019-02-28 12:07:52.541 (1551326872541685356 455c) mimic.default1.0101000000000001: wss: i am here in partition_resolver_simple::handle_pending_requests ERR_OK
E2019-02-28 12:07:52.541 (1551326872541715626 455c) mimic.default1.0101000000000001: wss: i am here in partition_resolver_simple::handle_pending_requests, ERR_OK, ERR_OBJECT_NOT_FOUND
E2019-02-28 12:07:52.541 (1551326872541738976 455c) mimic.default1.0101000000000001: wss: i am here in partition_resolver_simple::call
E2019-02-28 12:07:52.541 (1551326872541763269 455c) mimic.default1.0101000000000001: wss: i am here in partition_resolver_simple::call timeout immediately
E2019-02-28 12:07:52.541 (1551326872541786034 455c) mimic.default1.0101000000000001: wss: i am here in partition_resolver_simple::end_request
E2019-02-28 12:07:52.541 (1551326872541811523 455c) mimic.default1.0101000000000001: wss: i am here in partition_resolver_simple::end_request, timer LPC_REPLICATION_CLIENT_REQUEST_TIMEOUT was canceled
E2019-02-28 12:07:52.541 (1551326872541855416 455c) mimic.default1.0100454000030001: wss: i am here in partition_resolver::call_task callback, gpid = 0, err = ERR_TIMEOUT
E2019-02-28 12:07:52.541 (1551326872541911690 455a) mimic.io-thrd.17754: wss: i am here in simple_timer_service::add_timer, enqueue task = LPC_REPLICATION_CLIENT_REQUEST_TIMEOUT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant