Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BE occasionally crashes in DataStreamSender #2466

Closed
gaodayue opened this issue Dec 16, 2019 · 7 comments
Closed

BE occasionally crashes in DataStreamSender #2466

gaodayue opened this issue Dec 16, 2019 · 7 comments

Comments

@gaodayue
Copy link
Contributor

gaodayue commented Dec 16, 2019

Describe the bug
Recently we found BE occasionally crash inside brpc with the following stacktrace.

#0  dec_ref (this=0x2000000000134) at /home/kangkaisen/palo/thirdparty/src/incubator-brpc-0.9.5/src/butil/iobuf.cpp:248
#1  butil::IOBuf::clear (this=this@entry=0x7f0e2f176830) at /home/kangkaisen/palo/thirdparty/src/incubator-brpc-0.9.5/src/butil/iobuf.cpp:723
#2  0x0000000002189662 in ~IOBuf (this=0x7f0e2f176830, __in_chrg=<optimized out>) at /home/kangkaisen/palo/thirdparty/src/incubator-brpc-0.9.5/src/butil/iobuf.h:108
#3  brpc::Controller::IssueRPC (this=this@entry=0x4d70ba788, start_realtime_us=start_realtime_us@entry=1576230170582720)
    at /home/kangkaisen/palo/thirdparty/src/incubator-brpc-0.9.5/src/brpc/controller.cpp:1096
#4  0x0000000002180977 in brpc::Channel::CallMethod (this=0x692ed700, method=0x66dca380, controller_base=0x4d70ba788, request=0x8571315a0, response=0x4d70ba9d8, done=0x4d70ba780)
    at /home/kangkaisen/palo/thirdparty/src/incubator-brpc-0.9.5/src/brpc/channel.cpp:529
#5  0x000000000115b31f in palo::PInternalService_Stub::transmit_data (this=<optimized out>, controller=0x4d70ba788, request=0x8571315a0, response=0x4d70ba9d8, done=0x4d70ba780)
    at /home/kangkaisen/palo/gensrc/build/gen_cpp/palo_internal_service.pb.cc:299
#6  0x00000000013a4651 in doris::DataStreamSender::Channel::send_batch (this=this@entry=0x8571314a0, batch=batch@entry=0x0, eos=eos@entry=true)
    at /home/kangkaisen/palo/be/src/runtime/data_stream_sender.cpp:232
#7  0x00000000013a515c in doris::DataStreamSender::Channel::close_internal (this=0x8571314a0) at /home/kangkaisen/palo/be/src/runtime/data_stream_sender.cpp:289
#8  0x00000000013a5309 in close (state=0x39763a600, this=<optimized out>) at /home/kangkaisen/palo/be/src/runtime/data_stream_sender.cpp:296
#9  doris::DataStreamSender::close (this=0x7305669c0, state=0x39763a600, exec_status=...) at /home/kangkaisen/palo/be/src/runtime/data_stream_sender.cpp:607
#10 0x0000000000e944cc in doris::PlanFragmentExecutor::open_internal (this=this@entry=0x5e61ab670) at /home/kangkaisen/palo/be/src/runtime/plan_fragment_executor.cpp:348
#11 0x0000000000e946b4 in doris::PlanFragmentExecutor::open (this=this@entry=0x5e61ab670) at /home/kangkaisen/palo/be/src/runtime/plan_fragment_executor.cpp:281
#12 0x0000000000e23797 in doris::FragmentExecState::execute (this=0x5e61ab600) at /home/kangkaisen/palo/be/src/runtime/fragment_mgr.cpp:204
#13 0x0000000000e24f9c in doris::FragmentMgr::exec_actual(std::shared_ptr<doris::FragmentExecState>, std::function<void (doris::PlanFragmentExecutor*)>) (this=0x6ebc000, exec_state=..., cb=...)
    at /home/kangkaisen/palo/be/src/runtime/fragment_mgr.cpp:368
#14 0x0000000000e28364 in operator() (a2=<error reading variable: access outside bounds of object referenced via synthetic pointer>, a1=..., p=<optimized out>, this=<optimized out>)
    at /home/kangkaisen/palo/thirdparty/installed/include/boost/bind/mem_fn_template.hpp:280
#15 operator()<boost::_mfi::mf2<void, doris::FragmentMgr, std::shared_ptr<doris::FragmentExecState>, std::function<void(doris::PlanFragmentExecutor*)> >, boost::_bi::list0> (a=<synthetic pointer>, f=...,
    this=<optimized out>) at /home/kangkaisen/palo/thirdparty/installed/include/boost/bind/bind.hpp:398
#16 operator() (this=<optimized out>) at /home/kangkaisen/palo/thirdparty/installed/include/boost/bind/bind.hpp:1294
#17 boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, boost::_mfi::mf2<void, doris::FragmentMgr, std::shared_ptr<doris::FragmentExecState>, std::function<void (doris::PlanFragmentExecutor*)> >, boost::_bi::list3<boost::_bi::value<doris::FragmentMgr*>, boost::_bi::value<std::shared_ptr<doris::FragmentExecState> >, boost::_bi::value<std::function<void (doris::PlanFragmentExecutor*)> > > >, void>::invoke(boost::detail::function::function_buffer&) (function_obj_ptr=...) at /home/kangkaisen/palo/thirdparty/installed/include/boost/function/function_template.hpp:159
#18 0x0000000000ddf08a in operator() (this=0x7f0e2f177020) at /home/kangkaisen/palo/thirdparty/installed/include/boost/function/function_template.hpp:759
#19 doris::ThreadPool::work_thread (this=0x6ebc080, thread_id=<optimized out>) at /home/kangkaisen/palo/be/src/util/thread_pool.hpp:120
#20 0x00000000016f194d in thread_proxy ()
#21 0x00007f0e74130dd5 in start_thread () from /lib64/libpthread.so.0
#22 0x00007f0e73535ead in clone () from /lib64/libc.so.6

Content of the problematic butil::IOBuf is pasted below, both 0x2000000000134 and 0x30000000002a6 are invalid memory locations causing the segment fault we saw.

{_bv = {magic = 0, start = 400949248, refs = 0x2000000000134, nref = 0, cap_mask = 2281504768,
      nbytes = 844424930132646}, _sv = {refs = {{offset = 0, length = 400949248, block = 0x2000000000134}, {offset = 0, length = 2281504768, block = 0x30000000002a6}}}}

The context of DataStreamSender::Channel::send_batch is that _packet_seq == 1 which means the DataStreamSender::Channel is closed right after init.

What's also strange is that _closure->_refs == 1, does is mean the response of the RPC has been received?

(gdb) p _closure->_refs
$13 = {<std::__atomic_base<int>> = {static _S_alignment = 4, _M_i = 1}, <No data fields>}
(gdb) p *_closure
$14 = {<google::protobuf::Closure> = {<No data fields>}, cntl = {<google::protobuf::RpcController> = {<No data fields>}, static FLAGS_IGNORE_EOVERCROWDED = 1, static FLAGS_SECURITY_MODE = 2,
    static FLAGS_ADDED_CONCURRENCY = 4, static FLAGS_READ_PROGRESSIVELY = 8, static FLAGS_PROGRESSIVE_READER = 16, static FLAGS_BACKUP_REQUEST = 32, static FLAGS_DESTROY_CID_IN_DONE = 128,
    static FLAGS_CLOSE_CONNECTION = 256, static FLAGS_LOG_ID = 512, static FLAGS_REQUEST_CODE = 1024, static FLAGS_PB_BYTES_TO_BASE64 = 2048, static FLAGS_ALLOW_DONE_TO_RUN_IN_PLACE = 4096,
    static FLAGS_USED_BY_RPC = 8192, static FLAGS_REQUEST_WITH_AUTH = 32768, static FLAGS_PB_JSONIFY_EMPTY_ARRAY = 65536, static FLAGS_ENABLED_CIRCUIT_BREAKER = 131072, _span = 0x0, _flags = 10240,
    _error_code = 0, _error_text = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
        _M_p = 0x4d70ba7b0 ""}, _M_string_length = 0, {_M_local_buf = "\000\377\377\377\377\377\377\377\000\000\000\000\000\000\000", _M_allocated_capacity = 18446744073709551360}}, _remote_side = {ip = {
        s_addr = 713365002}, port = 8060}, _local_side = {ip = {s_addr = 361312522}, port = 54256}, _session_local_data = 0x0, _server = 0x0, _oncancel_id = {value = 0}, _auth_context = 0x0,
    _mongo_session_data = {px = 0x0}, _rpc_dump_meta = 0x0, _request_protocol = brpc::PROTOCOL_BAIDU_STD, _max_retry = 3, _retry_policy = 0x0, _correlation_id = {value = 501484676562425},
    _connection_type = brpc::CONNECTION_TYPE_SINGLE, _fail_limit = -123456789, _pipelined_count = 0, _timeout_ms = 20000, _connect_timeout_ms = 200, _backup_request_ms = -1, _abstime_us = 1576230190582720,
    _timeout_id = 0, _begin_time_us = 1576230170582720, _end_time_us = 1576230170584451, _tos = 0, _preferred_index = 1, _request_compress_type = brpc::COMPRESS_TYPE_NONE,
    _response_compress_type = brpc::COMPRESS_TYPE_NONE, _log_id = 0, _pchan_sub_count = 0, _response = 0x4d70ba9d8, _done = 0x4d70ba780, _sender = 0x0, _request_code = 0, _single_server_id = 640, _lb = {
      px = 0x0}, _tmp_completion_info = {id = {value = 0}, responded = false}, _current_call = {nretry = 0, need_feedback = false, enable_circuit_breaker = false, touched_by_stream_creator = false,
      peer_id = 640, begin_time_us = 1576230170582720, sending_sock = {_M_t = {
          _M_t = {<std::_Tuple_impl<0, brpc::Socket*, brpc::SocketDeleter>> = {<std::_Tuple_impl<1, brpc::SocketDeleter>> = {<std::_Head_base<1, brpc::SocketDeleter, true>> = {<brpc::SocketDeleter> = {<No data fields>}, <No data fields>}, <No data fields>}, <std::_Head_base<0, brpc::Socket*, false>> = {_M_head_impl = 0x0}, <No data fields>}, <No data fields>}}}, stream_user_data = 0x0}, _unfinished_call = 0x0,
    _accessed = 0x0, _stream_creator = 0x0,
    _pack_request = 0x21b69c0 <brpc::policy::PackRpcRequest(butil::IOBuf*, brpc::SocketMessage**, unsigned long, google::protobuf::MethodDescriptor const*, brpc::Controller*, butil::IOBuf const&, brpc::Authenticator const*)>, _method = 0x66dca380, _auth = 0x0, _request_buf = {static DEFAULT_BLOCK_SIZE = 8192, static INITIAL_CAP = 32, static INVALID_AREA = 0, {_bv = {magic = 6100, start = 34, refs = 0x4dae30000,
          nref = 0, cap_mask = 0, nbytes = 0}, _sv = {refs = {{offset = 6100, length = 34, block = 0x4dae30000}, {offset = 0, length = 0, block = 0x0}}}}}, _idl_names = {request_name = 0x2f5d3f0 "req",
      response_name = 0x2aed848 "res"}, _idl_result = 12345678987654321, _http_request = 0x0, _http_response = 0x0, _request_attachment = {static DEFAULT_BLOCK_SIZE = 8192, static INITIAL_CAP = 32,
      static INVALID_AREA = 0, {_bv = {magic = 0, start = 0, refs = 0x0, nref = 0, cap_mask = 0, nbytes = 0}, _sv = {refs = {{offset = 0, length = 0, block = 0x0}, {offset = 0, length = 0, block = 0x0}}}}},
    _response_attachment = {static DEFAULT_BLOCK_SIZE = 8192, static INITIAL_CAP = 32, static INVALID_AREA = 0, {_bv = {magic = 0, start = 0, refs = 0x0, nref = 0, cap_mask = 0, nbytes = 0}, _sv = {refs = {{
              offset = 0, length = 0, block = 0x0}, {offset = 0, length = 0, block = 0x0}}}}}, _wpa = {px = 0x0}, _rpa = {px = 0x0}, _request_stream = 18446744073709551615,
    _response_stream = 18446744073709551615, _remote_stream_settings = 0x0, _thrift_method_name = {static npos = 18446744073709551615,
      _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x4d70ba9c8 ""}, _M_string_length = 0, {
        _M_local_buf = "\000\000\000\000\000\000\000\000`\005\000\000\000\000\000", _M_allocated_capacity = 0}}}, result = {<google::protobuf::Message> = {<No data fields>}, static kIndexInFileMessages = 1,
    static kStatusFieldNumber = 1,
    _internal_metadata_ = {<google::protobuf::internal::InternalMetadataWithArenaBase<google::protobuf::UnknownFieldSet, google::protobuf::internal::InternalMetadataWithArena>> = {ptr_ = 0x0,
        static kPtrTagMask = 1, static kPtrValueMask = -2}, <No data fields>}, _has_bits_ = {has_bits_ = {0}}, _cached_size_ = 0, status_ = 0x0}, _refs = {<std::__atomic_base<int>> = {
      static _S_alignment = 4, _M_i = 1}, <No data fields>}}

To Reproduce
Right row I can't find a stable way to reproduce the bug. The version we use is 0.10 with #1830 patched.

@hxianshun
Copy link
Contributor

@gaodayue Hi, have you found where the problem is?

@gaodayue
Copy link
Contributor Author

gaodayue commented Apr 7, 2020

Not yet.

@hxianshun
Copy link
Contributor

Not yet.

@gaodayue Does this problem occured frequently in your cluster? Our cluster may occur this problem once a day.

@gaodayue
Copy link
Contributor Author

gaodayue commented Apr 8, 2020

Not as high as yours, we run into this problem about once a week

@chaoyli
Copy link
Contributor

chaoyli commented Apr 8, 2020

I double that memory is written dirty, you can can ASAN sanitizer to found the problem.
You can replace one node to found the problem because ASAN will consume much memory.

@hxianshun
Copy link
Contributor

The resource of our cluster is very nervous, so I can't use ASAN to analysis this problem. If the memory in your cluster is ample, hope u can help to find the problem. Thanks very much!

@vagetablechicken
Copy link
Contributor

We got brpc cores in DataStreamSender today again. And in our case, the queried tables are all UNIQUE KEY data model. How about it in your cases?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants