Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw/notification: add exception handling for persistent notification thread #39521

Merged
merged 1 commit into from Mar 1, 2021

Conversation

yuvalif
Copy link
Contributor

@yuvalif yuvalif commented Feb 17, 2021

Fixes: https://tracker.ceph.com/issues/49322

Signed-off-by: Yuval Lifshitz ylifshit@redhat.com


Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@cbodley
Copy link
Contributor

cbodley commented Feb 17, 2021

from the tracker:

observed an "std::bad_alloc" exception - not sure what is the root cause of the exception.

can we root cause this? in general, out-of-memory exceptions aren't something we can/should try to recover from. but in this case, it seems more likely that we're using uninitialized memory in a code path leading to an allocation

break; // exited notmally
} catch (const std::exception& err) {
ldpp_dout(this, 10) << "Notification worker failed with error: " << err.what() << dendl;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignoring exceptions here does keep the background threads running, but it doesn't actually recover if the exception was thrown in process_queues(). i'd argue it's better to crash and restart on an unexpected exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the backtrace for the exception (when not caught):

#0  0x00007f3e444159d5 in raise () from /lib64/libc.so.6
#1  0x00007f3e443fe8a4 in abort () from /lib64/libc.so.6
#2  0x00007f3e447c6926 in __gnu_cxx::__verbose_terminate_handler() [clone .cold] () from /lib64/libstdc++.so.6
#3  0x00007f3e447d21ac in __cxxabiv1::__terminate(void (*)()) () from /lib64/libstdc++.so.6
#4  0x00007f3e447d1239 in __cxa_call_terminate () from /lib64/libstdc++.so.6
#5  0x00007f3e447d1b81 in __gxx_personality_v0 () from /lib64/libstdc++.so.6
#6  0x00007f3e445d5c2f in _Unwind_RaiseException_Phase2 () from /lib64/libgcc_s.so.1
#7  0x00007f3e445d665e in _Unwind_Resume () from /lib64/libgcc_s.so.1
#8  0x0000564497356f9b in boost::asio::detail::scheduler::run (this=0x564498b9d000, ec=...) at /root/projects/ceph/build/boost/include/boost/asio/detail/impl/scheduler.ipp:193
#9  0x00007f3e516a4ab8 in boost::asio::io_context::run (this=0x564498bc7f18) at /root/projects/ceph/build/boost/include/boost/asio/impl/io_context.ipp:63
#10 0x00007f3e521b50bd in rgw::notify::Manager::Manager(ceph::common::CephContext*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, rgw::sal::RGWRadosStore*)::{lambda()#2}::o
perator()() const (__closure=0x564498ba6a98) at /root/projects/ceph/src/rgw/rgw_notify.cc:497
#11 0x00007f3e521c80de in std::__invoke_impl<void, rgw::notify::Manager::Manager(ceph::common::CephContext*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, rgw::sal::RGWRado
sStore*)::{lambda()#2}>(std::__invoke_other, rgw::notify::Manager::Manager(ceph::common::CephContext*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, rgw::sal::RGWRadosStore
*)::{lambda()#2}&&) (__f=...) at /usr/include/c++/10/bits/invoke.h:60
#12 0x00007f3e521c7fb6 in std::__invoke<rgw::notify::Manager::Manager(ceph::common::CephContext*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, rgw::sal::RGWRadosStore*)::{
lambda()#2}>(rgw::notify::Manager::Manager(ceph::common::CephContext*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, rgw::sal::RGWRadosStore*)::{lambda()#2}&&) (__fn=...) a
t /usr/include/c++/10/bits/invoke.h:95
#13 0x00007f3e521c7e86 in std::thread::_Invoker<std::tuple<rgw::notify::Manager::Manager(ceph::common::CephContext*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, rgw::sal:
:RGWRadosStore*)::{lambda()#2}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x564498ba6a98) at /usr/include/c++/10/thread:264
#14 0x00007f3e521c7898 in std::thread::_Invoker<std::tuple<rgw::notify::Manager::Manager(ceph::common::CephContext*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, rgw::sal:
:RGWRadosStore*)::{lambda()#2}> >::operator()() (this=0x564498ba6a98) at /usr/include/c++/10/thread:271
#15 0x00007f3e521c7406 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<rgw::notify::Manager::Manager(ceph::common::CephContext*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int
, unsigned int, rgw::sal::RGWRadosStore*)::{lambda()#2}> > >::_M_run() (this=0x564498ba6a90) at /usr/include/c++/10/thread:215
#16 0x00007f3e447fe5f4 in execute_native_thread_routine () from /lib64/libstdc++.so.6
#17 0x00007f3e445ac3f9 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f3e444d9903 in clone () from /lib64/libc.so.6

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since coroutines run on their own stack, there's no way for outside code to catch their exceptions. so the spawn library catches exceptions internally, and rethrows them after returning to the calling thread in io_context.run()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found the actual issue: queues garbage collector moved out of the loop to prevent a dangling reference inside the coroutine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok great!

good idea to log the exception - you can drop the while() and break parts though

@yuvalif yuvalif force-pushed the wip-yuval-fix-49322 branch 2 times, most recently from bbcc755 to c1ac439 Compare February 17, 2021 20:02
@yuvalif
Copy link
Contributor Author

yuvalif commented Feb 18, 2021

jenkins test make check

@yuvalif
Copy link
Contributor Author

yuvalif commented Feb 18, 2021

teuthology run have one crash without too much info, but looks suspicious:

   "crash_id": "2021-02-18T08:25:54.246674Z_b3857907-42a4-4ee4-92e6-7c6a3e59bae2",
    "timestamp": "2021-02-18T08:25:54.246674Z",
    "process_name": "radosgw",
    "entity_name": "client.0",
    "ceph_version": "17.0.0-828-gc1ac4395",
    "utsname_hostname": "gibba001",
    "utsname_sysname": "Linux",
    "utsname_release": "4.18.0-240.10.1.el8_3.x86_64",
    "utsname_version": "#1 SMP Mon Jan 18 17:05:51 UTC 2021",
    "utsname_machine": "x86_64",
    "os_name": "CentOS Linux",
    "os_id": "centos",
    "os_version_id": "8",
    "os_version": "8 (Core)",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12dc0) [0x7fc9fd899dc0]",
        "gsignal()",
        "abort()",
        "/lib64/libstdc++.so.6(+0x9006b) [0x7fc9fc89806b]",
        "/lib64/libstdc++.so.6(+0x9650c) [0x7fc9fc89e50c]",
        "/lib64/libstdc++.so.6(+0x95529) [0x7fc9fc89d529]",
        "__gxx_personality_v0()",
        "/lib64/libgcc_s.so.1(+0x10b13) [0x7fc9fc27eb13]",
        "_Unwind_Resume()",
        "(boost::asio::detail::scheduler::run(boost::system::error_code&)+0x744) [0x563add456374]",
        "/lib64/libradosgw.so.2(+0x4a2709) [0x7fca08626709]",
        "/lib64/libstdc++.so.6(+0xc2b73) [0x7fc9fc8cab73]",
        "/lib64/libpthread.so.0(+0x82de) [0x7fc9fd88f2de]",
        "clone()"
    ]   
}

@cbodley
Copy link
Contributor

cbodley commented Feb 18, 2021

from the teuthology.log i see this before the backtrace:

2021-02-18T08:25:54.245 INFO:tasks.rgw.client.0.gibba001.stdout:terminate called after throwing an instance of 'ceph::buffer::v15_2_0::bad_alloc'
2021-02-18T08:25:54.246 INFO:tasks.rgw.client.0.gibba001.stdout:  what():  Bad allocation
2021-02-18T08:25:54.246 INFO:tasks.rgw.client.0.gibba001.stdout:*** Caught signal (Aborted) **
2021-02-18T08:25:54.246 INFO:tasks.rgw.client.0.gibba001.stdout: in thread 7fc95d8ca700 thread_name:radosgw
...

@yuvalif
Copy link
Contributor Author

yuvalif commented Feb 18, 2021

from the teuthology.log i see this before the backtrace:

2021-02-18T08:25:54.245 INFO:tasks.rgw.client.0.gibba001.stdout:terminate called after throwing an instance of 'ceph::buffer::v15_2_0::bad_alloc'
2021-02-18T08:25:54.246 INFO:tasks.rgw.client.0.gibba001.stdout:  what():  Bad allocation
2021-02-18T08:25:54.246 INFO:tasks.rgw.client.0.gibba001.stdout:*** Caught signal (Aborted) **
2021-02-18T08:25:54.246 INFO:tasks.rgw.client.0.gibba001.stdout: in thread 7fc95d8ca700 thread_name:radosgw
...

this is strange, even if the problem is still there, that should have been caught. however, the ceph-ci version was not the latest of the PR...
built the latest and kicked another teuthology run:
http://pulpito.front.sepia.ceph.com/yuvalif-2021-02-18_12:54:54-rgw:verify-wip-yuval-fix-49322-distro-basic-gibba/

@yuvalif
Copy link
Contributor Author

yuvalif commented Feb 18, 2021

from the teuthology.log i see this before the backtrace:

2021-02-18T08:25:54.245 INFO:tasks.rgw.client.0.gibba001.stdout:terminate called after throwing an instance of 'ceph::buffer::v15_2_0::bad_alloc'
2021-02-18T08:25:54.246 INFO:tasks.rgw.client.0.gibba001.stdout:  what():  Bad allocation
2021-02-18T08:25:54.246 INFO:tasks.rgw.client.0.gibba001.stdout:*** Caught signal (Aborted) **
2021-02-18T08:25:54.246 INFO:tasks.rgw.client.0.gibba001.stdout: in thread 7fc95d8ca700 thread_name:radosgw
...

this is strange, even if the problem is still there, that should have been caught. however, the ceph-ci version was not the latest of the PR...
built the latest and kicked another teuthology run:
http://pulpito.front.sepia.ceph.com/yuvalif-2021-02-18_12:54:54-rgw:verify-wip-yuval-fix-49322-distro-basic-gibba/

latest run does not have the crash - but lots of dead tests. will try again.

@cbodley
Copy link
Contributor

cbodley commented Feb 18, 2021

@yuvalif what are you doing to reproduce this? i haven't ever seen it with vstart or in teuthology

@yuvalif
Copy link
Contributor Author

yuvalif commented Feb 18, 2021

@yuvalif what are you doing to reproduce this? i haven't ever seen it with vstart or in teuthology

the main crash was happening in the persistent notifications tests (which are not in teuthology).
i replicate that manually by testing this PR

the crash seen in teuthology must be different since most of the persistent notification code is not exercised in teuthology.
only notification related code that run in teuthology is from line 386 here to line 401.
could not see something there that could have caused that.

@yuvalif
Copy link
Contributor Author

yuvalif commented Feb 19, 2021

latest teuthology run is failing on:

suite > Object tests > ObjectTest.testUploadFileList FAILED
com.amazonaws.services.s3.model.AmazonS3Exception
  • crash that seems unrelated:
  "backtrace": [
        "/lib64/libpthread.so.0(+0x12dc0) [0x7efe4211fdc0]",
        "gsignal()",
        "abort()",
        "/lib64/libstdc++.so.6(+0x9006b) [0x7efe4111e06b]",
        "/lib64/libstdc++.so.6(+0x9650c) [0x7efe4112450c]",
        "/lib64/libstdc++.so.6(+0x96567) [0x7efe41124567]",
        "/lib64/libstdc++.so.6(+0x967c8) [0x7efe411247c8]",
        "/lib64/librados.so.2(+0x3abd6) [0x7efe4c697bd6]",
        "(ceph::buffer::v15_2_0::create_aligned(unsigned int, unsigned int)+0x26) [0x7efe4c6eccb6]",
        "(ceph::buffer::v15_2_0::create(unsigned int)+0x26) [0x7efe4c6ecd06]",
        "(ceph::buffer::v15_2_0::ptr::ptr(unsigned int)+0x2e) [0x7efe4c6ecd5e]",
        "(RGWPutObj_ObjStore::get_data(ceph::buffer::v15_2_0::list&)+0xaf) [0x7efe4d358b2f]",
        "(RGWPutObj_ObjStore_S3::get_data(ceph::buffer::v15_2_0::list&)+0x12) [0x7efe4d3843d2]",
        "(RGWPutObj::execute(optional_yield)+0xe21) [0x7efe4d2cbb31]",
        "(rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, bool)+0xbde) [0x7efe4cf619fe]",
        "(process_request(rgw::sal::RGWRadosStore*, RGWREST*, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, int*)+0x27bd) [0x7efe4cf657ad]",
        "(RGWCivetWebFrontend::process(mg_connection*)+0x280) [0x7efe4ce9c2d0]",
        "/lib64/libradosgw.so.2(+0x621056) [0x7efe4d02b056]",
        "/lib64/libradosgw.so.2(+0x622cc7) [0x7efe4d02ccc7]",
        "/lib64/libradosgw.so.2(+0x623188) [0x7efe4d02d188]",
        "/lib64/libpthread.so.0(+0x82de) [0x7efe421152de]",
        "clone()"
  • crash that could be related:
 "backtrace": [
        "/lib64/libpthread.so.0(+0x12dc0) [0x7fea780e0dc0]",
        "gsignal()",
        "abort()",
        "/lib64/libstdc++.so.6(+0x9006b) [0x7fea770df06b]",
        "/lib64/libstdc++.so.6(+0x9650c) [0x7fea770e550c]",
        "/lib64/libstdc++.so.6(+0x95529) [0x7fea770e4529]",
        "__gxx_personality_v0()",
        "/lib64/libgcc_s.so.1(+0x10b13) [0x7fea76ac5b13]",
        "_Unwind_Resume()",
        "(boost::asio::detail::scheduler::run(boost::system::error_code&)+0x744) [0x56112803e374]",
        "/lib64/libradosgw.so.2(+0x4a2709) [0x7fea82e6d709]",
        "/lib64/libstdc++.so.6(+0xc2b73) [0x7fea77111b73]",
        "/lib64/libpthread.so.0(+0x82de) [0x7fea780d62de]",
        "clone()"

@cbodley
Copy link
Contributor

cbodley commented Feb 19, 2021

even the unrelated crash shows bad_alloc:

2021-02-18T17:49:45.348 INFO:tasks.rgw.client.0.gibba014.stdout:terminate called after throwing an instance of 'ceph::buffer::v15_2_0::bad_alloc'
2021-02-18T17:49:45.348 INFO:tasks.rgw.client.0.gibba014.stdout:  what():  Bad allocation
2021-02-18T17:49:45.348 INFO:tasks.rgw.client.0.gibba014.stdout:*** Caught signal (Aborted) **
2021-02-18T17:49:45.348 INFO:tasks.rgw.client.0.gibba014.stdout: in thread 7efe0aa21700 thread_name:civetweb-worker

i looked at one of @dang's zipper runs and see the same crash there: http://qa-proxy.ceph.com/teuthology/dang-2021-02-18_17:04:23-rgw-wip-dang-zipper-10-distro-basic-smithi/5892959/teuthology.log

@yuvalif
Copy link
Contributor Author

yuvalif commented Feb 21, 2021

even the unrelated crash shows bad_alloc:

2021-02-18T17:49:45.348 INFO:tasks.rgw.client.0.gibba014.stdout:terminate called after throwing an instance of 'ceph::buffer::v15_2_0::bad_alloc'
2021-02-18T17:49:45.348 INFO:tasks.rgw.client.0.gibba014.stdout:  what():  Bad allocation
2021-02-18T17:49:45.348 INFO:tasks.rgw.client.0.gibba014.stdout:*** Caught signal (Aborted) **
2021-02-18T17:49:45.348 INFO:tasks.rgw.client.0.gibba014.stdout: in thread 7efe0aa21700 thread_name:civetweb-worker

i looked at one of @dang's zipper runs and see the same crash there: http://qa-proxy.ceph.com/teuthology/dang-2021-02-18_17:04:23-rgw-wip-dang-zipper-10-distro-basic-smithi/5892959/teuthology.log

this is happening only in civetweb runs. could easily reproduce it locally with java_s3test. backtrace:

#0  0x00007f5323ee29d5 in raise () from /lib64/libc.so.6
#1  0x00007f5323ecb8a4 in abort () from /lib64/libc.so.6
#2  0x00007f5324293926 in __gnu_cxx::__verbose_terminate_handler() [clone .cold] () from /lib64/libstdc++.so.6
#3  0x00007f532429f1ac in __cxxabiv1::__terminate(void (*)()) () from /lib64/libstdc++.so.6
#4  0x00007f532429f217 in std::terminate() () from /lib64/libstdc++.so.6
#5  0x00007f532429f4c9 in __cxa_throw () from /lib64/libstdc++.so.6
#6  0x00007f532f3af85a in ceph::buffer::v15_2_0::raw_posix_aligned::raw_posix_aligned (this=0x5573e4d0dbc0, l=1179648, _align=8) at /root/projects/ceph/src/common/buffer.cc:171
#7  0x00007f532f3a5ae8 in ceph::buffer::v15_2_0::create_aligned_in_mempool (len=1179648, align=8, mempool=18) at /root/projects/ceph/src/common/buffer.cc:335
#8  0x00007f532f3a5b7f in ceph::buffer::v15_2_0::create_aligned (len=1179648, align=8) at /root/projects/ceph/src/common/buffer.cc:345
#9  0x00007f532f3a5718 in ceph::buffer::v15_2_0::create (len=1179648) at /root/projects/ceph/src/common/buffer.cc:287
#10 0x00007f532f3a5d98 in ceph::buffer::v15_2_0::ptr::ptr (this=0x7f52f74f94e0, l=1179648) at /root/projects/ceph/src/common/buffer.cc:369
#11 0x00007f53319365c7 in RGWPutObj_ObjStore::get_data (this=0x5573e4470c00, bl=...) at /root/projects/ceph/src/rgw/rgw_rest.cc:1049
#12 0x00007f5331988d15 in RGWPutObj_ObjStore_S3::get_data (this=0x5573e4470c00, bl=...) at /root/projects/ceph/src/rgw/rgw_rest_s3.cc:2464
#13 0x00007f53318397b4 in RGWPutObj::execute (this=0x5573e4470c00, y=...) at /root/projects/ceph/src/rgw/rgw_op.cc:3903
#14 0x00007f53312f3628 in rgw_process_authenticated (handler=0x5573e4d5d080, op=@0x7f52f74fb7f0: 0x5573e4470c00, req=0x7f52f74fc630, s=0x7f52f74fbc40, y=..., skip_retarget=false) at /root/projects/ceph/src/rgw/rgw_process.cc:167
#15 0x00007f53312f5ee4 in process_request (store=0x5573e340fc00, rest=0x7ffd5a54aa80, req=0x7f52f74fc630, frontend_prefix="", auth_registry=..., client_io=0x7f52f74fc680, olog=0x0, yield=..., scheduler=0x0, user=0x0, http_ret=0x7f52f74fc6
08) at /root/projects/ceph/src/rgw/rgw_process.cc:301
#16 0x00007f53311288b3 in RGWCivetWebFrontend::process (this=0x5573e336fdc0, conn=0x5573e43109d8) at /root/projects/ceph/src/rgw/rgw_civetweb_frontend.cc:70
#17 0x00007f5331128682 in civetweb_callback (conn=0x5573e43109d8) at /root/projects/ceph/src/rgw/rgw_civetweb_frontend.cc:51
#18 0x00007f5331438abd in handle_request (conn=0x5573e43109d8) at /root/projects/ceph/src/civetweb/src/civetweb.c:12514
#19 0x00007f533143e39d in process_new_connection (conn=0x5573e43109d8) at /root/projects/ceph/src/civetweb/src/civetweb.c:15924
#20 0x00007f533143eb9b in worker_thread_run (thread_args=0x5573e33445e0) at /root/projects/ceph/src/civetweb/src/civetweb.c:16253
#21 0x00007f533143ec5a in worker_thread (thread_func_param=0x5573e33445e0) at /root/projects/ceph/src/civetweb/src/civetweb.c:16296
#22 0x00007f53240793f9 in start_thread () from /lib64/libpthread.so.0
#23 0x00007f5323fa6b53 in clone () from /lib64/libc.so.6

we throw the exception ourselves once we get failure from posix_memalign().
not sure why we got it there, the call to the function was with len=1179648 and 'align=8`
the return value was "12" (=OOM).
also overall memory consumption of RGW was (pmap):

total          5997716K

so not sure why allocation of another ~1MB makes any difference.

either way, this seems unrelated to this PR.

@yuvalif
Copy link
Contributor Author

yuvalif commented Feb 21, 2021

regarding the boost::asio::detail::scheduler::run(boost::system::error_code&) crash.
these are beast java_s3tests, could not reproduce it locally.
after looking at the log, it seems like this is not related to persistent notification code:

  • test does not cover bucket notifications
  • the code that periodically checks for existence of new queues/topics should have kicked in at: "17:31:56" while the crash happened at: "17:31:41"
-8883> 2021-02-18T17:31:25.937+0000 7fea471ef700 20 rgw notify: INFO: next queues processing will happen at: Thu Feb 18 17:31:56 2021
  • the persistent notification crash that was fixed, was easily reproduced with the "kafka PR", and does not happen anymore

@cbodley
Copy link
Contributor

cbodley commented Feb 22, 2021

i opened https://tracker.ceph.com/issues/49387 to track the radosgw crashes, and i see that other daemons are crashing this way in https://tracker.ceph.com/issues/49240

@yuvalif yuvalif requested a review from cbodley February 23, 2021 17:11
@yuvalif
Copy link
Contributor Author

yuvalif commented Feb 28, 2021

@cbodley given the investigation results here: https://tracker.ceph.com/issues/49387
I think that this issue is unrelated. could be merged seperatly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants