New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rgw/notification: add exception handling for persistent notification thread #39521
Conversation
from the tracker:
can we root cause this? in general, out-of-memory exceptions aren't something we can/should try to recover from. but in this case, it seems more likely that we're using uninitialized memory in a code path leading to an allocation |
src/rgw/rgw_notify.cc
Outdated
break; // exited notmally | ||
} catch (const std::exception& err) { | ||
ldpp_dout(this, 10) << "Notification worker failed with error: " << err.what() << dendl; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignoring exceptions here does keep the background threads running, but it doesn't actually recover if the exception was thrown in process_queues()
. i'd argue it's better to crash and restart on an unexpected exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the backtrace for the exception (when not caught):
#0 0x00007f3e444159d5 in raise () from /lib64/libc.so.6
#1 0x00007f3e443fe8a4 in abort () from /lib64/libc.so.6
#2 0x00007f3e447c6926 in __gnu_cxx::__verbose_terminate_handler() [clone .cold] () from /lib64/libstdc++.so.6
#3 0x00007f3e447d21ac in __cxxabiv1::__terminate(void (*)()) () from /lib64/libstdc++.so.6
#4 0x00007f3e447d1239 in __cxa_call_terminate () from /lib64/libstdc++.so.6
#5 0x00007f3e447d1b81 in __gxx_personality_v0 () from /lib64/libstdc++.so.6
#6 0x00007f3e445d5c2f in _Unwind_RaiseException_Phase2 () from /lib64/libgcc_s.so.1
#7 0x00007f3e445d665e in _Unwind_Resume () from /lib64/libgcc_s.so.1
#8 0x0000564497356f9b in boost::asio::detail::scheduler::run (this=0x564498b9d000, ec=...) at /root/projects/ceph/build/boost/include/boost/asio/detail/impl/scheduler.ipp:193
#9 0x00007f3e516a4ab8 in boost::asio::io_context::run (this=0x564498bc7f18) at /root/projects/ceph/build/boost/include/boost/asio/impl/io_context.ipp:63
#10 0x00007f3e521b50bd in rgw::notify::Manager::Manager(ceph::common::CephContext*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, rgw::sal::RGWRadosStore*)::{lambda()#2}::o
perator()() const (__closure=0x564498ba6a98) at /root/projects/ceph/src/rgw/rgw_notify.cc:497
#11 0x00007f3e521c80de in std::__invoke_impl<void, rgw::notify::Manager::Manager(ceph::common::CephContext*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, rgw::sal::RGWRado
sStore*)::{lambda()#2}>(std::__invoke_other, rgw::notify::Manager::Manager(ceph::common::CephContext*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, rgw::sal::RGWRadosStore
*)::{lambda()#2}&&) (__f=...) at /usr/include/c++/10/bits/invoke.h:60
#12 0x00007f3e521c7fb6 in std::__invoke<rgw::notify::Manager::Manager(ceph::common::CephContext*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, rgw::sal::RGWRadosStore*)::{
lambda()#2}>(rgw::notify::Manager::Manager(ceph::common::CephContext*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, rgw::sal::RGWRadosStore*)::{lambda()#2}&&) (__fn=...) a
t /usr/include/c++/10/bits/invoke.h:95
#13 0x00007f3e521c7e86 in std::thread::_Invoker<std::tuple<rgw::notify::Manager::Manager(ceph::common::CephContext*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, rgw::sal:
:RGWRadosStore*)::{lambda()#2}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x564498ba6a98) at /usr/include/c++/10/thread:264
#14 0x00007f3e521c7898 in std::thread::_Invoker<std::tuple<rgw::notify::Manager::Manager(ceph::common::CephContext*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, rgw::sal:
:RGWRadosStore*)::{lambda()#2}> >::operator()() (this=0x564498ba6a98) at /usr/include/c++/10/thread:271
#15 0x00007f3e521c7406 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<rgw::notify::Manager::Manager(ceph::common::CephContext*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int
, unsigned int, rgw::sal::RGWRadosStore*)::{lambda()#2}> > >::_M_run() (this=0x564498ba6a90) at /usr/include/c++/10/thread:215
#16 0x00007f3e447fe5f4 in execute_native_thread_routine () from /lib64/libstdc++.so.6
#17 0x00007f3e445ac3f9 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f3e444d9903 in clone () from /lib64/libc.so.6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since coroutines run on their own stack, there's no way for outside code to catch their exceptions. so the spawn library catches exceptions internally, and rethrows them after returning to the calling thread in io_context.run()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
found the actual issue: queues garbage collector moved out of the loop to prevent a dangling reference inside the coroutine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok great!
good idea to log the exception - you can drop the while() and break parts though
bbcc755
to
c1ac439
Compare
…thread Fixes: https://tracker.ceph.com/issues/49322 Signed-off-by: Yuval Lifshitz <ylifshit@redhat.com>
c1ac439
to
915963e
Compare
jenkins test make check |
teuthology run have one crash without too much info, but looks suspicious:
|
from the teuthology.log i see this before the backtrace:
|
this is strange, even if the problem is still there, that should have been caught. however, the ceph-ci version was not the latest of the PR... |
latest run does not have the crash - but lots of dead tests. will try again. |
@yuvalif what are you doing to reproduce this? i haven't ever seen it with vstart or in teuthology |
the main crash was happening in the persistent notifications tests (which are not in teuthology). the crash seen in teuthology must be different since most of the persistent notification code is not exercised in teuthology. |
latest teuthology run is failing on:
|
even the unrelated crash shows bad_alloc:
i looked at one of @dang's zipper runs and see the same crash there: http://qa-proxy.ceph.com/teuthology/dang-2021-02-18_17:04:23-rgw-wip-dang-zipper-10-distro-basic-smithi/5892959/teuthology.log |
this is happening only in civetweb runs. could easily reproduce it locally with java_s3test. backtrace:
we throw the exception ourselves once we get failure from
so not sure why allocation of another ~1MB makes any difference. either way, this seems unrelated to this PR. |
regarding the
|
i opened https://tracker.ceph.com/issues/49387 to track the radosgw crashes, and i see that other daemons are crashing this way in https://tracker.ceph.com/issues/49240 |
@cbodley given the investigation results here: https://tracker.ceph.com/issues/49387 |
Fixes: https://tracker.ceph.com/issues/49322
Signed-off-by: Yuval Lifshitz ylifshit@redhat.com
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox