New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rgw/notification: start/stop endpoint managers in notification manager #56979
Conversation
still not working:
|
#endif | ||
} | ||
|
||
throw configuration_error("unknown schema in: " + endpoint); | ||
return nullptr; | ||
} | ||
|
||
bool init_http_manager(CephContext* cct) { | ||
if (s_http_manager) return false; | ||
curl_global_init(CURL_GLOBAL_ALL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think we want this (curl_global_init) here. the process should only call it once, and rgw::curl::setup_curl()
in rgw::AppMain::init_http_clients()
should take care of that for us
src/rgw/driver/rados/rgw_notify.cc
Outdated
|
||
// unlock (lose ownership) queue | ||
int unlock_queue(const std::string& queue_name, librados::IoCtx& ioctx) { | ||
const auto ret = rados::cls::lock::unlock(&ioctx, queue_name, queue_name+"_lock", lock_cookie); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could you please use int
for error codes to be consistent with the rest of the codebase?
auto
is great when it saves typing, but it does that at the cost of obscuring the actual types. i don't think this is a good tradeoff for integer types especially, because their type names are already short and the signed/unsigned distinction matters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't use auto
to save on typing, i use it to avoid implicit/unintentional casting.
and the need to lookup the return value from a function unless i have to know it.
e.g. if rados::cls::lock::unlock()
returns unsigned
and i keep auto
(which would be unsiged), i would get an warning in: if (ret == -ENOENT)
.
but if i cast to int
, i won't get this warning.
see: https://godbolt.org/z/fa3nj4rr3
io_context.stop(); | ||
work_guard.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you want the unlock_queue()
cleanup stuff to happen, you should avoid the call to io_context.stop()
. that means that any suspended coroutines won't ever resume, so the worker threads would join immediately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbodley if i'm not stopping the io_context, i can get a a crash if the async unlock()
call completes after the manager already destroyed.
the other option, is to call the sync version of unlock()
(as i did before), but if we have many queues, this would be an issue.
so, i guess that i would risk that sometimes the unlock is being called and sometimes it does not (i will eventually get the lock, once the lock timer expires).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbodley if i'm not stopping the io_context, i can get a a crash if the async
unlock()
call completes after the manager already destroyed.
if you don't call io_context::stop()
, then any threads calling io_context.run()
would continue to run until all of their coroutines finish. that means the calls to thread::join()
just below will block until that happens. so i don't think there's a risk of racing with Manager's destruction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i tried commenting out this io_context.stop()
line and the cleanup/shutdown mostly works:
2024-05-02T15:28:32.534-0400 7f34c54006c0 5 rgw notify: INFO: manager stopped. done processing for queue: RGW11111111111111111:Topic
2024-05-02T15:28:32.534-0400 7f34c54006c0 1 -- 192.168.245.130:0/802693616 --> [v2:192.168.245.130:6800/3518936173,v1:192.168.245.130:6801/3518936173] -- osd_op(unknown.0.0:3088 2.0 2:0e9ddf52:notif::RGW11111111111111111%3aTopic:head [stat,call lock.unlock in=71b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e13) -- 0x562f2d2eaa80 con 0x562f2ad66480
2024-05-02T15:28:32.536-0400 7f350aa006c0 1 -- 192.168.245.130:0/802693616 <== osd.0 v2:192.168.245.130:6800/3518936173 3090 ==== osd_op_reply(3088 RGW11111111111111111:Topic [stat,call] v13'427 uv427 ondisk = 0) ==== 212+0+0 (crc 0 0 0) 0x562f2ae58d80 con 0x562f2ad66480
2024-05-02T15:28:32.537-0400 7f34c54006c0 10 rgw notify: INFO: queue: RGW11111111111111111:Topic not locked (ownership can move)
2024-05-02T15:28:32.537-0400 7f34c54006c0 10 rgw notify: INFO: queue: RGW11111111111111111:Topic marked for removal
2024-05-02T15:28:33.406-0400 7f34c54006c0 1 -- 192.168.245.130:0/802693616 --> [v2:192.168.245.130:6800/3518936173,v1:192.168.245.130:6801/3518936173] -- osd_op(unknown.0.0:3089 2.0 2:6fe6cc47:notif::queues_list_object:head [omap-get-keys in=12b] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e13) -- 0x562f2d273500 con 0x562f2ad66480
2024-05-02T15:28:33.406-0400 7f350aa006c0 1 -- 192.168.245.130:0/802693616 <== osd.0 v2:192.168.245.130:6800/3518936173 3091 ==== osd_op_reply(3089 queues_list_object [omap-get-keys out=35b] v0'0 uv411 ondisk = 0) ==== 162+0+35 (crc 0 0 0) 0x562f2ae58d80 con 0x562f2ad66480
2024-05-02T15:28:33.406-0400 7f34c54006c0 1 -- 192.168.245.130:0/802693616 --> [v2:192.168.245.130:6800/3518936173,v1:192.168.245.130:6801/3518936173] -- osd_op(unknown.0.0:3090 2.0 2:0e9ddf52:notif::RGW11111111111111111%3aTopic:head [stat,call lock.lock in=87b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e13) -- 0x562f2d273880 con 0x562f2ad66480
2024-05-02T15:28:33.409-0400 7f350aa006c0 1 -- 192.168.245.130:0/802693616 <== osd.0 v2:192.168.245.130:6800/3518936173 3092 ==== osd_op_reply(3090 RGW11111111111111111:Topic [stat,call] v13'428 uv428 ondisk = 0) ==== 212+0+0 (crc 0 0 0) 0x562f2ae58d80 con 0x562f2ad66480
2024-05-02T15:28:33.409-0400 7f34c54006c0 20 rgw notify: INFO: queue: RGW11111111111111111:Topic ownership (lock) renewed
2024-05-02T15:28:33.409-0400 7f34c54006c0 10 rgw notify: INFO: queue: RGW11111111111111111:Topic was removed
2024-05-02T15:28:33.409-0400 7f34c54006c0 5 rgw notify: INFO: manager stopped. done processing for queues
the only strange part is that process_queues()
raced to lock the queue again after process_queue()
unlocked/returned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tried testing locally without io_context.stop()
when more than one RGW exists:
RGW_PER_ZONE=2 ../src/test/rgw/test-rgw-multisite.sh 1
but the tests (data_path_v2_test
) got stuck in "migrating" state.
with the above call the tests are running fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you tell which coroutine gets stuck on shutdown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trying with these steps:
$ MON=1 OSD=1 RGW=0 MDS=0 MGR=0 RGW_PER_ZONE=2 ../src/test/rgw/test-rgw-multisite.sh 1
$ bin/radosgw-admin -c run/c1/ceph.conf user create --uid testid --access-key 0555b35654ad1656d804 --secret h7GhxuBLTrlhVUyxSPUKUV8r/2EI4ngqJxD7iBdBYLhwluN30JaT3Q== --display-name 'M. Tester' --email tester@ceph.com
$ BNTESTS_CONF=../work/test_bn.conf python -m nose -s ../src/test/rgw/bucket_notification/test_bn.py -v -a data_path_v2_test
i see both rgws crashing on Manager shutdown:
-5> 2024-05-03T13:55:17.483-0400 7fd99c4006c0 5 rgw notify: INFO: manager stopped. done processing for queue: :vxhzxs-1_topic
-4> 2024-05-03T13:55:17.483-0400 7fd99c4006c0 1 -- 192.168.245.130:0/2176396696 --> [v2:192.168.245.130:6800/2282501434,v1:192.168.245.130:6801/2282501434] -- osd_op(unknown.0.0:722 2.0 2:071c9d28:notif::%3avxhzxs-1_topic:head [stat,call lock.unlock in=60b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e19) -- 0x563fe8e8b880 con 0x563fe0b80900
-3> 2024-05-03T13:55:17.484-0400 7fd99c4006c0 5 rgw notify: INFO: manager stopped. done processing for queues
-2> 2024-05-03T13:55:17.485-0400 7fdb102006c0 1 -- 192.168.245.130:0/2176396696 <== osd.0 v2:192.168.245.130:6800/2282501434 731 ==== osd_op_reply(722 :vxhzxs-1_topic [stat,call] v19'1633 uv1633 ondisk = 0) ==== 201+0+0 (crc 0 0 0) 0x563fe592ed80 con 0x563fe0b80900
-1> 2024-05-03T13:55:17.485-0400 7fd99c4006c0 10 rgw notify: INFO: queue: :vxhzxs-1_topic not locked (ownership can move)
0> 2024-05-03T13:55:17.485-0400 7fd99c4006c0 -1 *** Caught signal (Segmentation fault) **
in thread 7fd99c4006c0 thread_name:notif-worker0
ceph version Development (no_version) squid (dev)
1: /home/cbodley/ceph/build/bin/radosgw(+0x13bf562) [0x563fde456562]
2: /lib64/libc.so.6(+0x3e9a0) [0x7fdb1905c9a0]
3: pthread_mutex_lock()
4: (std::mutex::lock()+0x9) [0x563fddc0eb7d]
5: (rgw::notify::Manager::process_queues(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)#1}::operator()(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >) const+0x2ab) [0x563fde03c9fb]
6: (spawn::detail::spawn_helper<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, rgw::notify::Manager::process_queues(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)#1}, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::{lambda(boost::context::continuation&&)#1}::operator()(boost::context::continuation&&) const+0xc9) [0x563fde03cd53]
7: (boost::context::detail::record<boost::context::continuation, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits>, spawn::detail::spawn_helper<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, rgw::notify::Manager::process_queues(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)#6}, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::{lambda(boost::context::continuation&&)#1}>::run(void*)+0x2e) [0x563fde03cef4]
8: (void boost::context::detail::context_entry<boost::context::detail::record<boost::context::continuation, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits>, spawn::detail::spawn_helper<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, rgw::notify::Manager::process_queues(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)#1}, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::{lambda(boost::context::continuation&&)#1}> >(boost::context::detail::transfer_t)+0x68) [0x563fde03cfd8]
9: make_fcontext()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's because queue_gc_lock
(and queue_gc
) are variables on the stack of process_queues()
, which can return before the process_queue()
coroutines return
on shutdown, maybe process_queues()
should keep looping until owned_queues
is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this gets too complicated, we might just revert the shutdown/unlock changes and leave io_context.stop() there for now
src/rgw/driver/rados/rgw_notify.cc
Outdated
|
||
// unlock (lose ownership) queue | ||
int unlock_queue(const std::string& queue_name, librados::IoCtx& ioctx) { | ||
const auto ret = rados::cls::lock::unlock(&ioctx, queue_name, queue_name+"_lock", lock_cookie); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separately, this is making a synchronous call to librados. it should use the ObjectWriteOperation
overload instead of the IoCtx
one, and call rgw_rados_operate()
which accepts our yield context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. will change that so we can unlock multiple queues in parallel.
note that the manager thread does not have anything else to do during its shutdown.
@yuvalif, sorry but i am still not able to understand what exactly is the issue? |
before the change, the kafka, amqp and http threads continoued regardless of the notification thread (their start/stop was part of the main function). |
c4d2dcf
to
9bfc669
Compare
lifecycle tests are failing because they seem to run from the
|
@yuvalif this was added in #39192. lifecycle only runs once a day by default, so for s3-tests, we don't have access to radosgw-admin so we rely on |
@mattbenjamin, @cbodley |
event_keys = [] | ||
events = http_server.get_and_reset_events() | ||
assert_equal(number_of_objects * 2, len(events)) | ||
assert number_of_objects * 2 <= len(events) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change this check to look for <= 2* events
coz the persistent notifications are not delivering all the events ?
since wait_for_queue_to_drain
is already verified before, then all the events should have been delivered and we should receive number_of_objects * 2
events
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
persistent notifications may deliver the same event more than once - the delivery guaranteee for persistent notifications is "at least once".
since we have 2 events per object, we should recieve at least twice the number objects
jenkins test api |
jenkins test make check |
jenkins test make check |
delete s_manager; | ||
s_manager = nullptr; | ||
if (!s_manager) return; | ||
RGWPubSubEndpoint::shutdown_all(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this guaranteed to trigger cancellation of outstanding completions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the endpoints are all stopped, there is nothing to invoke the callbacks for the completions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if outstanding completions aren't canceled, i would expect process_entry()
's calls to push_endpoint->send()
to hang and prevent the io_context threads from joining
separately, what happens if we call push_endpoint->send()
after RGWPubSubEndpoint::shutdown_all()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if outstanding completions aren't canceled, i would expect
process_entry()
's calls topush_endpoint->send()
to hang and prevent the io_context threads from joining
- in case of kafka, we call all completions with an erro code:
https://github.com/ceph/ceph/blob/main/src/rgw/rgw_kafka.cc#L101 - similar handling in case of amqp:
https://github.com/ceph/ceph/blob/main/src/rgw/rgw_amqp.cc#L146 - for http, i call "stop()" and then dtor:
https://github.com/ceph/ceph/pull/56979/files#diff-a7d6265d00f7a11ffed731f280898bebf406a80b3ccdcecc50c13c48903d669eR416
this is similar to the global "rgw_http_client_cleanup()".
however, i don't see any code there that calcel any pending work.
@cbodley do you think that calling "RGWHTTPManager::manage_pending_requests()" would fix that?
separately, what happens if we call
push_endpoint->send()
afterRGWPubSubEndpoint::shutdown_all()
?
the "manager" pointers should be deleted and set to null. this should return an "-ESRCH" error.
however, i do see this crash:
"/lib64/libc.so.6(+0x3e6f0) [0x7fb79c23e6f0]",
"__pthread_rwlock_wrlock()",
"(RGWHTTPManager::register_request(rgw_http_req_data*)+0x37) [0x55a1f56c76e7]",
"(RGWHTTPManager::add_request(RGWHTTPClient*)+0x168) [0x55a1f56c8c48]",
"radosgw(+0x80ab13) [0x55a1f58b3b13]",
"radosgw(+0x80f52b) [0x55a1f58b852b]",
"(void boost::context::detail::context_entry<boost::context::detail::record<boost::context::continuation, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits>, spawn::detail::spawn_helper<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor>, rgw::notify::Manager::process_queue(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)#7}, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::{lambda(boost::context::continuation&&)#1}> >(boost::context::detail::transfer_t)+0x55) [0x55a1f58b92f5]",
"make_fcontext()"
probably a race condition with the deletion of the pointer.
will protect it with a shared mutex, so that the different sender don't block one another, and only the shutdown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
however, i don't see any code there that calcel any pending work.
@cbodley do you think that calling "RGWHTTPManager::manage_pending_requests()" would fix that?
it looks like RGWHTTPManager::reqs_thread_entry()
, below the while (!going_down)
loop, does loop over outstanding requests to call _unlink_request()
-> _finish_request(req_data, -ECANCELED);
2831b23
to
344b29a
Compare
teuthology is failing when sending http notifications after there were (intentional) http notifications errors: the http server successfully started mid-test, and handled an empty test request sent to it from the test:
however, the RGW log, is showing retry failures. until the test ends and deletes the topic.
and we can see that each notifications is retried for 155 times (the retries are 1 second apart per notification):
other teuthology runs showing similar failures (no crashes):
will open a tracker for the http client issue. IMO, this PR can be merged as it fixes the reload crash |
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
as well as in: - multisite tests (used for notification v2 migration tests) - the qa suites running notifications enable lifecycle logs in notification tests: for the lc notification test cases this is needed after: 4299679 Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
Fixes: https://tracker.ceph.com/issues/65337 Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
* tests were passing only because they were not performings their asserts * tests are now separated with their own attribute * their topics are now marked "persistent" to workaround the issue in: https://tracker.ceph.com/issues/65645 Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
for example. job: 7697397 in test: yuvalif-2024-05-08_09:55:02-rgw:notifications-wip-yuval-65337-distro-default-smithi also reduce the side of the error log by sending less objects to the test_ps_s3_persistent_topic_stats test Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
in tests that require retries Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
fail test if not. to indicate this is a test issue and not a product bug Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
jenkins test api |
@yuvalif can you please prioritize the squid backport for https://tracker.ceph.com/issues/65996? |
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
x
between the brackets:[x]
. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e