rgw/notification: start/stop endpoint managers in notification manager #56979

yuvalif · 2024-04-18T09:42:37Z

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

yuvalif · 2024-04-18T09:43:25Z

still not working:

immediate unlock of queues on shutdown
lifecycle tests are failing

cbodley · 2024-04-18T13:31:51Z

src/rgw/driver/rados/rgw_pubsub_push.cc

 #endif
  }

  throw configuration_error("unknown schema in: " + endpoint);
  return nullptr;
 }

+bool init_http_manager(CephContext* cct) {
+  if (s_http_manager) return false;
+  curl_global_init(CURL_GLOBAL_ALL);


i don't think we want this (curl_global_init) here. the process should only call it once, and rgw::curl::setup_curl() in rgw::AppMain::init_http_clients() should take care of that for us

cbodley · 2024-04-18T13:57:13Z

src/rgw/driver/rados/rgw_notify.cc

+
+  // unlock (lose ownership) queue
+  int unlock_queue(const std::string& queue_name, librados::IoCtx& ioctx) {
+    const auto ret = rados::cls::lock::unlock(&ioctx, queue_name, queue_name+"_lock", lock_cookie);


nit: could you please use int for error codes to be consistent with the rest of the codebase?

auto is great when it saves typing, but it does that at the cost of obscuring the actual types. i don't think this is a good tradeoff for integer types especially, because their type names are already short and the signed/unsigned distinction matters

i don't use auto to save on typing, i use it to avoid implicit/unintentional casting.
and the need to lookup the return value from a function unless i have to know it.
e.g. if rados::cls::lock::unlock() returns unsigned and i keep auto (which would be unsiged), i would get an warning in: if (ret == -ENOENT).
but if i cast to int, i won't get this warning.
see: https://godbolt.org/z/fa3nj4rr3

cbodley · 2024-04-18T15:16:40Z

src/rgw/driver/rados/rgw_notify.cc

    io_context.stop();
+    work_guard.reset();


if you want the unlock_queue() cleanup stuff to happen, you should avoid the call to io_context.stop(). that means that any suspended coroutines won't ever resume, so the worker threads would join immediately

@cbodley if i'm not stopping the io_context, i can get a a crash if the async unlock() call completes after the manager already destroyed.
the other option, is to call the sync version of unlock() (as i did before), but if we have many queues, this would be an issue.
so, i guess that i would risk that sometimes the unlock is being called and sometimes it does not (i will eventually get the lock, once the lock timer expires).

@cbodley if i'm not stopping the io_context, i can get a a crash if the async unlock() call completes after the manager already destroyed.

if you don't call io_context::stop(), then any threads calling io_context.run() would continue to run until all of their coroutines finish. that means the calls to thread::join() just below will block until that happens. so i don't think there's a risk of racing with Manager's destruction

i tried commenting out this io_context.stop() line and the cleanup/shutdown mostly works:

2024-05-02T15:28:32.534-0400 7f34c54006c0 5 rgw notify: INFO: manager stopped. done processing for queue: RGW11111111111111111:Topic 2024-05-02T15:28:32.534-0400 7f34c54006c0 1 -- 192.168.245.130:0/802693616 --> [v2:192.168.245.130:6800/3518936173,v1:192.168.245.130:6801/3518936173] -- osd_op(unknown.0.0:3088 2.0 2:0e9ddf52:notif::RGW11111111111111111%3aTopic:head [stat,call lock.unlock in=71b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e13) -- 0x562f2d2eaa80 con 0x562f2ad66480 2024-05-02T15:28:32.536-0400 7f350aa006c0 1 -- 192.168.245.130:0/802693616 <== osd.0 v2:192.168.245.130:6800/3518936173 3090 ==== osd_op_reply(3088 RGW11111111111111111:Topic [stat,call] v13'427 uv427 ondisk = 0) ==== 212+0+0 (crc 0 0 0) 0x562f2ae58d80 con 0x562f2ad66480 2024-05-02T15:28:32.537-0400 7f34c54006c0 10 rgw notify: INFO: queue: RGW11111111111111111:Topic not locked (ownership can move) 2024-05-02T15:28:32.537-0400 7f34c54006c0 10 rgw notify: INFO: queue: RGW11111111111111111:Topic marked for removal 2024-05-02T15:28:33.406-0400 7f34c54006c0 1 -- 192.168.245.130:0/802693616 --> [v2:192.168.245.130:6800/3518936173,v1:192.168.245.130:6801/3518936173] -- osd_op(unknown.0.0:3089 2.0 2:6fe6cc47:notif::queues_list_object:head [omap-get-keys in=12b] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e13) -- 0x562f2d273500 con 0x562f2ad66480 2024-05-02T15:28:33.406-0400 7f350aa006c0 1 -- 192.168.245.130:0/802693616 <== osd.0 v2:192.168.245.130:6800/3518936173 3091 ==== osd_op_reply(3089 queues_list_object [omap-get-keys out=35b] v0'0 uv411 ondisk = 0) ==== 162+0+35 (crc 0 0 0) 0x562f2ae58d80 con 0x562f2ad66480 2024-05-02T15:28:33.406-0400 7f34c54006c0 1 -- 192.168.245.130:0/802693616 --> [v2:192.168.245.130:6800/3518936173,v1:192.168.245.130:6801/3518936173] -- osd_op(unknown.0.0:3090 2.0 2:0e9ddf52:notif::RGW11111111111111111%3aTopic:head [stat,call lock.lock in=87b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e13) -- 0x562f2d273880 con 0x562f2ad66480 2024-05-02T15:28:33.409-0400 7f350aa006c0 1 -- 192.168.245.130:0/802693616 <== osd.0 v2:192.168.245.130:6800/3518936173 3092 ==== osd_op_reply(3090 RGW11111111111111111:Topic [stat,call] v13'428 uv428 ondisk = 0) ==== 212+0+0 (crc 0 0 0) 0x562f2ae58d80 con 0x562f2ad66480 2024-05-02T15:28:33.409-0400 7f34c54006c0 20 rgw notify: INFO: queue: RGW11111111111111111:Topic ownership (lock) renewed 2024-05-02T15:28:33.409-0400 7f34c54006c0 10 rgw notify: INFO: queue: RGW11111111111111111:Topic was removed 2024-05-02T15:28:33.409-0400 7f34c54006c0 5 rgw notify: INFO: manager stopped. done processing for queues

the only strange part is that process_queues() raced to lock the queue again after process_queue() unlocked/returned

tried testing locally without io_context.stop() when more than one RGW exists:

RGW_PER_ZONE=2 ../src/test/rgw/test-rgw-multisite.sh 1

but the tests (data_path_v2_test) got stuck in "migrating" state.
with the above call the tests are running fine.

can you tell which coroutine gets stuck on shutdown?

trying with these steps:

$ MON=1 OSD=1 RGW=0 MDS=0 MGR=0 RGW_PER_ZONE=2 ../src/test/rgw/test-rgw-multisite.sh 1 $ bin/radosgw-admin -c run/c1/ceph.conf user create --uid testid --access-key 0555b35654ad1656d804 --secret h7GhxuBLTrlhVUyxSPUKUV8r/2EI4ngqJxD7iBdBYLhwluN30JaT3Q== --display-name 'M. Tester' --email tester@ceph.com $ BNTESTS_CONF=../work/test_bn.conf python -m nose -s ../src/test/rgw/bucket_notification/test_bn.py -v -a data_path_v2_test

i see both rgws crashing on Manager shutdown:

-5> 2024-05-03T13:55:17.483-0400 7fd99c4006c0 5 rgw notify: INFO: manager stopped. done processing for queue: :vxhzxs-1_topic -4> 2024-05-03T13:55:17.483-0400 7fd99c4006c0 1 -- 192.168.245.130:0/2176396696 --> [v2:192.168.245.130:6800/2282501434,v1:192.168.245.130:6801/2282501434] -- osd_op(unknown.0.0:722 2.0 2:071c9d28:notif::%3avxhzxs-1_topic:head [stat,call lock.unlock in=60b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e19) -- 0x563fe8e8b880 con 0x563fe0b80900 -3> 2024-05-03T13:55:17.484-0400 7fd99c4006c0 5 rgw notify: INFO: manager stopped. done processing for queues -2> 2024-05-03T13:55:17.485-0400 7fdb102006c0 1 -- 192.168.245.130:0/2176396696 <== osd.0 v2:192.168.245.130:6800/2282501434 731 ==== osd_op_reply(722 :vxhzxs-1_topic [stat,call] v19'1633 uv1633 ondisk = 0) ==== 201+0+0 (crc 0 0 0) 0x563fe592ed80 con 0x563fe0b80900 -1> 2024-05-03T13:55:17.485-0400 7fd99c4006c0 10 rgw notify: INFO: queue: :vxhzxs-1_topic not locked (ownership can move) 0> 2024-05-03T13:55:17.485-0400 7fd99c4006c0 -1 *** Caught signal (Segmentation fault) ** in thread 7fd99c4006c0 thread_name:notif-worker0 ceph version Development (no_version) squid (dev) 1: /home/cbodley/ceph/build/bin/radosgw(+0x13bf562) [0x563fde456562] 2: /lib64/libc.so.6(+0x3e9a0) [0x7fdb1905c9a0] 3: pthread_mutex_lock() 4: (std::mutex::lock()+0x9) [0x563fddc0eb7d] 5: (rgw::notify::Manager::process_queues(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)#1}::operator()(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >) const+0x2ab) [0x563fde03c9fb] 6: (spawn::detail::spawn_helper<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, rgw::notify::Manager::process_queues(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)#1}, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::{lambda(boost::context::continuation&&)#1}::operator()(boost::context::continuation&&) const+0xc9) [0x563fde03cd53] 7: (boost::context::detail::record<boost::context::continuation, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits>, spawn::detail::spawn_helper<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, rgw::notify::Manager::process_queues(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)#6}, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::{lambda(boost::context::continuation&&)#1}>::run(void*)+0x2e) [0x563fde03cef4] 8: (void boost::context::detail::context_entry<boost::context::detail::record<boost::context::continuation, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits>, spawn::detail::spawn_helper<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, rgw::notify::Manager::process_queues(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)#1}, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::{lambda(boost::context::continuation&&)#1}> >(boost::context::detail::transfer_t)+0x68) [0x563fde03cfd8] 9: make_fcontext()

that's because queue_gc_lock (and queue_gc) are variables on the stack of process_queues(), which can return before the process_queue() coroutines return

on shutdown, maybe process_queues() should keep looping until owned_queues is empty?

if this gets too complicated, we might just revert the shutdown/unlock changes and leave io_context.stop() there for now

cbodley · 2024-04-18T15:22:55Z

src/rgw/driver/rados/rgw_notify.cc

+
+  // unlock (lose ownership) queue
+  int unlock_queue(const std::string& queue_name, librados::IoCtx& ioctx) {
+    const auto ret = rados::cls::lock::unlock(&ioctx, queue_name, queue_name+"_lock", lock_cookie);


separately, this is making a synchronous call to librados. it should use the ObjectWriteOperation overload instead of the IoCtx one, and call rgw_rados_operate() which accepts our yield context

ok. will change that so we can unlock multiple queues in parallel.
note that the manager thread does not have anything else to do during its shutdown.

kchheda3 · 2024-04-19T01:58:10Z

@yuvalif, sorry but i am still not able to understand what exactly is the issue?
how is manager:shutdown() causing the crash ? can you explain

yuvalif · 2024-04-21T08:02:26Z

@yuvalif, sorry but i am still not able to understand what exactly is the issue? how is manager:shutdown() causing the crash ? can you explain

before the change, the kafka, amqp and http threads continoued regardless of the notification thread (their start/stop was part of the main function).
so, after a reload, there could be a case that the kafka thread would invoke a callback that would release a coroutine that beloned to the notification manager that was destroyed due to the reload.
after the change, all notification endpoints are initiated/stopped from within the init/stop of the notification manager.
e.g. we now stop the kafka thread before we stop the notification manager. this will remove all pending callbacks, and only after we finish, we stop and destroy the notification manager.
note that while we wait for kafka to shutdown, we may still try to send persistent notifications to kafka. this will result with an error (-ESRCH), but since the notification is persistent, it will just be retried, and sent to kafka once the kafka manager is up and running.

yuvalif · 2024-04-21T14:15:39Z

lifecycle tests are failing because they seem to run from the radosgw-admin ?!?
i get the following error logs from the test script (not the rgw):

2024-04-21T13:48:54.509+0000 7fae7d7fa640  1 ERROR: send failed. http endpoint manager not running
2024-04-21T13:48:54.509+0000 7fae7d7fa640  1 lifecycle: ERROR: failed to push sync notification event with error: -3 for event with notification id: 'zuodtj-5_notif', topic: '', endpoint: '', bucket_owner: 'zone.user', bucket: 'zuodtj-5', object: 'ooo5', event type: 'ObjectLifecycle:Expiration:Current'
2024-04-21T13:48:54.509+0000 7fae7cff9640  1 ERROR: send failed. http endpoint manager not running
2024-04-21T13:48:54.509+0000 7fae7cff9640  1 lifecycle: ERROR: failed to push sync notification event with error: -3 for event with notification id: 'zuodtj-5_notif', topic: '', endpoint: '', bucket_owner: 'zone.user', bucket: 'zuodtj-5', object: 'ooo9', event type: 'ObjectLifecycle:Expiration:Current'
2024-04-21T13:48:54.509+0000 7fae73fff640  1 ERROR: send failed. http endpoint manager not running
2024-04-21T13:48:54.509+0000 7fae73fff640  1 lifecycle: ERROR: failed to push sync notification event with error: -3 for event with notification id: 'zuodtj-5_notif', topic: '', endpoint: '', bucket_owner: 'zone.user', bucket: 'zuodtj-5', object: 'ooo1', event type: 'ObjectLifecycle:Expiration:Current'
2024-04-21T13:48:54.511+0000 7fae7d7fa640  1 ERROR: send failed. http endpoint manager not running
2024-04-21T13:48:54.511+0000 7fae7d7fa640  1 lifecycle: ERROR: failed to push sync notification event with error: -3 for event with notification id: 'zuodtj-5_notif', topic: '', endpoint: '', bucket_owner: 'zone.user', bucket: 'zuodtj-5', object: 'ooo6', event type: 'ObjectLifecycle:Expiration:Current'
2024-04-21T13:48:54.511+0000 7fae7cff9640  1 ERROR: send failed. http endpoint manager not running
2024-04-21T13:48:54.511+0000 7fae73fff640  1 ERROR: send failed. http endpoint manager not running
2024-04-21T13:48:54.511+0000 7fae7cff9640  1 lifecycle: ERROR: failed to push sync notification event with error: -3 for event with notification id: 'zuodtj-5_notif', topic: '', endpoint: '', bucket_owner: 'zone.user', bucket: 'zuodtj-5', object: 'ooo7', event type: 'ObjectLifecycle:Expiration:Current'
2024-04-21T13:48:54.511+0000 7fae73fff640  1 lifecycle: ERROR: failed to push sync notification event with error: -3 for event with notification id: 'zuodtj-5_notif', topic: '', endpoint: '', bucket_owner: 'zone.user', bucket: 'zuodtj-5', object: 'ooo2', event type: 'ObjectLifecycle:Expiration:Current'
2024-04-21T13:48:54.512+0000 7fae7d7fa640  1 ERROR: send failed. http endpoint manager not running
2024-04-21T13:48:54.512+0000 7fae7d7fa640  1 lifecycle: ERROR: failed to push sync notification event with error: -3 for event with notification id: 'zuodtj-5_notif', topic: '', endpoint: '', bucket_owner: 'zone.user', bucket: 'zuodtj-5', object: 'ooo3', event type: 'ObjectLifecycle:Expiration:Current'
2024-04-21T13:48:54.512+0000 7fae73fff640  1 ERROR: send failed. http endpoint manager not running
2024-04-21T13:48:54.512+0000 7fae73fff640  1 lifecycle: ERROR: failed to push sync notification event with error: -3 for event with notification id: 'zuodtj-5_notif', topic: '', endpoint: '', bucket_owner: 'zone.user', bucket: 'zuodtj-5', object: 'ooo4', event type: 'ObjectLifecycle:Expiration:Current'
2024-04-21T13:48:54.513+0000 7fae7cff9640  1 ERROR: send failed. http endpoint manager not running
2024-04-21T13:48:54.513+0000 7fae7cff9640  1 lifecycle: ERROR: failed to push sync notification event with error: -3 for event with notification id: 'zuodtj-5_notif', topic: '', endpoint: '', bucket_owner: 'zone.user', bucket: 'zuodtj-5', object: 'ooo0', event type: 'ObjectLifecycle:Expiration:Current'
2024-04-21T13:48:54.514+0000 7fae7d7fa640  1 ERROR: send failed. http endpoint manager not running
2024-04-21T13:48:54.514+0000 7fae7d7fa640  1 lifecycle: ERROR: failed to push sync notification event with error: -3 for event with notification id: 'zuodtj-5_notif', topic: '', endpoint: '', bucket_owner: 'zone.user', bucket: 'zuodtj-5', object: 'ooo8', event type: 'ObjectLifecycle:Expiration:Current'
wait for 5sec for the messages...

cbodley · 2024-04-22T14:24:31Z

lifecycle tests are failing because they seem to run from the radosgw-admin ?!?

@yuvalif this was added in #39192. lifecycle only runs once a day by default, so radosgw-admin lc process is a good way for tests to process everything synchronously

for s3-tests, we don't have access to radosgw-admin so we rely on rgw_lc_debug_interval to run lifecycle processing far more often, but that makes test cases finicky with regards to timing

yuvalif · 2024-04-24T10:36:46Z

lifecycle tests are failing because they seem to run from the radosgw-admin ?!?

@yuvalif this was added in #39192. lifecycle only runs once a day by default, so radosgw-admin lc process is a good way for tests to process everything synchronously

for s3-tests, we don't have access to radosgw-admin so we rely on rgw_lc_debug_interval to run lifecycle processing far more often, but that makes test cases finicky with regards to timing

@mattbenjamin, @cbodley
addresed the Lc tests in: 9e8cc6f

kchheda3 · 2024-04-24T15:22:45Z

src/test/rgw/bucket_notification/test_bn.py

    event_keys = []
    events = http_server.get_and_reset_events()
-    assert_equal(number_of_objects * 2, len(events))
+    assert number_of_objects * 2 <= len(events)


why change this check to look for <= 2* events
coz the persistent notifications are not delivering all the events ?
since wait_for_queue_to_drain is already verified before, then all the events should have been delivered and we should receive number_of_objects * 2 events

persistent notifications may deliver the same event more than once - the delivery guaranteee for persistent notifications is "at least once".
since we have 2 events per object, we should recieve at least twice the number objects

yuvalif · 2024-05-01T11:30:28Z

jenkins test api

yuvalif · 2024-05-01T11:30:42Z

jenkins test make check

yuvalif · 2024-05-02T04:27:02Z

jenkins test make check

cbodley · 2024-05-03T15:19:50Z

src/rgw/driver/rados/rgw_notify.cc

-  delete s_manager;
-  s_manager = nullptr;
+  if (!s_manager) return;
+  RGWPubSubEndpoint::shutdown_all();


is this guaranteed to trigger cancellation of outstanding completions?

if the endpoints are all stopped, there is nothing to invoke the callbacks for the completions.

if outstanding completions aren't canceled, i would expect process_entry()'s calls to push_endpoint->send() to hang and prevent the io_context threads from joining

separately, what happens if we call push_endpoint->send() after RGWPubSubEndpoint::shutdown_all()?

if outstanding completions aren't canceled, i would expect process_entry()'s calls to push_endpoint->send() to hang and prevent the io_context threads from joining

in case of kafka, we call all completions with an erro code:
https://github.com/ceph/ceph/blob/main/src/rgw/rgw_kafka.cc#L101

similar handling in case of amqp:
https://github.com/ceph/ceph/blob/main/src/rgw/rgw_amqp.cc#L146

for http, i call "stop()" and then dtor:
https://github.com/ceph/ceph/pull/56979/files#diff-a7d6265d00f7a11ffed731f280898bebf406a80b3ccdcecc50c13c48903d669eR416
this is similar to the global "rgw_http_client_cleanup()".
however, i don't see any code there that calcel any pending work.
@cbodley do you think that calling "RGWHTTPManager::manage_pending_requests()" would fix that?

separately, what happens if we call push_endpoint->send() after RGWPubSubEndpoint::shutdown_all()?

the "manager" pointers should be deleted and set to null. this should return an "-ESRCH" error.
however, i do see this crash:

"/lib64/libc.so.6(+0x3e6f0) [0x7fb79c23e6f0]", "__pthread_rwlock_wrlock()", "(RGWHTTPManager::register_request(rgw_http_req_data*)+0x37) [0x55a1f56c76e7]", "(RGWHTTPManager::add_request(RGWHTTPClient*)+0x168) [0x55a1f56c8c48]", "radosgw(+0x80ab13) [0x55a1f58b3b13]", "radosgw(+0x80f52b) [0x55a1f58b852b]", "(void boost::context::detail::context_entry<boost::context::detail::record<boost::context::continuation, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits>, spawn::detail::spawn_helper<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor>, rgw::notify::Manager::process_queue(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)#7}, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::{lambda(boost::context::continuation&&)#1}> >(boost::context::detail::transfer_t)+0x55) [0x55a1f58b92f5]", "make_fcontext()"

probably a race condition with the deletion of the pointer.
will protect it with a shared mutex, so that the different sender don't block one another, and only the shutdown

however, i don't see any code there that calcel any pending work.
@cbodley do you think that calling "RGWHTTPManager::manage_pending_requests()" would fix that?

it looks like RGWHTTPManager::reqs_thread_entry(), below the while (!going_down) loop, does loop over outstanding requests to call _unlink_request() -> _finish_request(req_data, -ECANCELED);

yuvalif · 2024-05-09T16:58:17Z

teuthology is failing when sending http notifications after there were (intentional) http notifications errors:
e.g. http://qa-proxy.ceph.com/teuthology/yuvalif-2024-05-09_15:36:24-rgw:notifications-wip-yuval-65337-distro-default-smithi/7699703/teuthology.log

the http server successfully started mid-test, and handled an empty test request sent to it from the test:

2024-05-09T16:28:03.466 INFO:teuthology.orchestra.run.smithi039.stderr:bucket_notification.test_bn: INFO: http server created on ('localhost', 16243)
2024-05-09T16:28:03.466 INFO:teuthology.orchestra.run.smithi039.stderr:bucket_notification.test_bn: INFO: http server started on ('localhost', 16243)
2024-05-09T16:28:03.466 INFO:teuthology.orchestra.run.smithi039.stderr:urllib3.connectionpool: DEBUG: Starting new HTTP connection (1): localhost:16243
2024-05-09T16:28:03.466 INFO:teuthology.orchestra.run.smithi039.stderr:bucket_notification.test_bn: INFO: HTTP Server received iempty event
2024-05-09T16:28:03.466 INFO:teuthology.orchestra.run.smithi039.stderr:urllib3.connectionpool: DEBUG: http://localhost:16243 "POST / HTTP/1.1" 200 None

however, the RGW log, is showing retry failures. until the test ends and deletes the topic.
the test is poling on the persistent queue for 155 seconds:

2024-05-09T16:28:03.470 INFO:teuthology.orchestra.run.smithi039.stderr:bucket_notification.test_bn: INFO: queue kcvqaq-23_topic has 20 entries after 155s
2024-05-09T16:28:03.470 INFO:teuthology.orchestra.run.smithi039.stderr:bucket_notification.test_bn: WARNING: queue kcvqaq-23_topic still has 20 entries after 155s

and we can see that each notifications is retried for 155 times (the retries are 1 second apart per notification):

2713975:2024-05-09T16:24:31.100+0000 7fc78653c640 20 rgw notify: Processing event entry with notification id: 'kcvqaq-23_notif', topic: 'kcvqaq-23_topic', endpoint: 'http://localhost:16243', bucket_owner: 'foo.client.0', bucket: 'kcvqaq-23', object: 'key-5', event type: 'ObjectCreated:Put' retry_number: 154 current time: 2024-05-09T16:24:31.100160+0000
2713981:2024-05-09T16:24:31.100+0000 7fc78653c640 20 rgw notify: Processing event entry with notification id: 'kcvqaq-23_notif', topic: 'kcvqaq-23_topic', endpoint: 'http://localhost:16243', bucket_owner: 'foo.client.0', bucket: 'kcvqaq-23', object: 'key-3', event type: 'ObjectCreated:Put' retry_number: 154 current time: 2024-05-09T16:24:31.100160+0000

other teuthology runs showing similar failures (no crashes):

will open a tracker for the http client issue. IMO, this PR can be merged as it fixes the reload crash

github-actions · 2024-05-09T19:05:43Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

as well as in: - multisite tests (used for notification v2 migration tests) - the qa suites running notifications enable lifecycle logs in notification tests: for the lc notification test cases this is needed after: 4299679 Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>

Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>

Fixes: https://tracker.ceph.com/issues/65337 Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>

* tests were passing only because they were not performings their asserts * tests are now separated with their own attribute * their topics are now marked "persistent" to workaround the issue in: https://tracker.ceph.com/issues/65645 Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>

for example. job: 7697397 in test: yuvalif-2024-05-08_09:55:02-rgw:notifications-wip-yuval-65337-distro-default-smithi also reduce the side of the error log by sending less objects to the test_ps_s3_persistent_topic_stats test Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>

in tests that require retries Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>

fail test if not. to indicate this is a test issue and not a product bug Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>

yuvalif · 2024-05-13T08:16:09Z

jenkins test api

cbodley · 2024-05-14T15:53:38Z

@yuvalif can you please prioritize the squid backport for https://tracker.ceph.com/issues/65996?

yuvalif added rgw DNM labels Apr 18, 2024

yuvalif requested a review from a team as a code owner April 18, 2024 09:42

github-actions bot added the tests label Apr 18, 2024

yuvalif mentioned this pull request Apr 18, 2024

rgw/kafka: do not destroy the connection on errors #56033

Open

14 tasks

cbodley reviewed Apr 18, 2024

View reviewed changes

yuvalif force-pushed the wip-yuval-65337 branch from c4d2dcf to 9bfc669 Compare April 21, 2024 14:10

kchheda3 reviewed Apr 24, 2024

View reviewed changes

yuvalif force-pushed the wip-yuval-65337 branch from 9e8cc6f to a7f3d15 Compare May 1, 2024 19:23

cbodley reviewed May 3, 2024

View reviewed changes

yuvalif force-pushed the wip-yuval-65337 branch from a7f3d15 to f703c4d Compare May 6, 2024 17:13

cbodley mentioned this pull request May 6, 2024

rgw: switch back to boost::asio for spawn() and yield_context #55592

Merged

yuvalif force-pushed the wip-yuval-65337 branch 2 times, most recently from 2831b23 to 344b29a Compare May 9, 2024 09:49

cbodley approved these changes May 9, 2024

View reviewed changes

github-actions bot added the needs-rebase label May 9, 2024

Yuval Lifshitz added 3 commits May 12, 2024 10:16

rgw/notifications: cleanup includes and unused parameters

cb9a09b

Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>

rgw/notifications: start/stop endpoint managers in notification manager

15536cf

Fixes: https://tracker.ceph.com/issues/65337 Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>

Yuval Lifshitz added 4 commits May 12, 2024 10:39

test/rgw/notification: prevent flooding the http server

1f509da

in tests that require retries Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>

test/rgw/notifications: make sure that http handler is running

9d56bbe

fail test if not. to indicate this is a test issue and not a product bug Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>

yuvalif force-pushed the wip-yuval-65337 branch from 8fdab19 to 9d56bbe Compare May 12, 2024 11:57

github-actions bot removed the needs-rebase label May 12, 2024

yuvalif removed the DNM label May 12, 2024

yuvalif merged commit 5327afb into ceph:main May 13, 2024
11 checks passed

yuvalif mentioned this pull request May 15, 2024

squid: rgw/notification: start/stop endpoint managers in notification manager #57470

Open

cbodley mentioned this pull request May 15, 2024

rgw/notify: decouple add/remove_persistent_topic() from Manager #57249

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rgw/notification: start/stop endpoint managers in notification manager #56979

rgw/notification: start/stop endpoint managers in notification manager #56979

yuvalif commented Apr 18, 2024

yuvalif commented Apr 18, 2024

cbodley Apr 18, 2024 •

edited

cbodley Apr 18, 2024

yuvalif Apr 21, 2024

cbodley Apr 18, 2024

yuvalif Apr 21, 2024

cbodley Apr 22, 2024 •

edited

cbodley May 2, 2024

yuvalif May 3, 2024

cbodley May 3, 2024

cbodley May 3, 2024

cbodley May 3, 2024

cbodley May 3, 2024

cbodley Apr 18, 2024

yuvalif Apr 21, 2024

kchheda3 commented Apr 19, 2024

yuvalif commented Apr 21, 2024

yuvalif commented Apr 21, 2024

cbodley commented Apr 22, 2024

yuvalif commented Apr 24, 2024

kchheda3 Apr 24, 2024

yuvalif Apr 24, 2024

yuvalif commented May 1, 2024

yuvalif commented May 1, 2024

yuvalif commented May 2, 2024

cbodley May 3, 2024

yuvalif May 6, 2024

cbodley May 6, 2024

yuvalif May 7, 2024

cbodley May 7, 2024

yuvalif commented May 9, 2024

github-actions bot commented May 9, 2024

yuvalif commented May 13, 2024

cbodley commented May 14, 2024

rgw/notification: start/stop endpoint managers in notification manager #56979

rgw/notification: start/stop endpoint managers in notification manager #56979

Conversation

yuvalif commented Apr 18, 2024

Contribution Guidelines

Checklist

yuvalif commented Apr 18, 2024

cbodley Apr 18, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbodley Apr 22, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kchheda3 commented Apr 19, 2024

yuvalif commented Apr 21, 2024

yuvalif commented Apr 21, 2024

cbodley commented Apr 22, 2024

yuvalif commented Apr 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuvalif commented May 1, 2024

yuvalif commented May 1, 2024

yuvalif commented May 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuvalif commented May 9, 2024

github-actions bot commented May 9, 2024

yuvalif commented May 13, 2024

cbodley commented May 14, 2024

cbodley Apr 18, 2024 •

edited

cbodley Apr 22, 2024 •

edited