Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw/notification: start/stop endpoint managers in notification manager #56979

Merged
merged 7 commits into from May 13, 2024

Conversation

yuvalif
Copy link
Contributor

@yuvalif yuvalif commented Apr 18, 2024

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@yuvalif yuvalif requested a review from a team as a code owner April 18, 2024 09:42
@github-actions github-actions bot added the tests label Apr 18, 2024
@yuvalif
Copy link
Contributor Author

yuvalif commented Apr 18, 2024

still not working:

  • immediate unlock of queues on shutdown
  • lifecycle tests are failing

#endif
}

throw configuration_error("unknown schema in: " + endpoint);
return nullptr;
}

bool init_http_manager(CephContext* cct) {
if (s_http_manager) return false;
curl_global_init(CURL_GLOBAL_ALL);
Copy link
Contributor

@cbodley cbodley Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we want this (curl_global_init) here. the process should only call it once, and rgw::curl::setup_curl() in rgw::AppMain::init_http_clients() should take care of that for us


// unlock (lose ownership) queue
int unlock_queue(const std::string& queue_name, librados::IoCtx& ioctx) {
const auto ret = rados::cls::lock::unlock(&ioctx, queue_name, queue_name+"_lock", lock_cookie);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you please use int for error codes to be consistent with the rest of the codebase?

auto is great when it saves typing, but it does that at the cost of obscuring the actual types. i don't think this is a good tradeoff for integer types especially, because their type names are already short and the signed/unsigned distinction matters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't use auto to save on typing, i use it to avoid implicit/unintentional casting.
and the need to lookup the return value from a function unless i have to know it.
e.g. if rados::cls::lock::unlock() returns unsigned and i keep auto (which would be unsiged), i would get an warning in: if (ret == -ENOENT).
but if i cast to int, i won't get this warning.
see: https://godbolt.org/z/fa3nj4rr3

io_context.stop();
work_guard.reset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want the unlock_queue() cleanup stuff to happen, you should avoid the call to io_context.stop(). that means that any suspended coroutines won't ever resume, so the worker threads would join immediately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbodley if i'm not stopping the io_context, i can get a a crash if the async unlock() call completes after the manager already destroyed.
the other option, is to call the sync version of unlock() (as i did before), but if we have many queues, this would be an issue.
so, i guess that i would risk that sometimes the unlock is being called and sometimes it does not (i will eventually get the lock, once the lock timer expires).

Copy link
Contributor

@cbodley cbodley Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbodley if i'm not stopping the io_context, i can get a a crash if the async unlock() call completes after the manager already destroyed.

if you don't call io_context::stop(), then any threads calling io_context.run() would continue to run until all of their coroutines finish. that means the calls to thread::join() just below will block until that happens. so i don't think there's a risk of racing with Manager's destruction

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i tried commenting out this io_context.stop() line and the cleanup/shutdown mostly works:

2024-05-02T15:28:32.534-0400 7f34c54006c0  5 rgw notify: INFO: manager stopped. done processing for queue: RGW11111111111111111:Topic
2024-05-02T15:28:32.534-0400 7f34c54006c0  1 -- 192.168.245.130:0/802693616 --> [v2:192.168.245.130:6800/3518936173,v1:192.168.245.130:6801/3518936173] -- osd_op(unknown.0.0:3088 2.0 2:0e9ddf52:notif::RGW11111111111111111%3aTopic:head [stat,call lock.unlock in=71b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e13) -- 0x562f2d2eaa80 con 0x562f2ad66480
2024-05-02T15:28:32.536-0400 7f350aa006c0  1 -- 192.168.245.130:0/802693616 <== osd.0 v2:192.168.245.130:6800/3518936173 3090 ==== osd_op_reply(3088 RGW11111111111111111:Topic [stat,call] v13'427 uv427 ondisk = 0) ==== 212+0+0 (crc 0 0 0) 0x562f2ae58d80 con 0x562f2ad66480
2024-05-02T15:28:32.537-0400 7f34c54006c0 10 rgw notify: INFO: queue: RGW11111111111111111:Topic not locked (ownership can move)
2024-05-02T15:28:32.537-0400 7f34c54006c0 10 rgw notify: INFO: queue: RGW11111111111111111:Topic marked for removal
2024-05-02T15:28:33.406-0400 7f34c54006c0  1 -- 192.168.245.130:0/802693616 --> [v2:192.168.245.130:6800/3518936173,v1:192.168.245.130:6801/3518936173] -- osd_op(unknown.0.0:3089 2.0 2:6fe6cc47:notif::queues_list_object:head [omap-get-keys in=12b] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e13) -- 0x562f2d273500 con 0x562f2ad66480
2024-05-02T15:28:33.406-0400 7f350aa006c0  1 -- 192.168.245.130:0/802693616 <== osd.0 v2:192.168.245.130:6800/3518936173 3091 ==== osd_op_reply(3089 queues_list_object [omap-get-keys out=35b] v0'0 uv411 ondisk = 0) ==== 162+0+35 (crc 0 0 0) 0x562f2ae58d80 con 0x562f2ad66480
2024-05-02T15:28:33.406-0400 7f34c54006c0  1 -- 192.168.245.130:0/802693616 --> [v2:192.168.245.130:6800/3518936173,v1:192.168.245.130:6801/3518936173] -- osd_op(unknown.0.0:3090 2.0 2:0e9ddf52:notif::RGW11111111111111111%3aTopic:head [stat,call lock.lock in=87b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e13) -- 0x562f2d273880 con 0x562f2ad66480
2024-05-02T15:28:33.409-0400 7f350aa006c0  1 -- 192.168.245.130:0/802693616 <== osd.0 v2:192.168.245.130:6800/3518936173 3092 ==== osd_op_reply(3090 RGW11111111111111111:Topic [stat,call] v13'428 uv428 ondisk = 0) ==== 212+0+0 (crc 0 0 0) 0x562f2ae58d80 con 0x562f2ad66480
2024-05-02T15:28:33.409-0400 7f34c54006c0 20 rgw notify: INFO: queue: RGW11111111111111111:Topic ownership (lock) renewed
2024-05-02T15:28:33.409-0400 7f34c54006c0 10 rgw notify: INFO: queue: RGW11111111111111111:Topic was removed
2024-05-02T15:28:33.409-0400 7f34c54006c0  5 rgw notify: INFO: manager stopped. done processing for queues

the only strange part is that process_queues() raced to lock the queue again after process_queue() unlocked/returned

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tried testing locally without io_context.stop() when more than one RGW exists:

RGW_PER_ZONE=2 ../src/test/rgw/test-rgw-multisite.sh 1

but the tests (data_path_v2_test) got stuck in "migrating" state.
with the above call the tests are running fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you tell which coroutine gets stuck on shutdown?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying with these steps:

$ MON=1 OSD=1 RGW=0 MDS=0 MGR=0 RGW_PER_ZONE=2 ../src/test/rgw/test-rgw-multisite.sh 1
$ bin/radosgw-admin -c run/c1/ceph.conf user create --uid testid --access-key 0555b35654ad1656d804 --secret h7GhxuBLTrlhVUyxSPUKUV8r/2EI4ngqJxD7iBdBYLhwluN30JaT3Q== --display-name 'M. Tester' --email tester@ceph.com
$ BNTESTS_CONF=../work/test_bn.conf python -m nose -s ../src/test/rgw/bucket_notification/test_bn.py -v -a data_path_v2_test

i see both rgws crashing on Manager shutdown:

    -5> 2024-05-03T13:55:17.483-0400 7fd99c4006c0  5 rgw notify: INFO: manager stopped. done processing for queue: :vxhzxs-1_topic
    -4> 2024-05-03T13:55:17.483-0400 7fd99c4006c0  1 -- 192.168.245.130:0/2176396696 --> [v2:192.168.245.130:6800/2282501434,v1:192.168.245.130:6801/2282501434] -- osd_op(unknown.0.0:722 2.0 2:071c9d28:notif::%3avxhzxs-1_topic:head [stat,call lock.unlock in=60b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e19) -- 0x563fe8e8b880 con 0x563fe0b80900
    -3> 2024-05-03T13:55:17.484-0400 7fd99c4006c0  5 rgw notify: INFO: manager stopped. done processing for queues
    -2> 2024-05-03T13:55:17.485-0400 7fdb102006c0  1 -- 192.168.245.130:0/2176396696 <== osd.0 v2:192.168.245.130:6800/2282501434 731 ==== osd_op_reply(722 :vxhzxs-1_topic [stat,call] v19'1633 uv1633 ondisk = 0) ==== 201+0+0 (crc 0 0 0) 0x563fe592ed80 con 0x563fe0b80900
    -1> 2024-05-03T13:55:17.485-0400 7fd99c4006c0 10 rgw notify: INFO: queue: :vxhzxs-1_topic not locked (ownership can move)
     0> 2024-05-03T13:55:17.485-0400 7fd99c4006c0 -1 *** Caught signal (Segmentation fault) **
 in thread 7fd99c4006c0 thread_name:notif-worker0

 ceph version Development (no_version) squid (dev)
 1: /home/cbodley/ceph/build/bin/radosgw(+0x13bf562) [0x563fde456562]
 2: /lib64/libc.so.6(+0x3e9a0) [0x7fdb1905c9a0]
 3: pthread_mutex_lock()
 4: (std::mutex::lock()+0x9) [0x563fddc0eb7d]
 5: (rgw::notify::Manager::process_queues(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)#1}::operator()(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >) const+0x2ab) [0x563fde03c9fb]
 6: (spawn::detail::spawn_helper<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, rgw::notify::Manager::process_queues(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)#1}, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::{lambda(boost::context::continuation&&)#1}::operator()(boost::context::continuation&&) const+0xc9) [0x563fde03cd53]  
 7: (boost::context::detail::record<boost::context::continuation, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits>, spawn::detail::spawn_helper<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, rgw::notify::Manager::process_queues(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)#6}, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::{lambda(boost::context::continuation&&)#1}>::run(void*)+0x2e) [0x563fde03cef4]
 8: (void boost::context::detail::context_entry<boost::context::detail::record<boost::context::continuation, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits>, spawn::detail::spawn_helper<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, rgw::notify::Manager::process_queues(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)#1}, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::{lambda(boost::context::continuation&&)#1}> >(boost::context::detail::transfer_t)+0x68) [0x563fde03cfd8]
 9: make_fcontext()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's because queue_gc_lock (and queue_gc) are variables on the stack of process_queues(), which can return before the process_queue() coroutines return

on shutdown, maybe process_queues() should keep looping until owned_queues is empty?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this gets too complicated, we might just revert the shutdown/unlock changes and leave io_context.stop() there for now


// unlock (lose ownership) queue
int unlock_queue(const std::string& queue_name, librados::IoCtx& ioctx) {
const auto ret = rados::cls::lock::unlock(&ioctx, queue_name, queue_name+"_lock", lock_cookie);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separately, this is making a synchronous call to librados. it should use the ObjectWriteOperation overload instead of the IoCtx one, and call rgw_rados_operate() which accepts our yield context

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. will change that so we can unlock multiple queues in parallel.
note that the manager thread does not have anything else to do during its shutdown.

@kchheda3
Copy link
Contributor

@yuvalif, sorry but i am still not able to understand what exactly is the issue?
how is manager:shutdown() causing the crash ? can you explain

@yuvalif
Copy link
Contributor Author

yuvalif commented Apr 21, 2024

@yuvalif, sorry but i am still not able to understand what exactly is the issue? how is manager:shutdown() causing the crash ? can you explain

before the change, the kafka, amqp and http threads continoued regardless of the notification thread (their start/stop was part of the main function).
so, after a reload, there could be a case that the kafka thread would invoke a callback that would release a coroutine that beloned to the notification manager that was destroyed due to the reload.
after the change, all notification endpoints are initiated/stopped from within the init/stop of the notification manager.
e.g. we now stop the kafka thread before we stop the notification manager. this will remove all pending callbacks, and only after we finish, we stop and destroy the notification manager.
note that while we wait for kafka to shutdown, we may still try to send persistent notifications to kafka. this will result with an error (-ESRCH), but since the notification is persistent, it will just be retried, and sent to kafka once the kafka manager is up and running.

@yuvalif
Copy link
Contributor Author

yuvalif commented Apr 21, 2024

lifecycle tests are failing because they seem to run from the radosgw-admin ?!?
i get the following error logs from the test script (not the rgw):

2024-04-21T13:48:54.509+0000 7fae7d7fa640  1 ERROR: send failed. http endpoint manager not running
2024-04-21T13:48:54.509+0000 7fae7d7fa640  1 lifecycle: ERROR: failed to push sync notification event with error: -3 for event with notification id: 'zuodtj-5_notif', topic: '', endpoint: '', bucket_owner: 'zone.user', bucket: 'zuodtj-5', object: 'ooo5', event type: 'ObjectLifecycle:Expiration:Current'
2024-04-21T13:48:54.509+0000 7fae7cff9640  1 ERROR: send failed. http endpoint manager not running
2024-04-21T13:48:54.509+0000 7fae7cff9640  1 lifecycle: ERROR: failed to push sync notification event with error: -3 for event with notification id: 'zuodtj-5_notif', topic: '', endpoint: '', bucket_owner: 'zone.user', bucket: 'zuodtj-5', object: 'ooo9', event type: 'ObjectLifecycle:Expiration:Current'
2024-04-21T13:48:54.509+0000 7fae73fff640  1 ERROR: send failed. http endpoint manager not running
2024-04-21T13:48:54.509+0000 7fae73fff640  1 lifecycle: ERROR: failed to push sync notification event with error: -3 for event with notification id: 'zuodtj-5_notif', topic: '', endpoint: '', bucket_owner: 'zone.user', bucket: 'zuodtj-5', object: 'ooo1', event type: 'ObjectLifecycle:Expiration:Current'
2024-04-21T13:48:54.511+0000 7fae7d7fa640  1 ERROR: send failed. http endpoint manager not running
2024-04-21T13:48:54.511+0000 7fae7d7fa640  1 lifecycle: ERROR: failed to push sync notification event with error: -3 for event with notification id: 'zuodtj-5_notif', topic: '', endpoint: '', bucket_owner: 'zone.user', bucket: 'zuodtj-5', object: 'ooo6', event type: 'ObjectLifecycle:Expiration:Current'
2024-04-21T13:48:54.511+0000 7fae7cff9640  1 ERROR: send failed. http endpoint manager not running
2024-04-21T13:48:54.511+0000 7fae73fff640  1 ERROR: send failed. http endpoint manager not running
2024-04-21T13:48:54.511+0000 7fae7cff9640  1 lifecycle: ERROR: failed to push sync notification event with error: -3 for event with notification id: 'zuodtj-5_notif', topic: '', endpoint: '', bucket_owner: 'zone.user', bucket: 'zuodtj-5', object: 'ooo7', event type: 'ObjectLifecycle:Expiration:Current'
2024-04-21T13:48:54.511+0000 7fae73fff640  1 lifecycle: ERROR: failed to push sync notification event with error: -3 for event with notification id: 'zuodtj-5_notif', topic: '', endpoint: '', bucket_owner: 'zone.user', bucket: 'zuodtj-5', object: 'ooo2', event type: 'ObjectLifecycle:Expiration:Current'
2024-04-21T13:48:54.512+0000 7fae7d7fa640  1 ERROR: send failed. http endpoint manager not running
2024-04-21T13:48:54.512+0000 7fae7d7fa640  1 lifecycle: ERROR: failed to push sync notification event with error: -3 for event with notification id: 'zuodtj-5_notif', topic: '', endpoint: '', bucket_owner: 'zone.user', bucket: 'zuodtj-5', object: 'ooo3', event type: 'ObjectLifecycle:Expiration:Current'
2024-04-21T13:48:54.512+0000 7fae73fff640  1 ERROR: send failed. http endpoint manager not running
2024-04-21T13:48:54.512+0000 7fae73fff640  1 lifecycle: ERROR: failed to push sync notification event with error: -3 for event with notification id: 'zuodtj-5_notif', topic: '', endpoint: '', bucket_owner: 'zone.user', bucket: 'zuodtj-5', object: 'ooo4', event type: 'ObjectLifecycle:Expiration:Current'
2024-04-21T13:48:54.513+0000 7fae7cff9640  1 ERROR: send failed. http endpoint manager not running
2024-04-21T13:48:54.513+0000 7fae7cff9640  1 lifecycle: ERROR: failed to push sync notification event with error: -3 for event with notification id: 'zuodtj-5_notif', topic: '', endpoint: '', bucket_owner: 'zone.user', bucket: 'zuodtj-5', object: 'ooo0', event type: 'ObjectLifecycle:Expiration:Current'
2024-04-21T13:48:54.514+0000 7fae7d7fa640  1 ERROR: send failed. http endpoint manager not running
2024-04-21T13:48:54.514+0000 7fae7d7fa640  1 lifecycle: ERROR: failed to push sync notification event with error: -3 for event with notification id: 'zuodtj-5_notif', topic: '', endpoint: '', bucket_owner: 'zone.user', bucket: 'zuodtj-5', object: 'ooo8', event type: 'ObjectLifecycle:Expiration:Current'
wait for 5sec for the messages...

@cbodley
Copy link
Contributor

cbodley commented Apr 22, 2024

lifecycle tests are failing because they seem to run from the radosgw-admin ?!?

@yuvalif this was added in #39192. lifecycle only runs once a day by default, so radosgw-admin lc process is a good way for tests to process everything synchronously

for s3-tests, we don't have access to radosgw-admin so we rely on rgw_lc_debug_interval to run lifecycle processing far more often, but that makes test cases finicky with regards to timing

@yuvalif
Copy link
Contributor Author

yuvalif commented Apr 24, 2024

lifecycle tests are failing because they seem to run from the radosgw-admin ?!?

@yuvalif this was added in #39192. lifecycle only runs once a day by default, so radosgw-admin lc process is a good way for tests to process everything synchronously

for s3-tests, we don't have access to radosgw-admin so we rely on rgw_lc_debug_interval to run lifecycle processing far more often, but that makes test cases finicky with regards to timing

@mattbenjamin, @cbodley
addresed the Lc tests in: 9e8cc6f

event_keys = []
events = http_server.get_and_reset_events()
assert_equal(number_of_objects * 2, len(events))
assert number_of_objects * 2 <= len(events)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change this check to look for <= 2* events
coz the persistent notifications are not delivering all the events ?
since wait_for_queue_to_drain is already verified before, then all the events should have been delivered and we should receive number_of_objects * 2 events

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

persistent notifications may deliver the same event more than once - the delivery guaranteee for persistent notifications is "at least once".
since we have 2 events per object, we should recieve at least twice the number objects

@yuvalif
Copy link
Contributor Author

yuvalif commented May 1, 2024

jenkins test api

@yuvalif
Copy link
Contributor Author

yuvalif commented May 1, 2024

jenkins test make check

@yuvalif
Copy link
Contributor Author

yuvalif commented May 2, 2024

jenkins test make check

delete s_manager;
s_manager = nullptr;
if (!s_manager) return;
RGWPubSubEndpoint::shutdown_all();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this guaranteed to trigger cancellation of outstanding completions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the endpoints are all stopped, there is nothing to invoke the callbacks for the completions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if outstanding completions aren't canceled, i would expect process_entry()'s calls to push_endpoint->send() to hang and prevent the io_context threads from joining

separately, what happens if we call push_endpoint->send() after RGWPubSubEndpoint::shutdown_all()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if outstanding completions aren't canceled, i would expect process_entry()'s calls to push_endpoint->send() to hang and prevent the io_context threads from joining

separately, what happens if we call push_endpoint->send() after RGWPubSubEndpoint::shutdown_all()?

the "manager" pointers should be deleted and set to null. this should return an "-ESRCH" error.
however, i do see this crash:

     "/lib64/libc.so.6(+0x3e6f0) [0x7fb79c23e6f0]",
        "__pthread_rwlock_wrlock()",
        "(RGWHTTPManager::register_request(rgw_http_req_data*)+0x37) [0x55a1f56c76e7]",
        "(RGWHTTPManager::add_request(RGWHTTPClient*)+0x168) [0x55a1f56c8c48]",
        "radosgw(+0x80ab13) [0x55a1f58b3b13]",
        "radosgw(+0x80f52b) [0x55a1f58b852b]",
        "(void boost::context::detail::context_entry<boost::context::detail::record<boost::context::continuation, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits>, spawn::detail::spawn_helper<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor>, rgw::notify::Manager::process_queue(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::any_io_executor> >)#7}, boost::context::basic_protected_fixedsize_stack<boost::context::stack_traits> >::operator()()::{lambda(boost::context::continuation&&)#1}> >(boost::context::detail::transfer_t)+0x55) [0x55a1f58b92f5]",
        "make_fcontext()"

probably a race condition with the deletion of the pointer.
will protect it with a shared mutex, so that the different sender don't block one another, and only the shutdown

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however, i don't see any code there that calcel any pending work.
@cbodley do you think that calling "RGWHTTPManager::manage_pending_requests()" would fix that?

it looks like RGWHTTPManager::reqs_thread_entry(), below the while (!going_down) loop, does loop over outstanding requests to call _unlink_request() -> _finish_request(req_data, -ECANCELED);

@yuvalif
Copy link
Contributor Author

yuvalif commented May 9, 2024

teuthology is failing when sending http notifications after there were (intentional) http notifications errors:
e.g. http://qa-proxy.ceph.com/teuthology/yuvalif-2024-05-09_15:36:24-rgw:notifications-wip-yuval-65337-distro-default-smithi/7699703/teuthology.log

the http server successfully started mid-test, and handled an empty test request sent to it from the test:

2024-05-09T16:28:03.466 INFO:teuthology.orchestra.run.smithi039.stderr:bucket_notification.test_bn: INFO: http server created on ('localhost', 16243)
2024-05-09T16:28:03.466 INFO:teuthology.orchestra.run.smithi039.stderr:bucket_notification.test_bn: INFO: http server started on ('localhost', 16243)
2024-05-09T16:28:03.466 INFO:teuthology.orchestra.run.smithi039.stderr:urllib3.connectionpool: DEBUG: Starting new HTTP connection (1): localhost:16243
2024-05-09T16:28:03.466 INFO:teuthology.orchestra.run.smithi039.stderr:bucket_notification.test_bn: INFO: HTTP Server received iempty event
2024-05-09T16:28:03.466 INFO:teuthology.orchestra.run.smithi039.stderr:urllib3.connectionpool: DEBUG: http://localhost:16243 "POST / HTTP/1.1" 200 None

however, the RGW log, is showing retry failures. until the test ends and deletes the topic.
the test is poling on the persistent queue for 155 seconds:

2024-05-09T16:28:03.470 INFO:teuthology.orchestra.run.smithi039.stderr:bucket_notification.test_bn: INFO: queue kcvqaq-23_topic has 20 entries after 155s
2024-05-09T16:28:03.470 INFO:teuthology.orchestra.run.smithi039.stderr:bucket_notification.test_bn: WARNING: queue kcvqaq-23_topic still has 20 entries after 155s

and we can see that each notifications is retried for 155 times (the retries are 1 second apart per notification):

2713975:2024-05-09T16:24:31.100+0000 7fc78653c640 20 rgw notify: Processing event entry with notification id: 'kcvqaq-23_notif', topic: 'kcvqaq-23_topic', endpoint: 'http://localhost:16243', bucket_owner: 'foo.client.0', bucket: 'kcvqaq-23', object: 'key-5', event type: 'ObjectCreated:Put' retry_number: 154 current time: 2024-05-09T16:24:31.100160+0000
2713981:2024-05-09T16:24:31.100+0000 7fc78653c640 20 rgw notify: Processing event entry with notification id: 'kcvqaq-23_notif', topic: 'kcvqaq-23_topic', endpoint: 'http://localhost:16243', bucket_owner: 'foo.client.0', bucket: 'kcvqaq-23', object: 'key-3', event type: 'ObjectCreated:Put' retry_number: 154 current time: 2024-05-09T16:24:31.100160+0000

other teuthology runs showing similar failures (no crashes):

will open a tracker for the http client issue. IMO, this PR can be merged as it fixes the reload crash

Copy link

github-actions bot commented May 9, 2024

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Yuval Lifshitz added 3 commits May 12, 2024 10:16
as well as in:
- multisite tests (used for notification v2 migration tests)
- the qa suites running notifications

enable lifecycle logs in notification tests: for the lc notification test cases

this is needed after: 4299679

Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
Yuval Lifshitz added 4 commits May 12, 2024 10:39
* tests were passing only because they were not performings their asserts
* tests are now separated with their own attribute
* their topics are now marked "persistent" to workaround the issue in:
  https://tracker.ceph.com/issues/65645

Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
for example. job: 7697397
in test: yuvalif-2024-05-08_09:55:02-rgw:notifications-wip-yuval-65337-distro-default-smithi

also reduce the side of the error log by sending less objects to the
test_ps_s3_persistent_topic_stats test

Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
in tests that require retries

Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
fail test if not. to indicate this is a test issue
and not a product bug

Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
@yuvalif
Copy link
Contributor Author

yuvalif commented May 13, 2024

jenkins test api

@yuvalif yuvalif merged commit 5327afb into ceph:main May 13, 2024
11 checks passed
@cbodley
Copy link
Contributor

cbodley commented May 14, 2024

@yuvalif can you please prioritize the squid backport for https://tracker.ceph.com/issues/65996?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants