New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rbd-nbd: fix stuck with disable request #50593
Conversation
Local client logs:
After 5 seconds (notifier timeout) you can see Remote client logs:
Note:
ImageUpdateWatchers::flush() hangs as its added to the worker queue, while it is waiting on the features update, this will simply block until the notifier timeout to trigger which then fails the entire async request with a timeout. But since |
@idryomov I had run qa/workunits/rbd/rbd-nbd.sh with these changes, it also contains resize test and many more nbd related, it returned OK. Here are the relevant nbd_test.log Thanks! |
src/librbd/ImageState.cc
Outdated
m_work_queue->queue(ctx, 0); | ||
return; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trociny do you remember why the above check is added?
Please see the tracker its 100% reproducible . Always we hit "update features timed out notifying lock owner".
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we just remove this block the flush
will not make sense.
The flush
is called when we want to make sure all in-flight notifications are complete (see ImageWatcher::handle_payload(HeaderUpdatePayload)
where it is called). So if there are in-flight notification we queue the completion callback so it will be called after all previously queued notifications are processed.
So the question should actually be why it gets stuck there, and you should look at the rbd-nbd side, where it actually gets stuck. The rbd-nbd registers a watcher callback to detect image resize, see NBDWatchCtx::handle_notify
. I added some debug to this callback (when it enters and returns) and running rbd-nbd with debug_rbd=30
I see:
2023-04-17T18:46:03.056+0100 7f58537fe6c0 20 librbd::DisableFeaturesRequest: 0x7f5824010bc0 handle_set_features: r=0
2023-04-17T18:46:03.056+0100 7f58537fe6c0 20 librbd::DisableFeaturesRequest: 0x7f5824010bc0 send_notify_update
2023-04-17T18:46:03.056+0100 7f58537fe6c0 20 librbd::ImageState: 0x5561f94f4580 handle_update_notification: refresh_seq = 2, last_refresh = 1
2023-04-17T18:46:03.056+0100 7f58537fe6c0 20 librbd::ImageState: 0x5561f97028a0 ImageUpdateWatchers::notify
2023-04-17T18:46:03.056+0100 7f58537fe6c0 20 librbd::ImageState: 0x5561f97028a0 ImageUpdateWatchers::send_notify: handle=0, watcher=0x7ffd9a7c4900
2023-04-17T18:46:03.056+0100 7f58537fe6c0 10 librbd::ImageWatcher: 0x7f582c008b60: notify_header_update
2023-04-17T18:46:03.056+0100 7f58537fe6c0 20 librbd::watcher::Notifier: 0x7f582c008c20 notify: pending=1
2023-04-17T18:46:03.056+0100 7f580ffff6c0 20 librbd::ImageState: 0x5561f97028a0 ImageUpdateWatchers::handle_notify: handle=0, watcher=0x7ffd9a7c4900
2023-04-17T18:46:03.056+0100 7f580ffff6c0 5 rbd-nbd: XXXMG: handle_notify: enter
2023-04-17T18:46:03.056+0100 7f580ffff6c0 20 librbd: info 0x5561f96b28f0
2023-04-17T18:46:03.056+0100 7f5852ffd6c0 5 librbd::Watcher: 0x7f582c008b60 notifications_blocked: blocked=0
2023-04-17T18:46:03.056+0100 7f5852ffd6c0 10 librbd::Watcher::C_NotifyAck 0x7f582c019470 C_NotifyAck: id=1151051236169, handle=140016537847264
2023-04-17T18:46:03.056+0100 7f5852ffd6c0 10 librbd::ImageWatcher: 0x7f582c008b60 image header updated
2023-04-17T18:46:03.056+0100 7f5852ffd6c0 20 librbd::ImageState: 0x5561f94f4580 handle_update_notification: refresh_seq = 3, last_refresh = 1
2023-04-17T18:46:03.056+0100 7f5852ffd6c0 20 librbd::ImageState: 0x5561f97028a0 ImageUpdateWatchers::notify
2023-04-17T18:46:03.056+0100 7f5852ffd6c0 20 librbd::ImageState: 0x5561f97028a0 ImageUpdateWatchers::send_notify: handle=0, watcher=0x7ffd9a7c4900
2023-04-17T18:46:03.056+0100 7f5852ffd6c0 20 librbd::ImageState: 0x5561f94f4580 flush_update_watchers
2023-04-17T18:46:03.056+0100 7f5852ffd6c0 20 librbd::ImageState: 0x5561f97028a0 ImageUpdateWatchers::flush
2023-04-17T18:46:08.056+0100 7f5852ffd6c0 20 librbd::watcher::Notifier: 0x7f582c008c20 handle_notify: r=-110
2023-04-17T18:46:08.056+0100 7f5852ffd6c0 20 librbd::watcher::Notifier: 0x7f582c008c20 handle_notify: pending=0
2023-04-17T18:46:08.056+0100 7f5852ffd6c0 20 librbd::DisableFeaturesRequest: 0x7f5824010bc0 handle_notify_update: r=-110
2023-04-17T18:46:08.056+0100 7f5852ffd6c0 20 librbd::DisableFeaturesRequest: 0x7f5824010bc0 handle_finish: r=-110
2023-04-17T18:46:08.056+0100 7f5852ffd6c0 20 librbd::ExclusiveLock: 0x7f5824001d40 unblock_requests
2023-04-17T18:46:08.056+0100 7f5852ffd6c0 5 librbd::io::WriteBlockImageDispatch: 0x5561f954f740 unblock_writes: 0x5561f96b28f0, num=0
2023-04-17T18:46:08.056+0100 7f5852ffd6c0 10 librbd::ImageState: 0x5561f94f4580 handle_prepare_lock_complete
2023-04-17T18:46:08.056+0100 7f5852ffd6c0 10 librbd::ImageState: 0x5561f94f4580 0x5561f94f4580 send_refresh_unlock
2023-04-17T18:46:08.056+0100 7f5852ffd6c0 10 librbd::image::RefreshRequest: 0x7f5804003520 send_v2_get_mutable_metadata
2023-04-17T18:46:08.056+0100 7f5852ffd6c0 10 librbd::Request: 0x7f5824010bc0 create_context_finisher
2023-04-17T18:46:08.056+0100 7f5852ffd6c0 10 librbd::Request: 0x7f5824010bc0 finish: r=-110
2023-04-17T18:46:08.056+0100 7f5852ffd6c0 20 librbd::Operations: finish_op: update features r=-110
2023-04-17T18:46:08.056+0100 7f58327fc6c0 20 librbd::ImageWatcher: 0x7f582c008b60 remote async request finished: [19699,139671329929968,1] = -110
2023-04-17T18:46:08.056+0100 7f58327fc6c0 20 librbd::watcher::Notifier: 0x7f582c008c20 notify: pending=1
2023-04-17T18:46:08.060+0100 7f58537fe6c0 10 librbd::image::RefreshRequest: 0x7f5804003520 handle_v2_get_mutable_metadata: r=0
2023-04-17T18:46:08.060+0100 7f58537fe6c0 10 librbd::image::RefreshRequest: 0x7f5804003520 send_v2_get_parent: legacy=0
2023-04-17T18:46:08.060+0100 7f5852ffd6c0 5 librbd::Watcher: 0x7f582c008b60 notifications_blocked: blocked=0
2023-04-17T18:46:08.060+0100 7f5852ffd6c0 10 librbd::Watcher::C_NotifyAck 0x7f581c0066b0 C_NotifyAck: id=1151051236170, handle=140016537847264
2023-04-17T18:46:08.060+0100 7f5852ffd6c0 20 librbd::ImageWatcher: remove_async_request: [19699,139671329929968,1]
2023-04-17T18:46:08.060+0100 7f5852ffd6c0 10 librbd::Watcher::C_NotifyAck 0x7f581c0066b0 finish: r=0
2023-04-17T18:46:08.060+0100 7f5852ffd6c0 10 librbd::image::RefreshRequest: 0x7f5804003520 handle_v2_get_parent: r=0
2023-04-17T18:46:08.060+0100 7f5852ffd6c0 10 librbd::image::RefreshRequest: 0x7f5804003520 send_v2_get_metadata
2023-04-17T18:46:08.060+0100 7f5852ffd6c0 15 librbd::image::GetMetadataRequest: 0x7f584c00a950 metadata_list: start_key=conf_
2023-04-17T18:46:08.060+0100 7f5852ffd6c0 20 librbd::watcher::Notifier: 0x7f582c008c20 handle_notify: r=0
2023-04-17T18:46:08.060+0100 7f5852ffd6c0 20 librbd::watcher::Notifier: 0x7f582c008c20 handle_notify: pending=0
2023-04-17T18:46:08.060+0100 7f5852ffd6c0 20 librbd::ImageWatcher: 0x7f582c008b60 handle_async_complete: request=[19699,139671329929968,1], r=0
2023-04-17T18:46:08.060+0100 7f58537fe6c0 15 librbd::image::GetMetadataRequest: 0x7f584c00a950 handle_metadata_list: r=0
2023-04-17T18:46:08.060+0100 7f58537fe6c0 15 librbd::image::GetMetadataRequest: 0x7f584c00a950 finish: r=0
2023-04-17T18:46:08.060+0100 7f58537fe6c0 10 librbd::image::RefreshRequest: 0x7f5804003520 handle_v2_get_metadata: r=0
2023-04-17T18:46:08.060+0100 7f58537fe6c0 10 librbd::image::RefreshRequest: 0x7f5804003520 send_v2_get_pool_metadata
2023-04-17T18:46:08.060+0100 7f58537fe6c0 15 librbd::image::GetMetadataRequest: 0x7f584c00af40 metadata_list: start_key=conf_
2023-04-17T18:46:08.064+0100 7f5852ffd6c0 15 librbd::image::GetMetadataRequest: 0x7f584c00af40 handle_metadata_list: r=0
2023-04-17T18:46:08.064+0100 7f5852ffd6c0 15 librbd::image::GetMetadataRequest: 0x7f584c00af40 finish: r=0
2023-04-17T18:46:08.064+0100 7f5852ffd6c0 10 librbd::image::RefreshRequest: 0x7f5804003520 handle_v2_get_pool_metadata: r=0
2023-04-17T18:46:08.064+0100 7f5852ffd6c0 20 librbd::ImageCtx: apply_metadata
2023-04-17T18:46:08.064+0100 7f5852ffd6c0 10 librbd::image::RefreshRequest: 0x7f5804003520 send_v2_get_group
2023-04-17T18:46:08.064+0100 7f58537fe6c0 10 librbd::image::RefreshRequest: 0x7f5804003520 handle_v2_get_group: r=0
2023-04-17T18:46:08.064+0100 7f58537fe6c0 10 librbd::image::RefreshRequest: 0x7f5804003520 send_v2_apply
2023-04-17T18:46:08.064+0100 7f58537fe6c0 10 librbd::image::RefreshRequest: 0x7f5804003520 handle_v2_apply
2023-04-17T18:46:08.064+0100 7f58537fe6c0 20 librbd::image::RefreshRequest: 0x7f5804003520 apply
2023-04-17T18:46:08.064+0100 7f58537fe6c0 10 librbd::ImageState: 0x5561f94f4580 0x5561f94f4580 handle_refresh: r=0
2023-04-17T18:46:08.064+0100 7f580ffff6c0 5 rbd-nbd: XXXMG: handle_notify: finish
2023-04-17T18:46:08.064+0100 7f580ffff6c0 20 librbd::ImageState: 0x5561f97028a0 ImageUpdateWatchers::handle_notify: handle=0, watcher=0x7ffd9a7c4900
2023-04-17T18:46:08.064+0100 7f580ffff6c0 5 rbd-nbd: XXXMG: handle_notify: enter
Note, where rbd-nbd: XXXMG: handle_notify: enter
is printed and where rbd-nbd: XXXMG: handle_notify: finish
is printed. The handle_notify
calls image info
method, which calls refresh_if_required
and it got stuck there.
The problem seems that we fire ImageUpdateWatchers
notifications twice. Note, DisableFeaturesRequest
calls send_notify_update
, which actually calls ImageCtx::notify_update
, which is currently:
void ImageCtx::notify_update() {
state->handle_update_notification();
ImageWatcher<>::notify_header_update(md_ctx, header_oid);
}
The state->handle_update_notification()
will fire ImageUpdateWatchers
notifications (the rbd-nbd handle_notify
is called) but does not wait it for complete and then it calls ImageWatcher<>::notify_header_update
, which will send the "header update" notification for all image watchers, including itself, so it will receive it itself and will fire ImageUpdateWatchers
the second time interfering with the first call.
The question is do we really need state->handle_update_notification()
in ImageCtx::notify_update()
? If we are going to receive our own notification anyway, may be we may just drop it.
If we really need it by some reason (not evident me right now) then another approach would be to modify it not to just fire the ImageUpdateWatchers
notification but actually wait for it to complete before proceeding with ImageWatcher<>::notify_header_update
. This will complicate the code a little, as instead of void ImageCtx::notify_update()
we will have to use ImageCtx::notify_update(Context *on_finish)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the detailed answer and logs @trociny :-) That really helps.
The question is do we really need state->handle_update_notification() in ImageCtx::notify_update()? If we are going to receive our own notification anyway, may be we may just drop it.
This sounds really reasonable to drop it as we seem to be getting our own notification anyway.
@idryomov any comments ? can we go ahead and make the changes to remove it from both void ImageCtx::notify_update()
and ImageCtx::notify_update(Context *on_finish)
or do you like to keep it as it is and think we should wait for handle_update_notification() to complete (atleast this was not the case in other places where notify_header_update() is called).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trociny I tried both the solutions you suggested, and it didn't solve the infinite loop issue:
2023-04-18T21:18:45.693+0530 7f0a450da640 20 librbd::ImageState: 0x55a75257c6b0 ImageUpdateWatchers::send_notify: handle=0, watcher=0x7ffc3a442d90
2023-04-18T21:18:45.693+0530 7f0a450da640 20 librbd::ImageState: 0x55a7523839a0 flush_update_watchers
2023-04-18T21:18:45.693+0530 7f0a450da640 20 librbd::ImageState: 0x55a75257c6b0 ImageUpdateWatchers::flush
2023-04-18T21:18:45.694+0530 7f0a1dffb640 20 librbd::ImageState: 0x55a75257c6b0 ImageUpdateWatchers::handle_notify: handle=0, watcher=0x7ffc3a442d90
2023-04-18T21:18:45.694+0530 7f0a1dffb640 5 rbd-nbd: handle_notify Prasanna: enter
2023-04-18T21:18:45.694+0530 7f0a1dffb640 20 librbd: info 0x55a75252e2f0
2023-04-18T21:18:50.695+0530 7f0a448d9640 20 librbd::watcher::Notifier: 0x7f0a10007900 handle_notify: r=-110
2023-04-18T21:18:50.695+0530 7f0a448d9640 20 librbd::watcher::Notifier: 0x7f0a10007900 handle_notify: pending=0
2023-04-18T21:18:50.695+0530 7f0a448d9640 20 librbd::DisableFeaturesRequest: 0x7f0a340339d0 handle_notify_update: r=-110
2023-04-18T21:18:50.695+0530 7f0a448d9640 20 librbd::DisableFeaturesRequest: 0x7f0a340339d0 handle_finish: r=-110
2023-04-18T21:18:50.695+0530 7f0a448d9640 20 librbd::ExclusiveLock: 0x7f0a08000b60 unblock_requests
2023-04-18T21:18:50.695+0530 7f0a448d9640 5 librbd::io::WriteBlockImageDispatch: 0x55a7523dd750 unblock_writes: 0x55a75252e2f0, num=0
2023-04-18T21:18:50.695+0530 7f0a448d9640 10 librbd::ImageState: 0x55a7523839a0 handle_prepare_lock_complete
2023-04-18T21:18:50.695+0530 7f0a448d9640 10 librbd::ImageState: 0x55a7523839a0 0x55a7523839a0 send_refresh_unlock
2023-04-18T21:18:50.695+0530 7f0a448d9640 10 librbd::image::RefreshRequest: 0x7f0a1000dd60 send_v2_get_mutable_metadata
2023-04-18T21:18:50.695+0530 7f0a448d9640 10 librbd::Request: 0x7f0a340339d0 create_context_finisher
2023-04-18T21:18:50.695+0530 7f0a448d9640 10 librbd::Request: 0x7f0a340339d0 finish: r=-110
2023-04-18T21:18:50.695+0530 7f0a448d9640 20 librbd::Operations: finish_op: update features r=-110
2023-04-18T21:18:50.695+0530 7f0a1ffff640 20 librbd::ImageWatcher: 0x7f0a10007840 remote async request finished: [4305,140018014309120,1] = -110
2023-04-18T21:18:50.695+0530 7f0a1ffff640 20 librbd::watcher::Notifier: 0x7f0a10007900 notify: pending=1
2023-04-18T21:18:50.696+0530 7f0a450da640 10 librbd::image::RefreshRequest: 0x7f0a1000dd60 handle_v2_get_mutable_metadata: r=0
2023-04-18T21:18:50.696+0530 7f0a450da640 10 librbd::image::RefreshRequest: 0x7f0a1000dd60 send_v2_get_parent: legacy=0
2023-04-18T21:18:50.696+0530 7f0a448d9640 5 librbd::Watcher: 0x7f0a10007840 notifications_blocked: blocked=0
2023-04-18T21:18:50.696+0530 7f0a448d9640 10 librbd::Watcher::C_NotifyAck 0x7f0a04001c70 C_NotifyAck: id=244813135878, handle=139681060649408
2023-04-18T21:18:50.696+0530 7f0a448d9640 20 librbd::ImageWatcher: remove_async_request: [4305,140018014309120,1]
2023-04-18T21:18:50.696+0530 7f0a448d9640 10 librbd::Watcher::C_NotifyAck 0x7f0a04001c70 finish: r=0
2023-04-18T21:18:50.697+0530 7f0a450da640 10 librbd::image::RefreshRequest: 0x7f0a1000dd60 handle_v2_get_parent: r=0
2023-04-18T21:18:50.697+0530 7f0a450da640 10 librbd::image::RefreshRequest: 0x7f0a1000dd60 send_v2_get_metadata
2023-04-18T21:18:50.697+0530 7f0a450da640 15 librbd::image::GetMetadataRequest: 0x7f0a3801cf40 metadata_list: start_key=conf_
2023-04-18T21:18:50.697+0530 7f0a448d9640 20 librbd::watcher::Notifier: 0x7f0a10007900 handle_notify: r=0
2023-04-18T21:18:50.697+0530 7f0a448d9640 20 librbd::watcher::Notifier: 0x7f0a10007900 handle_notify: pending=0
2023-04-18T21:18:50.697+0530 7f0a448d9640 20 librbd::ImageWatcher: 0x7f0a10007840 handle_async_complete: request=[4305,140018014309120,1], r=0
2023-04-18T21:18:50.698+0530 7f0a450da640 15 librbd::image::GetMetadataRequest: 0x7f0a3801cf40 handle_metadata_list: r=0
2023-04-18T21:18:50.698+0530 7f0a450da640 15 librbd::image::GetMetadataRequest: 0x7f0a3801cf40 finish: r=0
2023-04-18T21:18:50.698+0530 7f0a450da640 10 librbd::image::RefreshRequest: 0x7f0a1000dd60 handle_v2_get_metadata: r=0
2023-04-18T21:18:50.698+0530 7f0a450da640 10 librbd::image::RefreshRequest: 0x7f0a1000dd60 send_v2_get_pool_metadata
2023-04-18T21:18:50.698+0530 7f0a450da640 15 librbd::image::GetMetadataRequest: 0x55a75258c050 metadata_list: start_key=conf_
2023-04-18T21:18:50.699+0530 7f0a448d9640 15 librbd::image::GetMetadataRequest: 0x55a75258c050 handle_metadata_list: r=0
2023-04-18T21:18:50.699+0530 7f0a448d9640 15 librbd::image::GetMetadataRequest: 0x55a75258c050 finish: r=0
2023-04-18T21:18:50.699+0530 7f0a448d9640 10 librbd::image::RefreshRequest: 0x7f0a1000dd60 handle_v2_get_pool_metadata: r=0
2023-04-18T21:18:50.699+0530 7f0a448d9640 20 librbd::ImageCtx: apply_metadata
2023-04-18T21:18:50.699+0530 7f0a448d9640 10 librbd::image::RefreshRequest: 0x7f0a1000dd60 send_v2_get_group
2023-04-18T21:18:50.700+0530 7f0a450da640 10 librbd::image::RefreshRequest: 0x7f0a1000dd60 handle_v2_get_group: r=0
2023-04-18T21:18:50.700+0530 7f0a450da640 10 librbd::image::RefreshRequest: 0x7f0a1000dd60 send_v2_apply
2023-04-18T21:18:50.700+0530 7f0a450da640 10 librbd::image::RefreshRequest: 0x7f0a1000dd60 handle_v2_apply
2023-04-18T21:18:50.700+0530 7f0a450da640 20 librbd::image::RefreshRequest: 0x7f0a1000dd60 apply
2023-04-18T21:18:50.700+0530 7f0a450da640 10 librbd::ImageState: 0x55a7523839a0 0x55a7523839a0 handle_refresh: r=0
2023-04-18T21:18:50.700+0530 7f0a1dffb640 5 rbd-nbd: handle_notify Prasanna: finish
2023-04-18T21:18:50.700+0530 7f0a1dffb640 20 librbd::ImageState: 0x55a75257c6b0 ImageUpdateWatchers::operator(): completing flush
2023-04-18T21:18:50.700+0530 7f0a1dffb640 10 librbd::ImageWatcher: 0x7f0a1000fae0 C_ResponseMessage: r=0
2023-04-18T21:18:50.700+0530 7f0a1dffb640 10 librbd::Watcher::C_NotifyAck 0x7f0a34001fc0 finish: r=0
2023-04-18T21:18:50.708+0530 7f0a448d9640 5 librbd::Watcher: 0x7f0a10007840 notifications_blocked: blocked=0
2023-04-18T21:18:50.708+0530 7f0a448d9640 10 librbd::Watcher::C_NotifyAck 0x7f0a34009b90 C_NotifyAck: id=244813135879, handle=139681060649408
2023-04-18T21:18:50.708+0530 7f0a448d9640 10 librbd::ImageWatcher: 0x7f0a10007840 remote update_features request: [4305,140018014309120,1] 64 disabled
2023-04-18T21:18:50.708+0530 7f0a448d9640 20 librbd::ExclusiveLock: 0x7f0a08000b60 accept_request=1 (request_type=0)
2023-04-18T21:18:50.708+0530 7f0a448d9640 10 librbd::Watcher::C_NotifyAck 0x7f0a34009b90 finish: r=0
2023-04-18T21:18:50.715+0530 7f0a450da640 5 librbd::Watcher: 0x7f0a10007840 notifications_blocked: blocked=0
2023-04-18T21:18:50.715+0530 7f0a450da640 10 librbd::Watcher::C_NotifyAck 0x7f09f4004270 C_NotifyAck: id=244813135880, handle=139681060649408
2023-04-18T21:18:50.715+0530 7f0a450da640 10 librbd::ImageWatcher: 0x7f0a10007840 remote update_features request: [4305,140018014309120,1] 64 disabled
2023-04-18T21:18:50.715+0530 7f0a450da640 20 librbd::ExclusiveLock: 0x7f0a08000b60 accept_request=1 (request_type=0)
2023-04-18T21:18:50.715+0530 7f0a450da640 10 librbd::Watcher::C_NotifyAck 0x7f09f4004270 finish: r=0
2023-04-18T21:18:50.767+0530 7f0a448d9640 5 librbd::Watcher: 0x7f0a10007840 notifications_blocked: blocked=0
2023-04-18T21:18:50.767+0530 7f0a448d9640 10 librbd::Watcher::C_NotifyAck 0x7f0a34002010 C_NotifyAck: id=244813135881, handle=139681060649408
2023-04-18T21:18:50.767+0530 7f0a448d9640 10 librbd::ImageWatcher: 0x7f0a10007840 remote update_features request: [4305,140018014309120,1] 64 disabled
2023-04-18T21:18:50.767+0530 7f0a448d9640 20 librbd::ExclusiveLock: 0x7f0a08000b60 accept_request=1 (request_type=0)
2023-04-18T21:18:50.767+0530 7f0a448d9640 10 librbd::Watcher::C_NotifyAck 0x7f0a34002010 finish: r=0
2023-04-18T21:18:50.773+0530 7f0a450da640 5 librbd::Watcher: 0x7f0a10007840 notifications_blocked: blocked=0
2023-04-18T21:18:50.773+0530 7f0a450da640 10 librbd::Watcher::C_NotifyAck 0x7f0a04001c70 C_NotifyAck: id=244813135882, handle=139681060649408
2023-04-18T21:18:50.773+0530 7f0a450da640 10 librbd::ImageWatcher: 0x7f0a10007840 remote update_features request: [4305,140018014309120,1] 64 disabled
2023-04-18T21:18:50.773+0530 7f0a450da640 20 librbd::ExclusiveLock: 0x7f0a08000b60 accept_request=1 (request_type=0)
2023-04-18T21:18:50.773+0530 7f0a450da640 10 librbd::Watcher::C_NotifyAck 0x7f0a04001c70 finish: r=0
2023-04-18T21:18:50.779+0530 7f0a448d9640 5 librbd::Watcher: 0x7f0a10007840 notifications_blocked: blocked=0
2023-04-18T21:18:50.779+0530 7f0a448d9640 10 librbd::Watcher::C_NotifyAck 0x7f09f000d390 C_NotifyAck: id=244813135883, handle=139681060649408
2023-04-18T21:18:50.779+0530 7f0a448d9640 10 librbd::ImageWatcher: 0x7f0a10007840 remote update_features request: [4305,140018014309120,1] 64 disabled
2023-04-18T21:18:50.779+0530 7f0a448d9640 20 librbd::ExclusiveLock: 0x7f0a08000b60 accept_request=1 (request_type=0)
2023-04-18T21:18:50.779+0530 7f0a448d9640 10 librbd::Watcher::C_NotifyAck 0x7f09f000d390 finish: r=0
2023-04-18T21:18:50.784+0530 7f0a450da640 5 librbd::Watcher: 0x7f0a10007840 notifications_blocked: blocked=0
2023-04-18T21:18:50.784+0530 7f0a450da640 10 librbd::Watcher::C_NotifyAck 0x7f0a34009b90 C_NotifyAck: id=244813135884, handle=139681060649408
2023-04-18T21:18:50.784+0530 7f0a450da640 10 librbd::ImageWatcher: 0x7f0a10007840 remote update_features request: [4305,140018014309120,1] 64 disabled
2023-04-18T21:18:50.784+0530 7f0a450da640 20 librbd::ExclusiveLock: 0x7f0a08000b60 accept_request=1 (request_type=0)
2023-04-18T21:18:50.784+0530 7f0a450da640 10 librbd::Watcher::C_NotifyAck 0x7f0a34009b90 finish: r=0
2023-04-18T21:18:50.790+0530 7f0a448d9640 5 librbd::Watcher: 0x7f0a10007840 notifications_blocked: blocked=0
2023-04-18T21:18:50.791+0530 7f0a448d9640 10 librbd::Watcher::C_NotifyAck 0x7f09f4004270 C_NotifyAck: id=244813135885, handle=139681060649408
2023-04-18T21:18:50.791+0530 7f0a448d9640 10 librbd::ImageWatcher: 0x7f0a10007840 remote update_features request: [4305,140018014309120,1] 64 disabled
2023-04-18T21:18:50.791+0530 7f0a448d9640 20 librbd::ExclusiveLock: 0x7f0a08000b60 accept_request=1 (request_type=0)
2023-04-18T21:18:50.791+0530 7f0a448d9640 10 librbd::Watcher::C_NotifyAck 0x7f09f4004270 finish: r=0
2023-04-18T21:18:50.797+0530 7f0a450da640 5 librbd::Watcher: 0x7f0a10007840 notifications_blocked: blocked=0
2023-04-18T21:18:50.797+0530 7f0a450da640 10 librbd::Watcher::C_NotifyAck 0x7f0a34002010 C_NotifyAck: id=244813135886, handle=139681060649408
2023-04-18T21:18:50.797+0530 7f0a450da640 10 librbd::ImageWatcher: 0x7f0a10007840 remote update_features request: [4305,140018014309120,1] 64 disabled
2023-04-18T21:18:50.797+0530 7f0a450da640 20 librbd::ExclusiveLock: 0x7f0a08000b60 accept_request=1 (request_type=0)
2023-04-18T21:18:50.797+0530 7f0a450da640 10 librbd::Watcher::C_NotifyAck 0x7f0a34002010 finish: r=0
2023-04-18T21:18:50.804+0530 7f0a448d9640 5 librbd::Watcher: 0x7f0a10007840 notifications_blocked: blocked=0
2023-04-18T21:18:50.804+0530 7f0a448d9640 10 librbd::Watcher::C_NotifyAck 0x7f0a04001c70 C_NotifyAck: id=244813135887, handle=139681060649408
2023-04-18T21:18:50.804+0530 7f0a448d9640 10 librbd::ImageWatcher: 0x7f0a10007840 remote update_features request: [4305,140018014309120,1] 64 disabled
2023-04-18T21:18:50.804+0530 7f0a448d9640 20 librbd::ExclusiveLock: 0x7f0a08000b60 accept_request=1 (request_type=0)
2023-04-18T21:18:50.804+0530 7f0a448d9640 10 librbd::Watcher::C_NotifyAck 0x7f0a04001c70 finish: r=0
2023-04-18T21:18:50.810+0530 7f0a450da640 5 librbd::Watcher: 0x7f0a10007840 notifications_blocked: blocked=0
2023-04-18T21:18:50.810+0530 7f0a450da640 10 librbd::Watcher::C_NotifyAck 0x7f09f000d390 C_NotifyAck: id=244813135888, handle=139681060649408
2023-04-18T21:18:50.810+0530 7f0a450da640 10 librbd::ImageWatcher: 0x7f0a10007840 remote update_features request: [4305,140018014309120,1] 64 disabled
2023-04-18T21:18:50.810+0530 7f0a450da640 20 librbd::ExclusiveLock: 0x7f0a08000b60 accept_request=1 (request_type=0)
2023-04-18T21:18:50.810+0530 7f0a450da640 10 librbd::Watcher::C_NotifyAck 0x7f09f000d390 finish: r=0
2023-04-18T21:18:50.817+0530 7f0a448d9640 5 librbd::Watcher: 0x7f0a10007840 notifications_blocked: blocked=0
2023-04-18T21:18:50.817+0530 7f0a448d9640 10 librbd::Watcher::C_NotifyAck 0x7f0a34009b90 C_NotifyAck: id=244813135889, handle=139681060649408
2023-04-18T21:18:50.817+0530 7f0a448d9640 10 librbd::ImageWatcher: 0x7f0a10007840 remote update_features request: [4305,140018014309120,1] 64 disabled
2023-04-18T21:18:50.817+0530 7f0a448d9640 20 librbd::ExclusiveLock: 0x7f0a08000b60 accept_request=1 (request_type=0)
2023-04-18T21:18:50.817+0530 7f0a448d9640 10 librbd::Watcher::C_NotifyAck 0x7f0a34009b90 finish: r=0
2023-04-18T21:18:50.824+0530 7f0a450da640 5 librbd::Watcher: 0x7f0a10007840 notifications_blocked: blocked=0
2023-04-18T21:18:50.824+0530 7f0a450da640 10 librbd::Watcher::C_NotifyAck 0x7f09f4004270 C_NotifyAck: id=244813135890, handle=139681060649408
2023-04-18T21:18:50.824+0530 7f0a450da640 10 librbd::ImageWatcher: 0x7f0a10007840 remote update_features request: [4305,140018014309120,1] 64 disabled
2023-04-18T21:18:50.824+0530 7f0a450da640 20 librbd::ExclusiveLock: 0x7f0a08000b60 accept_request=1 (request_type=0)
2023-04-18T21:18:50.824+0530 7f0a450da640 10 librbd::Watcher::C_NotifyAck 0x7f09f4004270 finish: r=0
2023-04-18T21:18:50.831+0530 7f0a448d9640 5 librbd::Watcher: 0x7f0a10007840 notifications_blocked: blocked=0
2023-04-18T21:18:50.831+0530 7f0a448d9640 10 librbd::Watcher::C_NotifyAck 0x7f0a34002010 C_NotifyAck: id=244813135891, handle=139681060649408
2023-04-18T21:18:50.831+0530 7f0a448d9640 10 librbd::ImageWatcher: 0x7f0a10007840 remote update_features request: [4305,140018014309120,1] 64 disabled
2023-04-18T21:18:50.831+0530 7f0a448d9640 20 librbd::ExclusiveLock: 0x7f0a08000b60 accept_request=1 (request_type=0)
2023-04-18T21:18:50.831+0530 7f0a448d9640 10 librbd::Watcher::C_NotifyAck 0x7f0a34002010 finish: r=0
2023-04-18T21:18:50.837+0530 7f0a450da640 5 librbd::Watcher: 0x7f0a10007840 notifications_blocked: blocked=0
2023-04-18T21:18:50.837+0530 7f0a450da640 10 librbd::Watcher::C_NotifyAck 0x7f0a04001c70 C_NotifyAck: id=244813135892, handle=139681060649408
2023-04-18T21:18:50.837+0530 7f0a450da640 10 librbd::ImageWatcher: 0x7f0a10007840 remote update_features request: [4305,140018014309120,1] 64 disabled
2023-04-18T21:18:50.837+0530 7f0a450da640 20 librbd::ExclusiveLock: 0x7f0a08000b60 accept_request=1 (request_type=0)
2023-04-18T21:18:50.837+0530 7f0a450da640 10 librbd::Watcher::C_NotifyAck 0x7f0a04001c70 finish: r=0
2023-04-18T21:18:50.845+0530 7f0a448d9640 5 librbd::Watcher: 0x7f0a10007840 notifications_blocked: blocked=0
2023-04-18T21:18:50.845+0530 7f0a448d9640 10 librbd::Watcher::C_NotifyAck 0x7f09f000d390 C_NotifyAck: id=244813135893, handle=139681060649408
2023-04-18T21:18:50.845+0530 7f0a448d9640 10 librbd::ImageWatcher: 0x7f0a10007840 remote update_features request: [4305,140018014309120,1] 64 disabled
2023-04-18T21:18:50.845+0530 7f0a448d9640 20 librbd::ExclusiveLock: 0x7f0a08000b60 accept_request=1 (request_type=0)
2023-04-18T21:18:50.845+0530 7f0a448d9640 10 librbd::Watcher::C_NotifyAck 0x7f09f000d390 finish: r=0
2023-04-18T21:18:50.851+0530 7f0a450da640 5 librbd::Watcher: 0x7f0a10007840 notifications_blocked: blocked=0
2023-04-18T21:18:50.851+0530 7f0a450da640 10 librbd::Watcher::C_NotifyAck 0x7f0a34009b90 C_NotifyAck: id=244813135894, handle=139681060649408
2023-04-18T21:18:50.851+0530 7f0a450da640 10 librbd::ImageWatcher: 0x7f0a10007840 remote update_features request: [4305,140018014309120,1] 64 disabled
2023-04-18T21:18:50.851+0530 7f0a450da640 20 librbd::ExclusiveLock: 0x7f0a08000b60 accept_request=1 (request_type=0)
2023-04-18T21:18:50.851+0530 7f0a450da640 10 librbd::Watcher::C_NotifyAck 0x7f0a34009b90 finish: r=0
2023-04-18T21:18:50.858+0530 7f0a448d9640 5 librbd::Watcher: 0x7f0a10007840 notifications_blocked: blocked=0
2023-04-18T21:18:50.858+0530 7f0a448d9640 10 librbd::Watcher::C_NotifyAck 0x7f09f4004270 C_NotifyAck: id=244813135895, handle=139681060649408
2023-04-18T21:18:50.858+0530 7f0a448d9640 10 librbd::ImageWatcher: 0x7f0a10007840 remote update_features request: [4305,140018014309120,1] 64 disabled
2023-04-18T21:18:50.858+0530 7f0a448d9640 20 librbd::ExclusiveLock: 0x7f0a08000b60 accept_request=1 (request_type=0)
2023-04-18T21:18:50.858+0530 7f0a448d9640 10 librbd::Watcher::C_NotifyAck 0x7f09f4004270 finish: r=0
2023-04-18T21:18:50.864+0530 7f0a450da640 5 librbd::Watcher: 0x7f0a10007840 notifications_blocked: blocked=0
2023-04-18T21:18:50.864+0530 7f0a450da640 10 librbd::Watcher::C_NotifyAck 0x7f0a34002010 C_NotifyAck: id=244813135896, handle=139681060649408
2023-04-18T21:18:50.864+0530 7f0a450da640 10 librbd::ImageWatcher: 0x7f0a10007840 remote update_features request: [4305,140018014309120,1] 64 disabled
2023-04-18T21:18:50.864+0530 7f0a450da640 20 librbd::ExclusiveLock: 0x7f0a08000b60 accept_request=1 (request_type=0)
2023-04-18T21:18:50.864+0530 7f0a450da640 10 librbd::Watcher::C_NotifyAck 0x7f0a34002010 finish: r=0
2023-04-18T21:18:50.870+0530 7f0a448d9640 5 librbd::Watcher: 0x7f0a10007840 notifications_blocked: blocked=0
2023-04-18T21:18:50.870+0530 7f0a448d9640 10 librbd::Watcher::C_NotifyAck 0x7f0a04001c70 C_NotifyAck: id=244813135897, handle=139681060649408
2023-04-18T21:18:50.870+0530 7f0a448d9640 10 librbd::ImageWatcher: 0x7f0a10007840 remote update_features request: [4305,140018014309120,1] 64 disabled
2023-04-18T21:18:50.870+0530 7f0a448d9640 20 librbd::ExclusiveLock: 0x7f0a08000b60 accept_request=1 (request_type=0)
2023-04-18T21:18:50.870+0530 7f0a448d9640 10 librbd::Watcher::C_NotifyAck 0x7f0a04001c70 finish: r=0
2023-04-18T21:18:50.876+0530 7f0a450da640 5 librbd::Watcher: 0x7f0a10007840 notifications_blocked: blocked=0
2023-04-18T21:18:50.876+0530 7f0a450da640 10 librbd::Watcher::C_NotifyAck 0x7f09f000d390 C_NotifyAck: id=244813135898, handle=139681060649408
2023-04-18T21:18:50.876+0530 7f0a450da640 10 librbd::ImageWatcher: 0x7f0a10007840 remote update_features request: [4305,140018014309120,1] 64 disabled
2023-04-18T21:18:50.876+0530 7f0a450da640 20 librbd::ExclusiveLock: 0x7f0a08000b60 accept_request=1 (request_type=0)
2023-04-18T21:18:50.876+0530 7f0a450da640 10 librbd::Watcher::C_NotifyAck 0x7f09f000d390 finish: r=0
2023-04-18T21:18:50.904+0530 7f0a448d9640 5 librbd::Watcher: 0x7f0a10007840 notifications_blocked: blocked=0
2023-04-18T21:18:50.904+0530 7f0a448d9640 10 librbd::Watcher::C_NotifyAck 0x7f0a34009b90 C_NotifyAck: id=244813135899, handle=139681060649408
2023-04-18T21:18:50.904+0530 7f0a448d9640 10 librbd::ImageWatcher: 0x7f0a10007840 remote update_features request: [4305,140018014309120,1] 64 disabled
2023-04-18T21:18:50.904+0530 7f0a448d9640 20 librbd::ExclusiveLock: 0x7f0a08000b60 accept_request=1 (request_type=0)
2023-04-18T21:18:50.904+0530 7f0a448d9640 10 librbd::Watcher::C_NotifyAck 0x7f0a34009b90 finish: r=0
2023-04-18T21:18:50.910+0530 7f0a450da640 5 librbd::Watcher: 0x7f0a10007840 notifications_blocked: blocked=0
2023-04-18T21:18:50.910+0530 7f0a450da640 10 librbd::Watcher::C_NotifyAck 0x7f09f4004270 C_NotifyAck: id=244813135900, handle=139681060649408
The rbd-nbd notify_handle() is not done in 5 seconds notify timeout.
[pkalever@localhost build]$ git diff ../src/librbd/ImageCtx.cc
diff --git a/src/librbd/ImageCtx.cc b/src/librbd/ImageCtx.cc
index b990d8adc76..c4d4fe45385 100644
--- a/src/librbd/ImageCtx.cc
+++ b/src/librbd/ImageCtx.cc
@@ -897,12 +897,10 @@ librados::IoCtx duplicate_io_ctx(librados::IoCtx& io_ctx) {
}
void ImageCtx::notify_update() {
- state->handle_update_notification();
ImageWatcher<>::notify_header_update(md_ctx, header_oid);
}
void ImageCtx::notify_update(Context *on_finish) {
- state->handle_update_notification();
image_watcher->notify_header_update(on_finish);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@idryomov any comments ? can we go ahead and make the changes to remove it from both
void ImageCtx::notify_update()
andImageCtx::notify_update(Context *on_finish)
I don't think that would help. The issue that @trociny shed some light on is not that there are two update notifications but that the update callback in rbd-nbd gets stuck. It appears to be getting stuck in ImageState::refresh_if_required()
and that is likely because DisableFeaturesRequest
issues update notifications while still holding onto the exclusive lock with everything that has to do with it blocked. Notice how refresh only starts executing after DisableFeaturesRequest
gets through handle_prepare_lock_complete()
in handle_finish
:
2023-04-17T18:46:08.056+0100 7f5852ffd6c0 20 librbd::DisableFeaturesRequest: 0x7f5824010bc0 handle_finish: r=-110
2023-04-17T18:46:08.056+0100 7f5852ffd6c0 20 librbd::ExclusiveLock: 0x7f5824001d40 unblock_requests
2023-04-17T18:46:08.056+0100 7f5852ffd6c0 5 librbd::io::WriteBlockImageDispatch: 0x5561f954f740 unblock_writes: 0x5561f96b28f0, num=0
2023-04-17T18:46:08.056+0100 7f5852ffd6c0 10 librbd::ImageState: 0x5561f94f4580 handle_prepare_lock_complete
2023-04-17T18:46:08.056+0100 7f5852ffd6c0 10 librbd::ImageState: 0x5561f94f4580 0x5561f94f4580 send_refresh_unlock
I think the same hang would happen even with a single "header update" notification because, as part of processing it, ImageWatcher
would still want to flush update callbacks:
template <typename I>
bool ImageWatcher<I>::handle_payload(const HeaderUpdatePayload &payload,
C_NotifyAck *ack_ctx) {
ldout(m_image_ctx.cct, 10) << this << " image header updated" << dendl;
m_image_ctx.state->handle_update_notification();
m_image_ctx.perfcounter->inc(l_librbd_notify);
if (ack_ctx != nullptr) {
m_image_ctx.state->flush_update_watchers(new C_ResponseMessage(ack_ctx));
return false;
}
return true;
}
So the fix for this "where ETIMEDOUT
error is coming from" issue would probably be to see if it's possible to change DisableFeaturesRequest
to lift all blocks that it puts in place before it issues any update notifications.
But there is a another, let's call it "where the busy loop on rbd feature disable
side is coming from", issue though. ETIMEDOUT
error is not that big of deal, it could occur naturally if the update callback just did something that took a while. We can't really control what the user executes there, and, in general, have to be resilient to various things timing out. In my view this second issue is much more serious.
I suspect that it has to do with the fact async request processing code "remembers" the error code and then instead of figuring out that the request is in fact done (feature is disabled) on a retry, rbd feature disable
gets ETIMEDOUT
again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- RPC timeout erros
- Actual operation errors (timeout or anything else)
- Notifiers timeout (say the actual operation is in the notifier pending queue and is timeout because the previous ones didn't finish on time)
How should we differentiate them?
I don't think (1) and (3) need to be distinguished. As for how, this is for you to propose after examining the on-wire protocol and the code on both sides. I suspect that the issue boils down to only one 32-bit integer getting communicated which causes the different errors to collide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DisableFeaturesRequest
calls ImageCtx::notify_update
, which first calls state->handle_update_notification
(where the nbd callback is fired the first time and gets stuck) and not waiting for the result it calls image_watcher->notify_header_update
(we receive our own notification, the the nbd callback is fired the second time and gets stuck), and the notification machinery returns ETIMEDOUT
here. I don't see how an additional (distinguishable) error code will be useful here (and where it should be set). I suppose what DisableFeaturesReques
could do is to check the return value of ImageCtx::notify_update
(in DisableFeaturesRequest<I>::handle_notify_update
) and if it is ETIMEDOUT
then reset it to 0, which would mean "the notification might have got stuck but I don't care much".
And I still think it would be useful to get rid off that dup callback, i.e. remove state->handle_update_notification()
form ImageCtx::notify_update
, so only one callback is fired when HeaderUpdatePayload
is received.
And then, if want to go further, we may add async_request_id
to HeaderUpdatePayload
(i.e. make header update notifications "async"), then it would not return ETIMEDOUT
after 5 sec watcher timeout but would wait for the notification to complete as other "async" requests, like "snap create".
I am not sure if we want to go so far though, i.e. should we care so much for notification to be answered by all potential watchers, which might be misbehaving. May be we just want to send the header update notification and proceed. In this case we may want to return to @pkalever's initial approach, i.e. don't call notification flush (but instead of modifying the flush
operation making it NOP, we could just remove m_image_ctx.state->flush_update_watchers
in ImageWatcher<I>::handle_payload(const HeaderUpdatePayload)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DisableFeaturesRequest
callsImageCtx::notify_update
, which first callsstate->handle_update_notification
(where the nbd callback is fired the first time and gets stuck) and not waiting for the result it callsimage_watcher->notify_header_update
(we receive our own notification, the the nbd callback is fired the second time and gets stuck), and the notification machinery returnsETIMEDOUT
here.
As I highlighted earlier and @pkalever confirmed by running a test, the issue is not that the callback gets fired twice. Yes, it's most likely redundant, but the whole thing would get stuck even if state->handle_update_notification
-initiated callback was suppressed. The issue is between a) the body of the callback, b) DisableFeaturesRequest
code and c) the fact that when we receive our own notification, we attempt to "flush" callbacks.
I don't see how an additional (distinguishable) error code will be useful here (and where it should be set).
There is a more general issue here. Consider an abstract operation handler: it's allowed to return an arbitrary error code. If it happens to return ETIMEDOUT
(not necessarily because of a notification-related issue but even on it's own) while being executed on behalf of some other client (i.e. a "remote" request), the client gets into a busy loop. We should either make it so that "genuine" ETIMEDOUT
errors aren't retried (i.e. distinguish them) or make sure that operation handlers never return ETIMEDOUT
.
I'm arguing that a "remote" request should behave exactly the same as a "local" request as far as error propagation goes. Today, if DisableFeaturesRequest
is executed locally and returns ETIMEDOUT
, the error is propagated to a user immediately. If the same DisableFeaturesRequest
is executed remotely and returns ETIMEDOUT
, we get into a busy loop. (Note that I have used DisableFeaturesRequest
just as an example, this applies to all operations that can be proxied.)
I suppose what
DisableFeaturesReques
could do is to check the return value ofImageCtx::notify_update
(inDisableFeaturesRequest<I>::handle_notify_update
) and if it isETIMEDOUT
then reset it to 0, which would mean "the notification might have got stuck but I don't care much".
I'd rather change it to call ImageCtx::notify_update
after unblocking the exclusive lock. But this needs to be investigated -- is DisableFeaturesRequest
unique in what it does, what other operation handlers that block the exclusive lock and/or issue notifications do, etc. We definitely need consistency here.
And I still think it would be useful to get rid off that dup callback, i.e. remove
state->handle_update_notification()
formImageCtx::notify_update
, so only one callback is fired whenHeaderUpdatePayload
is received.
No argument here but this is a separate, third, issue. When considered against 1) DisableFeaturesRequest
generating ETIMEDOUT
error and 2) ETIMEDOUT
error getting mishandled in the "remote" request case, I'd say it's of lowest priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and if it is
ETIMEDOUT
then reset it to 0, which would mean "the notification might have got stuck but I don't care much".I'd rather change it to call
ImageCtx::notify_update
after unblocking the exclusive lock.
To elaborate on this a bit: I really dislike how librbd does this kind of error code munging in various (often non-obvious!) places. One recent example that we ran into while working on client-side encryption that took a while to debug was rbd_flatten
just swallowing EINVAL
errors. I'm guessing that it's done on the assumption that EINVAL
is generated only if there is no parent but this is not the case for encrypted flatten. There are many other examples, some seemingly for the sake of idempotency, some for no good reason at all.
This is super fragile because it's very easy to "introduce" an error that is being swallowed as part of changing some low-level API and not notice. I think we should avoid adding to that pile at all costs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we were discussing only ETIMEDOUT
issue here, leaving apart the issue with the nbd callback got stuck. So my last comment was about that issue. And sure removing the redundant notification will not fix it but midht simplify our consideration about the fix. And as a possible fix I actually proposed either of two:
-
Ignore
ETIMEDOUT
error returned byImageCtx::notify_update
inDisableFeaturesReques
. -
Make
HeaderUpdate
notification not returnETIMEDOUT
error but resend the request until either success or some other error is returned. I.e. addasync_request_id
to theHeaderUpdatePayload
and deal with it the same way we deal with other async requests (like snap create) that may execute long.
50b395d
to
32f9ca3
Compare
@idryomov @trociny Thanks for the great discussion :-) It was really helpful. In the PR I had a chance to fix couple of the issues discussed above which are w.r.t to https://tracker.ceph.com/issues/58740 , PTAL @idryomov it will be nice if you could please help with a tracker that has details of the left out issues which were part of our discussion and you think are good to have fixes. Here are the logs with this PR:
|
https://tracker.ceph.com/issues/59563 |
32f9ca3
to
0d782df
Compare
Thank you @idryomov ! |
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
889d124
to
597bf2e
Compare
re-basing as there is a merge conflict. |
597bf2e
to
b837602
Compare
b837602
to
881befe
Compare
Honestly, I'm not sure what that But, disregarding that, having to open an additional image context isn't nice indeed. |
I poked around and I see that Lines 119 to 125 in be88249
This goes back to Jewel (2016) and means that waiting for rbd_resize() to return, checking for success and proceeding with something like check_size() on the client isn't reliable even without this change.
I'm leaning towards patching the resize handler in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to patch
the resize handler in fsx to make it wait for the device to actually get resized.
@idryomov I thought you asked me to do something like:
But not sure why still the check_size() is failing...
Note the |
This is not what we discussed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change the title of the new commit to "test/librbd/fsx: wait for resize to propagate in krbd_resize()".
89ed73f
to
2c6a79a
Compare
2c6a79a
to
5f9aead
Compare
5f9aead
to
15e343d
Compare
15e343d
to
f0a9241
Compare
With this changes resize request will not be blocked until the resize is completed. Because of this the fsx test fails as it assumes that the request to resize immediately implies changes on the device size. Hence we have to add a wait in resize handler of fsx for the device to actually get resized. Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
f0a9241
to
6f3d0f5
Compare
jenkins test api |
Problem:
Trying to disable any feature on an rbd image mapped with nbd leads to stuck
in rbd-nbd.
The rbd-nbd registers a watcher callback to detect image resize in
NBDWatchCtx::handle_notify(). The handle_notify calls image info method, which
calls refresh_if_required and it got stuck there.
It is getting stuck in ImageState::refresh_if_required() because
DisableFeaturesRequest issues update notifications while still holding onto
the exclusive lock with everything that has to do with it blocked.
Solution:
Set only notify flag as part of NBDWatchCtx::handle_notify() and handle
the resize detection part as part of a different thread.
Fixes: https://tracker.ceph.com/issues/58740
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows