New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hammer:librbd: The qemu VMs hang occasionally after a snapshot is created. #17045
Conversation
…o_flush's callback to eliminate potential deadlock for newly created snapshot. Fixes: http://tracker.ceph.com/issues/21009 Signed-off-by: yupeng chen <chenyupeng-it@360.cn>
@chenyupeng360 hammer is very close to End-of-Life and it's getting less and less probable that there will be another hammer point release. Can you test if the bug is present in luminous or jewel? |
@smithfarm OK, I will try it in Jewel. Since our several large clusters had been running hammer for quite a long time, online upgrading may not be an option. We might choose to deploy Jewel on new clusters in the future. |
src/librbd/ImageCtx.cc
Outdated
@@ -114,7 +114,7 @@ struct C_AsyncCallback : public Context { | |||
: image_ctx(image_ctx), on_finish(on_finish) { | |||
} | |||
virtual void finish(int r) { | |||
image_ctx->op_work_queue->queue(on_finish, r); | |||
image_ctx->writeback_handler->queue(on_finish, r); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the in-memory cache is disabled? Then writeback_handler
will be null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dillaman You're right. Then is it necessary to fork a dedicated thread for callbacks or can we just borrow another thread for that purpose?
Or we can process it conditionally here in C_AsyncCallback.finish() because currently C_AsyncCallback is used solely for asynchronous flush callback:
if object_cacher is enabled:
image_ctx->writeback_handler->queue(on_finish, r);
else:
on_finish.complete(r);
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenyupeng360 The flush_aysnc_operations()
and flush_async_operations(Context*)
methods could be modified to be like [1] where it uses a helper Context
w/ a reentrant_safe
boolean flag. The synchronous (reentrant safe) version can just directly complete.
[1] https://github.com/ceph/ceph/blob/master/src/librbd/ImageCtx.cc#L123
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dillaman OK,I will modify the code as above.
…dedicated thread. If a snapshot is created out-of-band, the refreshing of the image caused by the snapshot may block an aio_flush() issued before the creation of the snapshot occasionally. Fixes: http://tracker.ceph.com/issues/21009 Signed-off-by: yupeng chen <chenyupeng-it@360.cn>
@smithfarm This issue can not be reproduced in Jewel. |
@chenyupeng360 Upon closer inspection of the hammer branch, my suggestion won't work. I think the only true fix would be to add another thread (or just upgrade your clients to jewel since it's already been fixed and Hammer will be EoL next week most likely). |
@dillaman As mentioned in the previous post, instead of upgrading from hammer to jewel for existing clusters, we would rather deploy the jewel release from scratch on newly added clusters. So for the existing clusters, we may still have to fix the bug in hammer release ourselves even if it would have reached EoL. As to the second fix, I think it's OK to use the writeback handler for calling flush_cache() whenever possible(i.e. when rbd cache is enabled),could you please tell me what the drawback is or what else I missed. Thanks. |
If a snapshot is created out-of-band, the refreshing of the image caused by the snapshot may block an aio_flush() issued before the creation of the snapshot occasionally. Fixes: http://tracker.ceph.com/issues/21009 Signed-off-by: yupeng chen <chenyupeng-it@360.cn>
@dillaman I pushed my modification, please have a look at it. We will use the latest code to fix the issue in our release. |
src/librbd/ImageCtx.cc
Outdated
@@ -860,7 +864,7 @@ void _flush_async_operations(ImageCtx *ictx, Context *on_finish) { | |||
|
|||
void ImageCtx::flush_async_operations() { | |||
C_SaferCond ctx; | |||
_flush_async_operations(this, &ctx); | |||
_flush_async_operations(this, new C_AsyncCallback(this, &ctx)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that is not necessary. I thought that all flush-related callbacks be processed by newly added Finisher, with one potential drawback that synchronous flush callbacks might be delayed since all flush callbacks are pushed onto the Finisher queue and processed sequentially by the Finisher thread. So we can leave it unchanged here.
If a snapshot is created out-of-band, the refreshing of the image caused by the snapshot may block an aio_flush() issued before the creation of the snapshot occasionally. Leave synchronous flush callbacks processed in the ThreadPool's Context. Fixes: http://tracker.ceph.com/issues/21009 Signed-off-by: yupeng chen <chenyupeng-it@360.cn>
Hammer is EOL. |
Hi,guys. Please help to check if my fix is appropriate, thanks.
Fixes: http://tracker.ceph.com/issues/21009
Signed-off-by: yupeng chen chenyupeng-it@360.cn