New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rgw: RGWCoroutine::set_sleeping() checks for null stack #46007
Conversation
|
@ofriedma i saw this commit in #45958, and it looks like a fix for https://tracker.ceph.com/issues/49302 which we're seeing downstream too. i updated the commit and message, and added your Signed-off-by. is that ok? |
|
@cbodley would it make sense to check for null stack in all the places that look plausible? |
cc @yehudasa probably, yeah - if in the past, we worked around issues like this by requiring the sync code to hold references to both the coroutine and its stack (some examples in RGWDataSyncShardCR and RGWMetaSyncCR) i don't think there's a good way to track down all of the other places we could crash like this, so adding null checks to |
|
@cbodley |
src/rgw/rgw_coroutine.cc
Outdated
| } | ||
|
|
||
| int RGWCoroutine::io_block(int ret, int64_t io_id) { | ||
| return io_block(ret, rgw_io_id{io_id, -1}); | ||
| } | ||
|
|
||
| int RGWCoroutine::io_block(int ret, const rgw_io_id& io_id) { | ||
| if (!stack) { | ||
| return 0; // XXX: what does 0 mean? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the returned value matters here anyway? (we should change the comment)
Maybe we should add more checks as you've mentioned to the code where the stack is being called.
Other than those, LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the returned value matters here anyway?
i see that several coroutines in rgw_sync.cc and rgw_data_sync.cc are returning the result with return io_block(0);
if coroutines return an error without calling set_cr_done() or set_cr_error(), it looks like we'd trigger this assertion in RGWCoroutinesStack::operate():
/* should r ever be negative at this point? */
ceph_assert(r >= 0);but if we're in RGWCoroutinesStack::operate(), then RGWCoroutine::stack probably isn't null, so maybe it doesn't matter in this specific case
(we should change the comment)
i removed the comment, thanks
Maybe we should add more checks as you've mentioned to the code where the stack is being called
the crashes do seem to be specific to RGWOmapAppend, so we could resolve https://tracker.ceph.com/issues/49302 by changing its callers to hold an extra reference to its stack. but who knows where else this is broken?
i'm hoping that this change in RGWCoroutine fixes this entire class of bugs, without needing to make the calling code more complicated. but this coroutine stuff itself is complicated, so we may just be breaking other things
users of the RGWOmapAppend coroutine don't manage the lifetime of its underlying coroutine stack, so end up making calls on RGWOmapAppend after its stack goes away. this null check is a band-aid, and there are still several other calls in RGWCoroutine that don't check for null stack Fixes: https://tracker.ceph.com/issues/49302 Signed-off-by: Or Friedmann <ofriedma@redhat.com> Signed-off-by: Casey Bodley <cbodley@redhat.com>
|
this 'passed' qa in https://pulpito.ceph.com/cbodley-2022-04-26_01:02:21-rgw-wip-cbodley-testing-distro-default-smithi/ even though the multisite jobs failed. at least i didn't see any crashes |
|
jenkins test make check |
in local testing, the only multisite failure i saw was |
users of the RGWOmapAppend coroutine don't manage the lifetime of its underlying coroutine stack, so end up making calls on RGWOmapAppend after its stack goes away. this null check is a band-aid, and there are still several other calls in RGWCoroutine that don't check for null stack
Fixes: https://tracker.ceph.com/issues/49302
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows