Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw: RGWCoroutine::set_sleeping() checks for null stack #46007

Merged
merged 1 commit into from Apr 26, 2022

Conversation

cbodley
Copy link
Contributor

@cbodley cbodley commented Apr 22, 2022

users of the RGWOmapAppend coroutine don't manage the lifetime of its underlying coroutine stack, so end up making calls on RGWOmapAppend after its stack goes away. this null check is a band-aid, and there are still several other calls in RGWCoroutine that don't check for null stack

Fixes: https://tracker.ceph.com/issues/49302

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@cbodley
Copy link
Contributor Author

cbodley commented Apr 22, 2022

@ofriedma i saw this commit in #45958, and it looks like a fix for https://tracker.ceph.com/issues/49302 which we're seeing downstream too. i updated the commit and message, and added your Signed-off-by. is that ok?

@mattbenjamin
Copy link
Contributor

@cbodley would it make sense to check for null stack in all the places that look plausible?

@cbodley
Copy link
Contributor Author

cbodley commented Apr 22, 2022

@cbodley would it make sense to check for null stack in all the places that look plausible?

cc @yehudasa

probably, yeah - if RGWCoroutine::stack is null, i assume that means the stack finished running all of its coroutines, so it's too late for calls on RGWCoroutine to have any effect

in the past, we worked around issues like this by requiring the sync code to hold references to both the coroutine and its stack (some examples in RGWDataSyncShardCR and RGWMetaSyncCR)

i don't think there's a good way to track down all of the other places we could crash like this, so adding null checks to RGWCoroutine seems like the only way to rule them out

@ofriedma
Copy link
Contributor

@cbodley
yes, it's fine I didn't manage to check if it happens on upstream as well (but saw that sometimes the stack could be null) , I added it as workaround for the issues I saw on the sync fairness, so not sure it it's the solution right here.

}

int RGWCoroutine::io_block(int ret, int64_t io_id) {
return io_block(ret, rgw_io_id{io_id, -1});
}

int RGWCoroutine::io_block(int ret, const rgw_io_id& io_id) {
if (!stack) {
return 0; // XXX: what does 0 mean?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the returned value matters here anyway? (we should change the comment)
Maybe we should add more checks as you've mentioned to the code where the stack is being called.
Other than those, LGTM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the returned value matters here anyway?

i see that several coroutines in rgw_sync.cc and rgw_data_sync.cc are returning the result with return io_block(0);

if coroutines return an error without calling set_cr_done() or set_cr_error(), it looks like we'd trigger this assertion in RGWCoroutinesStack::operate():

/* should r ever be negative at this point? */
ceph_assert(r >= 0);

but if we're in RGWCoroutinesStack::operate(), then RGWCoroutine::stack probably isn't null, so maybe it doesn't matter in this specific case

(we should change the comment)

i removed the comment, thanks

Maybe we should add more checks as you've mentioned to the code where the stack is being called

the crashes do seem to be specific to RGWOmapAppend, so we could resolve https://tracker.ceph.com/issues/49302 by changing its callers to hold an extra reference to its stack. but who knows where else this is broken?

i'm hoping that this change in RGWCoroutine fixes this entire class of bugs, without needing to make the calling code more complicated. but this coroutine stuff itself is complicated, so we may just be breaking other things

users of the RGWOmapAppend coroutine don't manage the lifetime of its
underlying coroutine stack, so end up making calls on RGWOmapAppend
after its stack goes away. this null check is a band-aid, and there are
still several other calls in RGWCoroutine that don't check for null
stack

Fixes: https://tracker.ceph.com/issues/49302

Signed-off-by: Or Friedmann <ofriedma@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
@cbodley
Copy link
Contributor Author

cbodley commented Apr 26, 2022

this 'passed' qa in https://pulpito.ceph.com/cbodley-2022-04-26_01:02:21-rgw-wip-cbodley-testing-distro-default-smithi/ even though the multisite jobs failed. at least i didn't see any crashes

@cbodley cbodley removed the needs-qa label Apr 26, 2022
@cbodley
Copy link
Contributor Author

cbodley commented Apr 26, 2022

jenkins test make check

@cbodley
Copy link
Contributor Author

cbodley commented Apr 26, 2022

this 'passed' qa in https://pulpito.ceph.com/cbodley-2022-04-26_01:02:21-rgw-wip-cbodley-testing-distro-default-smithi/ even though the multisite jobs failed. at least i didn't see any crashes

in local testing, the only multisite failure i saw was test_multi.test_version_suspended_incremental_sync from https://tracker.ceph.com/issues/55179

@cbodley cbodley merged commit 197e9ea into ceph:master Apr 26, 2022
11 checks passed
@cbodley cbodley deleted the wip-49302 branch April 26, 2022 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants