New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd: don't crash on empty snapset #21058
Conversation
@liewegas For one of our customers we observed osd crashes due to unexpectedly empty snapset returned for some objects in the cache tire pool. And the proposed patch helped to make their cluster functional again. This looks like related to https://tracker.ceph.com/issues/21557, and although the root cause is unknown, wouldn't it be a good idea to have these guards upstream so this type of inconsistency would not make the cluster down? Or do you have other suggestions how it could be addressed? |
I think the best would be to assert if the debug config option is true so that we can catch it qa. Although we have failed to do that.. unclear what the root cause is :( See #20040. Ideally we'd make some attempt to fix the inconsistency during scrub... |
0a073c3
to
61c546a
Compare
Thank you! Updated. I cherry-picked your 618f549 from #20040 to make it build. I will rebase when #20040 is merged.
Interesting. I think it could done as a separate PR. Do you have an idea what info could be used to check/restore a snapset? |
61c546a
to
9f0ccc1
Compare
Rebased after #20040 is merged. |
assert(!out->snaps.empty()); | ||
if (out->snaps.empty()) { | ||
dout(1) << __func__ << " " << oid << " empty snapset" << dendl; | ||
assert(!cct->_conf->osd_debug_verify_snaps); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no return -ENOENT
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was intentional: we don't want SnapMapper::get_snaps
to fail here, just return an empty snapset.
Interesting. I think it could done as a separate PR. Do you have an idea what info could be used to check/restore a snapset?
It might not require anything special, actually: perhaps an empty SnapSet
will result in any/all clones getting removed (as stray clones) and we'd
be done with it. I would give it a try by injecting the corruption
(perhaps removing it directly via the fuse mountpoint or via
ceph-objectstore-tool) to verify the io path now behaves and also that
scrub will too.
|
e137211
to
8e18306
Compare
@liewegas I have temporary added a DNM commit to this PR that shows how I tested this. The test contains a small tool to clear the snapset in an object snaps blob. It is used together with ceph-objectstore-tool to inject the empty snapset inconsistency. The objects belong to an rbd image with snapshots. I tested that after injecting the corruption I have failed to trigger "empty snapset" in |
@liewegas ping |
Looks good to me; let's drop teh DNM commit and then queue for testing? |
Signed-off-by: Igor Fedotov <ifedotov@suse.com> Signed-off-by: Mykola Golub <mgolub@suse.com>
8e18306
to
3996c0a
Compare
@liewegas thanks! Rebased. |
retest this please. |
Fixes: http://tracker.ceph.com/issues/23851
Signed-off-by: Igor Fedotov ifedotov@suse.com