Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: don't crash on empty snapset #21058

Merged
merged 1 commit into from Apr 23, 2018
Merged

Conversation

trociny
Copy link
Contributor

@trociny trociny commented Mar 27, 2018

Fixes: http://tracker.ceph.com/issues/23851

Signed-off-by: Igor Fedotov ifedotov@suse.com

@trociny
Copy link
Contributor Author

trociny commented Mar 27, 2018

@liewegas For one of our customers we observed osd crashes due to unexpectedly empty snapset returned for some objects in the cache tire pool. And the proposed patch helped to make their cluster functional again.

This looks like related to https://tracker.ceph.com/issues/21557, and although the root cause is unknown, wouldn't it be a good idea to have these guards upstream so this type of inconsistency would not make the cluster down? Or do you have other suggestions how it could be addressed?

@liewegas
Copy link
Member

I think the best would be to assert if the debug config option is true so that we can catch it qa. Although we have failed to do that.. unclear what the root cause is :(

See #20040. Ideally we'd make some attempt to fix the inconsistency during scrub...

@trociny
Copy link
Contributor Author

trociny commented Mar 28, 2018

@liewegas

I think the best would be to assert if the debug config option is true so that we can catch it qa.

Thank you! Updated. I cherry-picked your 618f549 from #20040 to make it build. I will rebase when #20040 is merged.

Ideally we'd make some attempt to fix the inconsistency during scrub...

Interesting. I think it could done as a separate PR. Do you have an idea what info could be used to check/restore a snapset?

@trociny
Copy link
Contributor Author

trociny commented Mar 29, 2018

Rebased after #20040 is merged.

assert(!out->snaps.empty());
if (out->snaps.empty()) {
dout(1) << __func__ << " " << oid << " empty snapset" << dendl;
assert(!cct->_conf->osd_debug_verify_snaps);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no return -ENOENT here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was intentional: we don't want SnapMapper::get_snaps to fail here, just return an empty snapset.

@liewegas
Copy link
Member

liewegas commented Apr 2, 2018 via email

@trociny trociny force-pushed the wip-osd-empty-snapset branch 2 times, most recently from e137211 to 8e18306 Compare April 6, 2018 17:37
@trociny
Copy link
Contributor Author

trociny commented Apr 6, 2018

@liewegas I have temporary added a DNM commit to this PR that shows how I tested this.

The test contains a small tool to clear the snapset in an object snaps blob. It is used together with ceph-objectstore-tool to inject the empty snapset inconsistency. The objects belong to an rbd image with snapshots. I tested that after injecting the corruption rbd export still returns the same result. The "empty snapset" message (or osd_debug_verify_snaps assert) is triggered on scrub in SnapMapper::get_snaps.

I have failed to trigger "empty snapset" in PrimaryLogPG::find_object_context() because I don't know how to reproduce this. Though we saw this in the field [1].

[1] https://tracker.ceph.com/issues/22672

@trociny
Copy link
Contributor Author

trociny commented Apr 16, 2018

@liewegas ping

@liewegas
Copy link
Member

Looks good to me; let's drop teh DNM commit and then queue for testing?

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
@trociny
Copy link
Contributor Author

trociny commented Apr 16, 2018

@liewegas thanks! Rebased.

@tchaikov
Copy link
Contributor

retest this please.

@tchaikov
Copy link
Contributor

@jdurgin @liewegas shall we include this change in mimic?

@tchaikov tchaikov self-assigned this Apr 23, 2018
@liewegas liewegas added this to the mimic milestone Apr 23, 2018
@tchaikov tchaikov merged commit 9a48ccb into ceph:master Apr 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants