Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: don't crash on empty snapset #21058

Merged
merged 1 commit into from Apr 23, 2018

Conversation

Projects
None yet
4 participants
@trociny
Copy link
Contributor

commented Mar 27, 2018

Fixes: http://tracker.ceph.com/issues/23851

Signed-off-by: Igor Fedotov ifedotov@suse.com

@trociny trociny requested a review from liewegas Mar 27, 2018

@trociny

This comment has been minimized.

Copy link
Contributor Author

commented Mar 27, 2018

@liewegas For one of our customers we observed osd crashes due to unexpectedly empty snapset returned for some objects in the cache tire pool. And the proposed patch helped to make their cluster functional again.

This looks like related to https://tracker.ceph.com/issues/21557, and although the root cause is unknown, wouldn't it be a good idea to have these guards upstream so this type of inconsistency would not make the cluster down? Or do you have other suggestions how it could be addressed?

@liewegas

This comment has been minimized.

Copy link
Member

commented Mar 27, 2018

I think the best would be to assert if the debug config option is true so that we can catch it qa. Although we have failed to do that.. unclear what the root cause is :(

See #20040. Ideally we'd make some attempt to fix the inconsistency during scrub...

@trociny trociny force-pushed the trociny:wip-osd-empty-snapset branch from 0a073c3 to 61c546a Mar 28, 2018

@trociny

This comment has been minimized.

Copy link
Contributor Author

commented Mar 28, 2018

@liewegas

I think the best would be to assert if the debug config option is true so that we can catch it qa.

Thank you! Updated. I cherry-picked your 618f549 from #20040 to make it build. I will rebase when #20040 is merged.

Ideally we'd make some attempt to fix the inconsistency during scrub...

Interesting. I think it could done as a separate PR. Do you have an idea what info could be used to check/restore a snapset?

@trociny trociny force-pushed the trociny:wip-osd-empty-snapset branch from 61c546a to 9f0ccc1 Mar 29, 2018

@trociny

This comment has been minimized.

Copy link
Contributor Author

commented Mar 29, 2018

Rebased after #20040 is merged.

assert(!out->snaps.empty());
if (out->snaps.empty()) {
dout(1) << __func__ << " " << oid << " empty snapset" << dendl;
assert(!cct->_conf->osd_debug_verify_snaps);

This comment has been minimized.

Copy link
@smithfarm

smithfarm Mar 29, 2018

Contributor

no return -ENOENT here?

This comment has been minimized.

Copy link
@trociny

trociny Mar 29, 2018

Author Contributor

Yes, it was intentional: we don't want SnapMapper::get_snaps to fail here, just return an empty snapset.

@liewegas

This comment has been minimized.

Copy link
Member

commented Apr 2, 2018

@trociny trociny force-pushed the trociny:wip-osd-empty-snapset branch 2 times, most recently from e137211 to 8e18306 Apr 6, 2018

@trociny

This comment has been minimized.

Copy link
Contributor Author

commented Apr 6, 2018

@liewegas I have temporary added a DNM commit to this PR that shows how I tested this.

The test contains a small tool to clear the snapset in an object snaps blob. It is used together with ceph-objectstore-tool to inject the empty snapset inconsistency. The objects belong to an rbd image with snapshots. I tested that after injecting the corruption rbd export still returns the same result. The "empty snapset" message (or osd_debug_verify_snaps assert) is triggered on scrub in SnapMapper::get_snaps.

I have failed to trigger "empty snapset" in PrimaryLogPG::find_object_context() because I don't know how to reproduce this. Though we saw this in the field [1].

[1] https://tracker.ceph.com/issues/22672

@trociny

This comment has been minimized.

Copy link
Contributor Author

commented Apr 16, 2018

@liewegas ping

@liewegas

This comment has been minimized.

Copy link
Member

commented Apr 16, 2018

Looks good to me; let's drop teh DNM commit and then queue for testing?

osd: don't crash on empty snapset
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>

@trociny trociny force-pushed the trociny:wip-osd-empty-snapset branch from 8e18306 to 3996c0a Apr 16, 2018

@trociny

This comment has been minimized.

Copy link
Contributor Author

commented Apr 16, 2018

@liewegas thanks! Rebased.

@tchaikov

This comment has been minimized.

@tchaikov

This comment has been minimized.

Copy link
Contributor

commented Apr 23, 2018

retest this please.

@tchaikov

This comment has been minimized.

Copy link
Contributor

commented Apr 23, 2018

@jdurgin @liewegas shall we include this change in mimic?

@tchaikov tchaikov self-assigned this Apr 23, 2018

@liewegas liewegas added this to the mimic milestone Apr 23, 2018

@tchaikov tchaikov merged commit 9a48ccb into ceph:master Apr 23, 2018

5 checks passed

Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.