-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removal of snapshot with corrupt replica crashes osd #22476
Conversation
@liewegas @jdurgin @tchaikov @neha-ojha I created this pull request to start the discussion on taking some asserts out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OSD code change looks good to me, but I'm not clear on the failure which is still in the ticket.
@@ -90,57 +90,58 @@ function create_scenario() { | |||
# Don't need to use ceph_objectstore_tool() function because osd stopped | |||
|
|||
JSON="$(ceph-objectstore-tool --data-path $dir/${osd} --head --op list obj1)" | |||
ceph-objectstore-tool --data-path $dir/${osd} "$JSON" --force remove | |||
ceph-objectstore-tool --data-path $dir/${osd} "$JSON" --force remove || return 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to tolerate an error in every one of these? Do we expect stale data we might need to clean up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this change causes the script to not tolerate errors. Before I just assumed that these commands can not fail, so didn't bother to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh dur.
👍
@gregsfortytwo Even with this change a replica can still crash when removing snapshots with certain corruptions. It isn't clear that we shouldn't consider some corruptions as fatal anyway.
|
will review this PR early tomorrow. |
src/tools/ceph_objectstore_tool.cc
Outdated
@@ -1981,25 +1981,36 @@ int do_meta(ObjectStore *store, string object, Formatter *formatter, bool debug, | |||
return 0; | |||
} | |||
|
|||
enum rmtype { | |||
STANDARD, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so
- STANDARD for removing both snap from snapmapper and the object
- SNAPMAP for removing the snap from snapmapper only
- NOSNAPMAP for remove the object only
i'd suggest rename STANDARD
to BOTH
, NOSNAPMAP
to OBJECT
, like
enum class Remove {
BOTH,
SNAPMAP,
OBJECT
};
and update remove_object()
like,
if (type == Remove::BOTH || type == Remove::SNAPMAP) {
// remove snap from snapmap
}
if (type == Remove::BOTH || type == Remove::OBJECT) {
// remove the object
}
easier to digest this way, IMHO.
@tchaikov I rebased this pull request and made the change you requested. |
retest this please |
1 similar comment
retest this please |
@dzafman I think you need to rebase this on master again to get in the WITH_SEASTAR fix. |
@gregsfortytwo jenkins always tries to rebase a PR to be tested against the latest master at that moment. so what we need is just: retest this please =) |
@dzafman is this ready to go after a rebase? |
@dzafman, I gather we should go through and re-review this now? Note that the final FIXUP commit is missing a signoff statement. |
@gregsfortytwo Yes, I'll squash, rebase and make sure test passes. |
My using nop() instead of setattr() we avoid crashing a replica that is missing the object for some reason. Although this is a corruption, it would be getting "fix" by removal of the snapshot. In the other cases we accept completely missing keys (ENOENT) that happens when an object is removed somehow or missed by recovery/backfill. Signed-off-by: David Zafman <dzafman@redhat.com>
…ting Signed-off-by: David Zafman <dzafman@redhat.com>
…e_scenario() Signed-off-by: David Zafman <dzafman@redhat.com>
…re-tool Use --rmtype snapmap with new obj16 to remove snapmap only, check for repair message Use --rmtype nosnapmap to remove obj5 while leaving snapmap behind Signed-off-by: David Zafman <dzafman@redhat.com>
… complete Due to deliberate corruptions snaptrim_error means snaptrim is done Signed-off-by: David Zafman <dzafman@redhat.com>
retest this please |
Rados test suite passed with unrelated failures: |
@jdurgin @gregsfortytwo @tchaikov This is ready for a final review and merge. I rebased, made changes suggested by Kefu and passed the rados suite. |
This is only a partial fix for http://tracker.ceph.com/issues/23875