Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removal of snapshot with corrupt replica crashes osd #22476

Merged
merged 5 commits into from
Nov 9, 2018

Conversation

dzafman
Copy link
Contributor

@dzafman dzafman commented Jun 8, 2018

This is only a partial fix for http://tracker.ceph.com/issues/23875

@dzafman
Copy link
Contributor Author

dzafman commented Jun 9, 2018

@liewegas @jdurgin @tchaikov @neha-ojha I created this pull request to start the discussion on taking some asserts out.

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OSD code change looks good to me, but I'm not clear on the failure which is still in the ticket.

@@ -90,57 +90,58 @@ function create_scenario() {
# Don't need to use ceph_objectstore_tool() function because osd stopped

JSON="$(ceph-objectstore-tool --data-path $dir/${osd} --head --op list obj1)"
ceph-objectstore-tool --data-path $dir/${osd} "$JSON" --force remove
ceph-objectstore-tool --data-path $dir/${osd} "$JSON" --force remove || return 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to tolerate an error in every one of these? Do we expect stale data we might need to clean up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this change causes the script to not tolerate errors. Before I just assumed that these commands can not fail, so didn't bother to check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh dur.
👍

@dzafman
Copy link
Contributor Author

dzafman commented Jul 11, 2018

@gregsfortytwo Even with this change a replica can still crash when removing snapshots with certain corruptions. It isn't clear that we shouldn't consider some corruptions as fatal anyway.

    # When removing snapshots with a corrupt replica, it crashes.
    # See http://tracker.ceph.com/issues/23875
    if [ $which = "primary" ];
    then
        for i in `seq 1 7`
        do
            rados -p $poolname rmsnap snap$i
        done
    fi

@dzafman dzafman changed the title DNM: Removal of snapshot with corrupt replica crashes osd Removal of snapshot with corrupt replica crashes osd Jul 13, 2018
@dzafman dzafman requested review from tchaikov and jdurgin July 13, 2018 18:15
@tchaikov
Copy link
Contributor

will review this PR early tomorrow.

@@ -1981,25 +1981,36 @@ int do_meta(ObjectStore *store, string object, Formatter *formatter, bool debug,
return 0;
}

enum rmtype {
STANDARD,
Copy link
Contributor

@tchaikov tchaikov Jul 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so

  • STANDARD for removing both snap from snapmapper and the object
  • SNAPMAP for removing the snap from snapmapper only
  • NOSNAPMAP for remove the object only

i'd suggest rename STANDARD to BOTH, NOSNAPMAP to OBJECT, like

enum class Remove {
  BOTH,
  SNAPMAP,
  OBJECT
};

and update remove_object() like,

if (type == Remove::BOTH || type == Remove::SNAPMAP) {
  // remove snap from snapmap
}
if (type == Remove::BOTH || type == Remove::OBJECT) {
  // remove the object
}

easier to digest this way, IMHO.

@dzafman
Copy link
Contributor Author

dzafman commented Jul 17, 2018

@tchaikov I rebased this pull request and made the change you requested.

@dzafman
Copy link
Contributor Author

dzafman commented Jul 17, 2018

retest this please

1 similar comment
@dzafman
Copy link
Contributor Author

dzafman commented Jul 18, 2018

retest this please

@gregsfortytwo
Copy link
Member

@dzafman I think you need to rebase this on master again to get in the WITH_SEASTAR fix.

@tchaikov
Copy link
Contributor

tchaikov commented Jul 21, 2018

@gregsfortytwo jenkins always tries to rebase a PR to be tested against the latest master at that moment. so what we need is just:

retest this please

=)

@jdurgin
Copy link
Member

jdurgin commented Oct 29, 2018

@dzafman is this ready to go after a rebase?

@gregsfortytwo
Copy link
Member

@dzafman, I gather we should go through and re-review this now?

Note that the final FIXUP commit is missing a signoff statement.

@dzafman
Copy link
Contributor Author

dzafman commented Oct 29, 2018

@gregsfortytwo Yes, I'll squash, rebase and make sure test passes.

My using nop() instead of setattr() we avoid crashing
a replica that is missing the object for some reason.
Although this is a corruption, it would be getting "fix"
by removal of the snapshot.

In the other cases we accept completely missing keys (ENOENT)
that happens when an object is removed somehow or missed
by recovery/backfill.

Signed-off-by: David Zafman <dzafman@redhat.com>
…ting

Signed-off-by: David Zafman <dzafman@redhat.com>
…e_scenario()

Signed-off-by: David Zafman <dzafman@redhat.com>
…re-tool

Use --rmtype snapmap with new obj16 to remove snapmap only, check for repair message
Use --rmtype nosnapmap to remove obj5 while leaving snapmap behind

Signed-off-by: David Zafman <dzafman@redhat.com>
… complete

Due to deliberate corruptions snaptrim_error means snaptrim is done

Signed-off-by: David Zafman <dzafman@redhat.com>
@dzafman
Copy link
Contributor Author

dzafman commented Nov 9, 2018

retest this please

@dzafman
Copy link
Contributor Author

dzafman commented Nov 9, 2018

Rados test suite passed with unrelated failures:
dzafman-2018-11-08_20:13:57-rados-wip-zafman-testing-22476-distro-basic-smithi
3238374 http://tracker.ceph.com/issues/20798. [ FAILED ] LibRadosLockECPP.LockExclusiveDurPP
3238479 /usr/include/rados/buffer.h:657:61: error: expected string-literal before ‘)’ token
3238552 scrub_test: assert len(objs['inconsistents']) == 1
3238596 rados/test_librados_build.sh: /usr/bin/x86_64-linux-gnu-ld: cannot find -lradospp failed

@dzafman
Copy link
Contributor Author

dzafman commented Nov 9, 2018

@jdurgin @gregsfortytwo @tchaikov This is ready for a final review and merge. I rebased, made changes suggested by Kefu and passed the rados suite.

@jdurgin jdurgin merged commit fd2a4c5 into ceph:master Nov 9, 2018
@dzafman dzafman deleted the wip-23875 branch September 3, 2019 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants