New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: eliminate snapdir objects and move clone snaps vector into SnapSet #13610

Merged
merged 41 commits into from May 12, 2017

Conversation

Projects
None yet
7 participants
@liewegas
Member

liewegas commented Feb 23, 2017

The basic approach:

  • once require_luminous is set, generate new-style SnapSets for new objects.
  • once require_luminous is set, convert legacy SnapSets and snapdir objects to the new
    style during scrub, using the same mechanism we use to update digests.
  • maintain a upper bound on the number of legacy SnapSets that may be present so that we can prevent upgrade past luminous until they've all been scrubbed away.

Tests

  • rados
  • rbd
  • cephfs

@liewegas liewegas changed the title from osd: eliminate snapdir objects and move clone snaps vector into SnapSet to DNM: osd: eliminate snapdir objects and move clone snaps vector into SnapSet Feb 23, 2017

@liewegas liewegas requested review from jdurgin and athanatos Feb 23, 2017

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Feb 23, 2017

I didn't look at most of this code, but I do wonder if we actually want clone whiteout and HEAD-delete white outs to be the same thing.
Pro: they use the same code.
Con: you can remove cache white outs if they're not dirty; you can't remove clone ones.

Are there other issues?

@liewegas

This comment has been minimized.

Member

liewegas commented Feb 23, 2017

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Feb 23, 2017

Especially if it's getting counted as an object in the pool stats and pool listing, I think it should get removed if it's clean.
Which suggests to me that the two whiteouts should be different, since having a dirty HEAD delete and trimming the last snapshot can collide on a cache pool.

@liewegas

This comment has been minimized.

Member

liewegas commented Feb 27, 2017

Updated; see the comment on the last patch:

    // NOTE: this arguably constitutes minor interference with the
    // tiering agent if this is a cache tier since a snap trim event
    // is effectively evicting a whiteout we might otherwise want to
    // keep around.

I'm not sure I understand your last comment, "Which suggests to me that the two whiteouts should be different, since having a dirty HEAD delete and trimming the last snapshot can collide on a cache pool." What do you mean by collide? Just that we can't tell whether the cache meant to have a whiteout here or not?

AFAICS this is the only place where there is any distinction, and the difference is tiny, so I lean toward simple over complex.

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Feb 27, 2017

We need to be able to distinguish between cases:

  1. we had snapshots and deleted them all, so we are deleting the entire object record
  2. we deleted the HEAD object in the cache tier and need to flush that delete to the base.

I think case 2 is usually indicated by having the object info marked as dirty, but reliably distinguishing them without having separate flags sounds like a nightmare to me.

@liewegas

This comment has been minimized.

Member

liewegas commented Feb 27, 2017

@jdurgin

This comment has been minimized.

Member

jdurgin commented Mar 15, 2017

looks like the osd-scrub-repair.sh make check failure is real

@dzafman

This comment has been minimized.

Member

dzafman commented Apr 5, 2017

@liewegas Is this planned for Luminous?

@liewegas

This comment has been minimized.

Member

liewegas commented Apr 5, 2017

@liewegas liewegas removed the request for review from athanatos Apr 21, 2017

@@ -133,7 +133,6 @@ TEST(cls_rgw, test_put_snap) {
ASSERT_EQ(0, ioctx.snap_create("snapbar"));
librados::ObjectWriteOperation *op = new_op();
op->create(false);

This comment has been minimized.

@jdurgin

jdurgin May 3, 2017

Member

do we only get ENODATA if the op touches the object first? seems like we'd want runs through rgw/rbd/fs suites as well to make sure this doesn't affect them

This comment has been minimized.

@jdurgin

jdurgin May 3, 2017

Member

in other words, where is the ENODATA coming from in this case?

This comment has been minimized.

@liewegas

liewegas May 3, 2017

Member

if you getxattr on a whiteout, ObjectStore returns ENODATA instead of ENOENT. and the cls code isn't differentiating between the two (for annoying convoluted reasons).

but yeah, there is a general problem with whiteouts that cls getxattr returns ENODATA instead of ENOENT. I started a patch that wrapped the do_osd_ops read calls in objclass.cc with a check for exists || is_whiteout() (the osd does this is do_op or something else above do_osd_op; strangely a few osd ops do it explicitly too) but i guess I didn't include it here. :(

either way, it's not a new problem.. it's a general whiteout issue. I'll open a separate bug for it and run this through rgw/rbd in the meantime?

This comment has been minimized.

@jdurgin

jdurgin May 4, 2017

Member

sounds good

@jdurgin

looks good overall, just one question about the ENODATA/ENOENT change.

liewegas added some commits Feb 20, 2017

osd/PrimaryLogPG: keep snapset on whiteout head if luminous
If REQUIRE_LUMINOUS is set on the OSDMap, put the SnapSet on the head
and make it a whiteout.  This is simpler and will eventually (once all
the old snapdir objects have been fixed up) let us remove a ton of
snapdir-related headaches.

This is surprisingly simple.  Everywhere else we work with snapdir is
already set up to handle the snapset on the head.  The only difference is
that we are preventing outselves from moving from the snapset-on-head
state to the snapset-on-snapdir-with-no-head state.  The only time this
happens is when the object is logically deleted in _delete_oid.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/PrimaryLogPG: filter snapc.snaps up front
No reason to wait for make_writeable(); ensure we have a valid snapc
from the start.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/osd_types: move SnapSetContext to osd_internal_types
Signed-off-by: Sage Weil <sage@redhat.com>
osd/osd_types: remove unused snapcolls from ScrubMap::object
Signed-off-by: Sage Weil <sage@redhat.com>
osd/osd_types: drop ScrubMap::object::nlink; drop some decode compat …
…cruft

We moved to v7 encoding in ca81563, which
was pre-jewel, so that's the oldest version we need to decode.

Drop the nlink field, which has not been used in a long time.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/PrimaryLogPG: debug obs.exists from get_object_context
Signed-off-by: Sage Weil <sage@redhat.com>
osd/PrimaryLogPG: improve make_writeable debug output
Signed-off-by: Sage Weil <sage@redhat.com>
osd/SnapMapper: add some debug output
Signed-off-by: Sage Weil <sage@redhat.com>
osd/osd_types: add clone_snaps to SnapSet
Signed-off-by: Sage Weil <sage@redhat.com>
osd/osd_types: rename object_info_t::snaps -> legacy_snaps
This will make it easier to identify users of this field that need to
conditionally use either the old legacy_snaps or the new
SnapSet::clone_snaps vector.

Signed-off-by: Sage Weil <sage@redhat.com>
osd: store clone list in SnapSet
Store the per-clone snaps list in SnapSet if (1) REQUIRE_LUMINOUS is set
in the OSDMap and (2) the SnapSet isn't a 'legacy' SnapSet that hasn't
been converted yet by scrub.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/PrimaryLogPG: remove head if snapset is empty and head is whiteout
We include a check to make sure we do not delete a dirty whiteout if this
is a tier pool and the object is dirty.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/PrimaryLogPG: maintain pessimistic estimate of num_legacy_snapsets
- Assume that any snapset we update, if not require_luminous, is a net-new
legacy SnapSet.  (It might be an existing one, which would be net-0, but
that is harder to tell.)

Then, during scrub,

- Any unreadable oi is assumed to include a legacy snapset
- Any snapset we encounter if !require_luminous is legacy
- Any object that should have a snapset but doesn't (corrupt or missing)
is assumed to be legacy.
- If were trying to update a legacy Snapset but have to abort, then it is
still legacy.

We could assume that a missing/broken snapset is not legacy since it has
to be repaired anyway (and therefore shouldn't block upgrade), but I'm
not sure.  For now, we'll take the conservative approach of blocking the
upgrade if the snapset metadata is missing/corrupt.

Signed-off-by: Sage Weil <sage@redhat.com>

# Conflicts:
#	src/osd/PrimaryLogPG.cc

liewegas added some commits Mar 30, 2017

ceph-objectstore-tool: handle both new and legacy SnapSet format
If we encounter legacy snaps, add to snapmapper then; otherwise, use the
clone_snaps field in SnapSet to add all clones to snapmapper when we
process the head object.

Signed-off-by: Sage Weil <sage@redhat.com>
ceph_test_cls_refcount: fix cls_rgw.test_put_snap
Do not touch the object before the put if we expect to see ENOENT.
We would get it anyway with older code, but with the snapset
changes we are on a whiteout and get ENODATA instead.

And really we shouldn't have been getting ENOENT after a touch
anyway!

Signed-off-by: Sage Weil <sage@redhat.com>
ceph_test_rados_api_list: update check_list
We no longer see dup entries on split.

Signed-off-by: Sage Weil <sage@redhat.com>
ceph_test_rados_api_*: do not clean up objects (usually)
First, this is pointless--each test runs in a namespace so they don't
step on each other.  Second, leaving objects in place is an opportunity
for scrub to notice any issues we created.  Third, the cleanup asserts
that delete succeeds but if clones exist pgls will show whiteouts and then
delete will return ENOENT.  We could disable the assert, but why bother
even attempting a sloppy cleanup?

We need to preserve cleanup behavior for a few tests (notably the object
listing ones).

Signed-off-by: Sage Weil <sage@redhat.com>
mon/OSDMonitor: add mon_fake_pool_delete option
Instead of deleting a pool, add a .NNN.DELETED suffix to the end.  This
keeps the data around long enough for it to be scrubbed later (in the
case of a teuthology job cleanup).

If you really want to delete the pool, then instead of the usual force
flag option you can pass --yes-i-really-really-mean-it-not-faking.  :)

Signed-off-by: Sage Weil <sage@redhat.com>
ceph_test_rados_api_list: make LibRadosListNP.ListObjectsError delete…
… pool even if faking

Signed-off-by: Sage Weil <sage@redhat.com>
mon/PGMonitor: do not warn about pgp_num for fake "deleted" pools
This breaks the upgrade test from jewel.  We can probably revert it later.

Signed-off-by: Sage Weil <sage@redhat.com>
qa/suites/rados/thrash: extra cluster create
Signed-off-by: Sage Weil <sage@redhat.com>
qa/suites/rados/basic: factor out cluster start
Signed-off-by: Sage Weil <sage@redhat.com>
qa/suites/rados/thrash-erasure-code: factor out cluster create
Signed-off-by: Sage Weil <sage@redhat.com>
qa/suites/rados/monthrash: simplify
Signed-off-by: Sage Weil <sage@redhat.com>
qa/suites/rados/verify: refactor thrash and cluster create
Signed-off-by: Sage Weil <sage@redhat.com>
qa/suites/rados: switch require-luminous facet to use full_sequential…
…_finally

This lets us run multiple cleanup steps right before ceph
teardown.

Note that we drop the facet from multimon/ because it
doesn't factor out cluster creation before this step
properly.  That's fine because the require_luminous
cleanup shouldn't be related to the multimon tests.

Signed-off-by: Sage Weil <sage@redhat.com>
qa/suites/rados: at end, scrub pgs, verify no legacy snapsets
Signed-off-by: Sage Weil <sage@redhat.com>
osd/PrimaryLogPG: allow creation of obc for new head object
We may be creating the head object to migrate the SnapSet to.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/PG: debug scrub range
Signed-off-by: Sage Weil <sage@redhat.com>

@liewegas liewegas changed the title from DNM: osd: eliminate snapdir objects and move clone snaps vector into SnapSet to osd: eliminate snapdir objects and move clone snaps vector into SnapSet May 8, 2017

@liewegas

This comment has been minimized.

Member

liewegas commented May 8, 2017

retest this please

@batrick

This comment has been minimized.

@liewegas

This comment has been minimized.

Member

liewegas commented May 12, 2017

@batrick can you confirm those are known cephfs issues? they dont' look related to me but there are a bunch of them :)

@batrick

This comment has been minimized.

Member

batrick commented May 12, 2017

Ya, they're all cephfs issues.

@liewegas liewegas merged commit d0a73ec into ceph:master May 12, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details

@liewegas liewegas deleted the liewegas:wip-snapset branch May 12, 2017

@@ -831,17 +834,47 @@ int get_attrs(ObjectStore *store, coll_t coll, ghobject_t hoid,
// This could have been handled in the caller if we didn't need to
// support exports that didn't include object_info_t in object_begin.
if (hoid.hobj.snap < CEPH_MAXSNAP && hoid.generation == ghobject_t::NO_GEN) {

This comment has been minimized.

@Liuchang0812

Liuchang0812 May 12, 2017

Contributor

hi, @liewegas , I'm not sure whether this PR breaks set_size here. function set_size depends on snap_dir.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment