Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mds: #11950: Persistent purge queue #12786

Merged
merged 33 commits into from Mar 8, 2017

Conversation

Projects
None yet
3 participants
@jcsp
Copy link
Contributor

jcsp commented Jan 4, 2017

I am sure there are things I've forgotten here but I think it's ready to get some input from others

@jcsp jcsp requested a review from gregsfortytwo Jan 4, 2017

@jcsp

This comment has been minimized.

Copy link
Contributor Author

jcsp commented Jan 5, 2017

(This currently has the testing commits from #12800 in it for convenience)


// Don't use the MDSDaemon's Finisher and Timer, because this class
// operates outside of MDSDaemon::mds_lock
Finisher finisher;

This comment has been minimized.

Copy link
@ukernel

ukernel Jan 11, 2017

Member

sorry, I missed this

@gregsfortytwo
Copy link
Member

gregsfortytwo left a comment

I've got several questions and comments. Overall it looks good, but I am concerned about some of the TODOs you seem to have given up on — namely, this is a lot of extra IOs for each file we delete. Deletion is already kind of slow and as far as I think we're now doing:

  1. Write to purge queue
  2. update journal head for new data
  3. Read purge queue
  4. (Mark entry trimmable and) update journal head
    for each file we delete. That's a lot of IO for a process which is already a bit slow, and a lot of it is because we're flushing on each op. I don't think tweaking things to avoid that is going to be difficult, and we definitely need to.
@@ -992,8 +992,9 @@ void MDLog::_recovery_thread(MDSInternalContextBase *completion)
}

/* Read the header from the front journal */
Journaler *front_journal = new Journaler(jp.front, mds->mdsmap->get_metadata_pool(),
CEPH_FS_ONDISK_MAGIC, mds->objecter, logger, l_mdl_jlat, &mds->timer, mds->finisher);
Journaler *front_journal = new Journaler("mdlog", jp.front,

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Jan 12, 2017

Member

Since these represent different Journalers in the MDLog, how about "mdlog.front" and "mdlog.back" for distinguishing them?

This comment has been minimized.

Copy link
@jcsp

jcsp Feb 10, 2017

Author Contributor

These Journaler instances are (depending on whether doing reformat) later assigned to the main MDLog::journaler, so special names would not make sense any more at that point

@@ -970,7 +970,6 @@ void Journaler::_issue_read(uint64_t len)
if (flush_pos == safe_pos) {
_flush(NULL);
}
assert(flush_pos > safe_pos);

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Jan 12, 2017

Member

I'm looking at Journaler::_flush() and not understanding why this line contradicts anything to do with prezeroing — it looks like it won't flush if there's no data to write, otherwise it will.

More broadly, what makes us issue reads at an offset which hasn't yet been written? This sort of looks like it's covering another bug...

This comment has been minimized.

Copy link
@jcsp

jcsp Feb 10, 2017

Author Contributor

flush_pos only advances in _do_flush if it's not also busy prezeroing, so in that case the assertion would fail. It's okay though, because the waitfor_safe callback will still happen, because in _finish_prezero we check waiting_for_zero and call _do_flush again.

We issue reads at an offset that hasn't been written in the case that PurgeQueue has no backlog: in push() we call _consume(). In a busy system, this _consume will not touch Journaler because PurgeQueue::in_flight will already be at its limit. Otherwise, _consume will call Journaler::is_readable(), which will bring us through this _issue_read path.

This could all be more efficient if PurgeQueue had a special case for when it's not busy, which would pass the item straight through to be executed right after it's been written to the journaler, without issuing a read on the journaler.

@@ -946,7 +946,9 @@ void Journaler::_assimilate_prefetch()

if ((got_any && !was_readable && readable) || read_pos == write_pos) {
// readable!
ldout(cct, 10) << "_finish_read now readable (or at journal end)" << dendl;
ldout(cct, 10) << "_finish_read now readable (or at journal end) readable="
<< readable << " read=" << read_pos << " write="

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Jan 12, 2017

Member

We'll all be a lot happier in future if these say read_pos/write_pos instead of just read and write? ;)

This comment has been minimized.

Copy link
@jcsp

jcsp Feb 10, 2017

Author Contributor

Updated.

item.size = to;
item.layout = pi->layout;
item.old_pools = pi->old_pools;
item.snapc = *snapc;

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Jan 12, 2017

Member

Just a note: I was worried about using a static SnapContext here at first glance. But by the time it's purged there shouldn't be any more writes to the objects; we only have the SnapContext to begin with to prove that we are a new enough writer for the OSD to accept it as a proper delete of HEAD and to make sure we aren't deleting any data which is live in a snapshot but hasn't had HEAD modified since.

dout(20) << " decoding entry" << dendl;
PurgeItem item;
bufferlist::iterator q = bl.begin();
::decode(item, q);

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Jan 12, 2017

Member

We need to wrap this to avoid decode errors aborting the MDS.

This comment has been minimized.

Copy link
@jcsp

jcsp Feb 11, 2017

Author Contributor

Done

max_required_mds))
remote.run([os.path.join(SRC_PREFIX, "stop.sh")], check_status=False)
remote.run(["rm", "-rf", "./out"])
remote.run(["rm", "-rf", "./dev"])

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Jan 12, 2017

Member

The vstart.sh -n option takes care of both of these steps...?

@@ -506,6 +509,8 @@ def __init__(self):
# certain teuthology tests want to run tasks in parallel
self.lock = threading.RLock()

self.log = lambda x: log.info(x)

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Jan 12, 2017

Member

I must not be interpreting this syntax properly. I'd expect to alias log("foo") to log.info("foo") but I don't even see it being referenced anywhere. What's happening?

<< dendl;
// This takes unbounded time, so we must indicate progress
// to the administrator
// TODO include progress in message

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Jan 12, 2017

Member

Remaining TODO

// how many OSD ops we're allowed to emit at a time to
// race through the queue as fast as we can.
// TODO: if Objecter has any slow requests, take that as a hint and
// slow down our rate of purging (keep accepting pushes though)

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Jan 12, 2017

Member

Two remaining TODOs?

// This was the lowest journal position in flight, so we can now
// safely expire the journal up to here.
journaler.set_expire_pos(expire_to);
journaler.trim(); // TODO: don't call trim so often?

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Jan 12, 2017

Member

And another TODO.

@jcsp jcsp force-pushed the jcsp:wip-11950 branch 3 times, most recently from b4ba350 to 6a09fe7 Feb 10, 2017

@ukernel

This comment has been minimized.

Copy link
Member

ukernel commented Feb 27, 2017

LGTM

@jcsp jcsp force-pushed the jcsp:wip-11950 branch 6 times, most recently from 41ba7de to bd80dd7 Mar 1, 2017

John Spray added some commits Dec 1, 2016

John Spray
compact_set: add #includes for dependencies
This was previously working by side effects, I happened
to include it somewhere that its dependencies weren't
already included.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
osdc/Journaler: assign a name for logging
Now that we have an MDLog journaler and a PurgeQueue journaler,
this is needed to avoid confusion.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
osdc/Journaler: remove incorrect assertion
This asserted that flush_pos would be ahead of
safe_pos after calling _flush.  However, this
is not guaranteed to be the case because
prezeroing might prevent us from flushing
right now.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
osdc/Filer: const fix for passed layouts
...so that const references can be passed into
purge calls.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
osdc/Journaler: add have_waiter()
Allows users of wait_for_readable to conveniently
see if there is already a waiter.  Yes, they could
do this themselves, but I'd rather peek at an existing
variable than add a new one caller-side.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
common/lockdep: clearer log messages
Previously these were contextless "using id..." messages with
no indication of what subsystem the message came from.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
osdc/Journaler: wrap recover() completion in finisher
Otherwise, the callback will deadlock if it in turn
calls into any Journaler functions.  Don't care
about performance because we do this once at startup.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
mds: const methods on SnapRealm
Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
mds: const snaprealm getters on CInode
Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
mds: use a persistent queue for purging deleted files
To avoid creating stray directories of unbounded size
and all the associated pain, use a more appropriate
datastructure to store a FIFO of inodes that need
purging.

Fixes: http://tracker.ceph.com/issues/11950
Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
mds: move PurgeQueue up to MDSRank
To better reflect its lifecycle: it has a part to play
in create/open and has an init/shutdown, unlike StrayManager.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
mds: move throttling code out of StrayManager
This will belong in PurgeQueue from now on.  We assume
that there is no need to throttle the rate of insertions
into purge queue as it is an efficient sequentially-written
journal.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
mds: move dir purge and truncate into purgequeue
Signed-off-by: John Spray <john.spray@redhat.com>

John Spray added some commits Dec 21, 2016

John Spray
mds: add stats to PurgeQueue
Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
mds: implement PurgeQueue throttling
Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
qa: update test_strays for purgequeue
Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
mds: wait for purgequeue on rank shutdown
Also, move shutdown_pass call from dispatch
to tick, so that it doesn't rely on incoming
messages to make progress.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
qa/cephfs: add TestStrays.test_purge_on_shutdown
...and change test_migration_on_shutdown to
specifically target non-purgeable strays (i.e.
hardlink-ish things).

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
mds: create purge queue if it's not found
Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
mds: update PurgeQueue for single-ack OSD change
Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
mds: expose progress during PurgeQueue drain
We don't track an item count, but we do have
a number of bytes left in the Journaler, so
can use that to give an indication of progress
while the MDS rank shutdown is waiting for
the PurgeQueue to do its thing.

Also lift the ops limit on the PurgeQueue
when it goes into the drain phase.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
mds: add error handling in PurgeQueue
For decode errors, and for Journaler errors.
Both are considered damage to the MDS rank, as
with other per-rank data structures.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
osdc: remove Journaler "journaler_batch_*" settings
This was an unused code path.  If anyone set a nonzero
value here the MDS would crash because the Timer implementation
has changed since this code was written, and now requires
add_event_after callers to hold the right lock.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
osdc: less aggressive prefetch in read/write Journaler
Previously, if doing a write/is_readable/write/is_readable sequence,
you'd end up doing a flush after every write, even though there
was already a flush in flight that would advance the readable-ness
of the journal.

Because this flush-during-read path is only active when using
a read/write journal such as in PurgeQueue, tweak the behaviour
to suit this case.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
mds: update for removing Timer from Journaler
Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
mds: remove unnecessary flush() from PurgeQueue
We can drive all flushing from the read side.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
osdc: expose Journaler::write_head_needed
So that callers on the read side can optionally
do their own write_head calls according to
the same condition that Journaler uses
internally for its write_head during _flush() condition.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
mds: write_head when reading in PurgeQueue
Previously write_head calls were only generated
on the write side, so if you had a big queue
and were just working through consuming it, you
wouldn't record your progress, and on a daemon
restart would end up repeating a load of work.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
qa: add TestStrays.test_purge_queue_op_rate
For ensuring that the PurgeQueue code is not generating
too many extra IOs.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
mds: flush PQ even when not consuming
In normal operation we generate flushes from
_consume when we read from the journaler.  However,
we should also have a fallback flush mechanism for
situations where can_consume() is false fo a long time.

This comes up in testing when we set throttle to zero to
prevent progress, but would also come up in real life if
we were busy purging a few very large files, or if purging
was stuck due to bad PGs in the data pool -- we don't want
that to stop us completing appends to the PQ.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
qa: update TestFlush for changed stray perf counters
Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
qa: update TestDamage for PurgeQueue
Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
mds: handle Journaler::recover errors in PurgeQueue
Signed-off-by: John Spray <john.spray@redhat.com>

@jcsp jcsp force-pushed the jcsp:wip-11950 branch from bd80dd7 to 1777333 Mar 8, 2017

@jcsp jcsp merged commit 82a0f3e into ceph:master Mar 8, 2017

2 of 3 checks passed

default Build started sha1 is merged.
Details
Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details

@jcsp jcsp deleted the jcsp:wip-11950 branch Mar 8, 2017

@theanalyst theanalyst changed the title #11950: Persistent purge queue mds: #11950: Persistent purge queue Mar 23, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.