Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: put pg removal in op_wq #19433

Merged
merged 25 commits into from Jan 16, 2018

Conversation

Projects
None yet
2 participants
@liewegas
Copy link
Member

commented Dec 11, 2017

A lot of awkward complexity is implemented in OSD to handle PGs that aren't in
pg_map and are in the process of being deleted. This is hard because if the
PG is recreated (or split, or whatever) then we need to stop the deletion and
create a fresh PG referencing the same data.

Instead, leave deleting PGs registered and Stray, with a new peering state
Stray/Deleting. Let them continue to process OSDMaps, splits, peering intervals,
and so on. If they are not fully deleted, they'll go back to Reset -> Stray and
so on and the new primary will get the notify and decide what to do with them
(usually instruct them to delete again).

This (1) streamlines and cleans up the code structure, and also (2) gets rid of
the special purpose RemoveWQ and moves the delete work into the main op_wq
where it can get throttled and so on.

This eliminates one of the roadblocks I hit while working on fast dispatch of peering
messages (resurrect_pg and it's weird locking+semantics didn't fit at all with the streamlined
handlers).

TODO

  • revise/improve throttling behavior
  • audit split code and make sure it squares with the new deletion behavior
  • check default tuning values vs pg delete speed w/ old vs new code
  • verify perfcounter is accurate and working
  • delete reservation

@liewegas liewegas requested review from gregsfortytwo and jdurgin Dec 11, 2017

@liewegas liewegas force-pushed the liewegas:wip-pg-removal branch from e221b84 to b62f3af Dec 11, 2017

@liewegas

This comment has been minimized.

Copy link
Member Author

commented Dec 11, 2017

Rebased on #18196 because I needed the OpSequencer shard change (with the new pg removal code the OpSequencer outlives the upper-layer Sequencer).

@gregsfortytwo
Copy link
Member

left a comment

Looks pretty good but a few quibbles.

// clear log
PGLogEntryHandler rollbacker{this, t};
pg_log.roll_forward(&rollbacker);

write_if_dirty(*t);

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Dec 12, 2017

Member

I haven't looked at everything here yet, but this seems odd to me — I think this is the only way we ever write the transaction down?

This comment has been minimized.

Copy link
@liewegas

liewegas Dec 12, 2017

Author Member

See PG::do_peering_event(), where all of the state machine events come come through. Previously on_removal() was special because it was called by the ad-hoc PG deletion code in OSD.cc instead of via the state machine; now it is called via the state machine (which handles this for us).

This comment has been minimized.

Copy link
@gregsfortytwo
@@ -327,7 +330,7 @@ class PG : public DoutPrefixProvider {
}
bool pg_has_reset_since(epoch_t e) {
assert(is_locked());
return deleting || e < get_last_peering_reset();
return deleted || e < get_last_peering_reset();

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Dec 12, 2017

Member

Why don't we have to check deleting here any more?

This comment has been minimized.

Copy link
@liewegas

liewegas Dec 12, 2017

Author Member

If there is an interval change we want to detect that and stop deleting. The new primary then gets a chance to use/keep us or restart deleting.

This comment has been minimized.

Copy link
@gregsfortytwo
context< RecoveryMachine >().log_enter(state_name);
PG *pg = context< RecoveryMachine >().pg;
pg->deleting = true;
ObjectStore::Transaction* t = context<RecoveryMachine>().get_cur_transaction();

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Dec 12, 2017

Member

Do we normally have a transaction when entering this state? We do in most of recovery because we're usually doing something to the disk, but if we transition straight here I don't see anything setting it up?

This comment has been minimized.

Copy link
@liewegas

liewegas Dec 12, 2017

Author Member

Generally speaking we always have a transaction (via the RecoveryContext) when inside the state machine. (That's why we dont' have to worry about write_if_dirty() in that other spot; the generic state machine event submitter function handles it.)

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Dec 12, 2017

Member

Ah yep, I didn't poke enough — we have start_handle() and end_handle() functions that are invoked as part of handle_event().
And as you elsewhere mentioned, do_peering_event() takes care of persisting them. Cool!

// this ensures we get a valid result. it *also* serves to throttle
// us a bit (on filestore) because we won't delete more until the
// previous deletions are applied.
osr->flush();

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Dec 12, 2017

Member

Okay, so obviously we need a valid result. But that's only an issue on the first invocation, when we might have pending ops against the PG, right?
I don't think sticking a blocking operation in-line here is the right way to do flushing or throttling, especially since we're now eating up one of the main op workqueue threads. If we need to wait, that should be another state that allows other operations to proceed.

This comment has been minimized.

Copy link
@liewegas

liewegas Dec 12, 2017

Author Member

hmm, that's a good point. i'm not sure what a better way to throttle would be, though. options are:

  • wait for commit to queue next delete. this will be pretty slow if the chunk size is small; large chunk sizes mean more interference with client io.
  • wait for onreadable -- this won't work with bluestore, which is immediately readable (or at least claims to be).
  • ???

Maybe we want to do something clever here, like make each op accumulate some read work (to keep it small and fast), and after many such DeleteSome passes queue up the write transaction and wait for commit?

This comment has been minimized.

Copy link
@liewegas

liewegas Dec 12, 2017

Author Member

added todo

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Dec 12, 2017

Member

We can spin up callbacks when the deletes commit, right? Maybe just make a max outstanding number and block (via a DeletingWait state) doing a new chunk until they're done?

This comment has been minimized.

Copy link
@liewegas

liewegas Dec 14, 2017

Author Member

Hrm, we are triggering the next delete step once onreadable. We can't change that oncommit or else it will break filestore (unless we keep the flush). It also means that the flush() is a no-op for filestore on subsequent passes because we waited for onreadable.

For bluestore, onreadable is triggered at queue time, which means that without the flush we go really fast without much (any?) throttling. Keeping the flush throttles it.

I think the fix is to condition the next step on both onreadable and oncommit. That's what the latest patch does.

@@ -3809,8 +3834,6 @@ void OSD::add_newly_split_pg(PG *pg, PG::RecoveryCtx *rctx)
}
peering_wait_for_split.erase(to_wake);
}
if (!service.get_osdmap()->have_pg_pool(pg->pg_id.pool()))
_remove_pg(pg);

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Dec 12, 2017

Member

No replacement?

This comment has been minimized.

Copy link
@liewegas

liewegas Dec 12, 2017

Author Member

i need to review all of the split vs pg_map code; haven't done that yet.

This comment has been minimized.

Copy link
@liewegas

liewegas Dec 12, 2017

Author Member

added todo

This comment has been minimized.

Copy link
@liewegas

liewegas Dec 12, 2017

Author Member

ok, it turns out the current split tracking code is fine. This can go away along with some cleanup helpers because post-split pg deletion is now handled when the new/child pg consumes the osdmap with the pool deletion (i.e., pg removal is less of a special case).

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Dec 12, 2017

Member

Ah yeah, I figured that out in one of the other code paths...much nicer!

up, up_primary, acting, acting_primary);
if (valid_history &&
history.same_interval_since <= m->get_epoch()) {
assert(pg->get_primary().osd == m->get_source().num());

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Dec 12, 2017

Member

We don't seem to have preserved these checks that the OSD in question is the valid primary on the interval we're in? We can do it in the PG instead of in this dispatch thread, but I didn't notice them popping up there either.

This comment has been minimized.

Copy link
@liewegas

liewegas Dec 12, 2017

Author Member

the peering event is tagged with an epoch so the generic checks there catch the stale events.

This comment has been minimized.

Copy link
@gregsfortytwo
@gregsfortytwo

This comment has been minimized.

Copy link
Member

commented Dec 12, 2017

Looks good except for that flushing thing you said you had a TODO for

@liewegas liewegas force-pushed the liewegas:wip-pg-removal branch from d88f007 to c09b89d Dec 14, 2017

// complete will be called exactly count times; only the last time will actualy
// complete.
if (--count) {
return;

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Dec 14, 2017

Member

Should we be checking r before returning, in case we get an error the first time through?

Otherwise, 👍!

This comment has been minimized.

Copy link
@liewegas

liewegas Dec 14, 2017

Author Member

these are always 0.. i'll add an assert!

@gregsfortytwo

This comment has been minimized.

Copy link
Member

commented Dec 14, 2017

Note the commit S-O-B complaint. Didn't walk through all the patches to try and identify new stuff, so let me know if more needs checking.

@liewegas liewegas force-pushed the liewegas:wip-pg-removal branch from defc7f5 to aff0386 Dec 14, 2017

@liewegas

This comment has been minimized.

Copy link
Member Author

commented Dec 14, 2017

Squashed them out. Thanks @gregsfortytwo ! Running a finalish run now, although this is blocked on #1819 merging first (and rebasing on the final version of that).

@liewegas liewegas force-pushed the liewegas:wip-pg-removal branch 3 times, most recently from 928827c to 539be3a Dec 15, 2017

@liewegas liewegas force-pushed the liewegas:wip-pg-removal branch 3 times, most recently from 0d479fd to 4ab3352 Jan 4, 2018

liewegas added some commits Dec 9, 2017

osd: leave PG registered (and stray) during delete; reimplement pg de…
…letion

A lot of awkward complexity is implemented in OSD to handle PGs that aren't in
pg_map and are in the process of being deleted.  This is hard because if the
PG is recreated (or split, or whatever) then we need to stop the deletion and
create a fresh PG referencing the same data.

Instead, leave deleting PGs registered and Stray, with a new peering state
Stray/Deleting.  Let them continue to process OSDMaps, splits, peering intervals,
and so on.  If they are not fully deleted, they'll go back to Reset -> Stray and
so on and the new primary will get the notify and decide what to do with them
(usually instruct them to delete again).

This (1) streamlines and cleans up the code structure, and also (2) gets rid of
the special purpose RemoveWQ and moves the delete work into the main op_wq
where it can get throttled and so on.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/osd_types: tolerate pool deletion
- the epoch the pool is deleted is an interval change
- no changes are possible after that

Also, use a pg_pool_t pointer to avoid the repeat lookups.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/PG: tolerate missing pool
These assertions will soon no longer be valid.

Signed-off-by: Sage Weil <sage@redhat.com>
osd: remove old pg removal infrastructure
Another queue bites the dust! \o/

Signed-off-by: Sage Weil <sage@redhat.com>
osd: enqueue peering evt with pgid, not PG*
Signed-off-by: Sage Weil <sage@redhat.com>
osd/PG: carry pg ref for final pg delete txn
FileStore breaks of the Sequencer is destroyed while the txn is in flight.

Signed-off-by: Sage Weil <sage@redhat.com>
osd: queue peering events directly, without pg lookup
We do not need to look up the PG in order to queue a peering event now
that the queue is based on spg_t and handles the validity checks (based on
the epoch) for us.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/PG: do not publish_stats_to_osd in state exit() methods
exit() can happen due to AdvMap and a peering interval change, but it
runs before we have updated any of our internal state about whether we
are the primary, whether our pool is deleted and the pg no longer exists,
and so on.  The publish depends on (1) being primary, and (2) will crash
if the pool is gone from the OSDMap.

Signed-off-by: Sage Weil <sage@redhat.com>

liewegas added some commits Dec 12, 2017

osd: drop advance_pg() split_pgs arg
We can do the same work inside the function and simplify the code.

Signed-off-by: Sage Weil <sage@redhat.com>
osd: document split tracking structures a bit
Signed-off-by: Sage Weil <sage@redhat.com>
osd/PG: wait for commit *and* readable before deleting more of PG
For filestore, waiting for onreadable ensures that (1) the backend has done
(all) of the deletion work (we are throttled) and (2) that the flush() will
not block.  So, all good.

For bluestore, onreadable happens at queue time, so the flush() was needed
to throttle progress.  However, we don't want to block the op thread on
flush.  And waiting for commit isn't sufficient because that would not
capture the filestore apply work.

Fix by waiting for both commit and readable before doing more deletion
work.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/PG: start delete after initial delete start info is applied
This allows us to eliminate the flush in _delete_some().

Signed-off-by: Sage Weil <sage@redhat.com>
osd: do not clean up split state on deleted pools
We want the PGs to get created and then process the newer osdmap(s) to
delete themselves.

Signed-off-by: Sage Weil <sage@redhat.com>
osd: always call complete_split()
We need to complete the split regardless of whether the pg will get
removed after.

Signed-off-by: Sage Weil <sage@redhat.com>
osd: clean up pending splits when removing a pg
Say we get an osdmap indicating a pg will split, but the pg is deleting and
finishes its delete before the pg consumes that map.  We need to clean up
the pending split state.

Signed-off-by: Sage Weil <sage@redhat.com>
osd: identify splits even if pool is eventually deleted
Previously we wouldn't bother splitting if the pool was going away; now
we walk the pg forward and will process the split even if the pool is
later deleted.  Adjust the loop to terminate gracefully if that happens.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/PG: use local_reserver to schedule delete
Use the reserver so that delete competes for the same slot(s) as recovery
and such.

Priority below recovery normally, unless the OSD is getting fullish, in
which case we set a very high priority.  We have to be careful here because
backfill will back off when the OSD gets full(ish) but log recovery does
not.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/PG: clear requeued peering events
Broken by fffcc8a

Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 3f07137)
osd: tolerate startup for pg with no pool in osdmap
Signed-off-by: Sage Weil <sage@redhat.com>
osd/PG: drop PGPool auid member
It's at info.auid.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/PG: simplify snapmapper init
Use PGPool info; don't look at map (pool may have been deleted).

Signed-off-by: Sage Weil <sage@redhat.com>

@liewegas liewegas force-pushed the liewegas:wip-pg-removal branch from f799eda to a51ddc6 Jan 11, 2018

osd: record final pg_pool_t when a pool is deleted
Also, prevent OSD start if we have a PG whose pool is deleted and no
stored pool info.  (User should downgrade, let PG deletion complete, then
upgrade.)

Signed-off-by: Sage Weil <sage@redhat.com>

@liewegas liewegas force-pushed the liewegas:wip-pg-removal branch from a51ddc6 to cd0b2b5 Jan 11, 2018

liewegas added some commits Jan 9, 2018

osd: track deleted pools' pg_nums for calculating split
This is needed to determine (quickly) whether PGs have split.  Calling
get_pg_num() on the latest map does not work when the pool has been
deleted from that map.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/PrimaryLogPG: do on_shutdown on removal
This cleans up our reservations and misc other state in OSDService that
needs to be cleaned up.

Signed-off-by: Sage Weil <sage@redhat.com>

@liewegas liewegas force-pushed the liewegas:wip-pg-removal branch from 43d01db to d37d966 Jan 16, 2018

@liewegas liewegas merged commit d079916 into ceph:master Jan 16, 2018

5 checks passed

Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details

@liewegas liewegas deleted the liewegas:wip-pg-removal branch Jan 16, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.