New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd/PG: make prioritized recovery possible #13723

Merged
merged 8 commits into from Jul 24, 2017

Conversation

Projects
None yet
8 participants
@branch-predictor
Member

branch-predictor commented Mar 1, 2017

In a large, live cluster, it may be desireable to have particular PGs recovered before others. Actual example include a recovery after a rack failure, where a lot of PGs must be recovered and some of them host data for live VMs with SLA higher than other VMs, in which case we'd like to have high-SLA VMs to be restored to full health and performance as fast as possible. This PR adds four new commands ("ceph pg force-recovery", "ceph pg force-backfill", "ceph pg cancel-force-recovery", "ceph pg cancel-force-backfill") which mark one or more specified PGs as "forced", and thus having their recovery or backfill priority maximized. This PR also alters the priorities of default recovery (reduces max priority to 254), so any other PG won't get in the way. User can restore default priorities with "cancel-force-*" commands at any time.

Signed-off-by: Piotr Dałek piotr.dalek@corp.ovh.com

@liewegas liewegas changed the title from PG: make prioritized recovery possible to osd/PG: make prioritized recovery possible Mar 2, 2017

@liewegas

This comment has been minimized.

Member

liewegas commented Mar 2, 2017

Related question: since this is motivated by an SLA, would it make sense to have a pool property that automatically translates to a higher priority for PGs in those pools? Then you wouldn't need to manually poke at PGs.

@jdurgin

This comment has been minimized.

Member

jdurgin commented Mar 2, 2017

@liewegas there already is a pool property for recovery priority - the difference with this is that it's also changing the way recovery and backfill are queued

@branch-predictor could you explain a bit about what motivated that queuing change? E.g. do you have measurements of time to recover and client i/o impact with that approach vs just using the pool priorities?

@branch-predictor

This comment has been minimized.

Member

branch-predictor commented Mar 3, 2017

@liewegas @jdurgin I don't have exact numbers atm (@byo might have more detailed data), but it often takes even up to a few days for recovery to finish. We have quite large clusters with a lot of PGs (thousands of PGs) and even if it takes a few seconds to recover a single PG, it quickly adds up - to the point that we have to halt recovery process during daytime so customers could actually use their VMs and restart it on night shift. With prioritized recovery, we could at least recover each VM by one and don't worry about them anymore. That also reduces the area of damage (in terms of number of images affected) when cascaded failures occur.

@branch-predictor

This comment has been minimized.

Member

branch-predictor commented Mar 3, 2017

@jdurgin We could use a pool per vm, but that would be inconvenient, and using two pools (high-sla and low-sla) will also be inconvenient and prone to human errors (including management mistakes which suddenly may require that all images are high-sla). Finally, high-sla vms will be recovered with random order of PGs, with possibility that single VM will be recovered after a lot of other, unrelated PGs are done.

@byo

This comment has been minimized.

Contributor

byo commented Mar 7, 2017

@liewegas @jdurgin this patch is a result of one of our biggest outages where cluster recovery was incorrectly prioritized making cluster blocked for few hours. Major causes of this outage were already fixed in #12389 (cluster did prefer recovery of active PGs when there were inactive ones in the cluster) and #12342 (messages directed to inactive PGs saturated message throttlers making OSD blocked).
This PR would give us a perfect tool to quickly react if there would still be some anomalies during recovery. Sometimes we try to do this by manipulating osd max backfills across the cluster but it's not perfect..

@liewegas

This comment has been minimized.

Member

liewegas commented Mar 7, 2017

@jdurgin

This comment has been minimized.

Member

jdurgin commented Mar 7, 2017

Yeah, this seems like a good backup - hence the lack of persistence and pg state flags to let you know when it's still set. Ideally long term there would be a way to do this per-vm, rather than per-pg, but this makes sense for now. It could even be used as a building block for such functionality in ceph-mgr.

The code looks good, and I'm happy to see you've updated the docs too.

Could you add some randomized testing of this to the ceph_manager.py and thrasher.py, something similar to what is done here: https://github.com/ceph/ceph/pull/13759/files

That way we can be sure it won't cause hangs or crashes at least.

ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
This will cause Ceph to perform recovery or backfill on specified placement
groups first, before other placement groups. This does not break currently

This comment has been minimized.

@jharbott

jharbott Jun 21, 2017

Contributor

s/break/interrupt/ would sound more fitting to me, or maybe "terminate"

This comment has been minimized.

@branch-predictor

branch-predictor Jun 26, 2017

Member

OK, I'll change it.

backfill or recovery), execute the following::
ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]

This comment has been minimized.

@jharbott

jharbott Jun 21, 2017

Contributor

Is there a situation where doing only one of these is really useful? In my experience, I would mostly want to do both, so having a single command for that would simplify things. Maybe it would then even be enough to only have this single command.

This comment has been minimized.

@branch-predictor

branch-predictor Jun 26, 2017

Member

"recovery" and "backfill" are different states, and you may run into situation where you want to boost priority of backfilled pgs, disregarding recovery alone.

// get pgids to process and prune duplicates
cmd_getval(g_ceph_context, cmdmap, "pgid", pgidstr);
for (auto& pstr : pgidstr) {

This comment has been minimized.

@chriswue

chriswue Jun 22, 2017

you should be able to simply do:

set<string> pgidstr_nodup(pgidstr.begin(), pgidstr.end())

This comment has been minimized.

@branch-predictor
auto it = pgidstr_nodup.begin();
for (size_t i = 0 ; i < pgidstr_nodup.size(); i++) {
pgidstr[i] = std::move(*it++);
}

This comment has been minimized.

@chriswue

chriswue Jun 22, 2017

this could be simplified to:

pgidstr.clear();
pgidstr.insert(pgidstr.end(), make_move_iterator(pgidstr_nodup.begin()), make_move_iterator(pgidstr_nodup.end()));

Technically you should be able to create a new vector from the set (with the elements moved) and then move the new vector into pgidstr but not sure how much that would save you. Don't think there is much point in re-sizing the original vector.

This comment has been minimized.

@branch-predictor

branch-predictor Jun 26, 2017

Member

With clearing and re-expanding, we're freeing a block of memory and then requesting a new one, while downsizing can be done in-place.

This comment has been minimized.

@chriswue

chriswue Jul 1, 2017

clear() removes all elements from the container but retains the reserved capacity. So there is no re-allocation going on

}
for (auto i : workpg.acting) {
osds.insert(i);
}

This comment has been minimized.

@chriswue

chriswue Jun 22, 2017

the two loops above can be shortened to:

set<int32_t> osds(workpg.up.begin(), workpg.up.end());
osds.insert(workpg.acting.begin(), workpg.acting.end());

This comment has been minimized.

@branch-predictor
@jdurgin

tests look good! let's see how this fares in the suite

@jdurgin jdurgin added the needs-qa label Jun 28, 2017

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Jun 30, 2017

@branch-predictor needs rebase

@branch-predictor

This comment has been minimized.

Member

branch-predictor commented Jun 30, 2017

@tchaikov done.

@liewegas

This comment has been minimized.

Member

liewegas commented Jul 5, 2017

this crashed and burned on teh teuthology change... see http://pulpito.ceph.com/sage-2017-07-03_15:45:54-rados-wip-sage-testing2-distro-basic-smithi/ failures

@branch-predictor branch-predictor changed the title from osd/PG: make prioritized recovery possible to [DNM] osd/PG: make prioritized recovery possible Jul 18, 2017

@branch-predictor

This comment has been minimized.

Member

branch-predictor commented Jul 18, 2017

For some odd reason, under teuthology testing mons sometimes process the pg force-* command and return EINVAL as a result. DNM until I figure out why.

branch-predictor added some commits Feb 8, 2017

PG: introduce forced recovery/backfill
Reduce max automatically calculable recovery/backfill priority to 254
and reserve 255 for forced backfill/recovery, so recovery/backfill on
user-designated PGs can be requested before other currently backfilled
and/or recovered PGs. Clear PG_STATE_FORCED_BACKFILL and
PG_STATE_FORCED_RECOVERY once recovery/backfill is done.

Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
messages: introduce MOSDForceRecovery
Introduce new message type (MOSDForceRecovery) that will be used to
force (or cancel forcing) PG recovery/backfill.

Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
Mgr: implement force-recovery/backfill commands
Implement commands "pg force-recovery", "pg force-backfill", "pg
cancel-force-recovery" and "pg cancel-force-backfill" that accept
an one or more PG IDs and cause these PGs to be recovered or
backfilled first. "cancel-*" commands can be used to revert the
effect of "pg force-*" commands. 

Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
OSD, PG: remove optional arg from queue_for_recovery
The optional arg ("front") was meant to control if PG was supposed
to be put in front or back (default) of awaiting_throttle. For some
reason, this was't used at all, so this commit removes it and
replaces with logic that checks whether the PG has forcecd
backfill or recovery set, and lets it in the front only in that
case.

Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
osd, pg: implement force_recovery
This commit implements the MOSDForceRecovery handler along with
all required code to have PGs processed in desired order
(PGs with force_recovery/force_backfill first).
Obviously it's not going to work cluster-wide and OSDs that are
not affected - but may affect affected OSDs - may cut into PG
recovery queue and cause PGs with force_* flags to get recovered
or backfilled later than expected, but still way earlier than
without it.

Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
doc: document new force-recovery/force-backfill commands
Documentation for new pg force-recovery, pg force-backfill,
pg-cancel-force-recovery and pg-cancel-force-backfill.

Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>

branch-predictor added some commits Jun 28, 2017

qa: add force/cancel recovery/backfill to QA testing
This randomly issues pg force-recovery/force-backfill and
pg cancel-force-recovery/cancel-force-backfill during QA
testing. Disabled for upgrades from hammer, jewel and kraken.

Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
PGMonitor: return -EPERM on pg force-* commands during upgrade
Return -EPERM on pg force-* commands during upgrade.

Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
@branch-predictor

This comment has been minimized.

Member

branch-predictor commented Jul 21, 2017

@liewegas @jcsp @jdurgin Moved commands to mgr, rebased to include mgr fixes, worked around upgrade from pre-luminous issue, tested at http://149.202.190.108:8081/ubuntu-2017-07-20_07:54:37-rados-bp-forced-recovery---basic-openstack/ and is re-tested again at http://149.202.190.108:8081/ubuntu-2017-07-20_19:58:28-rados-bp-forced-recovery---basic-openstack/. Failures are either environmental noise or unrelated to forced backfill/recovery. Please re-review and if it looks OK, I'm going to squash ec7094c with de0566f.

@branch-predictor branch-predictor changed the title from [DNM] osd/PG: make prioritized recovery possible to osd/PG: make prioritized recovery possible Jul 21, 2017

@liewegas liewegas added this to the luminous milestone Jul 21, 2017

@liewegas

This comment has been minimized.

Member

liewegas commented Jul 22, 2017

@liewegas liewegas merged commit 29549e6 into ceph:master Jul 24, 2017

2 of 4 checks passed

make check make check failed
Details
make check (arm64) make check failed
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Jul 31, 2017

This appears to be a bit broken, the recovery_lock is supposed to be very narrow but it's being held while grabbing a pg lock here. See http://tracker.ceph.com/issues/20854

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment