New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: async recovery #19811

Merged
merged 27 commits into from Mar 15, 2018

Conversation

Projects
None yet
6 participants
@neha-ojha
Member

neha-ojha commented Jan 5, 2018

This PR adds asynchronous recovery capability, by performing recovery in the background on an OSD out of the acting set, similar to backfill, but still using the PG log to determine what needs recovery. We call such OSDs, async_recovery_targets. The idea is to not block writes to objects awaiting recovery in async_recovery_targets.

Signed-off-by: Neha Ojha nojha@redhat.com

@neha-ojha

This comment has been minimized.

Member

neha-ojha commented Jan 6, 2018

retest this please.

@neha-ojha neha-ojha requested a review from jdurgin Jan 6, 2018

@@ -1397,6 +1397,131 @@ void PG::calc_replicated_acting(
}
}
bool PG::recoverable_and_ge_min_size(const vector<int> &want) const

This comment has been minimized.

@Liuchang0812

Liuchang0812 Jan 6, 2018

Contributor

recoverable_and_get_min_size?

This comment has been minimized.

@neha-ojha

neha-ojha Jan 6, 2018

Member

@Liuchang0812 that implies greater than equal to.

This comment has been minimized.

@ZVampirEM77

ZVampirEM77 Jan 8, 2018

Contributor

It is very hard to understand what the function does from this function name~ Could we use an understandable one?

This comment has been minimized.

@neha-ojha

neha-ojha Jan 9, 2018

Member

The comments within the function try to explain what the function does - check if recoverable. Any suggestions for a better name based on those comments?

get_parent()->get_actingbackfill_shards().begin();
i != get_parent()->get_actingbackfill_shards().end();
get_parent()->get_acting_recovery_backfill_shards().begin();
i != get_parent()->get_acting_recovery_backfill_shards().end();

This comment has been minimized.

@ZVampirEM77

ZVampirEM77 Jan 8, 2018

Contributor

I think the repeating call get_parent()->get_acting_recovery_backfill_shards() is inefficiency.
Maybe

const set<pg_shard_t> acting_recovery_backfill_shards = get_parent()->get_acting_recovery_backfill_shards();
for (set<pg_shard_t>::const_iterator i = acting_recovery_backfill_shards.begin();
       i != acting_recovery_backfill_shards.end(); ++i)

is better. What do you think?

get_parent()->get_actingbackfill_shards().begin();
i != get_parent()->get_actingbackfill_shards().end();
get_parent()->get_acting_recovery_backfill_shards().begin();
i != get_parent()->get_acting_recovery_backfill_shards().end();

This comment has been minimized.

@ZVampirEM77

ZVampirEM77 Jan 8, 2018

Contributor

same

}

This comment has been minimized.

@ZVampirEM77

ZVampirEM77 Jan 8, 2018

Contributor

I think this kind of cleanup will pollute the commit history, which make us can not use git show and git blame to dig commit history, right? Maybe we should not introduce it~

@@ -584,6 +585,8 @@ bool PrimaryLogPG::should_send_op(
}
if (async_recovery_targets.count(peer) && peer_missing[peer].is_missing(hoid)) {
should_send = false;
missing_loc.add_location(hoid, peer);

This comment has been minimized.

@jdurgin

jdurgin Jan 16, 2018

Member

missing_loc is the location of valid copies of missing objects - should add the acting set where the object is not missing

@@ -584,6 +585,8 @@ bool PrimaryLogPG::should_send_op(
}
if (async_recovery_targets.count(peer) && peer_missing[peer].is_missing(hoid)) {
should_send = false;
missing_loc.add_location(hoid, peer);
missing_loc.add_missing(hoid, v, eversion_t());

This comment has been minimized.

@jdurgin

jdurgin Jan 16, 2018

Member

putting these missing_loc updates in issue_repop() along with the peer_missing updates, before the submit_transaction() call there, would make it easier to find - should_send_op() doesn't sound like it should be modifying state

@neha-ojha

This comment has been minimized.

Member

neha-ojha commented Jan 22, 2018

retest this please.

@dzafman

This pull request and #20220 will probably have conflicts that will need to be resolved.

@@ -190,14 +190,14 @@ function TEST_recovery_sizeup() {
ceph osd out osd.$primary osd.$otherosd
ceph osd pool set test size 4
ceph osd unset norecover
ceph tell osd.$(get_primary $poolname obj1) debug kick_recovery_wq 0
# Get new primary
primary=$(get_primary $poolname obj1)

This comment has been minimized.

@dzafman

dzafman Mar 14, 2018

Member

This change works in master even without async recovery. Looking at this now I'd say the most important thing is that after the reconfiguration the new primary needs to be selected and returned at this point. If the primary really did change between the kick_recovery_wq and after the wait_for_clean() then we probably kicked the wrong osd. The test will fail if the check() below uses the wrong osd's log.

This comment has been minimized.

@neha-ojha

neha-ojha Mar 14, 2018

Member

Here is some background I shared with Josh earlier on this:

  • before async_recovery, the new osds in play would be [2,3,4,5] and all of them would be in the acting set. This would mean that the primary once chosen would remain the same. So the primary used to kick recovery would be the same as when we "Get new primary". In this case it would be osd 2.

  • after async recovery, the acting set becomes [4,5] in the choose_acting function with [2,3] as async recovery targets. So, osd.4 is used to kick recovery. However, when recovery completes, all the osds get back to the acting set and the primary is now osd.2. So there is a race condition, where the new primary found would be different from the primary on which the recovery was kicked. The value of the primary is then used in the check function to grep for degraded. This will fail since it greps on osd.2, while the logs exist on osd.4.

This comment has been minimized.

@dzafman

dzafman Mar 14, 2018

Member

So that mean that a recovery that use to look like [2,3,4,5]p2 now looks like [2,3]/[4,5]p4 up/acting?

This comment has been minimized.

@neha-ojha

neha-ojha Mar 14, 2018

Member

Something like this [2,4,3,5]/[4,5]

This comment has been minimized.

@jdurgin

jdurgin Mar 14, 2018

Member

maybe we should add async_recovery_targets and backfill_targets to the pg dout() prefix to make this clearer when peering logs aren't available

@neha-ojha

This comment has been minimized.

r != peers.end();
++r) {
pg_shard_t peer(*r);
missing_loc.remove_location(soid, peer);

This comment has been minimized.

@jdurgin

jdurgin Mar 14, 2018

Member

this should only contain peers, so it'd be a little more efficient to add a MissingLoc::clear_location(soid) call that just does clear() on the locations instead of looping

neha-ojha added some commits Nov 28, 2017

PG: initial changes required for async recovery
Signed-off-by: Neha Ojha <nojha@redhat.com>
osd: introduce acting_recovery_backfill shards
Signed-off-by: Neha Ojha <nojha@redhat.com>
PG: cleanup and fixes for async recovery
Signed-off-by: Neha Ojha <nojha@redhat.com>
PG: modify should_send_op to handle async recovery
Signed-off-by: Neha Ojha <nojha@redhat.com>
PG: do not block writes on async_recovery_targets
Signed-off-by: Neha Ojha <nojha@redhat.com>
PG: update missing_loc for async_recovery_targets
Signed-off-by: Neha Ojha <nojha@redhat.com>
PG: set delete flag in needs_recovery_map for deletes
Signed-off-by: Neha Ojha <nojha@redhat.com>
PG: mark recovery started when fail to get recovery_read_lock
Signed-off-by: Neha Ojha <nojha@redhat.com>
PG: call choose_acting when there are async_recovery_targets
Signed-off-by: Neha Ojha <nojha@redhat.com>
PG: reset async_recovery_targets after recovery
Signed-off-by: Neha Ojha <nojha@redhat.com>
PG: update pg_log missing for async_recovery_targets
Signed-off-by: Neha Ojha <nojha@redhat.com>
PG: bug fix in is_degraded_or_backfilling_object()
Signed-off-by: Neha Ojha <nojha@redhat.com>
void clear_location(const hobject_t &hoid) {
auto p = missing_loc.find(hoid);
if (p != missing_loc.end()) {
missing_loc.erase(p);

This comment has been minimized.

@dzafman

dzafman Mar 15, 2018

Member

You need a _dec_count(p->second) here to update parallel map missing_by_count.

PG: skip primary in peer_missing in rebuild()
Signed-off-by: Neha Ojha <nojha@redhat.com>

neha-ojha added some commits Jan 31, 2018

ECBackend: update pg_log missing for async_recovery_targets
Signed-off-by: Neha Ojha <nojha@redhat.com>
PG: handle degraded object in _rollback_to()
Signed-off-by: Neha Ojha <nojha@redhat.com>
PG: clear out old missing_loc in issue_repop
Signed-off-by: Neha Ojha <nojha@redhat.com>
PG: add location to missing_loc only during async_recovery
Signed-off-by: Neha Ojha <nojha@redhat.com>
PG: clear missing_loc only during async_recovery
Signed-off-by: Neha Ojha <nojha@redhat.com>
PG: want set size should not be below min_size
Signed-off-by: Neha Ojha <nojha@redhat.com>
qa: modify TEST_recovery_sizeup() to handle async recovery
Signed-off-by: Neha Ojha <nojha@redhat.com>
ECBackend: send correct stats to async_recovery_targets
Signed-off-by: Neha Ojha <nojha@redhat.com>
PG: handle async targets with missing clone in _rollback_to()
Signed-off-by: Neha Ojha <nojha@redhat.com>
PG: use get_recovery_read to prevent race during recovery deletes
Signed-off-by: Neha Ojha <nojha@redhat.com>
PG: use clear_location() to clear out missing_loc
Signed-off-by: Neha Ojha <nojha@redhat.com>
PG: add async recovery and backfill to pg dout() prefix
Signed-off-by: Neha Ojha <nojha@redhat.com>
PG: dout cleanup
Signed-off-by: Neha Ojha <nojha@redhat.com>
@neha-ojha

This comment has been minimized.

Member

neha-ojha commented Mar 15, 2018

retest this please

@dzafman

Looks good. I haven't reviewed the entire pull request.

Looks good. I haven't reviewed the entire pull request.

@jdurgin

great work!

@jdurgin jdurgin changed the title from [DNM]osd: async recovery to osd: async recovery Mar 15, 2018

@jdurgin jdurgin merged commit 9b72808 into ceph:master Mar 15, 2018

4 of 5 checks passed

make check (arm64) make check failed
Details
Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
@neha-ojha

This comment has been minimized.

Member

neha-ojha commented Mar 15, 2018

thanks @jdurgin @dzafman !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment