Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd/PG: fix objects degraded higher than 100% #18297

Merged
merged 5 commits into from Oct 24, 2017

Conversation

Projects
None yet
5 participants
@dzafman
Copy link
Member

commented Oct 13, 2017

osd: Fix object_copies calculation so degraded is accurate

@dzafman dzafman force-pushed the dzafman:wip-21803 branch from 4ba221a to cea303e Oct 13, 2017

@dzafman dzafman requested a review from gregsfortytwo Oct 13, 2017

@dzafman dzafman force-pushed the dzafman:wip-21803 branch from cea303e to e92e8f9 Oct 16, 2017

@gregsfortytwo
Copy link
Member

left a comment

This looks feasible to me, but I'm not quite sure.

And you need to add a S-O-B on the second commit. :)

}
missing += osd_missing;

// Safety check so we don't go negative anywhere

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Oct 16, 2017

Member

Hmm, can this actually happen? Do we want to stop processing if they have a bunch of objects which the primary doesn't? It seems like that could get us into the opposite problem — we undercount recovery we need to do, instead of overcounting it.

osd: Re-organize _update_calc_stats() to make the code clearer
Signed-off-by: David Zafman <dzafman@redhat.com>

@dzafman dzafman force-pushed the dzafman:wip-21803 branch from e92e8f9 to 568819f Oct 18, 2017

@gregsfortytwo
Copy link
Member

left a comment

Reviewed-by: Greg Farnum gfarnum@redhat.com

@gregsfortytwo

This comment has been minimized.

Copy link
Member

commented Oct 18, 2017

@dzafman do you have a plan to add tests? Not sure what the best path forward is here but it looks good to me.

}
missing += osd_missing;

osd_objects = MAX(0, num_objects - osd_missing);

This comment has been minimized.

Copy link
@tchaikov

tchaikov Oct 19, 2017

Contributor

please note, num_objects will be promoted to unsigned when evaluating this expression which involves unsigned operand. so
MAX(0, num_objects - osd_missing) is always evaluated to num_objects - osd_missing. and the latter could be an underflow of unsigned integer.

This comment has been minimized.

Copy link
@dzafman

dzafman Oct 19, 2017

Author Member

@tchaikov I tested this and it shows that the results of (int64_t)num_objects - (unsigned)osd_missing is signed (int64_t). So the MAX comparison to 0 will work and then final results changed back to unsigned.

This comment has been minimized.

Copy link
@tchaikov

tchaikov Oct 22, 2017

Contributor

@dzafman ahh, right! i see. i didn't realize that num_objects is an int64_t. the size of int64_t is greater than that of unsigned, hence has higher integer conversion rank. so (int64_t)num_objects - (unsigned)osd_missing is a signed (int64_t).

sorry for the noise =(

@liewegas liewegas changed the title Fix objects degraded higher than 100% osd/PG: fix objects degraded higher than 100% Oct 22, 2017

@liewegas

This comment has been minimized.

Copy link
Member

commented Oct 22, 2017

Just saw this.. was working on the same bug on #18462. Will rebase on top of this since there is a second bug I found

@liewegas

This comment has been minimized.

Copy link
Member

commented Oct 22, 2017

See #18463

@liewegas

This comment has been minimized.

Copy link
Member

commented Oct 23, 2017

FWIW tested this (+ my patch) out on lab cluster and it works beautifully.

@yuriw

This comment has been minimized.

Copy link
Contributor

commented Oct 23, 2017

osd/PG: _update_calc_stats: backfill targets do not always affect deg…
…raded

Our progress on a backfill target should only count toward degraded if
the target is needed in order for us to reach the target pool size.  If
we have remapped other complete copies, we can backfill and not be degraded
at all.  And we may be some combination of the two if there are multiple
backfill targets; use the target(s) with the most objects.

Signed-off-by: Sage Weil <sage@redhat.com>
osd: Adjust misplaced calculation
When there are more backfill ndoes then misplaced nodes the
object calculation would cause misplace to drop too quickly.
Assuming that all backfill nodes are at approximate the same
place, we compute backfill per node and only adjust by
the number of misplaced nodes.

Fixes: http://tracker.ceph.com/issues/21898

Signed-off-by: David Zafman <dzafman@redhat.com>

@dzafman dzafman force-pushed the dzafman:wip-21803 branch from 8ef21fd to 3818e8e Oct 24, 2017

@dzafman

This comment has been minimized.

Copy link
Member Author

commented Oct 24, 2017

@liewegas @gregsfortytwo The code as of this time does not address the customer complaint about degraded when backfill is NOT involved. In the example below we took osd.1 out, but during recovery there is no knowledge that the out OSD contains all the objects:

UP/ACTING [1, 0]
ceph osd out osd.1
UP/ACTING [0, 2]
state: active+recovering+degraded
Degraded data redundancy: 62/200 objects degraded (31.000%), 1 pg degraded

@liewegas

This comment has been minimized.

Copy link
Member

commented Oct 24, 2017

I'm not following the scenario... can you include the steps to reproduce?

osd: Handle objects on non-actingbackfill nodes during recovery
With this fix we can see more misplaced and no degraded when
doing recovery to a new up/acting set with node(s) containing
objects that are not longer part of up/acting.

Signed-off-by: David Zafman <dzafman@redhat.com>
@liewegas

This comment has been minimized.

Copy link
Member

commented Oct 24, 2017

Hmm, the misplaced calculation still confuses me... I think because the definition of what misplaced means feels fuzzy. Here's my attempt at a concrete definition... does this makes sense?

  • The up set tells us where each object copy/shard should be. A clean object is the right object (shard) in its right place.
  • A misplaced object is an object (shard) that is not in its right place, but there is another copy of that object (shard) in a temporary place (i.e, the acting set) that is being actively maintained.
  • A degraded object is an object (shard) that is not in its right place, and there is not another copy (of that shard) somewhere else (in the acting set) that is actively maintained to compensate.

If that's the case, then we should be able to consider the up and acting osds, and then look at every location in the 'up' set and decide whether the object is clean, degraded, or misplaced. And in the end num_object_copies == clean + misplaced + degraded.

  • for each up + acting osd: clean = num_objects - missing, degraded = missing.
  • for each acting + !up osd: misplaced = num_objects - missing, degraded = missing
  • for each backfill target < size: osd_clean = peer_stat.num_objects, degraded = num_objects - osd_clean.
  • for each backfill target >= size: osd_clean = peer_stat.num_objects, misplaced = num_objects - osd_clean

?

@dzafman

This comment has been minimized.

Copy link
Member Author

commented Oct 24, 2017

( Not related to your most recent comment)

My latest commit counts the 100 objects on osd.1 which isn't part of UP/ACTING during recovery but
is in peer_info[]...num_objects.

Scenario
ceph osd pool create test 1 1
ceph osd pool set test size 2
Create 100 objects
UP/ACTING [1, 0]
ceph osd out osd.1
UP/ACTING [0, 2]
state: active+recovering+degraded
Degraded data redundancy: 62/200 objects degraded (31.000%), 1 pg degraded

@dzafman

This comment has been minimized.

Copy link
Member Author

commented Oct 24, 2017

@liewegas

Should misplaced be larger than num_object_copies? When I did the following:
ceph osd pool create test 1 1
ceph osd pool set test size 4
Create 200 objects
UP/ACTING [0, 1, 2, 4]
ceph osd out osd.0 osd.1 osd.2 osd.4
ceph osd pool set test size 2
UP/ACTING [3, 5]
state: active+recovering+degraded
800/400 objects misplaced (200.000%)

All the misplaced where the 200 objects on each of 0, 1, 2, 4.

Even though degraded state is set, we don't end up counting any objects as degraded because they are still on 0, 1, 2, 4 which aren't down. The 400 missing objects on 3, 5 aren't degraded because you won't lose data if there is a crash.

Also, note that with the current code the misplaced doesn't change as the recovery proceeds.

@liewegas

This comment has been minimized.

Copy link
Member

commented Oct 24, 2017

Using my definition above, going from [0,1,2,4] -> [3,5] should resulting in 100% misplaced, but not > 100%.

The scenario where it falls apart is when you are deleting objects. Say you go from 100->10 objects and are backfilling... are the 90 'missing' deletes degraded or misplaced? Not according to the above, but that would mean we don't reflect them at all in the degraded or misplaced report, even though the PG has work to do. Not sure the get_num_missing() tells you deletes vs updates, though. It'll probably be close enough..

@yuriw yuriw merged commit 64bd835 into ceph:master Oct 24, 2017

5 checks passed

Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details
// degraded and account as misplaced.
for (auto i = peer_info.begin() ; i != peer_info.end() ; ++i) {
if (actingbackfill.find(i->first) == actingbackfill.end())
object_copies += i->second.stats.stats.sum.num_objects;

This comment has been minimized.

Copy link
@tchaikov

tchaikov Oct 25, 2017

Contributor

missing a pair of parentheses =) fix posted at #18528

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.