osd/PeeringState: do not exclude up from acting_recovery_backfill #31703

xiexingguo · 2019-11-18T07:43:39Z

If we choose a primary that does not belong to the current up set,
and all up peers are still recoverable, then we might end up excluding
some up peer from the acting_recovery_backfill set too due to the
"want size <= pool size" constraint (since #24035),
as a result of which all up peers might not get recovered in one go.

Fix by falling through any oversized want set to async recovery, which
should be able to handle it nicely.

Fixes: https://tracker.ceph.com/issues/42577
Signed-off-by: xie xingguo xie.xingguo@zte.com.cn

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard backend
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

xiexingguo · 2019-11-18T07:47:28Z

src/osd/PeeringState.cc

@@ -1703,9 +1703,6 @@ void PeeringState::calc_replicated_acting(
      acting_backfill->insert(up_cand);
      ss << " osd." << i << " (up) accepted " << cur_info << std::endl;
    }
-    if (want->size() >= size) {


@liewegas it's wrong to break here as we still want all up peers go to the acting_backfill set..

xiexingguo · 2019-11-18T12:30:10Z

retest this please

dzafman · 2019-11-18T17:49:01Z

It might be clearer to show this as a revert of c3e2990 and a commit with the alternate fix.

xiexingguo · 2019-11-19T00:21:42Z

this as a revert of c3e2990

I tried. The conflict of reverting is huge(pg.cc -> peeringstate.cc) :-(

dzafman · 2019-11-19T16:21:39Z

this as a revert of c3e2990

I tried. The conflict of reverting is huge(pg.cc -> peeringstate.cc) :-(

The code moved to src/osd/PeeringState.cc. You could do the following:
git revert c3e2990,
git reset HEAD src/osd/PG.cc
git checkout -- src/osd/PG.cc
Remove the 3 lines from PeeringState::calc_replicated_acting() in src/osd/PeeringState.cc
git add src/osd/PeeringState.cc
git revert --continue

This reverts commit c3e2990. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

If we choose a primary that does not belong to the current up set, and all up peers are still recoverable, then we might end up excluding some up peer from the acting_recovery_backfill set too due to the "want size <= pool size" constraint (since ceph#24035), as a result of which all up peers might not get recovered in one go. Fix by falling through any oversized want set to async recovery, which should be able to handle it nicely. Fixes: https://tracker.ceph.com/issues/42577 Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

xiexingguo · 2019-11-20T00:55:24Z

@dzafman Thanks!
@neha-ojha @liewegas Mind taking a look?

neha-ojha

looks good to me and also addresses https://tracker.ceph.com/issues/35924

neha-ojha · 2019-11-20T19:45:23Z

retest this please

tchaikov · 2019-11-24T16:58:04Z

xiexingguo · 2019-11-25T00:06:09Z

@tchaikov Thanks!

xiexingguo added bug-fix core labels Nov 18, 2019

xiexingguo requested review from jdurgin and neha-ojha November 18, 2019 07:43

xiexingguo commented Nov 18, 2019

View reviewed changes

tchaikov added the wip-kefu2-testing label Nov 19, 2019

xiexingguo added 2 commits November 20, 2019 08:30

Revert "osd/PG: avoid choose_acting picking want with > pool size items"

82bb83f

This reverts commit c3e2990. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

xiexingguo force-pushed the wip-42577-plus branch from de0cb0e to 22c8cda Compare November 20, 2019 00:53

neha-ojha approved these changes Nov 20, 2019

View reviewed changes

neha-ojha added the needs-qa label Nov 20, 2019

tchaikov merged commit 819ccfd into ceph:master Nov 24, 2019

xiexingguo deleted the wip-42577-plus branch November 25, 2019 00:06

smithfarm mentioned this pull request Dec 6, 2019

nautilus: osd/PeeringState: do not exclude up from acting_recovery_backfill #32064

Merged

This was referenced Feb 14, 2020

mimic: osd/PeeringState: do not exclude up from acting_recovery_backfill #33324

Merged

luminous: osd/PeeringState: do not exclude up from acting_recovery_backfill #33326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osd/PeeringState: do not exclude up from acting_recovery_backfill #31703

osd/PeeringState: do not exclude up from acting_recovery_backfill #31703

xiexingguo commented Nov 18, 2019 •

edited by dzafman

Loading

xiexingguo Nov 18, 2019

xiexingguo commented Nov 18, 2019

dzafman commented Nov 18, 2019

xiexingguo commented Nov 19, 2019

dzafman commented Nov 19, 2019

xiexingguo commented Nov 20, 2019

neha-ojha left a comment

neha-ojha commented Nov 20, 2019

tchaikov commented Nov 24, 2019

xiexingguo commented Nov 25, 2019

osd/PeeringState: do not exclude up from acting_recovery_backfill #31703

osd/PeeringState: do not exclude up from acting_recovery_backfill #31703

Conversation

xiexingguo commented Nov 18, 2019 • edited by dzafman Loading

Checklist

xiexingguo Nov 18, 2019

Choose a reason for hiding this comment

xiexingguo commented Nov 18, 2019

dzafman commented Nov 18, 2019

xiexingguo commented Nov 19, 2019

dzafman commented Nov 19, 2019

xiexingguo commented Nov 20, 2019

neha-ojha left a comment

Choose a reason for hiding this comment

neha-ojha commented Nov 20, 2019

tchaikov commented Nov 24, 2019

xiexingguo commented Nov 25, 2019

xiexingguo commented Nov 18, 2019 •

edited by dzafman

Loading