Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd/PeeringState: do not exclude up from acting_recovery_backfill #31703

Merged
merged 2 commits into from
Nov 24, 2019

Conversation

xiexingguo
Copy link
Member

@xiexingguo xiexingguo commented Nov 18, 2019

If we choose a primary that does not belong to the current up set,
and all up peers are still recoverable, then we might end up excluding
some up peer from the acting_recovery_backfill set too due to the
"want size <= pool size" constraint (since #24035),
as a result of which all up peers might not get recovered in one go.

Fix by falling through any oversized want set to async recovery, which
should be able to handle it nicely.

Fixes: https://tracker.ceph.com/issues/42577
Signed-off-by: xie xingguo xie.xingguo@zte.com.cn

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard backend
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@@ -1703,9 +1703,6 @@ void PeeringState::calc_replicated_acting(
acting_backfill->insert(up_cand);
ss << " osd." << i << " (up) accepted " << cur_info << std::endl;
}
if (want->size() >= size) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liewegas it's wrong to break here as we still want all up peers go to the acting_backfill set..

@xiexingguo
Copy link
Member Author

retest this please

@dzafman
Copy link
Contributor

dzafman commented Nov 18, 2019

It might be clearer to show this as a revert of c3e2990 and a commit with the alternate fix.

@xiexingguo
Copy link
Member Author

this as a revert of c3e2990

I tried. The conflict of reverting is huge(pg.cc -> peeringstate.cc) :-(

@dzafman
Copy link
Contributor

dzafman commented Nov 19, 2019

this as a revert of c3e2990

I tried. The conflict of reverting is huge(pg.cc -> peeringstate.cc) :-(

The code moved to src/osd/PeeringState.cc. You could do the following:
git revert c3e2990,
git reset HEAD src/osd/PG.cc
git checkout -- src/osd/PG.cc
Remove the 3 lines from PeeringState::calc_replicated_acting() in src/osd/PeeringState.cc
git add src/osd/PeeringState.cc
git revert --continue

This reverts commit c3e2990.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
If we choose a primary that does not belong to the current up set,
and all up peers are still recoverable, then we might end up excluding
some up peer from the acting_recovery_backfill set too due to the
"want size <= pool size" constraint (since ceph#24035),
as a result of which all up peers might not get recovered in one go.

Fix by falling through any oversized want set to async recovery, which
should be able to handle it nicely.

Fixes: https://tracker.ceph.com/issues/42577
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
@xiexingguo
Copy link
Member Author

@dzafman Thanks!
@neha-ojha @liewegas Mind taking a look?

Copy link
Member

@neha-ojha neha-ojha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me and also addresses https://tracker.ceph.com/issues/35924

@neha-ojha
Copy link
Member

retest this please

@tchaikov tchaikov merged commit 819ccfd into ceph:master Nov 24, 2019
@xiexingguo
Copy link
Member Author

@tchaikov Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants