Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: PeeringState: fix selection order in calc_replicated_acting_stretch #44518

Merged
merged 1 commit into from Jan 14, 2022

Conversation

gregsfortytwo
Copy link
Member

@gregsfortytwo gregsfortytwo commented Jan 11, 2022

We were previously mis-ordering these to deprioritize the existing acting set. That is bad!

We generate OSD candidates from the acting set and strays, and push
them into the candidates list as a tuple of <<!in_acting,pg_info.last_update>,osd_id>.

Then we sort the list. Then we go through the list from front to back and
push_back entries into the appropriate ancestor lists.

And then we pop_back() off the lists to select the acting set.
Which of course turns our nice careful order backwards! So don't do that.

Fixes: https://tracker.ceph.com/issues/53824

Signed-off-by: Greg Farnum gfarnum@redhat.com

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@gregsfortytwo
Copy link
Member Author

Our lack of upstream tests around stretch mode is a problem here: I don't really see how to properly test this except giving it to the folks who have seen it (Red Hat QE; thanks guys!). So I think we just do our due diligence that it doesn't break the suite and then merge it in.

src/osd/PeeringState.cc Outdated Show resolved Hide resolved
We were previously mis-ordering these to *de*prioritize the existing acting set. That is bad!

We generate OSD candidates from the acting set and strays, and push
them into the candidates list as a tuple of <<!in_acting,pg_info.last_update>,osd_id>.

Then we sort the list. Then we go through the list from front to back and
push_back entries into the appropriate ancestor lists.

And then we pop_back() off the lists to select the acting set.
Which of course turns our nice careful order backwards! So don't do that.

Fixes: https://tracker.ceph.com/issues/53824

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
@gregsfortytwo
Copy link
Member Author

No patch changes, just rebased to latest master since the doc and make check both failed on not-my-code?

@neha-ojha
Copy link
Member

jenkins test api

@neha-ojha
Copy link
Member

Our lack of upstream tests around stretch mode is a problem here: I don't really see how to properly test this except giving it to the folks who have seen it (Red Hat QE; thanks guys!). So I think we just do our due diligence that it doesn't break the suite and then merge it in.

what kind of test scenario uncovered this bug?

@yuriw
Copy link
Contributor

yuriw commented Jan 12, 2022

jenkins test api

2 similar comments
@yuriw
Copy link
Contributor

yuriw commented Jan 13, 2022

jenkins test api

@neha-ojha
Copy link
Member

jenkins test api

@ljflores
Copy link
Contributor

The failed api test showed this line: Building remotely on 172.21.2.12+braggi12 indicating that it was running on braggi12. There was an issue with braggi12 in the past, but it was fixed by David Galloway so not entirely sure it was the same thing.

In any case, we'll see if it passes now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants