New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mgr: don't update pending service map epoch on receiving map from mon #36964
Conversation
It may still be an older service map. Fixes: https://tracker.ceph.com/issues/47275 Signed-off-by: Mykola Golub <mgolub@suse.com>
failures tracked by and i tested locally using ../qa/run-standalone.sh osd-backfill-prio.sh it passed. |
@dillaman hi Jason, do you want to take a final look before this PR gets merged? |
} else { | ||
// we we already active and therefore must have persisted it, | ||
// which means ours is the same or newer. | ||
dout(10) << "got updated map e" << service_map.epoch << dendl; | ||
ceph_assert(pending_service_map.epoch > service_map.epoch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: per the comment, shouldn't this be >=
? I wonder if your initial issue was just due to the pending_service_map_dirty
variable not getting updated? i.e. if the current pending_service_map_dirty >= pending_service_map.epoch
(dirty), reset pending_service_map_dirty
to the new epoch in this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should always be pending_service_map.epoch > service_map.epoch
here.
I see only two places where pending_service_map.epoch
is set/updated:
- in
got_service_map
(just two lines above), if we got initial map, and then we set pending epoch to current epoch + 1 - in
send_report
, when we advertise a new map (send the pending map to mon) and increase the pending epoch.
The comment looks correct to me if by "ours" one means the committed map, not the pending map. But I am open to suggestions how the comment could be improved.
I am not sure I understand how you propose to fix this alternatively. I assume you saw my comment in the tracker ticket [1] describing the problem scenario? I think it was wrong that we could bump pending_service_map.epoch
(and then pending_service_map_dirty
) to a smaller value. They should only increase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do standby MGRs not start the daemon server? I'm fine w/ ignoring the update if it's in the past, I just feel like the assertion might be a bit extreme w/o knowing all the corner cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the standby starts the daemon server.
I don't expect the corner cases, and if they really may happen I would like to know about them as soon as possible, because it may mean some other assumptions are also wrong. And the assertion will be very helpful in this case. I would prefer an mgr crash with backtrace reported instead of undefined/weird behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough -- we have plenty of time for this to soak test in teuthology.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we've been hitting this assert intermittently in testing and most recently in the LRC when it was upgraded to pacific https://tracker.ceph.com/issues/48022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@tchaikov I will run this through the RBD mirror tests to verify |
@dillaman Additional tests certainly will not hurt, but just FYI I already ran it through the RBD mirror tests couple of times just to make sure the issue had been fixed (and no mgr crashes observed). |
Even better ... and I presume it fixed the issue w/ the instances being missing from the test runs? |
Yes. At least I was not able to reproduce. |
It may still be an older service map.
Fixes: https://tracker.ceph.com/issues/47275
Signed-off-by: Mykola Golub mgolub@suse.com
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox