Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr: don't update pending service map epoch on receiving map from mon #36964

Merged
merged 1 commit into from Sep 9, 2020

Conversation

trociny
Copy link
Contributor

@trociny trociny commented Sep 3, 2020

It may still be an older service map.

Fixes: https://tracker.ceph.com/issues/47275
Signed-off-by: Mykola Golub mgolub@suse.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

It may still be an older service map.

Fixes: https://tracker.ceph.com/issues/47275
Signed-off-by: Mykola Golub <mgolub@suse.com>
@tchaikov
Copy link
Contributor

tchaikov commented Sep 9, 2020

@dillaman hi Jason, do you want to take a final look before this PR gets merged?

} else {
// we we already active and therefore must have persisted it,
// which means ours is the same or newer.
dout(10) << "got updated map e" << service_map.epoch << dendl;
ceph_assert(pending_service_map.epoch > service_map.epoch);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: per the comment, shouldn't this be >=? I wonder if your initial issue was just due to the pending_service_map_dirty variable not getting updated? i.e. if the current pending_service_map_dirty >= pending_service_map.epoch (dirty), reset pending_service_map_dirty to the new epoch in this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should always be pending_service_map.epoch > service_map.epoch here.
I see only two places where pending_service_map.epoch is set/updated:

  1. in got_service_map (just two lines above), if we got initial map, and then we set pending epoch to current epoch + 1
  2. in send_report, when we advertise a new map (send the pending map to mon) and increase the pending epoch.

The comment looks correct to me if by "ours" one means the committed map, not the pending map. But I am open to suggestions how the comment could be improved.

I am not sure I understand how you propose to fix this alternatively. I assume you saw my comment in the tracker ticket [1] describing the problem scenario? I think it was wrong that we could bump pending_service_map.epoch (and then pending_service_map_dirty) to a smaller value. They should only increase.

[1] https://tracker.ceph.com/issues/47275#note-1

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do standby MGRs not start the daemon server? I'm fine w/ ignoring the update if it's in the past, I just feel like the assertion might be a bit extreme w/o knowing all the corner cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the standby starts the daemon server.

I don't expect the corner cases, and if they really may happen I would like to know about them as soon as possible, because it may mean some other assumptions are also wrong. And the assertion will be very helpful in this case. I would prefer an mgr crash with backtrace reported instead of undefined/weird behavior.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough -- we have plenty of time for this to soak test in teuthology.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we've been hitting this assert intermittently in testing and most recently in the LRC when it was upgraded to pacific https://tracker.ceph.com/issues/48022

Copy link

@dillaman dillaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dillaman
Copy link

dillaman commented Sep 9, 2020

@tchaikov I will run this through the RBD mirror tests to verify

@trociny
Copy link
Contributor Author

trociny commented Sep 9, 2020

@dillaman Additional tests certainly will not hurt, but just FYI I already ran it through the RBD mirror tests couple of times just to make sure the issue had been fixed (and no mgr crashes observed).

@dillaman
Copy link

dillaman commented Sep 9, 2020

@dillaman Additional tests certainly will not hurt, but just FYI I already ran it through the RBD mirror tests couple of times just to make sure the issue had been fixed (and no mgr crashes observed).

Even better ... and I presume it fixed the issue w/ the instances being missing from the test runs?

@trociny
Copy link
Contributor Author

trociny commented Sep 9, 2020

and I presume it fixed the issue w/ the instances being missing from the test runs?

Yes. At least I was not able to reproduce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants