mgr: don't update pending service map epoch on receiving map from mon #36964

trociny · 2020-09-03T07:31:46Z

It may still be an older service map.

Fixes: https://tracker.ceph.com/issues/47275
Signed-off-by: Mykola Golub mgolub@suse.com

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

It may still be an older service map. Fixes: https://tracker.ceph.com/issues/47275 Signed-off-by: Mykola Golub <mgolub@suse.com>

tchaikov · 2020-09-07T06:47:28Z

failures tracked by

and i tested locally using

../qa/run-standalone.sh osd-backfill-prio.sh

it passed.

tchaikov · 2020-09-09T14:09:08Z

@dillaman hi Jason, do you want to take a final look before this PR gets merged?

dillaman · 2020-09-09T14:23:39Z

src/mgr/DaemonServer.cc

      } else {
 	// we we already active and therefore must have persisted it,
 	// which means ours is the same or newer.
 	dout(10) << "got updated map e" << service_map.epoch << dendl;
+	ceph_assert(pending_service_map.epoch > service_map.epoch);


Nit: per the comment, shouldn't this be >=? I wonder if your initial issue was just due to the pending_service_map_dirty variable not getting updated? i.e. if the current pending_service_map_dirty >= pending_service_map.epoch (dirty), reset pending_service_map_dirty to the new epoch in this method.

I think it should always be pending_service_map.epoch > service_map.epoch here.
I see only two places where pending_service_map.epoch is set/updated:

in got_service_map (just two lines above), if we got initial map, and then we set pending epoch to current epoch + 1

in send_report, when we advertise a new map (send the pending map to mon) and increase the pending epoch.

The comment looks correct to me if by "ours" one means the committed map, not the pending map. But I am open to suggestions how the comment could be improved.

I am not sure I understand how you propose to fix this alternatively. I assume you saw my comment in the tracker ticket [1] describing the problem scenario? I think it was wrong that we could bump pending_service_map.epoch (and then pending_service_map_dirty) to a smaller value. They should only increase.

[1] https://tracker.ceph.com/issues/47275#note-1

Do standby MGRs not start the daemon server? I'm fine w/ ignoring the update if it's in the past, I just feel like the assertion might be a bit extreme w/o knowing all the corner cases.

I don't think the standby starts the daemon server.

I don't expect the corner cases, and if they really may happen I would like to know about them as soon as possible, because it may mean some other assumptions are also wrong. And the assertion will be very helpful in this case. I would prefer an mgr crash with backtrace reported instead of undefined/weird behavior.

Fair enough -- we have plenty of time for this to soak test in teuthology.

we've been hitting this assert intermittently in testing and most recently in the LRC when it was upgraded to pacific https://tracker.ceph.com/issues/48022

dillaman

lgtm

dillaman · 2020-09-09T15:18:11Z

@tchaikov I will run this through the RBD mirror tests to verify

trociny · 2020-09-09T15:20:42Z

@dillaman Additional tests certainly will not hurt, but just FYI I already ran it through the RBD mirror tests couple of times just to make sure the issue had been fixed (and no mgr crashes observed).

dillaman · 2020-09-09T15:22:24Z

@dillaman Additional tests certainly will not hurt, but just FYI I already ran it through the RBD mirror tests couple of times just to make sure the issue had been fixed (and no mgr crashes observed).

Even better ... and I presume it fixed the issue w/ the instances being missing from the test runs?

trociny · 2020-09-09T15:23:05Z

and I presume it fixed the issue w/ the instances being missing from the test runs?

Yes. At least I was not able to reproduce.

mgr: don't update pending service map epoch on receiving map from mon

b9edfbd

It may still be an older service map. Fixes: https://tracker.ceph.com/issues/47275 Signed-off-by: Mykola Golub <mgolub@suse.com>

trociny added bug-fix mgr labels Sep 3, 2020

tchaikov self-requested a review September 3, 2020 09:10

dillaman self-requested a review September 3, 2020 15:32

tchaikov approved these changes Sep 4, 2020

View reviewed changes

tchaikov added needs-qa wip-kefu-testing labels Sep 4, 2020

tchaikov removed needs-qa wip-kefu-testing labels Sep 7, 2020

tchaikov self-assigned this Sep 7, 2020

tchaikov added the needs-review label Sep 7, 2020

dillaman reviewed Sep 9, 2020

View reviewed changes

dillaman approved these changes Sep 9, 2020

View reviewed changes

dillaman added needs-qa rbd and removed needs-review labels Sep 9, 2020

dillaman merged commit fa53f10 into ceph:master Sep 9, 2020

This was referenced Sep 16, 2020

octopus: mgr: don't update pending service map epoch on receiving map from mon #37180

Merged

nautilus: mgr: don't update pending service map epoch on receiving map from mon #37181

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mgr: don't update pending service map epoch on receiving map from mon #36964

mgr: don't update pending service map epoch on receiving map from mon #36964

trociny commented Sep 3, 2020

tchaikov commented Sep 7, 2020

tchaikov commented Sep 9, 2020

dillaman Sep 9, 2020

trociny Sep 9, 2020

dillaman Sep 9, 2020

trociny Sep 9, 2020

dillaman Sep 9, 2020

neha-ojha Mar 17, 2021

dillaman left a comment

dillaman commented Sep 9, 2020

trociny commented Sep 9, 2020

dillaman commented Sep 9, 2020

trociny commented Sep 9, 2020

mgr: don't update pending service map epoch on receiving map from mon #36964

mgr: don't update pending service map epoch on receiving map from mon #36964

Conversation

trociny commented Sep 3, 2020

Checklist

tchaikov commented Sep 7, 2020

tchaikov commented Sep 9, 2020

dillaman Sep 9, 2020

Choose a reason for hiding this comment

trociny Sep 9, 2020

Choose a reason for hiding this comment

dillaman Sep 9, 2020

Choose a reason for hiding this comment

trociny Sep 9, 2020

Choose a reason for hiding this comment

dillaman Sep 9, 2020

Choose a reason for hiding this comment

neha-ojha Mar 17, 2021

Choose a reason for hiding this comment

dillaman left a comment

Choose a reason for hiding this comment

dillaman commented Sep 9, 2020

trociny commented Sep 9, 2020

dillaman commented Sep 9, 2020

trociny commented Sep 9, 2020