Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nautilus: mds: do not defer incoming mgrmap when mds is laggy #36168

Merged
merged 2 commits into from
Oct 5, 2020

Conversation

smithfarm
Copy link
Contributor

@smithfarm smithfarm commented Jul 18, 2020

backport trackers: https://tracker.ceph.com/issues/46151


backport of #34024

parent trackers: https://tracker.ceph.com/issues/44638

this backport was staged using ceph-backport.sh version 15.1.1.389
find the latest version at https://github.com/ceph/ceph/blob/master/src/script/ceph-backport.sh

@smithfarm smithfarm added this to the nautilus milestone Jul 18, 2020
@smithfarm smithfarm added the cephfs Ceph File System label Jul 18, 2020
@smithfarm
Copy link
Contributor Author

/home/jenkins-build/build/workspace/ceph-pull-requests/src/mds/MDSRank.cc: In member function 'bool MDSRank::handle_message(const const_ref&)':
/home/jenkins-build/build/workspace/ceph-pull-requests/src/mds/MDSRank.cc:1196:7: error: return-statement with no value, in function returning 'bool' [-fpermissive]
 1196 |       return;                                                           \
      |       ^~~~~~
/home/jenkins-build/build/workspace/ceph-pull-requests/src/mds/MDSRank.cc:1196:7: note: in definition of macro 'ALLOW_MESSAGES_FROM'
 1196 |       return;                                                           \
      |       ^~~~~~

@smithfarm smithfarm added the DNM label Jul 18, 2020
@smithfarm
Copy link
Contributor Author

smithfarm commented Jul 18, 2020

DNM until #36176 is merged and SHA1 double-checked

Nevermind. #36176 was a misguided attempt to fix a problem that turned out to be caused by an oversight in this backport.

@smithfarm smithfarm force-pushed the wip-46151-nautilus branch 2 times, most recently from f2b5eb2 to c660323 Compare July 19, 2020 06:44
vshankar and others added 2 commits July 19, 2020 10:25
When the mds is laggy, the incoming mgrmap is queued to be processed
at a later stage. But, the mds does not handle mgrmap message directly.
So, later when the mds is not laggy anymore, the mgrmap message is not
handled and is dropped. But, when the mgrmap message was queued up, the
mds acknowledges that it has handled the message. This causes the mgr
client instance to never process the mgrmap and never connecting to the
manager (the receipt of mgrmap drives the connection to the manager).

The fix is to not acknowledge messages that the mds cannot handle. In
normal cases, the mds does not ack the message but when it's laggy, it
just blindly queues up the message -- so, check if the message can be
handled (later) even when the mds is laggy.

Also, a minor change in a function name -- handle_deferrable_message()
is kind of a misnomer since the function is called to process messages
that are not deferred. That's changed to handle_message() now.

Fixes: http://tracker.ceph.com/issues/44638
Signed-off-by: Venky Shankar <vshankar@redhat.com>
(cherry picked from commit beb12fa)

Conflicts:
	src/mds/MDSDaemon.cc
	src/mds/MDSRank.cc
	src/mds/MDSRank.h
- nautilus has "Message::const_ref" instead of "cref_t<Message>"
Fixes: beb12fa
Signed-off-by: Nathan Cutler <ncutler@suse.com>
(cherry picked from commit 6e96081)
@batrick
Copy link
Member

batrick commented Aug 7, 2020

@smithfarm is this still DNM?

@smithfarm smithfarm removed the DNM label Sep 23, 2020
@smithfarm
Copy link
Contributor Author

test this please

@yuriw
Copy link
Contributor

yuriw commented Sep 23, 2020

@yuriw yuriw merged commit ce64bd1 into ceph:nautilus Oct 5, 2020
@smithfarm smithfarm deleted the wip-46151-nautilus branch October 5, 2020 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cephfs Ceph File System nautilus-batch-1 nautilus point releases needs-qa wip-yuri6-testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants