Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mds: register with mgr only after added to FSMap #31400

Merged
merged 7 commits into from
Nov 8, 2019
Merged

Conversation

batrick
Copy link
Member

@batrick batrick commented Nov 5, 2019

Fixes: https://tracker.ceph.com/issues/41538
Fixes: https://tracker.ceph.com/issues/42635
Signed-off-by: Patrick Donnelly pdonnell@redhat.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard backend
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@batrick
Copy link
Member Author

batrick commented Nov 6, 2019

@liewegas I had to make several changes. Please have another look.

@batrick batrick force-pushed the i41538 branch 2 times, most recently from 4ea8ff0 to e66e40c Compare November 6, 2019 08:30
src/mds/MDSDaemon.cc Outdated Show resolved Hide resolved
src/mds/MDSDaemon.cc Outdated Show resolved Hide resolved
Copy link
Member

@liewegas liewegas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming the mds always starts in standby before going to creating or replay or whatever, then this looks good!

@batrick batrick force-pushed the i41538 branch 4 times, most recently from 54d8d04 to baa5175 Compare November 7, 2019 07:17
@callithea
Copy link
Member

callithea commented Nov 8, 2019

I tested the changes on my local system (because of https://tracker.ceph.com/issues/42011) and got a slightly different error when running the dashboard backend API tests:

2019-11-07 16:15:49,323.323 INFO:__main__:FAILED (failures=1)
2019-11-07 16:15:49,323.323 INFO:__main__:
2019-11-07 16:15:49,324.324 INFO:__main__:======================================================================
2019-11-07 16:15:49,324.324 INFO:__main__:FAIL: test_perf_counters_not_found (tasks.mgr.dashboard.test_perf_counters.PerfCountersControllerTest)
2019-11-07 16:15:49,324.324 INFO:__main__:----------------------------------------------------------------------
2019-11-07 16:15:49,324.324 INFO:__main__:Traceback (most recent call last):
2019-11-07 16:15:49,325.325 INFO:__main__:  File "/ceph/qa/tasks/mgr/dashboard/test_perf_counters.py", line 73, in test_perf_counters_not_found
2019-11-07 16:15:49,325.325 INFO:__main__:    self.assertSchemaBody(schema)
2019-11-07 16:15:49,325.325 INFO:__main__:  File "/ceph/qa/tasks/mgr/dashboard/helper.py", line 340, in assertSchemaBody
2019-11-07 16:15:49,325.325 INFO:__main__:    self.assertSchema(self.jsonBody(), schema)
2019-11-07 16:15:49,325.325 INFO:__main__:  File "/ceph/qa/tasks/mgr/dashboard/helper.py", line 337, in assertSchema
2019-11-07 16:15:49,326.326 INFO:__main__:    self.assertEqual(data, str(e))
2019-11-07 16:15:49,326.326 INFO:__main__:AssertionError: {u'status': u'404 Not Found', u'detail': u"'osd.4' not found", u'request_id': u'5bf96290-cae7-447d-857e-69856b965a9b'} != "In `input`: missing keys: set(['version', 'traceback'])"

I guess this is out of scope of this PR and I'll create another tracker issue for that one.
JFYI: https://tracker.ceph.com/issues/42708

@s0nea
Copy link
Member

s0nea commented Nov 8, 2019

I tested the changes on my local system (because of https://tracker.ceph.com/issues/42011) and got a slightly different error when running the dashboard backend API tests:

2019-11-07 16:15:49,323.323 INFO:__main__:FAILED (failures=1)
2019-11-07 16:15:49,323.323 INFO:__main__:
2019-11-07 16:15:49,324.324 INFO:__main__:======================================================================
2019-11-07 16:15:49,324.324 INFO:__main__:FAIL: test_perf_counters_not_found (tasks.mgr.dashboard.test_perf_counters.PerfCountersControllerTest)
2019-11-07 16:15:49,324.324 INFO:__main__:----------------------------------------------------------------------
2019-11-07 16:15:49,324.324 INFO:__main__:Traceback (most recent call last):
2019-11-07 16:15:49,325.325 INFO:__main__:  File "/ceph/qa/tasks/mgr/dashboard/test_perf_counters.py", line 73, in test_perf_counters_not_found
2019-11-07 16:15:49,325.325 INFO:__main__:    self.assertSchemaBody(schema)
2019-11-07 16:15:49,325.325 INFO:__main__:  File "/ceph/qa/tasks/mgr/dashboard/helper.py", line 340, in assertSchemaBody
2019-11-07 16:15:49,325.325 INFO:__main__:    self.assertSchema(self.jsonBody(), schema)
2019-11-07 16:15:49,325.325 INFO:__main__:  File "/ceph/qa/tasks/mgr/dashboard/helper.py", line 337, in assertSchema
2019-11-07 16:15:49,326.326 INFO:__main__:    self.assertEqual(data, str(e))
2019-11-07 16:15:49,326.326 INFO:__main__:AssertionError: {u'status': u'404 Not Found', u'detail': u"'osd.4' not found", u'request_id': u'5bf96290-cae7-447d-857e-69856b965a9b'} != "In `input`: missing keys: set(['version', 'traceback'])"

I guess this is out of scope of this PR and I'll create another tracker issue for that one.

Found the same issue here: https://jenkins.ceph.com/job/ceph-dashboard-pr-backend/456/

This also cleans up the output to be more readable/useful in debug
output.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
This is a trivial refactor.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Few things here:

- Make explicit the check for getting removed from the MDSMap. This was
  only done before by checking if MDS held a rank which does not check the
  case where a standby is removed from the FSMap.

- Use mds_info_t::dump to simplify various debug output.

- Add a few sanity asserts for invalid state transitions.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
The Monitors send an empty MDSMap to an MDS it is removing. The MDS
can't diagnose the cause. Instead suggest looking at the cluster/monitor
logs.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
This commit undoes the service daemon registration for the MDS. It doesn't look
absolutely necessary and it causes the MDS to be listed twice in the `ceph
versions` output:

    $ ceph versions
        ...
        "mds": {
            "ceph version v15.0.0-6915-g0143b904676 (0143b9046763ea1801efa8358a0c033ec862cea9) octopus (dev)": 3
        },
        "mds": {
            "unknown": 3
        },
        "overall": {
            "ceph version v15.0.0-6915-g0143b904676 (0143b9046763ea1801efa8358a0c033ec862cea9) octopus (dev)": 10,
            "unknown": 3
        }

Fixing that requires looking for duplicates or ignoring MDSs in the
service daemons when the mon processes `ceph versions`. I have a feeling
that it wasn't actually designed to be used by the MDS this way however.
Additionally, the reason for "unknown" version is because the metadata
sent to the mgr does not include "ceph_version".

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Note that we now sub to the mgrmap after init because the MgrClient
connection to the mgr is driven by receipt of the MgrMap.

This is important so that the MDS does not have metadata with the mgr
when the mons are ignoring the MDS otherwise due to CompatSet
incompatibilities.

Fixes: https://tracker.ceph.com/issues/41538
Fixes: https://tracker.ceph.com/issues/42635
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
batrick added a commit that referenced this pull request Nov 8, 2019
* refs/pull/31400/head:
	mds: establish session with mgr only after added to FSMap
	mds: do not register as a service daemon
	mds: do not try to diagnose cause of MDSMap removal
	mds: fix handling of initial MDS states
	mds: remove unnecessary const qualifier
	mds: cleanup type decl and map iteration
	mds: define stream operator for mds_info_t

Reviewed-by: Sage Weil <sage@redhat.com>
@batrick batrick merged commit e765f2d into ceph:master Nov 8, 2019
@batrick batrick deleted the i41538 branch November 8, 2019 18:07
Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. sorry for being late. was traveling last week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants