New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mgr: block MgrClient::start_command until mgrmap #21698
Conversation
One implication here is if you use a new (mimic) client against a pre-luminous cluster, we'll block indefinitely instead of getting EACCES. Maybe we can do a feature check? |
@liewegas you mean the combination of: oic, it's a mimic ceph cli + mimic librados + pre-luminous cluster. |
luminous cli + librados, jewel cluster. we won't get a mgrmap and 'mgr tell' would hang. from 3015f30 mgr/MgrClient: assume missing MgrMap means no acces to mgr at all If we get as far as authenticating and have no MgrMap that implies the mon didn't provide us one (despite our request) and we have no access to the mgr at all. |
retest this please. |
This is for use when talking to pre-luminous clusters, where we should not block waiting for MgrMap because it might never come. Fixes: https://tracker.ceph.com/issues/23627 Signed-off-by: John Spray <john.spray@redhat.com>
Signed-off-by: John Spray <john.spray@redhat.com>
This wasn't taking the MonClient lock: should use with_monmap to protect access to MonClient::monmap. Signed-off-by: John Spray <john.spray@redhat.com>
Updated to have behaviour depend on the luminous feature bit |
This master failure may be related, BTW: a 'ceph pg dump' command hangs indefinitely. See |
I'm pretty sure this is causing the caps.sh failures (stuck on pg dump). See http://pulpito.ceph.com/sage-2018-05-01_02:09:39-rados-wip-sage3-testing-2018-04-30-1610-distro-basic-smithi/ |
2018-05-01T04:12:02.761 INFO:tasks.workunit.client.0.smithi134.stderr:+ expected_ret=13 2018-05-01T04:12:02.761 INFO:tasks.workunit.client.0.smithi134.stderr:+ echo ceph -k /tmp/cephtest-mon-caps-madness.foo.keyring --user foo pg dump 2018-05-01T04:12:02.761 INFO:tasks.workunit.client.0.smithi134.stderr:+ eval ceph -k /tmp/cephtest-mon-caps-madness.foo.keyring --user foo pg dump 2018-05-01T04:12:02.761 INFO:tasks.workunit.client.0.smithi134.stdout:ceph -k /tmp/cephtest-mon-caps-madness.foo.keyring --user foo pg dump ... then times out, 2018-05-01T04:51:12.987 ERROR:tasks.mon_thrash:Saw exception while triggering scrub Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-sage3-testing-2018-04-30-1610/qa/tasks/mon_thrash.py", line 301, in do_thrash self.manager.raw_cluster_cmd('scrub') File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-sage3-testing-2018-04-30-1610/qa/tasks/ceph_manager.py", line 1134, in raw_cluster_cmd stdout=StringIO(), File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 193, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 423, in run r.wait() File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 155, in wait self._raise_for_status() File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 177, in _raise_for_status node=self.hostname, label=self.label CommandFailedError: Command failed on smithi200 with status 16: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph scrub' /a/sage-2018-05-01_02:09:39-rados-wip-sage3-testing-2018-04-30-1610-distro-basic-smithi/2459042 |
client.foo key: AQDXRetar1sFKRAA2K6L/TAYRxvKCYj9lGTMrw== caps: [mon] allow command "auth ls", allow command mon_status and pg dumpw orks as client.admin, but as client.foo, ... 2018-05-03 18:59:33.577 7fd44f861700 1 -- 172.21.15.72:0/1114096433 <== mon.2 172.21.15.136:6790/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (254546305 0 0) 0x7fd440000ee0 con 0x7fd4480532e0 2018-05-03 18:59:33.577 7fd44f861700 10 monclient(hunting): my global_id is 504200 2018-05-03 18:59:33.577 7fd44f861700 1 -- 172.21.15.72:0/1114096433 --> 172.21.15.136:6790/0 -- auth(proto 2 2 bytes epoch 0) v1 -- ?+0 0x7fd434003a10 con 0x7fd4480532e0 2018-05-03 18:59:33.577 7fd44f861700 1 -- 172.21.15.72:0/1114096433 <== mon.2 172.21.15.136:6790/0 3 ==== auth_reply(proto 2 -22 (22) Invalid argument) v1 ==== 24+0+0 (3542374562 0 0) 0x7fd440000ee0 con 0x7fd4480532e0 2018-05-03 18:59:33.577 7fd44f861700 1 -- 172.21.15.72:0/1114096433 mark_down 0x7fd4480532e0 -- 0x7fd4480851c0 and hang. |
the mon side dropped the "mon_subscribe({mgrmap=0+})" on floor due to "not enough caps for". so the mgrclient was waiting for the mgrmap in vain. |
Closing in favour of #21811 |
This EACCESS case was motivated by auth caps issues
when mgr was first being introduced, but now it's
just causing problems in the race condition where
start_command gets called before first MMgrMap from mon.
Fixes: https://tracker.ceph.com/issues/23627
Signed-off-by: John Spray john.spray@redhat.com