mgr: block MgrClient::start_command until mgrmap #21698

jcsp · 2018-04-27T14:10:39Z

This EACCESS case was motivated by auth caps issues
when mgr was first being introduced, but now it's
just causing problems in the race condition where
start_command gets called before first MMgrMap from mon.

Fixes: https://tracker.ceph.com/issues/23627
Signed-off-by: John Spray john.spray@redhat.com

liewegas · 2018-04-27T15:12:27Z

One implication here is if you use a new (mimic) client against a pre-luminous cluster, we'll block indefinitely instead of getting EACCES.

Maybe we can do a feature check?

tchaikov · 2018-04-27T15:43:12Z

@liewegas you mean the combination of: ~~a mimic ceph cli + pre-luminous librados~~

oic, it's a mimic ceph cli + mimic librados + pre-luminous cluster.

liewegas · 2018-04-27T15:49:25Z

luminous cli + librados, jewel cluster. we won't get a mgrmap and 'mgr tell' would hang. from 3015f30

    mgr/MgrClient: assume missing MgrMap means no acces to mgr at all
    
    If we get as far as authenticating and have no MgrMap that implies the
    mon didn't provide us one (despite our request) and we have no access to
    the mgr at all.

tchaikov · 2018-04-28T15:11:14Z

http://pulpito.ceph.com/kchai-2018-04-28_01:51:03-rados-wip-kefu-testing-2018-04-28-0125-distro-basic-smithi/

tchaikov · 2018-04-28T15:54:25Z

retest this please.

This is for use when talking to pre-luminous clusters, where we should not block waiting for MgrMap because it might never come. Fixes: https://tracker.ceph.com/issues/23627 Signed-off-by: John Spray <john.spray@redhat.com>

Signed-off-by: John Spray <john.spray@redhat.com>

This wasn't taking the MonClient lock: should use with_monmap to protect access to MonClient::monmap. Signed-off-by: John Spray <john.spray@redhat.com>

jcsp · 2018-04-30T15:37:44Z

Updated to have behaviour depend on the luminous feature bit

liewegas · 2018-04-30T15:47:25Z

This master failure may be related, BTW: a 'ceph pg dump' command hangs indefinitely. See
/a/sage-2018-04-30_00:12:46-rados-wip-sage3-testing-2018-04-29-1658-distro-basic-smithi/2453718

liewegas · 2018-05-01T15:01:58Z

I'm pretty sure this is causing the caps.sh failures (stuck on pg dump). See

http://pulpito.ceph.com/sage-2018-05-01_02:09:39-rados-wip-sage3-testing-2018-04-30-1610-distro-basic-smithi/
vs master rerun
http://pulpito.ceph.com/sage-2018-05-01_12:21:13-rados-master-distro-basic-smithi

yuriw · 2018-05-02T15:31:32Z

wip-yuri4-testing-2018-05-02-1530

liewegas · 2018-05-02T17:10:08Z

2018-05-01T04:12:02.761 INFO:tasks.workunit.client.0.smithi134.stderr:+ expected_ret=13
2018-05-01T04:12:02.761 INFO:tasks.workunit.client.0.smithi134.stderr:+ echo ceph -k /tmp/cephtest-mon-caps-madness.foo.keyring --user foo pg dump
2018-05-01T04:12:02.761 INFO:tasks.workunit.client.0.smithi134.stderr:+ eval ceph -k /tmp/cephtest-mon-caps-madness.foo.keyring --user foo pg dump
2018-05-01T04:12:02.761 INFO:tasks.workunit.client.0.smithi134.stdout:ceph -k /tmp/cephtest-mon-caps-madness.foo.keyring --user foo pg dump
...

then times out,

2018-05-01T04:51:12.987 ERROR:tasks.mon_thrash:Saw exception while triggering scrub
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-sage3-testing-2018-04-30-1610/qa/tasks/mon_thrash.py", line 301, in do_thrash
    self.manager.raw_cluster_cmd('scrub')
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-sage3-testing-2018-04-30-1610/qa/tasks/ceph_manager.py", line 1134, in raw_cluster_cmd
    stdout=StringIO(),
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 193, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 423, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 155, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 177, in _raise_for_status
    node=self.hostname, label=self.label
CommandFailedError: Command failed on smithi200 with status 16: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph scrub'

/a/sage-2018-05-01_02:09:39-rados-wip-sage3-testing-2018-04-30-1610-distro-basic-smithi/2459042
(saw the same thing yesterday)

yuriw · 2018-05-03T16:30:47Z

@liewegas FYI http://pulpito.front.sepia.ceph.com/yuriw-2018-05-02_17:20:06-rados-wip-yuri4-testing-2018-05-02-1530-distro-basic-smithi/

liewegas · 2018-05-03T18:59:52Z

client.foo
        key: AQDXRetar1sFKRAA2K6L/TAYRxvKCYj9lGTMrw==
        caps: [mon] allow command "auth ls", allow command mon_status

and pg dumpw orks as client.admin, but as client.foo,

...
2018-05-03 18:59:33.577 7fd44f861700  1 -- 172.21.15.72:0/1114096433 <== mon.2 172.21.15.136:6790/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (254546305 0 0) 0x7fd440000ee0 con 0x7fd4480532e0
2018-05-03 18:59:33.577 7fd44f861700 10 monclient(hunting): my global_id is 504200
2018-05-03 18:59:33.577 7fd44f861700  1 -- 172.21.15.72:0/1114096433 --> 172.21.15.136:6790/0 -- auth(proto 2 2 bytes epoch 0) v1 -- ?+0 0x7fd434003a10 con 0x7fd4480532e0
2018-05-03 18:59:33.577 7fd44f861700  1 -- 172.21.15.72:0/1114096433 <== mon.2 172.21.15.136:6790/0 3 ==== auth_reply(proto 2 -22 (22) Invalid argument) v1 ==== 24+0+0 (3542374562 0 0) 0x7fd440000ee0 con 0x7fd4480532e0
2018-05-03 18:59:33.577 7fd44f861700  1 -- 172.21.15.72:0/1114096433 mark_down 0x7fd4480532e0 -- 0x7fd4480851c0

and hang.

tchaikov · 2018-05-04T05:49:01Z

2018-05-04 05:20:56.498 7f8346856700 10 mon.b@1(peon) e1 handle_subscribe mon_subscribe({mgrmap=0+}) v3
2018-05-04 05:20:56.498 7f8346856700 20 is_capable service=mon command= read on cap allow command "auth ls", allow command mon_status
2018-05-04 05:20:56.498 7f8346856700 20  allow so far , doing grant allow command "auth ls"
2018-05-04 05:20:56.498 7f8346856700 20  allow so far , doing grant allow command mon_status
2018-05-04 05:20:56.498 7f8346856700  5 mon.b@1(peon) e1 handle_subscribe client.64135 172.21.6.106:0/896299522 not enough caps for mon_subscribe({mgrmap=0+}) v3 -- dropping
2018-05-04 05:20:56.498 7f8346856700  1 -- 172.21.6.106:6789/0 <== client.64135 172.21.6.106:0/896299522 6 ==== mon_subscribe({osdmap=0}) v3 ==== 27+0+0 (2355785087 0 0) 0x55ed4e08fb00 con 0x55ed4dd52c60
2018-05-04 05:20:56.498 7f8346856700 20 mon.b@1(peon) e1 _ms_dispatch existing session 0x55ed4ea2cfc0 for client.? 172.21.6.106:0/896299522
2018-05-04 05:20:56.498 7f8346856700 20 mon.b@1(peon) e1  caps allow command "auth ls", allow command mon_status
2018-05-04 05:20:56.498 7f8346856700 10 mon.b@1(peon) e1 handle_subscribe mon_subscribe({osdmap=0}) v3
2018-05-04 05:20:56.498 7f8346856700 20 is_capable service=mon command= read on cap allow command "auth ls", allow command mon_status
2018-05-04 05:20:56.498 7f8346856700 20  allow so far , doing grant allow command "auth ls"
2018-05-04 05:20:56.498 7f8346856700 20  allow so far , doing grant allow command mon_status
2018-05-04 05:20:56.498 7f8346856700  5 mon.b@1(peon) e1 handle_subscribe client.64135 172.21.6.106:0/896299522 not enough caps for mon_subscribe({osdmap=0}) v3 -- dropping

the mon side dropped the "mon_subscribe({mgrmap=0+})" on floor due to "not enough caps for". so the mgrclient was waiting for the mgrmap in vain.

jcsp · 2018-05-04T09:14:44Z

Closing in favour of #21811

jcsp added bug-fix mgr labels Apr 27, 2018

jcsp requested review from tchaikov and liewegas April 27, 2018 14:10

jcsp mentioned this pull request Apr 27, 2018

pybind/ceph_argparse: wait for mgrmap before erroring out #21695

Closed

tchaikov approved these changes Apr 27, 2018

View reviewed changes

tchaikov added the needs-qa label Apr 27, 2018

tchaikov added this to the mimic milestone Apr 27, 2018

tchaikov added the wip-kefu-testing label Apr 27, 2018

tchaikov removed the wip-kefu-testing label Apr 28, 2018

liewegas added the wip-sage3-testing label Apr 29, 2018

yuriw removed the wip-sage3-testing label Apr 30, 2018

John Spray added 3 commits April 30, 2018 11:36

mgr/MgrClient: add mgr_optional mode

9eb8b81

This is for use when talking to pre-luminous clusters, where we should not block waiting for MgrMap because it might never come. Fixes: https://tracker.ceph.com/issues/23627 Signed-off-by: John Spray <john.spray@redhat.com>

librados: config mgrclient for pre-luminous cluster

6c7fdd4

Signed-off-by: John Spray <john.spray@redhat.com>

librados: fix locking on get_required_monitor_features

0c01917

This wasn't taking the MonClient lock: should use with_monmap to protect access to MonClient::monmap. Signed-off-by: John Spray <john.spray@redhat.com>

jcsp force-pushed the wip-23627 branch from 968a9c3 to 0c01917 Compare April 30, 2018 15:36

liewegas approved these changes Apr 30, 2018

View reviewed changes

liewegas added the wip-sage3-testing label Apr 30, 2018

liewegas removed the wip-sage3-testing label May 1, 2018

yuriw added the wip-yuri4-testing label May 2, 2018

liewegas added the wip-sage-testing label May 3, 2018

liewegas changed the base branch from master to mimic May 3, 2018 18:17

yuriw removed the wip-yuri4-testing label May 3, 2018

tchaikov mentioned this pull request May 4, 2018

librados: block MgrClient::start_command until mgrmap #21811

Merged

jcsp closed this May 4, 2018

jcsp deleted the wip-23627 branch May 4, 2018 09:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mgr: block MgrClient::start_command until mgrmap #21698

mgr: block MgrClient::start_command until mgrmap #21698

jcsp commented Apr 27, 2018

liewegas commented Apr 27, 2018

tchaikov commented Apr 27, 2018 •

edited

liewegas commented Apr 27, 2018

tchaikov commented Apr 28, 2018

tchaikov commented Apr 28, 2018

jcsp commented Apr 30, 2018

liewegas commented Apr 30, 2018

liewegas commented May 1, 2018

yuriw commented May 2, 2018

liewegas commented May 2, 2018

yuriw commented May 3, 2018

liewegas commented May 3, 2018

tchaikov commented May 4, 2018 •

edited

jcsp commented May 4, 2018

mgr: block MgrClient::start_command until mgrmap #21698

mgr: block MgrClient::start_command until mgrmap #21698

Conversation

jcsp commented Apr 27, 2018

liewegas commented Apr 27, 2018

tchaikov commented Apr 27, 2018 • edited

liewegas commented Apr 27, 2018

tchaikov commented Apr 28, 2018

tchaikov commented Apr 28, 2018

jcsp commented Apr 30, 2018

liewegas commented Apr 30, 2018

liewegas commented May 1, 2018

yuriw commented May 2, 2018

liewegas commented May 2, 2018

yuriw commented May 3, 2018

liewegas commented May 3, 2018

tchaikov commented May 4, 2018 • edited

jcsp commented May 4, 2018

tchaikov commented Apr 27, 2018 •

edited

tchaikov commented May 4, 2018 •

edited