Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr: block MgrClient::start_command until mgrmap #21698

Closed
wants to merge 3 commits into from
Closed

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Apr 27, 2018

This EACCESS case was motivated by auth caps issues
when mgr was first being introduced, but now it's
just causing problems in the race condition where
start_command gets called before first MMgrMap from mon.

Fixes: https://tracker.ceph.com/issues/23627
Signed-off-by: John Spray john.spray@redhat.com

@tchaikov tchaikov added this to the mimic milestone Apr 27, 2018
@liewegas
Copy link
Member

One implication here is if you use a new (mimic) client against a pre-luminous cluster, we'll block indefinitely instead of getting EACCES.

Maybe we can do a feature check?

@tchaikov
Copy link
Contributor

tchaikov commented Apr 27, 2018

@liewegas you mean the combination of: a mimic ceph cli + pre-luminous librados

oic, it's a mimic ceph cli + mimic librados + pre-luminous cluster.

@liewegas
Copy link
Member

luminous cli + librados, jewel cluster. we won't get a mgrmap and 'mgr tell' would hang. from 3015f30

    mgr/MgrClient: assume missing MgrMap means no acces to mgr at all
    
    If we get as far as authenticating and have no MgrMap that implies the
    mon didn't provide us one (despite our request) and we have no access to
    the mgr at all.
 

@tchaikov
Copy link
Contributor

retest this please.

John Spray added 3 commits April 30, 2018 11:36
This is for use when talking to pre-luminous
clusters, where we should not block waiting
for MgrMap because it might never come.

Fixes: https://tracker.ceph.com/issues/23627
Signed-off-by: John Spray <john.spray@redhat.com>
Signed-off-by: John Spray <john.spray@redhat.com>
This wasn't taking the MonClient lock: should use
with_monmap to protect access to MonClient::monmap.

Signed-off-by: John Spray <john.spray@redhat.com>
@jcsp
Copy link
Contributor Author

jcsp commented Apr 30, 2018

Updated to have behaviour depend on the luminous feature bit

@liewegas
Copy link
Member

This master failure may be related, BTW: a 'ceph pg dump' command hangs indefinitely. See
/a/sage-2018-04-30_00:12:46-rados-wip-sage3-testing-2018-04-29-1658-distro-basic-smithi/2453718

@liewegas
Copy link
Member

liewegas commented May 1, 2018

@yuriw
Copy link
Contributor

yuriw commented May 2, 2018

@liewegas
Copy link
Member

liewegas commented May 2, 2018

2018-05-01T04:12:02.761 INFO:tasks.workunit.client.0.smithi134.stderr:+ expected_ret=13
2018-05-01T04:12:02.761 INFO:tasks.workunit.client.0.smithi134.stderr:+ echo ceph -k /tmp/cephtest-mon-caps-madness.foo.keyring --user foo pg dump
2018-05-01T04:12:02.761 INFO:tasks.workunit.client.0.smithi134.stderr:+ eval ceph -k /tmp/cephtest-mon-caps-madness.foo.keyring --user foo pg dump
2018-05-01T04:12:02.761 INFO:tasks.workunit.client.0.smithi134.stdout:ceph -k /tmp/cephtest-mon-caps-madness.foo.keyring --user foo pg dump
...

then times out,

2018-05-01T04:51:12.987 ERROR:tasks.mon_thrash:Saw exception while triggering scrub
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-sage3-testing-2018-04-30-1610/qa/tasks/mon_thrash.py", line 301, in do_thrash
    self.manager.raw_cluster_cmd('scrub')
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-sage3-testing-2018-04-30-1610/qa/tasks/ceph_manager.py", line 1134, in raw_cluster_cmd
    stdout=StringIO(),
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 193, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 423, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 155, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 177, in _raise_for_status
    node=self.hostname, label=self.label
CommandFailedError: Command failed on smithi200 with status 16: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph scrub'

/a/sage-2018-05-01_02:09:39-rados-wip-sage3-testing-2018-04-30-1610-distro-basic-smithi/2459042
(saw the same thing yesterday)

@liewegas liewegas changed the base branch from master to mimic May 3, 2018 18:17
@liewegas
Copy link
Member

liewegas commented May 3, 2018

client.foo
        key: AQDXRetar1sFKRAA2K6L/TAYRxvKCYj9lGTMrw==
        caps: [mon] allow command "auth ls", allow command mon_status

and pg dumpw orks as client.admin, but as client.foo,

...
2018-05-03 18:59:33.577 7fd44f861700  1 -- 172.21.15.72:0/1114096433 <== mon.2 172.21.15.136:6790/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (254546305 0 0) 0x7fd440000ee0 con 0x7fd4480532e0
2018-05-03 18:59:33.577 7fd44f861700 10 monclient(hunting): my global_id is 504200
2018-05-03 18:59:33.577 7fd44f861700  1 -- 172.21.15.72:0/1114096433 --> 172.21.15.136:6790/0 -- auth(proto 2 2 bytes epoch 0) v1 -- ?+0 0x7fd434003a10 con 0x7fd4480532e0
2018-05-03 18:59:33.577 7fd44f861700  1 -- 172.21.15.72:0/1114096433 <== mon.2 172.21.15.136:6790/0 3 ==== auth_reply(proto 2 -22 (22) Invalid argument) v1 ==== 24+0+0 (3542374562 0 0) 0x7fd440000ee0 con 0x7fd4480532e0
2018-05-03 18:59:33.577 7fd44f861700  1 -- 172.21.15.72:0/1114096433 mark_down 0x7fd4480532e0 -- 0x7fd4480851c0

and hang.

@tchaikov
Copy link
Contributor

tchaikov commented May 4, 2018

2018-05-04 05:20:56.498 7f8346856700 10 mon.b@1(peon) e1 handle_subscribe mon_subscribe({mgrmap=0+}) v3
2018-05-04 05:20:56.498 7f8346856700 20 is_capable service=mon command= read on cap allow command "auth ls", allow command mon_status
2018-05-04 05:20:56.498 7f8346856700 20  allow so far , doing grant allow command "auth ls"
2018-05-04 05:20:56.498 7f8346856700 20  allow so far , doing grant allow command mon_status
2018-05-04 05:20:56.498 7f8346856700  5 mon.b@1(peon) e1 handle_subscribe client.64135 172.21.6.106:0/896299522 not enough caps for mon_subscribe({mgrmap=0+}) v3 -- dropping
2018-05-04 05:20:56.498 7f8346856700  1 -- 172.21.6.106:6789/0 <== client.64135 172.21.6.106:0/896299522 6 ==== mon_subscribe({osdmap=0}) v3 ==== 27+0+0 (2355785087 0 0) 0x55ed4e08fb00 con 0x55ed4dd52c60
2018-05-04 05:20:56.498 7f8346856700 20 mon.b@1(peon) e1 _ms_dispatch existing session 0x55ed4ea2cfc0 for client.? 172.21.6.106:0/896299522
2018-05-04 05:20:56.498 7f8346856700 20 mon.b@1(peon) e1  caps allow command "auth ls", allow command mon_status
2018-05-04 05:20:56.498 7f8346856700 10 mon.b@1(peon) e1 handle_subscribe mon_subscribe({osdmap=0}) v3
2018-05-04 05:20:56.498 7f8346856700 20 is_capable service=mon command= read on cap allow command "auth ls", allow command mon_status
2018-05-04 05:20:56.498 7f8346856700 20  allow so far , doing grant allow command "auth ls"
2018-05-04 05:20:56.498 7f8346856700 20  allow so far , doing grant allow command mon_status
2018-05-04 05:20:56.498 7f8346856700  5 mon.b@1(peon) e1 handle_subscribe client.64135 172.21.6.106:0/896299522 not enough caps for mon_subscribe({osdmap=0}) v3 -- dropping

the mon side dropped the "mon_subscribe({mgrmap=0+})" on floor due to "not enough caps for". so the mgrclient was waiting for the mgrmap in vain.

@jcsp
Copy link
Contributor Author

jcsp commented May 4, 2018

Closing in favour of #21811

@jcsp jcsp closed this May 4, 2018
@jcsp jcsp deleted the wip-23627 branch May 4, 2018 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants