mgr/cephadm: Remove gateway.conf from iscsi pool when service is removed #40313

jmolmo · 2021-03-22T15:15:26Z

fixes: https://tracker.ceph.com/issues/48930

Signed-off-by: Juan Miguel Olmo Martínez jolmomar@redhat.com

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

dillaman

lgtm

mgfritch

also remove the iscsi-gateway.key and iscsi-gateway.crt config keys during purge?

dillaman · 2021-03-23T15:16:58Z

also remove the iscsi-gateway.key and iscsi-gateway.crt config keys during purge?

I thought those are pulled from the MGR config-key store now? or are you referring to pulling them out of the MGR config-key store?

sebastian-philipp · 2021-03-23T15:31:20Z

src/pybind/mgr/cephadm/services/iscsi.py

+            with self.mgr.rados.open_ioctx(spec.pool) as ioctx:
+                ioctx.remove_object("gateway.conf")
+                logger.debug(f'<gateway.conf> removed from {spec.pool}')
+        except rados.ObjectNotFound:


I think we should also allow other exceptions here. Like access forbidden or something. I don't see a point in retrying calling purge till it succeeds.

Suggested change

except rados.ObjectNotFound:

except rados.Errror:

logger.exception('failed to purge {service_name}')

mgfritch · 2021-03-26T15:23:14Z

also remove the iscsi-gateway.key and iscsi-gateway.crt config keys during purge?

I thought those are pulled from the MGR config-key store now? or are you referring to pulling them out of the MGR config-key store?

yeah, remove them from the mgr config-key store.

similar to this:

ceph/src/pybind/mgr/cephadm/services/cephadmservice.py

Lines 805 to 819 in 7e29d0a

    
           def purge(self, service_name: str) -> None: 
        
               self.mgr.check_mon_command({ 
        
                   'prefix': 'config rm', 
        
                   'who': utils.name_to_config_section(service_name), 
        
                   'name': 'rgw_realm', 
        
               }) 
        
               self.mgr.check_mon_command({ 
        
                   'prefix': 'config rm', 
        
                   'who': utils.name_to_config_section(service_name), 
        
                   'name': 'rgw_zone', 
        
               }) 
        
               self.mgr.check_mon_command({ 
        
                   'prefix': 'config-key rm', 
        
                   'key': f'rgw/cert/{service_name}', 
        
               })

sebastian-philipp · 2021-04-01T10:13:08Z

src/pybind/mgr/cephadm/services/iscsi.py

+        spec = cast(IscsiServiceSpec, self.mgr.spec_store[service_name].spec)
+        try:
+            # remove service configuration from the pool
+            with self.mgr.rados.open_ioctx(spec.pool) as ioctx:


Next issue: this might hang indefinitely, if the cluster is in trouble. Doesn't sounds like a problem, but users need to be able to fix their cluster using cephadm and this is going to fail, if we're hanging indefinitely here.

wdyt of testing for HEALTH_OK here? @adk3798

I think we should avoid HEALTH_OK tests; better to have a timeout. librados can do timeouts but it's not super well supported, and this is a shared client instance; it might be better to shell out to the rados CLI to do this.

sebastian-philipp · 2021-04-07T13:19:29Z

src/pybind/mgr/cephadm/services/iscsi.py

+        spec = cast(IscsiServiceSpec, self.mgr.spec_store[service_name].spec)
+        try:
+            # remove service configuration from the pool
+            rm_pool_thread = Thread(target=remove_pool, args=(self.mgr, spec.pool))


I think we need to shell out. just a thread might not help us.

How do you suggest to shell out? multiprocessing module will add another mgr instance into the container and can produce problems.

i think that to "shell out" the rados command only can be achieved through "_run_cephadm" ... ( or making possible in the mgr container to use ceph commands). if we try to execute the "rados" command directly in the mgr container we have this:

[ceph: root@cephlab2-node-00 /]# rados -p myscsi ls 2021-04-12T14:45:42.483+0000 7f90c006cd00 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory 2021-04-12T14:45:42.483+0000 7f90c006cd00 -1 AuthRegistry(0x55fd4904f440) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,, disabling cephx 2021-04-12T14:45:42.484+0000 7f90c006cd00 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory 2021-04-12T14:45:42.484+0000 7f90c006cd00 -1 AuthRegistry(0x7ffd0e55ad00) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,, disabling cephx 2021-04-12T14:45:42.485+0000 7f90b821d700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1] 2021-04-12T14:45:42.486+0000 7f90b8a1e700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1] 2021-04-12T14:45:42.486+0000 7f90b921f700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1] 2021-04-12T14:45:42.486+0000 7f90c006cd00 -1 monclient: authenticate NOTE: no keyring found; disabled cephx authentication failed to fetch mon config (--no-mon-config to skip) [ceph: root@cephlab2-node-00 /]# rados -c /etc/ceph/ceph.conf -p myscsi ls 2021-04-12T14:47:23.129+0000 7f5b8cd5bd00 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory 2021-04-12T14:47:23.129+0000 7f5b8cd5bd00 -1 AuthRegistry(0x562c24dfb440) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,, disabling cephx 2021-04-12T14:47:23.130+0000 7f5b8cd5bd00 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory 2021-04-12T14:47:23.130+0000 7f5b8cd5bd00 -1 AuthRegistry(0x7fff123a85e0) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,, disabling cephx 2021-04-12T14:47:23.130+0000 7f5b85f0e700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1] 2021-04-12T14:47:23.131+0000 7f5b84f0c700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1] 2021-04-12T14:47:23.131+0000 7f5b8570d700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1] 2021-04-12T14:47:23.131+0000 7f5b8cd5bd00 -1 monclient: authenticate NOTE: no keyring found; disabled cephx authentication failed to fetch mon config (--no-mon-config to skip)

What is the preferred implementation?
_run_cephadm
or
make mgr container able to execute ceph/rados commands?

- cmd = ['radosgw-admin', - '--key=%s' % keyring, - '--user', 'rgw.%s' % rgw_id, - 'realm', 'list', - '--format=json'] - result = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE) - out = result.stdout - if not out: - return [] - try: - j = json.loads(out) - return j.get('realms', []) - except Exception: - raise OrchestratorError('failed to parse realm info')

for the keyring, pass the keyring file for the mgr itself, which should be rados -k /var/lib/ceph/mgr/ceph-{{mgr_id}}/keyring -i {{mgr_id}} ..

liewegas · 2021-04-14T14:53:23Z

src/pybind/mgr/cephadm/services/iscsi.py

+                hosts=self.mgr._hosts_with_daemon_inventory(),
+                daemons=[],
+                networks=self.mgr.cache.networks)
+            _, slots, _ = ha.place()


shouldn't we just look at the existing daemons? self.mgr.cache.find_daemons_by_service(...)?

When we reach this part, all the iscsi daemons have been removed. The only thing we have about the deleted iscsi service is just the spec file. To calculate again the hosts assignment is the only way i found to be able to remove the different config setting that references hosts where the iscsi daemons were deployed

Can you put it in the post_remove() method instead then? That takes a DaemonDescription for the just-removed daemon.

jmolmo · 2021-04-26T14:49:56Z

jenkins test make check

liewegas · 2021-04-27T14:13:31Z

src/pybind/mgr/cephadm/services/iscsi.py

+            # remove service configuration from the pool
+            try:
+                mgr_id = self.mgr.get_mgr_id()
+                keyring = f'/var/lib/ceph/mgr/ceph-{mgr_id}/keyring'


I just ended up doing this for the HA NFS PR.. .see 5a028ae#diff-b13a6b9b90f10d54533fcaa8b7d9db7f8d24ae39bb52cd611c28c9c39cb429ecR142

fetch the config option so it works with vstart

use the list form of subprocess.run instead of a string

liewegas · 2021-04-27T14:14:47Z

src/pybind/mgr/cephadm/services/iscsi.py

+
+        # remove iscsi cert and key from ceph config
+        for iscsi_key, value in iscsi_config_dict.items():
+            if daemon.hostname in iscsi_key:


shouldn't this check for a / or something? otherwise if the host is say isc every key will be removed

Remove gateway.conf from iscsi pool when service is removed Remove iscsi ceph config keys Remove iscsi dashboard gateways config from dashboard fixes: https://tracker.ceph.com/issues/48930 Signed-off-by: Juan Miguel Olmo Martínez <jolmomar@redhat.com>

tchaikov · 2021-04-30T11:42:03Z

jmolmo · 2021-05-04T08:17:38Z

Errors in:
https://pulpito.ceph.com/kchai-2021-04-30_10:06:45-rados-wip-kefu-testing-2021-04-30-1335-distro-basic-smithi/

See:
https://pulpito.ceph.com/kchai-2021-04-30_10:06:45-rados-wip-kefu-testing-2021-04-30-1335-distro-basic-smithi/6086127/
https://pulpito.ceph.com/kchai-2021-04-30_10:06:45-rados-wip-kefu-testing-2021-04-30-1335-distro-basic-smithi/6086137/

Details:

2021-04-30T04:21:05.860 DEBUG:teuthology.orchestra.run.smithi141:> sudo /home/ubuntu/cephtest/cephadm --image docker.io/ceph/daemon-base:latest-pacific -v bootstrap --fsid 7ac5e406-a96b-11eb-821c-001a4aab830c --config /home/ubuntu/cephtest/seed.ceph.conf --output-config /etc/ceph/ceph.conf --output-keyring /etc/ceph/ceph.client.admin.keyring --output-pub-ssh-key /home/ubuntu/cephtest/ceph.pub --mon-id a --mgr-id y --orphan-initial-daemons --skip-monitoring-stack --mon-ip 172.21.15.141 --skip-admin-label && sudo chmod +r /etc/ceph/ceph.client.admin.keyring
2021-04-30T04:21:05.885 INFO:journalctl@ceph.mgr.y.smithi141.stdout:-- Logs begin at Fri 2021-04-30 04:13:11 UTC. --
2021-04-30T04:21:06.118 INFO:teuthology.orchestra.run.smithi141.stderr:usage: cephadm [-h] [--image IMAGE] [--docker] [--data-dir DATA_DIR]
2021-04-30T04:21:06.118 INFO:teuthology.orchestra.run.smithi141.stderr:               [--log-dir LOG_DIR] [--logrotate-dir LOGROTATE_DIR]
2021-04-30T04:21:06.118 INFO:teuthology.orchestra.run.smithi141.stderr:               [--unit-dir UNIT_DIR] [--verbose] [--timeout TIMEOUT]
2021-04-30T04:21:06.118 INFO:teuthology.orchestra.run.smithi141.stderr:               [--retry RETRY] [--env ENV] [--no-container-init]
2021-04-30T04:21:06.119 INFO:teuthology.orchestra.run.smithi141.stderr:               {version,pull,inspect-image,ls,list-networks,adopt,rm-daemon,rm-cluster,run,shell,enter,ceph-volume,unit,logs,bootstrap,deploy,check-host,prepare-host,add-repo,rm-repo,install,registry-login,gather-facts,exporter,host-maintenance,verify-prereqs}
2021-04-30T04:21:06.119 INFO:teuthology.orchestra.run.smithi141.stderr:               ...
2021-04-30T04:21:06.119 INFO:teuthology.orchestra.run.smithi141.stderr:cephadm: error: unrecognized arguments: --skip-admin-label

It seems that cephadm binary installed is old and does not have the new parameter "skip-admin-label" introduced in #40941

2021-04-30T10:27:36.607 INFO:teuthology.orchestra.run.smithi002.stdout:================================================================================
2021-04-30T10:27:36.607 INFO:teuthology.orchestra.run.smithi002.stdout: Package     Arch       Version                           Repository       Size
2021-04-30T10:27:36.607 INFO:teuthology.orchestra.run.smithi002.stdout:================================================================================
2021-04-30T10:27:36.607 INFO:teuthology.orchestra.run.smithi002.stdout:Installing:
2021-04-30T10:27:36.607 INFO:teuthology.orchestra.run.smithi002.stdout: cephadm     noarch     2:16.2.1-257.g717ce59b.el8        ceph-noarch      71 k

sebastian-philipp · 2021-05-07T09:21:09Z

When backporting, please include 41181

jmolmo requested a review from a team March 22, 2021 15:15

github-actions bot added cephadm pybind labels Mar 22, 2021

liewegas requested a review from dillaman March 22, 2021 15:21

dillaman approved these changes Mar 22, 2021

View reviewed changes

liewegas added needs-qa wip-sage-testing labels Mar 22, 2021

mgfritch approved these changes Mar 23, 2021

View reviewed changes

sebastian-philipp suggested changes Mar 23, 2021

View reviewed changes

adk3798 approved these changes Mar 26, 2021

View reviewed changes

jmolmo force-pushed the purge_iscsi_config branch from 7e29d0a to db957fe Compare March 31, 2021 12:40

jmolmo requested review from mgfritch, adk3798, dillaman and sebastian-philipp March 31, 2021 12:40

sebastian-philipp reviewed Apr 1, 2021

View reviewed changes

liewegas removed the wip-sage-testing label Apr 6, 2021

jmolmo force-pushed the purge_iscsi_config branch from db957fe to 725e1fa Compare April 7, 2021 12:45

jmolmo requested review from liewegas and sebastian-philipp and removed request for dillaman April 7, 2021 12:46

sebastian-philipp reviewed Apr 7, 2021

View reviewed changes

liewegas reviewed Apr 14, 2021

View reviewed changes

jmolmo force-pushed the purge_iscsi_config branch from 725e1fa to 3f1e733 Compare April 23, 2021 16:02

jmolmo requested review from sebastian-philipp and liewegas April 23, 2021 16:03

liewegas reviewed Apr 27, 2021

View reviewed changes

jmolmo force-pushed the purge_iscsi_config branch from 3f1e733 to 1b9e3ed Compare April 28, 2021 14:01

jmolmo requested a review from liewegas April 28, 2021 14:02

tchaikov added the wip-kefu-testing label Apr 29, 2021

tchaikov removed the wip-kefu-testing label Apr 30, 2021

sebastian-philipp approved these changes May 4, 2021

View reviewed changes

sebastian-philipp merged commit 2e195e9 into ceph:master May 4, 2021

sebastian-philipp mentioned this pull request May 7, 2021

mgr/cephadm: add timeout when removing iscsi gateway.conf #41181

Merged

3 tasks

liewegas mentioned this pull request May 7, 2021

pacific: cephadm: batch backport for May (2) #41219

Merged

neha-ojha mentioned this pull request May 14, 2021

pybind/mgr/pg_autoscaler: Added autoscale-profile feature #40836

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mgr/cephadm: Remove gateway.conf from iscsi pool when service is removed #40313

mgr/cephadm: Remove gateway.conf from iscsi pool when service is removed #40313

jmolmo commented Mar 22, 2021

dillaman left a comment

mgfritch left a comment

dillaman commented Mar 23, 2021 •

edited

sebastian-philipp Mar 23, 2021

mgfritch commented Mar 26, 2021

sebastian-philipp Apr 1, 2021

liewegas Apr 1, 2021

sebastian-philipp Apr 7, 2021

jmolmo Apr 8, 2021

jmolmo Apr 13, 2021

jmolmo Apr 13, 2021

liewegas Apr 16, 2021

liewegas Apr 14, 2021

jmolmo Apr 16, 2021

liewegas Apr 16, 2021

jmolmo commented Apr 26, 2021

liewegas Apr 27, 2021

liewegas Apr 27, 2021

tchaikov commented Apr 30, 2021

jmolmo commented May 4, 2021

sebastian-philipp commented May 7, 2021

	except rados.ObjectNotFound:
	except rados.Errror:
	logger.exception('failed to purge {service_name}')

mgr/cephadm: Remove gateway.conf from iscsi pool when service is removed #40313

mgr/cephadm: Remove gateway.conf from iscsi pool when service is removed #40313

Conversation

jmolmo commented Mar 22, 2021

Checklist

dillaman left a comment

Choose a reason for hiding this comment

mgfritch left a comment

Choose a reason for hiding this comment

dillaman commented Mar 23, 2021 • edited

Choose a reason for hiding this comment

mgfritch commented Mar 26, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmolmo commented Apr 26, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tchaikov commented Apr 30, 2021

jmolmo commented May 4, 2021

sebastian-philipp commented May 7, 2021

dillaman commented Mar 23, 2021 •

edited