Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceph-salt can no longer bootstrap pacific as of ceph commit 5c9227bbdf987365ae90433ac4e501e1a88b7aca #463

Closed
smithfarm opened this issue Nov 8, 2021 · 8 comments · Fixed by #465
Assignees
Labels
bug Something isn't working regression

Comments

@smithfarm
Copy link
Contributor

As of ceph/ceph@5c9227bbdf98 ceph-salt can no longer bootstrap pacific clusters:

    master: [2021-11-08 09:44:23.138464] [master.pacific-m] Finished with failures
    master: Failure in minion: master.pacific-mini.test
    master: __id__: add host to ceph orch
    master: __run_num__: 104
    master: __sls__: ceph-salt.apply.cephorch
    master: changes: {}
    master: comment: "Warning: Permanently added 'master.pacific-mini.test' (ECDSA) to the list\
    master:   \ of known hosts.\r\nError EINVAL: Can not automatically resolve ip address of host\
    master:   \ where active mgr is running. Please explicitly provide the address."
    master: duration: 748.424
    master: name: add host to ceph orch
    master: result: false
    master: start_time: '09:44:22.328683'
    master: state: ceph_orch_|-add host to ceph orch_|-add host to ceph orch_|-add_host
    master: Finished execution of ceph-salt formula

ANALYSIS:

The relevant yaml is:

{{ macros.begin_stage('Add host to ceph orchestrator') }}
add host to ceph orch:
  ceph_orch.add_host:
    - host: {{ my_hostname }}
    - is_admin: {{ is_admin }}
    - failhard: True
{{ macros.end_stage('Add host to ceph orchestrator') }}

which triggers the following Salt State code:

and my_hostname is getting populated by this line:

{% set my_hostname = salt['ceph_salt.hostname']() %}

which in turns triggers the following function:

def hostname():
    return socket.gethostname()

All this seems to mean that the hostname returned by socket.gethostname() is passed to the following function, add_host:

def add_host(name, host, is_admin=False):
    """ 
    Requires the following grains to be set:
      - ceph-salt:execution:admin_host
    """
    ret = {'name': name, 'changes': {}, 'comment': '', 'result': False}
    admin_host = __salt__['grains.get']('ceph-salt:execution:admin_host')
    cmd_ret = __salt__['ceph_salt.ssh'](
                       admin_host,
                       "sudo ceph orch host add {}{}".format(
                           host,
                           ' --labels _admin' if is_admin else '', 
                        ),  
                       attempts=10)
    if cmd_ret['retcode'] == 0:
        ret['result'] = True
    else:
        ret['comment'] = cmd_ret.get('stderr')
    return ret 

Now, when I try socket.gethostname() in the sesdev environment, it returns the short hostname (master). And it seems that cephadm is expecting to get the short hostname here. But I couldn't find any log message to confirm that ceph-salt is really sending the short hostname.

@smithfarm
Copy link
Contributor Author

The commit ceph/ceph@5c9227bbdf98 comes originally from here - ceph/ceph#42772 - which was added to fix https://tracker.ceph.com/issues/51667

The description of that PR says:

force users to use set-addr or label add/rm to modify host
this prevents labels being removed if a host is readded
or addr being set to 127.0.... if readding host active mgr is on

which is rather cryptic, but I did confirm that ceph-salt is not issuing any ceph orch host set-addr command. Maybe it should be?

What's weird is that the error is being triggered when the host is first being added. The PR description seems to indicate it was intended to fix a bug in subsequent host modification.

@smithfarm smithfarm added bug Something isn't working regression labels Nov 8, 2021
@sebastian-philipp
Copy link
Contributor

cc @Daniel-Pivonka

@adk3798
Copy link

adk3798 commented Nov 8, 2021

From what I can see, the linked PR includes a check that requires explicitly providing an address when calling "ceph orch host add" for the host the active mgr is running on. I think this was to prevent issues where the active mgr, when trying to resolve its own ip, would resolve to a loopback address rather than a proper ip. During cephadm bootstrap this typically isn't an issue because we explicitly pass the --mon-ip arg as the address for the bootstrap host (where the active mgr is at the time). If ceph-salt is doing things differently and just calling "ceph orch host add" without an explicit ip addr for the bootstrap host then it would trigger this. Perhaps that check is a bit too severe and we should only throw errors if the address actually resolves to a loopback address?

@smithfarm
Copy link
Contributor Author

@sebastian-philipp This is happening post-bootstrap, when ceph-salt tries to add the cluster hosts to the orchestrator.

@smithfarm
Copy link
Contributor Author

smithfarm commented Nov 8, 2021

a check that requires explicitly providing an address when calling "ceph orch host add" for the host the active mgr is running on

How would this be done, syntactically speaking?

UPDATE: nvm

orch host add <hostname> [<addr>] [<labels>...] [-- Add a host
 maintenance]

All clear now. But we'll need to patch ceph-salt for this.

@sebastian-philipp
Copy link
Contributor

Hm, this thing is: we should not make this a requirement, as this is often just not necessary, cause we can deduce the IP of many hosts properly anyway.

@smithfarm
Copy link
Contributor Author

smithfarm commented Nov 8, 2021

Yeah, I think many will be surprised when they hit this error, especially since the error message:

  1. does not make clear why mgr/cephadm thinks the IP address cannot be determined
  2. does not say what the user should do to make the command succeed

At the very least it could say something like: "re-run the command providing the host's IP address as a second argument".

smithfarm added a commit to smithfarm/ceph-salt that referenced this issue Nov 8, 2021
The upstream cephadm developers recently added a commit
- 0facfac91fd8f71e5a8b869d818e7c2b07b93516 - which was backported to pacific
and was found to be causing our "ceph orch host add" command to fail because we
weren't specifying the IP address explicitly.

While the developers are now discussing whether the command really should fail
in this way, it seems prudent to give our IP address here explicitly, since
mgr/cephadm is going to try to resolve the hostname, anyway.

Fixes: ceph#463

Signed-off-by: Nathan Cutler <ncutler@suse.com>
@Daniel-Pivonka
Copy link

This error message will only appear when trying to readd the host that the active mgr is running on. we do not want to have the active mgr try to resolve its own ip as there is some weirdness with the /etc/hosts file inside containers and can result in the ip resolving to 12.7.0.0.1 which is not correct.

Im not sure we should be explaining the weirdness of the container /etc/host file in this error message as that would be a long wordy error message.

and the error does say the user needs to provide an ip address to make this command succeed.
the error message needs to be generic enough that its applicable to both adding hosts via cli or yml spec so it cant say explicitly say add the ip to the second argument.

if ceph-salt is readding the host the active mgr is running it need to explicitly provide the ip address

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regression
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants