Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceph-ansible: "Timeout when waiting for file /etc/ceph/ceph.client.admin.keyring" #1380

Closed
gator1 opened this issue Mar 16, 2017 · 27 comments
Closed

Comments

@gator1
Copy link

gator1 commented Mar 16, 2017

Hi,

I set up seven KVM centos CentOS-7-x86_64-Minimal-1611 VMs.
They use bridged network on the host with static ip address.

I set up a user cephuser in wheel group but ceph-ansible doesn't like that user.

Now instead I use root. I tried to set up three monitors (mon-node1--3) and three osds (osd-node1-3)

I use this all file to run ansible-playbook site.yml -u root with site.yml copied from sample unchanged.

I checked mon-node1 /etc/ceph and there is no keyring file.

Do I need certain utility for this to work? ssh-keygen works.

Please help.

all.txt

](url)

TASK [ceph-mon : wait for ceph.client.admin.keyring exists] ********************
fatal: [mon-node1]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for file /etc/ceph/ceph.client.admin.keyring"}
fatal: [mon-node3]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for file /etc/ceph/ceph.client.admin.keyring"}
fatal: [mon-node2]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for file /etc/ceph/ceph.client.admin.keyring"}
to retry, use: --limit @/root/ceph-ansible/site.retry

PLAY RECAP *********************************************************************
mon-node1 : ok=41 changed=1 unreachable=0 failed=1
mon-node2 : ok=39 changed=1 unreachable=0 failed=1
mon-node3 : ok=39 changed=1 unreachable=0 failed=1
osd-node1 : ok=2 changed=0 unreachable=0 failed=0
osd-node2 : ok=2 changed=0 unreachable=0 failed=0
osd-node3 : ok=2 changed=0 unreachable=0 failed=0

@gator1
Copy link
Author

gator1 commented Mar 17, 2017

monitor kind of started but ceph needs the keys on monitors.

There must be some simple issues. From the deplot node (called mgmt) I can ssh root@mon-node1 without password. Does that mean I did everything regarding to the key?

[root@mon-node1 ceph]# ceph -s
2017-03-16 17:37:42.112015 7f0f40feb700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
2017-03-16 17:37:42.113360 7f0f387f8700 0 -- :/3462189818 >> 10.145.82.102:6789/0 pipe(0x7f0f3c05dbf0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f0f3c05eeb0).fault
2017-03-16 17:37:45.113535 7f0f386f7700 0 -- :/3462189818 >> 10.145.82.103:6789/0 pipe(0x7f0f24000c80 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f0f24001f90).fault
2017-03-16 17:37:48.113716 7f0f387f8700 0 -- :/3462189818 >> 10.145.82.102:6789/0 pipe(0x7f0f240052b0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f0f24006570).fault
2017-03-16 17:37:54.114394 7f0f385f6700 0 -- 10.145.82.101:0/3462189818 >> 10.145.82.102:6789/0 pipe(0x7f0f240052b0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f0f24004620).fault
2017-03-16 17:37:57.114570 7f0f387f8700 0 -- 10.145.82.101:0/3462189818 >> 10.145.82.103:6789/0 pipe(0x7f0f24000c80 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f0f24002350).fault
2017-03-16 17:38:03.114993 7f0f386f7700 0 -- 10.145.82.101:0/3462189818 >> 10.145.82.103:6789/0 pipe(0x7f0f24000c80 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f0f24007c40).fault
2017-03-16 17:38:09.115629 7f0f385f6700 0 -- 10.145.82.101:0/3462189818 >> 10.145.82.102:6789/0 pipe(0x7f0f24000c80 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f0f24007e10).fault
2017-03-16 17:38:12.115703 7f0f386f7700 0 -- 10.145.82.101:0/3462189818 >> 10.145.82.103:6789/0 pipe(0x7f0f240052b0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f0f2400aa70).fault
2017-03-16 17:38:15.115913 7f0f385f6700 0 -- 10.145.82.101:0/3462189818 >> 10.145.82.102:6789/0 pipe(0x7f0f24000c80 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f0f2400b340).fault
2017-03-16 17:38:21.118907 7f0f385f6700 0 -- 10.145.82.101:0/3462189818 >> 10.145.82.102:6789/0 pipe(0x7f0f24000c80 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f0f2400b340).fault

@WingkaiHo
Copy link
Contributor

check group_vars/all.yml Monitor options

set the monitor_interface or monitor_address

@gator1
Copy link
Author

gator1 commented Mar 22, 2017

I used eth0 for monitor interface

@WingkaiHo
Copy link
Contributor

Which version you select. Because the 10.2.5 the admin key build before ceph-mon start
the monitor service file like this
...
After=network-online.target local-fs.target time-sync.target ceph-create-keys@%i.service
Wants=network-online.target local-fs.target time-sync.target ceph-create-keys@%i.service
...

the version 11.xxx.xx, it will create by the ansbile-playbook, roles/ceph-mon/tasks/ceph_keys.yml
of task collect admin and bootstrap keys (for or after kraken release)

@gator1
Copy link
Author

gator1 commented Apr 4, 2017

Tried 9 and 10, not 11. I am not sure what it means "Because the 10.2.5 the admin key build before ceph-mon start the monitor service file like this...". Does it mean 10 and 9 won't work? I can try 11.

@gator1
Copy link
Author

gator1 commented Apr 4, 2017

I tried 11 and got the same results pretty much. Can't generate keyring at /etc/ceph for monitor nodes.

TASK [ceph-mon : wait for ceph.client.admin.keyring exists] ********************
fatal: [mon-node2]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for file /etc/ceph/ceph.client.admin.keyring"}
fatal: [mon-node3]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for file /etc/ceph/ceph.client.admin.keyring"}
fatal: [mon-node1]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for file /etc/ceph/ceph.client.admin.keyring"}

@gator1
Copy link
Author

gator1 commented Apr 4, 2017

more output

TASK [ceph-mon : start the monitor service] ************************************
ok: [mon-node2]
ok: [mon-node3]
ok: [mon-node1]

TASK [ceph-mon : include] ******************************************************
included: /root/ceph-ansible/roles/ceph-mon/tasks/ceph_keys.yml for mon-node1, mon-node2, mon-node3

TASK [ceph-mon : collect admin and bootstrap keys (for or after kraken release)] ***
[DEPRECATION WARNING]: always_run is deprecated. Use check_mode = no instead..
This feature will be removed in version 2.4. Deprecation warnings can be disabled
by setting deprecation_warnings=False in ansible.cfg.
[DEPRECATION WARNING]: always_run is deprecated. Use check_mode = no instead..
This feature will be removed in version 2.4. Deprecation warnings can be disabled
by setting deprecation_warnings=False in ansible.cfg.
[DEPRECATION WARNING]: always_run is deprecated. Use check_mode = no instead..
This feature will be removed in version 2.4. Deprecation warnings can be disabled
by setting deprecation_warnings=False in ansible.cfg.
ok: [mon-node1]
ok: [mon-node2]
ok: [mon-node3]

TASK [ceph-mon : wait for ceph.client.admin.keyring exists] ********************
fatal: [mon-node2]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for file /etc/ceph/ceph.client.admin.keyring"}
fatal: [mon-node3]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for file /etc/ceph/ceph.client.admin.keyring"}
fatal: [mon-node1]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for file /etc/ceph/ceph.client.admin.keyring"}

RUNNING HANDLER [ceph.ceph-common : restart ceph mons] *************************

RUNNING HANDLER [ceph.ceph-common : restart ceph osds] *************************

RUNNING HANDLER [ceph.ceph-common : restart ceph mdss] *************************

RUNNING HANDLER [ceph.ceph-common : restart ceph rgws] *************************

RUNNING HANDLER [ceph.ceph-common : restart ceph nfss] *************************
to retry, use: --limit @/root/ceph-ansible/site.retry

PLAY RECAP *********************************************************************
mon-node1 : ok=42 changed=11 unreachable=0 failed=1
mon-node2 : ok=40 changed=11 unreachable=0 failed=1
mon-node3 : ok=40 changed=11 unreachable=0 failed=1
osd-node1 : ok=2 changed=0 unreachable=0 failed=0
osd-node2 : ok=2 changed=0 unreachable=0 failed=0
osd-node3 : ok=2 changed=0 unreachable=0 failed=0

@leseb
Copy link
Member

leseb commented Apr 4, 2017

@gator1 please check if your ceph-mon are running. Look at the system logs for ceph-mon as well.

@gator1
Copy link
Author

gator1 commented Apr 4, 2017 via email

@gator1
Copy link
Author

gator1 commented Apr 4, 2017

on a monitor node:

[root@mon-node2 ~]# ceph -s

2017-04-04 09:51:13.217859 7f16a0b18700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory

@gator1
Copy link
Author

gator1 commented Apr 4, 2017

ceph-mon is running, does ceph-ansible create user ceph?

[root@mon-node2 ceph]# ps -aux | grep ceph
root 2551 0.0 0.0 112648 956 pts/0 S+ 10:02 0:00 grep --color=auto ceph
ceph 31453 0.0 1.7 372568 66504 ? Ssl 01:46 0:06 /usr/bin/ceph-mon -f --cluster ceph --id mon-node2 --setuser ceph --setgroup ceph

@gator1
Copy link
Author

gator1 commented Apr 4, 2017

ceph-mon log from /var/log doesn't seem to have error other than

2017-04-04 01:46:46.474482 7f448d3497c0 0 set uid:gid to 167:167 (ceph:ceph)
2017-04-04 01:46:46.474516 7f448d3497c0 0 ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7), process ceph-mon, pid 31453
2017-04-04 01:46:46.474585 7f448d3497c0 0 pidfile_write: ignore empty --pid-file
2017-04-04 01:46:46.515899 7f448d3497c0 0 load: jerasure load: lrc load: isa
2017-04-04 01:46:46.516709 7f448d3497c0 1 leveldb: Recovering log #6
2017-04-04 01:46:46.518279 7f448d3497c0 1 leveldb: Delete type=0 #6

2017-04-04 01:46:46.518318 7f448d3497c0 1 leveldb: Delete type=3 #4

2017-04-04 01:46:46.519499 7f448d3497c0 0 starting mon.mon-node2 rank 1 at 10.145.82.102:6789/0 mon_data /var/lib/ceph/mon/ceph-mon-node2 fsid ba09e207-d873-4bca-ad02-ee76d76e335d
2017-04-04 01:46:46.521027 7f448d3497c0 1 mon.mon-node2@-1(probing) e0 preinit fsid ba09e207-d873-4bca-ad02-ee76d76e335d
2017-04-04 01:46:46.521119 7f448d3497c0 1 mon.mon-node2@-1(probing) e0 initial_members mon-node1,mon-node2,mon-node3, filtering seed monmap
2017-04-04 01:46:46.521212 7f448d3497c0 1 mon.mon-node2@-1(probing).mds e0 Unable to load 'last_metadata'
2017-04-04 01:46:46.521954 7f448d3497c0 0 mon.mon-node2@-1(probing) e0 my rank is now 0 (was -1)
2017-04-04 01:47:46.521511 7f4484e37700 0 mon.mon-node2@0(probing).data_health(0) update_stats avail 73% total 6334 MB, used 1666 MB, avail 4667 MB

I attach the entire log here
ceph-mon.mon-node2.txt

@leseb
Copy link
Member

leseb commented Apr 5, 2017

@gator1 try to run ceph-create-keys --cluster ceph --id <monitor hostname> on one of the monitors.

@OrFriedmann
Copy link
Contributor

Usually this error occurs when you use wrong public address or wrong cluster address or maybe wrong monitor interface
The public address and the public address are cidr.
You can check it on the ceph conf that was generated if it is correct cidr.

@gator1
Copy link
Author

gator1 commented Apr 5, 2017

ceph-create-keys --cluster ceph --id mon-node1 and
ceph-create-keys --cluster ceph --id mon-node2 on each node keeps generating:
INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'

@gator1
Copy link
Author

gator1 commented Apr 5, 2017

@OrFriedmann :

Here is the ceph.conf created.
I only has eth0 for each node so monitor interface is eth0 and public and cluster addresses are
the same, 10.145.82.0/24.

This is a network without dhcp, each node is assigned a static ip.

The set up is on a physical server (36 core, 128g RAM, 20t total ssd) running ubuntu 16.04.
All the ceph nodes are VM running centos 7. the network bridged on the host.


Please do not change this file directly since it is managed by Ansible and will
l be overwritten

[global]
fsid = ba09e207-d873-4bca-ad02-ee76d76e335d
max open files = 131072
mon initial members = mon-node1,mon-node2,mon-node3
mon host = 10.145.82.101,10.145.82.102,10.145.82.103
public network = 10.145.82.0/24
cluster network = 10.145.82.0/24

[client.libvirt]
admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok # must be writaa
ble by QEMU and allowed by SELinux or AppArmor
log file = /var/log/ceph/qemu-guest-$pid.log # must be writable by QEMU and alloo
wed by SELinux or AppArmor

[osd]
osd mkfs type = xfs
osd mkfs options xfs = -f -i size=2048
osd mount options xfs = noatime,largeio,inode64,swalloc
osd journal size = 4096

@piwi3910
Copy link
Contributor

piwi3910 commented Apr 8, 2017

you got it fixed, seems i have the same issue, Tried it first on 4 VM's worked perfect, now the same config on HW gives the same issue you have.

@piwi3910
Copy link
Contributor

piwi3910 commented Apr 8, 2017

I got it fixed by running with fqd for the mon and making sure the mon interface is the same one that maps to the fqdn name

@gator1
Copy link
Author

gator1 commented Apr 10, 2017

I dont have fqdn, there is no dns server on my network. I finally bypassed this by following this link.
http://www.virtualtothecore.com/en/adventures-ceph-storage-part-4-deploy-the-nodes-in-the-lab/

I did everything in this link about the system, not sure which one works, maybe SELinux?

Now the ceph -s is health HEALTH_WARN. It seems to be another long battle to fix it.

@leseb
Copy link
Member

leseb commented Jul 3, 2017

No activity. Closing this, feel free to re-open.

@Masber
Copy link

Masber commented Aug 22, 2017

I am having same issue.

ceph-ansible deployment fails.

TASK [ceph-mon : collect admin and bootstrap keys] *****************************
[DEPRECATION WARNING]: always_run is deprecated. Use check_mode = no instead..
This feature will be removed in version 2.4. Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg.
[DEPRECATION WARNING]: always_run is deprecated. Use check_mode = no instead..
This feature will be removed in version 2.4. Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg.
ok: [ceph-4-3]
ok: [ceph-4-2]

TASK [ceph-mon : wait for ceph.client.admin.keyring exists] ********************
fatal: [ceph-4-2]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for file /etc/ceph/ceph.client.admin.keyring"}
fatal: [ceph-4-3]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for file /etc/ceph/ceph.client.admin.keyring"}

And I can't create keyrings:

[root@ceph-4-2 ~]# ceph-create-keys --cluster ceph --id ceph-4-2
INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'

ansible hosts file:

[root@kolla ceph-ansible]# cat /etc/ansible/hosts
[mons]
ceph-4-3
ceph-4-2

[osds]
ceph-4-3
ceph-4-2

[rgws]
ceph-4-2

[mgr]
ceph-4-2

this is the ceph.conf file:

# Please do not change this file directly since it is managed by Ansible and will be overwritten

[global]
fsid = 1439d592-555b-4ebc-89f7-4d93df2ae4b0
max open files = 131072
mon initial members = ceph-4-2
mon host = 10.1.0.42

public network = 10.1.0.0/16
cluster network = 10.1.0.0/16

[client.libvirt]
admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok # must be writable by QEMU and allowed by SELinux or AppArmor
log file = /var/log/ceph/qemu-guest-$pid.log # must be writable by QEMU and allowed by SELinux or AppArmor

[osd]
osd mkfs type = xfs
osd mkfs options xfs = -f -i size=2048
osd mount options xfs = noatime,largeio,inode64,swalloc
osd journal size = 5120


[client.rgw.ceph-4-2]
host = ceph-4-2
keyring = /var/lib/ceph/radosgw/ceph-rgw.ceph-4-2/keyring
rgw socket path = /tmp/radosgw-ceph-4-2.sock
log file = /var/log/ceph/ceph-rgw-ceph-4-2.log
rgw data = /var/lib/ceph/radosgw/ceph-rgw.ceph-4-2
rgw frontends = civetweb port=192.168.20.42:8080 num_threads=100

Any ideas?

@Masber
Copy link

Masber commented Aug 22, 2017

I fixed my issue by changing the number of monitors from 2 to 1.
I think it would be nice if ansible script could advise the user about this kind of errors.

@leseb
Copy link
Member

leseb commented Aug 22, 2017

@Masber normally we advise going with odd numbers when it comes to monitors. So please stick with this recommendation.

@ist0ne
Copy link

ist0ne commented Oct 9, 2017

I got it fixed by running:

systemctl disable firewalld
systemctl stop firewalld

sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config

reboot

@acomisario
Copy link

@ist0ne that was the answer !

@aizuddin85
Copy link

use this rule to allow ceph creating cluster at the same time running the firewall. These rules inclusive of OSD, MON, RADOSGW and MDS.

firewall-cmd --zone=public --add-port=6789/tcp --permanent
firewall-cmd --zone=public --add-port=6800-7300/tcp --permanent
firewall-cmd --zone=public --add-port=7480/tcp
firewall-cmd --permanent --zone=public --add-port=443/tcp
firewall-cmd --permanent --zone=public --add-port=80/tcp
firewall-cmd --reload

@mflannery
Copy link

I ran into the same issue. Executing the firewall commands above fixed the issue. The selinux commands above that did not fix the issue. Try firewall first.

Cheers!
Mike

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants