Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Libvirt and Ceph: libvirtd tries to open random RBD images #8211

Closed
zap51 opened this issue Nov 9, 2023 · 13 comments
Closed

Libvirt and Ceph: libvirtd tries to open random RBD images #8211

zap51 opened this issue Nov 9, 2023 · 13 comments

Comments

@zap51
Copy link
Contributor

zap51 commented Nov 9, 2023

ISSUE TYPE
  • Bug Report
COMPONENT NAME
Storage, Ceph, RBD
CLOUDSTACK VERSION
4.16.1.0
CONFIGURATION

Advanced Networking

OS / ENVIRONMENT

Ubuntu Server 20.04 LTS on both Management and Hypervisors
Hypervisor: KVM

SUMMARY

libvirtd throws error messages like Oct 25 13:38:11 hv-01 libvirtd[9464]: failed to open the RBD image '087bb114-448a-41d2-9f5d-6865b62eed15': No such file or directory

STEPS TO REPRODUCE
N/A
EXPECTED RESULTS
N/A
ACTUAL RESULTS
It appears that one of our clusters having 8 hosts is having the issue. We have HCI on these 8 hosts and there are approximately 700+ VMs running. But strange enough, there are these logs like below on hosts.


Oct 25 13:38:11 hv-01 libvirtd[9464]: failed to open the RBD image '087bb114-448a-41d2-9f5d-6865b62eed15': No such file or directory
Oct 25 20:35:22 hv-01 libvirtd[9464]: failed to open the RBD image 'ccc1168a-5ffa-4b6d-a953-8e0ac788ebc5': No such file or directory
Oct 26 09:48:33 hv-01 libvirtd[9464]: failed to open the RBD image 'a3fe82f8-afc9-4604-b55e-91b676514a18': No such file or directory
Oct 26 10:38:17 hv-01 libvirtd[9464]: End of file while reading data: Input/output error


We've got DNS servers on which there is an`A` record resolving to all the IPv4 Addresses of 8 monitors and there have not been any issues with the DNS resolution. But the issue of "failed to open the RBD image 'ccc1168a-5ffa-4b6d-a953-8e0ac788ebc5': No such file or directory" gets more weird because the VM that is making use of that RBD image lets say "087bb114-448a-41d2-9f5d-6865b62eed15" is running on altogether different host like "hv-06". On further inspection of that specific Virtual Machine, it has been running on that host "hv-06" for more than 4 months or so (looked at "Last updated" field). Fortunately, the Virtual Machine has no issues and has been running since then.

We're noticing the same "failed to open the RBD image" on all the hosts in that cluster. There are no network issues observed or any issues on the hosts. I was thinking of getting this host into maintenance but the same is there on all the hosts in that cluster. I've not gotten a chance to look at the management server logs thus seeking assistance.

Thanks,
Jayanth Reddy

@weizhouapache
Copy link
Member

weizhouapache commented Nov 9, 2023

I suspect it is caused by the refresh of ceph pool

@zap51
can you get the uuid of ceph storage pool by "virsh pool-list", run "virsh pool-refresh pool-uuid" , and then check if there are similar errors ?

@zap51
Copy link
Contributor Author

zap51 commented Nov 10, 2023

@weizhouapache Thanks for the response.

The information is as follows:

# date
Fri Nov 10 05:27:42 UTC 2023

# journalctl -f -u libvirtd
-- Logs begin at Mon 2023-11-06 06:22:37 UTC. --
Nov 08 06:44:16 hv-01 libvirtd[9510]: failed to open the RBD image '1d1077d6-20c1-4a47-9134-94ecee3cc5fe': No such file or directory
Nov 08 08:04:20 hv-01 libvirtd[9510]: Domain id=80 name='i-13716-10743-VM' uuid=b7f4e563-c7f0-4e68-890b-d37521dd6396 is tainted: high-privileges
Nov 08 15:30:39 hv-01 libvirtd[9510]: failed to open the RBD image '7104880a-a752-4afc-9cb5-3c3a4f5418f5': No such file or directory
Nov 09 08:58:19 hv-01 libvirtd[9510]: failed to open the RBD image 'c2f96307-e957-473c-9959-d728752e9d96': No such file or directory
Nov 09 12:09:47 hv-01 libvirtd[9510]: failed to open the RBD image '099024bc-f798-4a84-801d-a0ee12bff6b0': No such file or directory

# virsh pool-list
 Name                                   State    Autostart
------------------------------------------------------------
 59fe6865-f6ad-451c-8a2e-7a0ce11df5f6   active   no
 c15508c7-5c2c-317f-aa2e-29f307771415   active   no
# virsh pool-info c15508c7-5c2c-317f-aa2e-29f307771415
Name:           c15508c7-5c2c-317f-aa2e-29f307771415
UUID:           c15508c7-5c2c-317f-aa2e-29f307771415
State:          running
Persistent:     no
Autostart:      no
Capacity:       1.25 PiB
Allocation:     489.52 TiB
Available:      787.36 TiB

I refreshed the pool using and it took like 20 to 30 seconds.

# virsh pool-refresh c15508c7-5c2c-317f-aa2e-29f307771415

Pool c15508c7-5c2c-317f-aa2e-29f307771415 refreshed

And after # journalctl -f -u libvirtd, the output is the same as the first. I'll keep inspecting the logs to see if it reappears.

Additional info:

# kvm --version
QEMU emulator version 4.2.1 (Debian 1:4.2-3ubuntu6.27)
Copyright (c) 2003-2019 Fabrice Bellard and the QEMU Project developers
# libvirtd --version
libvirtd (libvirt) 6.0.0

@zap51
Copy link
Contributor Author

zap51 commented Nov 13, 2023

Hi @weizhouapache,
It appears that we're still experiencing the same.

# journalctl -f -u libvirtd
-- Logs begin at Thu 2023-11-09 06:18:26 UTC. --
Nov 10 15:00:39 hv-01 libvirtd[9510]: failed to open the RBD image '510eaecc-ba97-4ed1-864d-2e75af8ae534': No such file or directory
Nov 11 01:52:27 hv-01 libvirtd[9510]: nl_recv returned with error: No buffer space available
Nov 11 10:18:07 hv-01 libvirtd[9510]: failed to open the RBD image '4c069ecc-ff81-4180-a504-56f55fdd0cd8': No such file or directory
Nov 11 12:31:47 hv-01 libvirtd[9510]: failed to open the RBD image 'b7dfe76f-3a69-4587-9870-bf5c48593485': No such file or directory
Nov 11 20:48:13 hv-01 libvirtd[9510]: failed to open the RBD image '16647b94-7f80-4599-9ae6-10dbe7387ce0': No such file or directory
Nov 12 14:21:32 hv-01 libvirtd[9510]: failed to open the RBD image 'a5a16123-d692-49bb-acb6-00c484006a03': No such file or directory
Nov 12 17:28:46 hv-01 libvirtd[9510]: internal error: End of file from qemu monitor
Nov 12 20:09:32 hv-01 libvirtd[9510]: failed to open the RBD image '830f41ba-d0f2-4e03-b7df-0407709b22b5': No such file or directory
Nov 13 05:28:14 hv-01 libvirtd[9510]: failed to open the RBD image '1409189f-1206-406a-a8f0-c135e1ec7d41': No such file or directory
Nov 13 05:33:55 hv-01 libvirtd[9510]: Domain id=83 name='i-15177-10871-VM' uuid=4464fbf5-b848-4f85-b758-6d79f5b9e5e5 is tainted: high-privileges

Thanks,
Jayanth Reddy

@weizhouapache
Copy link
Member

Nov 13 05:28:14 hv-01 libvirtd[9510]: failed to open the RBD image '1409189f-1206-406a-a8f0-c135e1ec7d41': No such file or directory

@zap51
I have seen similiar issue since many years ago.
I would say, It is a libvirt error (definitely not cloudstack) , and there is no impact on user vms as far as I know.
Do you have any issue caused by the errors ?

@zap51
Copy link
Contributor Author

zap51 commented Nov 13, 2023

Nov 13 05:28:14 hv-01 libvirtd[9510]: failed to open the RBD image '1409189f-1206-406a-a8f0-c135e1ec7d41': No such file or directory

@zap51
I have seen similiar issue since many years ago.
I would say, It is a libvirt error (definitely not cloudstack) , and there is no impact on user vms as far as I know.
Do you have any issue caused by the errors ?

Hi @weizhouapache,

Right. There are no impact on user VMs.

Thanks

@zap51
Copy link
Contributor Author

zap51 commented Nov 17, 2023

@weizhouapache, shall I raise this issue on libvirtd forums or ML instead? I've also been looking for the similar issues reported but can not find.

Thanks

@weizhouapache
Copy link
Member

@weizhouapache, shall I raise this issue on libvirtd forums or ML instead? I've also been looking for the similar issues reported but can not find.

Thanks

@zap51
Yes, you can ask the libvirt community.

@zap51
Copy link
Contributor Author

zap51 commented Nov 20, 2023

@zap51
Copy link
Contributor Author

zap51 commented Dec 2, 2023

@DaanHoogland
Copy link
Contributor

any updates @zap51 ?

@zap51
Copy link
Contributor Author

zap51 commented Jan 15, 2024

Hi @DaanHoogland @weizhouapache,
We seem to have figured out the issue. When there is existence of a block device of some large like 1 TiB, it usually takes some time to delete as there are a lot of RADOS Objects assigned. This is usually observed in large clusters where there are millions of objects. I was able to reproduce this way.

  1. Create a block device of size 20 TiB, Ceph allocates a few million RADOS objects to this. The block device ID is <pool_name>/370ffe6b-a536-401a-978f-14cb2f79b10f
  2. Info command with # rbd info 370ffe6b-a536-401a-978f-14cb2f79b10f works when the image is active and present.
  3. Now delete the image on CloudStack and try executing, while # rbd ls still shows the presence of the image, but it gives
# rbd info 370ffe6b-a536-401a-978f-14cb2f79b10f
rbd: error opening image 370ffe6b-a536-401a-978f-14cb2f79b10f: (2) No such file or directory
  1. The logs appear in libvirtd because of CloudStack-agent frequently asking libvirtd to refresh the pool. As per libvirt forums, libvirt tries opening the RBD image for querying the size, which is similar to rbd info but ends up getting (2) No such file or directory.
  2. This is normal in large clusters and in clusters where slow_deletion of objects is configured.

These warnings can be safely ignored. Thanks to Libvirt, Ceph & CloudStack communities.

@DaanHoogland
Copy link
Contributor

ok, @zap51 closing this. please reopen or create a new issue when needed.

@weizhouapache
Copy link
Member

Hi @DaanHoogland @weizhouapache, We seem to have figured out the issue. When there is existence of a block device of some large like 1 TiB, it usually takes some time to delete as there are a lot of RADOS Objects assigned. This is usually observed in large clusters where there are millions of objects. I was able to reproduce this way.

  1. Create a block device of size 20 TiB, Ceph allocates a few million RADOS objects to this. The block device ID is <pool_name>/370ffe6b-a536-401a-978f-14cb2f79b10f
  2. Info command with # rbd info 370ffe6b-a536-401a-978f-14cb2f79b10f works when the image is active and present.
  3. Now delete the image on CloudStack and try executing, but it gives
# rbd info 370ffe6b-a536-401a-978f-14cb2f79b10f
rbd: error opening image 370ffe6b-a536-401a-978f-14cb2f79b10f: (2) No such file or directory
  1. The logs appear in libvirtd because of CloudStack-agent frequently asking libvirtd to refresh the pool. As per libvirt forums, libvirt tries opening the RBD image for querying the size, which is similar to rbd info but ends up getting (2) No such file or directory.
  2. This is normal in large clusters and in clusters where slow_deletion of objects is configured.

These warnings can be safely ignored. Thanks to Libvirt, Ceph & CloudStack communities.

thanks @zap51 for the update. it explains the issue very clearly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants