Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migrateVolume does not copy the template #5759

Open
GutoVeronezi opened this issue Dec 8, 2021 · 20 comments
Open

migrateVolume does not copy the template #5759

GutoVeronezi opened this issue Dec 8, 2021 · 20 comments

Comments

@GutoVeronezi
Copy link
Contributor

When a volume is copied to another storage pool, if the template does not exist in the destination storage pool yet, it should be copied. The API migrateVirtualMachineWithVolume already does it; however, when migrating the volume with migrateVolume (VM stopped), the template is not copied, which causes inconsistencies in the database and primary storage systems.

@DaanHoogland
Copy link
Contributor

is this what you are fixing in #5758, @GutoVeronezi ? or do you mean that more improvement is needed after that?

@GutoVeronezi
Copy link
Contributor Author

@DaanHoogland it will need more improvement after that.

@nvazquez
Copy link
Contributor

nvazquez commented Feb 4, 2022

Hi @GutoVeronezi are you working on this issue or planning to?

@GutoVeronezi
Copy link
Contributor Author

Hi, @nvazquez, I have other priorities at the moment, this issue is not in my plannings for now.

@nvazquez
Copy link
Contributor

nvazquez commented Feb 4, 2022

Thanks for clarifying @GutoVeronezi

@nvazquez nvazquez added this to the 4.18.0.0 milestone Mar 6, 2022
@nvazquez nvazquez modified the milestones: 4.18.0.0, 4.17.1.0 Apr 20, 2022
@shwstppr
Copy link
Contributor

Moving this to next milestone

@rohityadavcloud
Copy link
Member

I've hit this @GutoVeronezi for local storage (but not shared storage). Did you reproduce this only for KVM + local storage (or any other hypervisor/storage type)?

@GutoVeronezi
Copy link
Contributor Author

@rohityadavcloud, at the time I created this issue, I was running ACS with 4.15 or 4.16 (I do not remember quite well). I will test it again and verify if the situation still happens with the current version.

@DaanHoogland DaanHoogland modified the milestones: 4.18.0.0, 4.19.0.0 Jan 9, 2023
@weizhouapache weizhouapache self-assigned this Jan 26, 2023
@weizhouapache weizhouapache modified the milestones: 4.19.0.0, 4.18.1.0 Jan 26, 2023
@weizhouapache
Copy link
Member

@rohityadavcloud @GutoVeronezi
I have tested with local storage as well shared storage. both work.

The migrateVolume will firstly copy the image to secondary storage (full clone) and then copy to another primary storage.
It should not be a bug. therefore I have removed the label 'type:bug'

The whole process can be improved of course.

@weizhouapache weizhouapache modified the milestones: 4.18.1.0, 4.19.0.0 Jan 30, 2023
@weizhouapache weizhouapache removed their assignment Jan 30, 2023
@GutoVeronezi
Copy link
Contributor Author

@rohityadavcloud @GutoVeronezi I have tested with local storage as well shared storage. both work.

The migrateVolume will firstly copy the image to secondary storage (full clone) and then copy to another primary storage. It should not be a bug. therefore I have removed the label 'type:bug'

The whole process can be improved of course.

Thanks for the tests, @weizhouapache.

@weizhouapache
Copy link
Member

@rohityadavcloud @GutoVeronezi I have tested with local storage as well shared storage. both work.
The migrateVolume will firstly copy the image to secondary storage (full clone) and then copy to another primary storage. It should not be a bug. therefore I have removed the label 'type:bug'
The whole process can be improved of course.

Thanks for the tests, @weizhouapache.

@GutoVeronezi
please feel free to improve it.

@rohityadavcloud
Copy link
Member

@weizhouapache there are some issues; cc @DaanHoogland @GutoVeronezi
While live running VM migration with storage works between KVM hosts with local storage, try this;

  1. Deploy a new VM we can see the root disk has a backing file:

root@kvm1:/var/lib/libvirt/images# qemu-img info f9a311bb-509c-4f01-ad55-29eb8ff6166a
image: f9a311bb-509c-4f01-ad55-29eb8ff6166a
file format: qcow2
virtual size: 25 GiB (26843545600 bytes)
disk size: 1.51 GiB
cluster_size: 65536
backing file: /var/lib/libvirt/images/748e7f9d-7257-48dd-9efa-971396fee88f
backing file format: qcow2
Format specific information:
compat: 1.1
compression type: zlib
lazy refcounts: false
refcount bits: 16
corrupt: false
extended l2: false

  1. Stop the VM and migrate the disk from kvm1 to kvm2 (local storage), but now on kvm2 local storage the VM has no backing file:

root@kvm2:/var/lib/libvirt/images# qemu-img info 6b4add3f-9014-4c76-ae45-cb19f830c3be
image: 6b4add3f-9014-4c76-ae45-cb19f830c3be
file format: qcow2
virtual size: 25 GiB (26843545600 bytes)
disk size: 1.5 GiB
cluster_size: 65536
Format specific information:
compat: 1.1
compression type: zlib
lazy refcounts: false
refcount bits: 16
corrupt: false
extended l2: false

  1. Start the VM, and try to live migrate with storage to kvm1 host, it works or sometimes fails. When it works, I can see there is no backing file on the migrated disk: (migrated and shutdown the VM to get the qemu-img info to work)

root@kvm1:/var/lib/libvirt/images# qemu-img info f919f5e7-eee8-4a64-ac6f-a4361dfcaa1b
image: f919f5e7-eee8-4a64-ac6f-a4361dfcaa1b
file format: qcow2
virtual size: 25 GiB (26843545600 bytes)
disk size: 1.51 GiB
cluster_size: 65536
Format specific information:
compat: 1.1
compression type: zlib
lazy refcounts: false
refcount bits: 16
corrupt: false
extended l2: false

  1. For the failure case, I think try to migrate to kvm3 host which doesn't have the backfile file (template seeded), the running VM crashes as the domain on the destination fails to run and we get the error:

2023-07-10 19:17:42,074 WARN [cloud.agent.Agent] (agentRequest-Handler-2:null) (logid:22f93a95) Caught:
com.cloud.utils.exception.CloudRuntimeException: Could not fetch storage pool 8aa9768c-cbcf-4e8e-8875-f94a7f9445b6 from libvirt due to org.libvirt.LibvirtException: Storage pool not found: no storage pool with matching uuid '8aa9768c-cbcf-4e8e-8875-f94a7f9445b6'
at com.cloud.hypervisor.kvm.storage.KVMStoragePoolManager.getStoragePool(KVMStoragePoolManager.java:277)
at com.cloud.hypervisor.kvm.storage.KVMStoragePoolManager.getStoragePool(KVMStoragePoolManager.java:263)

I think in #4 when/if it fails, it's because somehow it's trying to use/find the local storage pool on the dest. host which does not exist.

@rohityadavcloud
Copy link
Member

I suppose the primarily issue seems to be when during offline migration, the backing file information is lost as the KVM local storage/root disk(s) are melded in a way it no longer needs the template. In edge cases, the VM migration can crash the running VM when on the destination host when for whatever reason it find the domain isn't running. We need perhaps a round of investigation/reproduction, and if possible (a) avoid using secondary storage for offline local storage migration (or document the behaviour) and (b) backing file context isn't lost. /cc @weizhouapache @DaanHoogland

@weizhouapache
Copy link
Member

this might be related to #7615

@weizhouapache
Copy link
Member

@GutoVeronezi
would you like to work on this ?

@GutoVeronezi
Copy link
Contributor Author

@rohityadavcloud @weizhouapache Considering that the migration consolidates the volume and template (at least when using KVM), I think now that it makes sense not copying the template when migrating the volume, as the migrated volume will already contain the template data. However, we should do some tests to pinpoint use cases that could be affected by this behavior or are not working as expected.

@GutoVeronezi would you like to work on this ?

Unfortunately, I will not be able to dedicate my efforts to this right now. @gpordeus, as you are already working on #7615, could take a look at this?

@gpordeus
Copy link
Collaborator

@GutoVeronezi Yes, I'm now available. I'll assemble a use case table, test it and share the results as soon as they are ready.

@weizhouapache
Copy link
Member

moved to 4.18.2.0

@rohityadavcloud
Copy link
Member

Is this fixed ?

@DaanHoogland
Copy link
Contributor

@gpordeus (cc @GutoVeronezi ) is work going on on this?

@DaanHoogland DaanHoogland modified the milestones: 4.18.3, 4.19.1.0 May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment