Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

virtio-devices: save pci configuration capability state in snapshot #6326

Merged
merged 1 commit into from
Mar 24, 2024

Conversation

alex-matei
Copy link
Contributor

When restoring a VM, the VirtioPciCfgCapInfo struct is not properly initialized. All fields are 0, including the offset where the capabibility should start. Hence, when you read a PCI configuration register in the range [0..length(VirtioPciCfgCap)] you get the value 0 instead of the actual register contents.

Linux rescans the whole PCI bus when adding a new device. It reads the values vendor_id and device_id for every device. Because these are stored at offset 0 in pci configuration space, their value is 0 for existing devices. As such, Linux considers that the devices have been unplugged and it removes them from the system.

Fixes: #6265

@alex-matei alex-matei requested a review from a team as a code owner March 24, 2024 07:44
@alex-matei alex-matei force-pushed the fix/save_pci_cap_cfg branch 3 times, most recently from eccfaee to 3a5e2c2 Compare March 24, 2024 08:27
When restoring a VM, the VirtioPciCfgCapInfo struct is not properly
initialized. All fields are 0, including the offset where the
capabibility starts. Hence, when you read a PCI configuration register
in the range [0..length(VirtioPciCfgCap)] you get the value 0 instead of
the actual register contents.

Linux rescans the whole PCI bus when adding a new device. It reads the
values vendor_id and device_id for every device. Because these are
stored at offset 0 in pci configuration space, their value is 0 for
existing devices.  As such, Linux considers that the devices have been
unplugged and it removes them from the system.

Fixes: cloud-hypervisor#6265

Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
@alex-matei alex-matei changed the title virtio-device: save pci configuration capability state in snapshots virtio-devices: save pci configuration capability state in snapshot Mar 24, 2024
Copy link
Member

@rbradford rbradford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fab - good work. I tested this with live migration which is how I bisected the first bad commit in #6265

@rbradford rbradford added this pull request to the merge queue Mar 24, 2024
Merged via the queue into cloud-hypervisor:main with commit c3f1c3e Mar 24, 2024
30 of 31 checks passed
@likebreath
Copy link
Member

I want to point out that this is a breaking change for live upgrade, and I don't think we can (easily) workaround it (like what we did for xsave_state, etc).

Also, I can see this is an very important bug fix, given any migrated/restored VM won't be functional if there is a device hotplug without this fix. We will need to see how we want to backport it.

@liuw
Copy link
Member

liuw commented Mar 25, 2024

I want to point out that this is a breaking change for live upgrade, and I don't think we can (easily) workaround it (like what we did for xsave_state, etc).

Also, I can see this is an very important bug fix, given any migrated/restored VM won't be functional if there is a device hotplug without this fix. We will need to see how we want to backport it.

Is the concern here the new CH will reject an old stream because the old stream does not have the PCI state field? In that case can we generate a state structure that's all zeros to preserve the buggy behaviour?

@likebreath
Copy link
Member

The concern is that we won't be able to backport this fix without breaking the live-upgrade, particularly for point releases of LTS version, where we intended to ensure they are live upgradable.

@rbradford
Copy link
Member

I think we may be able to fix the issue in the backport in a different way - to me it looks like that the PCI capability data for the virtio devices is static - afaict the only place the offset and capability data is set is here:

        let configuration_cap = VirtioPciCfgCap::new();
        self.cap_pci_cfg_info.offset = self
            .configuration
            .add_capability(&configuration_cap)
            .map_err(PciDeviceError::CapabilitiesSetup)?
            + VIRTIO_PCI_CAP_OFFSET;
        self.cap_pci_cfg_info.cap = configuration_cap;

Perhaps we can just setup the capability and offset after the device migration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🆕 New
Development

Successfully merging this pull request may close these issues.

Disk hot-add after vm.restore causes I/O errors in guest
4 participants