Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pvmemcontrol: control guest physical memory properties #6467

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

yuanchu-xie
Copy link

I'm working on memory passthrough for lightweight VMs. We've come up with an approach that's guest driven and tries to keep the VM slim proactively. Pvmemcontrol is the name of the device/driver that communicates between the guest and vmm to control the host backing of guest memory.

Yuanchu Xie yuanchu@google.com
Pasha Tatashin pasha.tatashin@soleen.com @soleen


Pvmemcontrol provides a way for the guest to control its physical memory
properties, and enables optimizations and security features. For
example, the guest can provide information to the host where parts of a
hugepage may be unbacked, or sensitive data may not be swapped out, etc.

Pvmemcontrol allows guests to manipulate its gPTE entries in the SLAT,
and also some other properties of the memory map the back's host memory.
This is achieved by using the KVM_CAP_SYNC_MMU capability. When this
capability is available, the changes in the backing of the memory region
on the host are automatically reflected into the guest. For example, an
mmap() or madvise() that affects the region will be made visible
immediately.

There are two components of the implementation: the guest Linux driver
and Virtual Machine Monitor (VMM) device. A guest-allocated shared
buffer is negotiated per-cpu through a few PCI MMIO registers, the VMM
device assigns a unique command for each per-cpu buffer. The guest
writes its pvmemcontrol request in the per-cpu buffer, then writes the
corresponding command into the command register, calling into the VMM
device to perform the pvmemcontrol request.

The synchronous per-cpu shared buffer approach avoids the kick and busy
waiting that the guest would have to do with virtio virtqueue transport.

User API
From the userland, the pvmemcontrol guest driver is controlled via
ioctl(2) call. It requires CAP_SYS_ADMIN.

ioctl(fd, PVMEMCONTROL_IOCTL, struct pvmemcontrol_buf *buf);

Guest userland applications can tag VMAs and guest hugepages, or advise
the host on how to handle sensitive guest pages.

Supported function codes and their use cases:
PVMEMCONTROL_FREE/REMOVE/DONTNEED/PAGEOUT. For the guest. One can reduce
the struct page and page table lookup overhead by using hugepages backed
by smaller pages on the host. These pvmemcontrol commands can allow for
partial freeing of private guest hugepages to save memory. They also
allow kernel memory, such as kernel stacks and task_structs to be
paravirtualized if we expose kernel APIs.

PVMEMCONTROL_UNMERGEABLE is useful for security, when the VM does not
want to share its backing pages.
The same with PVMEMCONTROL_DONTDUMP, so sensitive pages are not included
in a dump.
MLOCK/UNLOCK can advise the host that sensitive information is not
swapped out on the host.

PVMEMCONTROL_MPROTECT_NONE/R/W/RW. For guest stacks backed by hugepages,
stack guard pages can be handled in the host and memory can be saved in
the hugepage.

PVMEMCONTROL_SET_VMA_ANON_NAME is useful for observability and debugging
how guest memory is being mapped on the host.

Sample program making use of PVMEMCONTROL_DONTNEED:
https://github.com/Dummyc0m/pvmemcontrol-user

Previously posted RFC to cloud-hypervisor:
#6318

LKML posting of Linux guest driver:
https://lore.kernel.org/lkml/20240518072422.771698-1-yuanchu@google.com/

@yuanchu-xie yuanchu-xie requested a review from a team as a code owner May 18, 2024 07:37
@up2wing
Copy link
Contributor

up2wing commented May 20, 2024

If I understand correctly, the guest can change/operate the host memory properties using the pvmemcontrol device, which I think may worry some public cloud users. So it might be better to add a feature like guest_debug to control this.

@@ -939,6 +943,9 @@ pub struct DeviceManager {
// GPIO device for AArch64
gpio_device: Option<Arc<Mutex<devices::legacy::Gpio>>>,

pvmemcontrol_bus_device: Option<Arc<devices::pvmemcontrol::PvmemcontrolBusDevice>>,
pvmemcontrol_pci_device: Option<Arc<Mutex<devices::pvmemcontrol::PvmemcontrolPciDevice>>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use devices::pvmemcontrol:: { PvmemcontrolBusDevice, PvmemcontrolPciDevice };

can make this simpler.

id: String,
configuration: PciConfiguration,
bar_regions: Vec<PciBarConfiguration>,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like to explain why do you need use two structs to represent the device? In my opinion, one
struct maybe PvmemcontrolDevice seems like enough.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. My observation was that both BusDevice and PciDevice handle device writes/reads, but actually only the BusDevice impl received the writes/reads. I want the device to handle requests on multiple cpus at the same time. So I made the BusDeviceSync trait similar to crosvm, which is just the BusDevice trait without the exclusive ref requirement on the read and write trait methods so the impl can handle its own locking, so multiple read locks can be taken at the same time.

I left the PciDevice trait in place, so I need two structs because the PciDevice gets wrapped in an Arc<Mutex<>> when I want a RwLock. On second thought, maybe I should instead refactor the Pci/BusDevice traits such that PciDevice also handles its own locking? That would be more consistent, but also inflate the PR to a tree-wide change.

@Dummyc0m
Copy link

Dummyc0m commented Jun 7, 2024

If I understand correctly, the guest can change/operate the host memory properties using the pvmemcontrol device, which I think may worry some public cloud users. So it might be better to add a feature like guest_debug to control this.

By default the device is not enabled, and I would say this is roughly in the same ballpark as virtio-balloon reporting free pages for the host to madvise away. Would you say that the device should be feature gated?

@Dummyc0m Dummyc0m force-pushed the memctl-pci branch 2 times, most recently from 6b4c78f to d56ecdf Compare June 7, 2024 23:28
@Dummyc0m
Copy link

refreshed kernel patches to resolve sparse warnings https://lore.kernel.org/linux-mm/20240612021207.3314369-1-yuanchu@google.com/

@liuw
Copy link
Member

liuw commented Jun 19, 2024

A few comments:

  1. I think this should be gated by a flag and be disabled by default, because kernel code is not yet upstreamed.
  2. I think you should remove the reference to the prototype in your commit message.
  3. The device is really simple, and the code is self-contained, so I don't worry about it being overly buggy or anything. I can only speak for myself, but I'm happy to merge experimental code like this to nurture innovation.

I know there is a chicken-and-egg problem. Kernel wants to have some users before merging new code, while user space programs are hesitant to take in new code because kernel code can still change. Having the feature merged but disabled by default seems like a good way forward.

Lastly, I know it is not possible to test this right now, but if we merge this, please plan to add a test case when the kernel changes are merged.

@Dummyc0m
Copy link

Thanks Liu Wei, I agree on all three remarks, plus testing when the kernel changes are merged. Let me make the changes.

BusDevice trait functions currently holds a mutable reference to self,
and exclusive access is guaranteed by taking a Mutex when dispatched by
the Bus object. However, this prevents individual devices from serving
accesses that do not require an mutable reference or is better served
with different synchronization primitives. We switch Bus to dispatch via
BusDeviceSync, which holds a shared reference, and delegate locking to
the BusDeviceSync trait implementation for Mutex<BusDevice>.

Other changes are made to make use of the dyn BusDeviceSync
trait object.

Signed-off-by: Yuanchu Xie <yuanchu@google.com>
The BusDevice requirement is not needed, only Send is required.

Signed-off-by: Yuanchu Xie <yuanchu@google.com>
@Dummyc0m
Copy link

Seems like I missed a few things. Let me actually add the pre-commit hooks to my local setup and not forget to run some the checks every time.

Pvmemcontrol provides a way for the guest to control its physical memory
properties, and enables optimizations and security features. For
example, the guest can provide information to the host where parts of a
hugepage may be unbacked, or sensitive data may not be swapped out, etc.

Pvmemcontrol allows guests to manipulate its gPTE entries in the SLAT,
and also some other properties of the memory map the back's host memory.
This is achieved by using the KVM_CAP_SYNC_MMU capability. When this
capability is available, the changes in the backing of the memory region
on the host are automatically reflected into the guest. For example, an
mmap() or madvise() that affects the region will be made visible
immediately.

There are two components of the implementation: the guest Linux driver
and Virtual Machine Monitor (VMM) device. A guest-allocated shared
buffer is negotiated per-cpu through a few PCI MMIO registers, the VMM
device assigns a unique command for each per-cpu buffer. The guest
writes its pvmemcontrol request in the per-cpu buffer, then writes the
corresponding command into the command register, calling into the VMM
device to perform the pvmemcontrol request.

The synchronous per-cpu shared buffer approach avoids the kick and busy
waiting that the guest would have to do with virtio virtqueue transport.

The Cloud Hypervisor component can be enabled with --pvmemcontrol.

Signed-off-by: Yuanchu Xie <yuanchu@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants