-
Notifications
You must be signed in to change notification settings - Fork 420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pvmemcontrol: control guest physical memory properties #6467
base: main
Are you sure you want to change the base?
Conversation
If I understand correctly, the guest can change/operate the host memory properties using the pvmemcontrol device, which I think may worry some public cloud users. So it might be better to add a feature like guest_debug to control this. |
vmm/src/device_manager.rs
Outdated
@@ -939,6 +943,9 @@ pub struct DeviceManager { | |||
// GPIO device for AArch64 | |||
gpio_device: Option<Arc<Mutex<devices::legacy::Gpio>>>, | |||
|
|||
pvmemcontrol_bus_device: Option<Arc<devices::pvmemcontrol::PvmemcontrolBusDevice>>, | |||
pvmemcontrol_pci_device: Option<Arc<Mutex<devices::pvmemcontrol::PvmemcontrolPciDevice>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use devices::pvmemcontrol:: { PvmemcontrolBusDevice, PvmemcontrolPciDevice };
can make this simpler.
id: String, | ||
configuration: PciConfiguration, | ||
bar_regions: Vec<PciBarConfiguration>, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like to explain why do you need use two structs to represent the device? In my opinion, one
struct maybe PvmemcontrolDevice seems like enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. My observation was that both BusDevice and PciDevice handle device writes/reads, but actually only the BusDevice impl received the writes/reads. I want the device to handle requests on multiple cpus at the same time. So I made the BusDeviceSync trait similar to crosvm, which is just the BusDevice trait without the exclusive ref requirement on the read and write trait methods so the impl can handle its own locking, so multiple read locks can be taken at the same time.
I left the PciDevice trait in place, so I need two structs because the PciDevice gets wrapped in an Arc<Mutex<>> when I want a RwLock. On second thought, maybe I should instead refactor the Pci/BusDevice traits such that PciDevice also handles its own locking? That would be more consistent, but also inflate the PR to a tree-wide change.
By default the device is not enabled, and I would say this is roughly in the same ballpark as virtio-balloon reporting free pages for the host to madvise away. Would you say that the device should be feature gated? |
6b4c78f
to
d56ecdf
Compare
refreshed kernel patches to resolve sparse warnings https://lore.kernel.org/linux-mm/20240612021207.3314369-1-yuanchu@google.com/ |
A few comments:
I know there is a chicken-and-egg problem. Kernel wants to have some users before merging new code, while user space programs are hesitant to take in new code because kernel code can still change. Having the feature merged but disabled by default seems like a good way forward. Lastly, I know it is not possible to test this right now, but if we merge this, please plan to add a test case when the kernel changes are merged. |
Thanks Liu Wei, I agree on all three remarks, plus testing when the kernel changes are merged. Let me make the changes. |
BusDevice trait functions currently holds a mutable reference to self, and exclusive access is guaranteed by taking a Mutex when dispatched by the Bus object. However, this prevents individual devices from serving accesses that do not require an mutable reference or is better served with different synchronization primitives. We switch Bus to dispatch via BusDeviceSync, which holds a shared reference, and delegate locking to the BusDeviceSync trait implementation for Mutex<BusDevice>. Other changes are made to make use of the dyn BusDeviceSync trait object. Signed-off-by: Yuanchu Xie <yuanchu@google.com>
The BusDevice requirement is not needed, only Send is required. Signed-off-by: Yuanchu Xie <yuanchu@google.com>
Seems like I missed a few things. Let me actually add the pre-commit hooks to my local setup and not forget to run some the checks every time. |
Pvmemcontrol provides a way for the guest to control its physical memory properties, and enables optimizations and security features. For example, the guest can provide information to the host where parts of a hugepage may be unbacked, or sensitive data may not be swapped out, etc. Pvmemcontrol allows guests to manipulate its gPTE entries in the SLAT, and also some other properties of the memory map the back's host memory. This is achieved by using the KVM_CAP_SYNC_MMU capability. When this capability is available, the changes in the backing of the memory region on the host are automatically reflected into the guest. For example, an mmap() or madvise() that affects the region will be made visible immediately. There are two components of the implementation: the guest Linux driver and Virtual Machine Monitor (VMM) device. A guest-allocated shared buffer is negotiated per-cpu through a few PCI MMIO registers, the VMM device assigns a unique command for each per-cpu buffer. The guest writes its pvmemcontrol request in the per-cpu buffer, then writes the corresponding command into the command register, calling into the VMM device to perform the pvmemcontrol request. The synchronous per-cpu shared buffer approach avoids the kick and busy waiting that the guest would have to do with virtio virtqueue transport. The Cloud Hypervisor component can be enabled with --pvmemcontrol. Signed-off-by: Yuanchu Xie <yuanchu@google.com>
I'm working on memory passthrough for lightweight VMs. We've come up with an approach that's guest driven and tries to keep the VM slim proactively. Pvmemcontrol is the name of the device/driver that communicates between the guest and vmm to control the host backing of guest memory.
Yuanchu Xie yuanchu@google.com
Pasha Tatashin pasha.tatashin@soleen.com @soleen
Pvmemcontrol provides a way for the guest to control its physical memory
properties, and enables optimizations and security features. For
example, the guest can provide information to the host where parts of a
hugepage may be unbacked, or sensitive data may not be swapped out, etc.
Pvmemcontrol allows guests to manipulate its gPTE entries in the SLAT,
and also some other properties of the memory map the back's host memory.
This is achieved by using the KVM_CAP_SYNC_MMU capability. When this
capability is available, the changes in the backing of the memory region
on the host are automatically reflected into the guest. For example, an
mmap() or madvise() that affects the region will be made visible
immediately.
There are two components of the implementation: the guest Linux driver
and Virtual Machine Monitor (VMM) device. A guest-allocated shared
buffer is negotiated per-cpu through a few PCI MMIO registers, the VMM
device assigns a unique command for each per-cpu buffer. The guest
writes its pvmemcontrol request in the per-cpu buffer, then writes the
corresponding command into the command register, calling into the VMM
device to perform the pvmemcontrol request.
The synchronous per-cpu shared buffer approach avoids the kick and busy
waiting that the guest would have to do with virtio virtqueue transport.
User API
From the userland, the pvmemcontrol guest driver is controlled via
ioctl(2) call. It requires CAP_SYS_ADMIN.
ioctl(fd, PVMEMCONTROL_IOCTL, struct pvmemcontrol_buf *buf);
Guest userland applications can tag VMAs and guest hugepages, or advise
the host on how to handle sensitive guest pages.
Supported function codes and their use cases:
PVMEMCONTROL_FREE/REMOVE/DONTNEED/PAGEOUT. For the guest. One can reduce
the struct page and page table lookup overhead by using hugepages backed
by smaller pages on the host. These pvmemcontrol commands can allow for
partial freeing of private guest hugepages to save memory. They also
allow kernel memory, such as kernel stacks and task_structs to be
paravirtualized if we expose kernel APIs.
PVMEMCONTROL_UNMERGEABLE is useful for security, when the VM does not
want to share its backing pages.
The same with PVMEMCONTROL_DONTDUMP, so sensitive pages are not included
in a dump.
MLOCK/UNLOCK can advise the host that sensitive information is not
swapped out on the host.
PVMEMCONTROL_MPROTECT_NONE/R/W/RW. For guest stacks backed by hugepages,
stack guard pages can be handled in the host and memory can be saved in
the hugepage.
PVMEMCONTROL_SET_VMA_ANON_NAME is useful for observability and debugging
how guest memory is being mapped on the host.
Sample program making use of PVMEMCONTROL_DONTNEED:
https://github.com/Dummyc0m/pvmemcontrol-user
Previously posted RFC to cloud-hypervisor:
#6318
LKML posting of Linux guest driver:
https://lore.kernel.org/lkml/20240518072422.771698-1-yuanchu@google.com/