vmm, hypervisor: kvm: Add KVM SEV-SNP to Cloud Hypervisor#7942
vmm, hypervisor: kvm: Add KVM SEV-SNP to Cloud Hypervisor#7942rbradford merged 17 commits intocloud-hypervisor:mainfrom
Conversation
|
This PR description says it builds on #7787 and #7316, but #7787 is still open and #7316 was closed. Also, the current title makes this look very close in scope to #7787, so at the moment it is not clear to me whether this is:
I'm marking this as draft for now until that relationship is clarified. Please expand the PR description and explain:
If this PR requires #7787 first, then I think this should be closed for now and reopened once all prerequisites are merged. This keeps review focused. If I'm misunderstanding the intent here, feel free to reach out. |
@phip1611 My colleague @rhakobyan opened this PR, derived from the code in the original PR, but he has the ability to build and test this PR on the hardware. Something that the author of #7787 said they couldn't do - I think this PR should supersede that PR. We identified that there was some dead code in the original PR related to register programming that was superfluous when using Project Oak stage0, which was a requirement of that version and this version. It would be possible to add that back when we can come up with a solution that does not require IGVM. |
|
Thanks for the clarification @rbradford - good to know! It would be great if we could make this clear in the PR description. Feel free to undraft! |
rbradford
left a comment
There was a problem hiding this comment.
Thank you @rhakobyan - some more stuff for you to look at. Please also add some documentation like you did in the PR body showing how to run it.
It also looks like, right now, we should make the sev_snp feature depend on the igvm and fw_cfg features.
| } | ||
| memfd = if kvm_guest_memfd_supported { | ||
| // TODO: Refactor to create memfd when the memory region is created so | ||
| // that the size is appropriate. |
There was a problem hiding this comment.
I don't get this comment - is it that a new guest_memfd should be created for each region (e.g. for the < 3GiB and + 4GiB regions)
There was a problem hiding this comment.
Yes, I think this goes back to the earlier comment. Right now, we create one memfd spanning the whole virtual address space, but should instead do this per-region.
| // (e.g. MSHV) handle this through their own import path. | ||
| #[cfg(all(feature = "kvm", feature = "sev_snp"))] | ||
| if cpu_manager.lock().unwrap().sev_snp_enabled() { | ||
| Self::reserve_region_for_stage0(&memory_manager)?; |
There was a problem hiding this comment.
Why do we have to do this differently on MSHV?
109b408 to
4745bd2
Compare
| #[cfg(feature = "igvm")] | ||
| pub const STAGE0_START_ADDRESS: GuestAddress = GuestAddress(0xffe0_0000); | ||
| #[cfg(feature = "igvm")] | ||
| pub const STAGE0_SIZE: usize = 0x20_0000; |
There was a problem hiding this comment.
IDK if we want to use stage0's specific reset vector here, or if we would prefer that it's a bit more generic. The industry uses EDK2, which we want to be able to swap either for a non-traditional bootloader like stage0 or one that implements UEFI
There was a problem hiding this comment.
Thanks, I have made this a bit more generic, though ideally I think we'd wanna parse IGVM directives to check the VMSA, etc. I've added a TODO.
| } | ||
| } | ||
|
|
||
| pub(crate) fn init2(&self, vm: &VmFd) -> Result<()> { |
There was a problem hiding this comment.
I think we need to make sure that we pass in the SEV_FEATURES into the VMSA here before calling init2, at least it seems like QEMU does this:
https://gitlab.com/qemu-project/qemu/-/blob/master/target/i386/sev.c#L1895-1914
This makes sense to me, and it looks like we're just providing defaults. This isn't actually want we want to do--SEV_FEATURES is something that should be passed in from the user
There was a problem hiding this comment.
Yeah, we currently go with defaults, but will need to parse SEV_FEATURES from IGVM I think. I'll check how QEMU handles this.
There was a problem hiding this comment.
Ok, I updated this — it's now parsing IGVM for SEV features.
4745bd2 to
1ba0cfe
Compare
Thanks @phip1611, I have updated the PR to be more explicit in what it's doing. |
We are adding support for stage0 in Cloud Hypervisor and found that we were busy polling for a PCI device on the bridge... Eventually we discovered that Cloud Hypervisor actually has its own `PciConfiguration` here: https://github.com/cloud-hypervisor/cloud-hypervisor/blob/8248650e799457fee3068b0bf3b54ffcade4bce1/pci/src/bus.rs#L66-L82 Teach Stage0 about this so that we can init the machine with this new type # Changes - Moved the `mmio64_hole()` method to a helper method which introduces an `addr_size`--This is solely just so that we can customize what arg is passed - Added the `IntelVirtualPcieHost` machine type # Tests ``` $ nix develop $ bazel test //stage0 ... INFO: Found 3 targets and 2 test targets... INFO: Elapsed time: 183.706s, Critical Path: 172.55s INFO: 1966 processes: 850 internal, 1116 linux-sandbox. INFO: Build completed successfully, 1966 total actions //stage0:stage0_test PASSED in 0.0s //stage0/testing:stage0_kernel_test PASSED in 0.9s Executed 2 out of 2 tests: 2 tests pass. ``` I also launched a SEV-SNP VM using Cloud hypervisor. We're in the midst of adding this support, so please look at the steps in cloud-hypervisor/cloud-hypervisor#7942 (comment) for reproducing the binaries used for testing. I also used the `buildigvm` binary from https://gitlab.com/qemu-project/buildigvm, which was checked in recently [stage0_pcie.txt](https://github.com/user-attachments/files/26471385/stage0_pcie.txt) Change-Id: Id3ee62393df631fcaa7cc9144aa7fc416a6a6964
| use crate::igvm::{BootPageAcceptance, HV_PAGE_SIZE, IgvmLoadedInfo, StartupMemoryType}; | ||
| use crate::memory_manager::{Error as MemoryManagerError, MemoryManager}; | ||
|
|
||
| #[cfg(feature = "sev_snp")] |
There was a problem hiding this comment.
Oh, we decided to put the kernel hash logic in here huh. I was thinking we could leave it out to keep the focus solely on enabling the launch, but we can leave this in here...
There was a problem hiding this comment.
Good point, I'll move this out to a separate PR. I'll address the comments there directly.
| #[cfg(feature = "sev_snp")] | ||
| const SNP_CPUID_LIMIT: u32 = 64; | ||
| #[cfg(all(feature = "sev_snp", target_arch = "x86_64"))] | ||
| const SEV_HASH_BLOCK_ADDRESS: u64 = 0x10c00; |
There was a problem hiding this comment.
This all comes from my pending PR in stage0 (https://github.com/project-oak/oak/pull/5073/changes), which I need to actually get checked in...
We could leave links as to where these are all found (in EDK2). For instance, the SEV_HASH_BLOCK_ADDRESS comes from https://github.com/tianocore/edk2/blob/f98662c5e35b6ab60f46ee4350fa0e6eab0497cf/OvmfPkg/Include/Fdf/MemFd.fdf.inc#L89-L93
| } else { | ||
| None | ||
| }; | ||
| vm.sev_snp_init(Self::get_default_sev_snp_guest_policy()) |
There was a problem hiding this comment.
I love the default policy, although we'll need to make sure this is eventually injectable from the guest :) There are certain policies we need to enable from the guest side to make sure that they're enabled in the future (enable/disable SMT, make sure debug is off, etc.) Doesn't have to be this patch, but we'll need to keep in mind that this will be an actively changed paramter
There was a problem hiding this comment.
Thanks, I have noted this, will get the policy from IGVM in a subsequent PRs to keep this one simpler.
| if let Some(acpi_hash) = _fw_cfg_acpi_hash.as_deref() { | ||
| let mut config = config.lock().unwrap(); | ||
| if let Some(payload) = config.payload.as_mut() { | ||
| Self::ensure_cmdline_param(payload, "acpi_hash", acpi_hash); |
There was a problem hiding this comment.
Curious as to why we're adding acpi_hash to the cmdline (or even checking for an ACPI hash at all)? I know this is enabled in stage0 (we add it there as an additional cmdline after we've launched, but in the measured boot use-case involving kernel hashes, this is actually un-desireable--CH would launch with a kernel cmdline digest that is different from what was measured previously via igvmmeasure and what not.
IMO, this isn't a CH/VMM responsibility to inject the acpi hash into the VM. CH/our VMM should just generate the required ACPI tables and then launch, let the CVM care about the hashes and what not
| } | ||
|
|
||
| #[cfg(all(feature = "sev_snp", target_arch = "x86_64"))] | ||
| fn import_isolated_pages( |
There was a problem hiding this comment.
tangent, but I was a bit confused on the import_isolated_pages naming until I realized this was more or less just a wrapper over SNP_LAUNCH_UPDATE (import data into the context for launch digest) and SNP_LAUNCH_FINISH (complete the digest and storing it as the measurement for launch), perhaps this is more relevant for the MSHV use-case
There was a problem hiding this comment.
Yeah, you're definitely right, import_isolated_pages is an MSHV-speicifc ioctl (https://docs.rs/mshv-ioctls/latest/src/mshv_ioctls/ioctls/vm.rs.html#157-168) so the naming here is a not great. We should probably think about naming this something more generic.
| use std::fmt::Write; | ||
| write!(&mut hex_digest, "{byte:02x}").unwrap(); | ||
| } | ||
| hex_digest |
There was a problem hiding this comment.
I guess it's okay to have the ACPI table hashes and to print them out, although I don't think we should be injecting them into the kernel cmdline, which will cause attestation issues
| // loader populates vCPU registers from the VMSA via set_sev_control_register | ||
| // (currently KVM-specific; MSHV handles this through its own import path). | ||
| // igvm_enabled is kept as an explicit flag rather than derived from sev_snp | ||
| // state because IGVM could theoretically be used independently of SEV-SNP. |
There was a problem hiding this comment.
I think we have a few non-SEV-SNP VMs that try to emulate the flow of an SEV-SNP CVM (mainly for dev/testing purposes), so this is fairly valid. We'd like to keep the parameters mostly the same if trying to launch basically the same VM except without SEV-SNP technology
| // Inject acpi_hash into cmdline if needed (for boot() path where | ||
| // init_sev_snp's ensure_cmdline_param may not have run). |
There was a problem hiding this comment.
I think it's okay for us to do a check from the VMM to see that the ACPI hash matches but it should not be injected into the actual cmdline
|
|
||
| // Stage0 region is only needed for KVM SEV-SNP with IGVM. | ||
| #[cfg(all(feature = "kvm", feature = "sev_snp"))] | ||
| fn reserve_region_for_stage0(memory_manager: &Arc<Mutex<MemoryManager>>) -> Result<()> { |
There was a problem hiding this comment.
Same note as above--let's not tailor specifically for Stage0 and it's quirks, but instead prime for generic bootloader entry paths
There was a problem hiding this comment.
Fixed the name, thanks.
| let mut loader = Loader::new(memory); | ||
|
|
||
| let mut parameter_areas: HashMap<u32, ParameterAreaState> = HashMap::new(); | ||
| #[cfg(all(feature = "sev_snp", target_arch = "x86_64"))] |
There was a problem hiding this comment.
Should we make the measured boot an option that can be passed in? This is what QEMU does.
In the SEV SNP object in QEMU, we toggle kernel-hashes=on to enforce the measured boot hash, otherwise I suppose it would be whatever pages were enforced via SNP_LAUNCH_UPDATE
| } | ||
|
|
||
| #[cfg(all(feature = "sev_snp", target_arch = "x86_64"))] | ||
| impl MeasuredBootInfo { |
There was a problem hiding this comment.
IMO MeasuredBootInfo and all of the above SEV SNP logic should be all delegated to sev.rs, since it's not really an IGVM concept
|
|
||
| let table = PaddedSevHashTable { | ||
| table: SevHashTable { | ||
| guid: uuid::uuid!("9438d606-4f22-4cc9-b479-a793d411fd21").to_bytes_le(), |
There was a problem hiding this comment.
These UUID strings should be attributed to their provenance here: https://github.com/tianocore/edk2/blob/76406ef119625ee3185d4ae3bf7701147dba50ac/OvmfPkg/AmdSev/BlobVerifierLibSevHashes/BlobVerifierSevHashes.c#L23-L30
Otherwise, it'll look really confusing when reading at a glance
ac74021 to
59d73c9
Compare
Add support for guest_memfd (available in Linux kernel v6.8+), which enables private memory for confidential VMs. Key changes: - Introduce UserMemoryRegion abstraction with guest_memfd fields - Add From impls between kvm_userspace_memory_region2 and UserMemoryRegion - Convert all KVM memory region operations from kvm_userspace_memory_region to kvm_userspace_memory_region2, with automatic fallback to v1 when guest_memfd is not supported - Add set_user_memory_region() wrapper that dispatches to v1/v2 based on kvm_guest_memfd_supported capability - Create guest_memfd via KVM_CREATE_GUEST_MEMFD ioctl when supported - Extend KvmDirtyLogSlot to preserve region2 fields across dirty log start/stop cycles This is prerequisite infrastructure for KVM-based confidential computing that requires private guest memory backed by guest_memfd. Co-authored-by: Alex Orozco <aorozco@google.com> Signed-off-by: Keith Adler <kadler@cloudflare.com> Signed-off-by: Ruben Hakobyan <hruben@meta.com>
Previously, the payload validation rejected an IGVM file combined with a kernel or firmware. Relax this constraint to allow an IGVM carrying a firmware (e.g Oak stage0) to be paired with a separate kernel image. This enables fw_cfg-style boot where stage0 loads a kernel provided through fw_cfg rather than embedded in the IGVM file itself. Signed-off-by: Ruben Hakobyan <hruben@meta.com>
When we use igvm + kvm, we setup the regs and sregs using the cpuid page. We still need to setup the fpu in configure_vcpu. Co-authored-by: Keith Adler <kadler@cloudflare.com> Signed-off-by: Keith Adler <kadler@cloudflare.com> Co-authored-by: Alex Orozco <aorozco@google.com> Signed-off-by: Alex Orozco <aorozco@google.com> Signed-off-by: Ruben Hakobyan <hruben@meta.com>
The SNP guest policy (AMD SEV-SNP ABI bits controlling SMT, migration, debug, etc.) was previously hardcoded inside the MSHV implementation. Widen Vm::sev_snp_init() to accept an SnpPolicy parameter so each hypervisor backend receives the policy at init time. Add get_default_sev_snp_guest_policy() in the VMM to construct the default policy. Co-authored-by: Keith Adler <kadler@cloudflare.com> Signed-off-by: Keith Adler <kadler@cloudflare.com> Co-authored-by: Alex Orozco <aorozco@google.com> Signed-off-by: Alex Orozco <aorozco@google.com> Signed-off-by: Ruben Hakobyan <hruben@meta.com>
Change configure_system to take an Option<GuestAddress> since rsdp is wrapped into an option anyways (we use configure system to setup the mptables). Co-authored-by: Keith Adler <kadler@cloudflare.com> Signed-off-by: Keith Adler <kadler@cloudflare.com> Co-authored-by: Alex Orozco <aorozco@google.com> Signed-off-by: Alex Orozco <aorozco@google.com> Signed-off-by: Ruben Hakobyan <hruben@meta.com>
The load_payload and load_payload_async functions previously received a sev_snp_enabled flag to decide whether to call load_igvm with or without the host_data parameter. Replace this with a single code path that always passes host_data behind a cfg(feature = "sev_snp") gate, removing the runtime branch and the extra parameter threaded through three call sites. Co-authored-by: Keith Adler <kadler@cloudflare.com> Signed-off-by: Keith Adler <kadler@cloudflare.com> Co-authored-by: Alex Orozco <aorozco@google.com> Signed-off-by: Alex Orozco <aorozco@google.com> Signed-off-by: Ruben Hakobyan <hruben@meta.com>
Move IGVM file parsing from load_igvm() into a dedicated parse_igvm() helper in igvm/mod.rs, and parse the file upfront in Vm::new() so the resulting IgvmFile struct is available throughout VM initialization. This is a prerequisite for extracting VMSA SEV features from the parsed IGVM before issuing KVM_SEV_INIT2, which needs sev_features. Signed-off-by: Ruben Hakobyan <hruben@meta.com>
Introduce the SevFd abstraction that wraps /dev/sev and implements the KVM_SEV_INIT2 and KVM_SEV_SNP_LAUNCH_START ioctls for SEV-SNP VM initialization on KVM. Key changes: - Add sev.rs with KvmSevInit and KvmSevSnpLaunchStart ioctl structs matching the kernel layout (linux/arch/x86/include/uapi/asm/kvm.h) - Implement KVM_SEV_INIT2 and KVM_SEV_SNP_LAUNCH_START ioctls - Set KVM_MEMORY_ATTRIBUTE_PRIVATE on newly created memory regions when guest_memfd is supported - Widen SevSnpPageAccessProxy cfg gates from mshv-only to all sev_snp-enabled builds - Make sev_snp_init a required trait method (remove default impl) - Include KVM_SEV_SNP_LAUNCH_START in the seccomp allowlist - Parse VMSA SEV features from IGVM and include them in the KVM_SEV_INIT2 ioctl Co-authored-by: Keith Adler <kadler@cloudflare.com> Signed-off-by: Keith Adler <kadler@cloudflare.com> Co-authored-by: Alex Orozco <aorozco@google.com> Signed-off-by: Alex Orozco <aorozco@google.com> Co-authored-by: Rob Bradford <rbradford@meta.com> Signed-off-by: Rob Bradford <rbradford@meta.com> Signed-off-by: Ruben Hakobyan <hruben@meta.com>
Implement the KVM_SEV_SNP_LAUNCH_UPDATE ioctl. Extend Vm::import_isolated_pages() with a uaddrs parameter carrying host virtual addresses, which KVM needs, unlike MSHV. Compute uaddrs from guest memory mappings in the IGVM loader. Add KVM_SEV_SNP_LAUNCH_UPDATE to the seccomp allowlist. Co-authored-by: Keith Adler <kadler@cloudflare.com> Signed-off-by: Keith Adler <kadler@cloudflare.com> Co-authored-by: Alex Orozco <aorozco@google.com> Signed-off-by: Alex Orozco <aorozco@google.com> Signed-off-by: Ruben Hakobyan <hruben@meta.com>
Add the KVM_SEV_SNP_LAUNCH_FINISH ioctl, which finalizes the SNP launch sequence and transitions the VM into a runnable encrypted state. Additionally, add KVM_SEV_SNP_LAUNCH_FINISH to the seccomp allowlist. Co-authored-by: Keith Adler <kadler@cloudflare.com> Signed-off-by: Keith Adler <kadler@cloudflare.com> Co-authored-by: Alex Orozco <aorozco@google.com> Signed-off-by: Alex Orozco <aorozco@google.com> Signed-off-by: Ruben Hakobyan <hruben@meta.com>
SEV-SNP guests will issue this hypercall to signal a change in the page encryption status to the hypervisor. Handle VcpuExit::Hypercall in the KVM vCPU run loop: decode the GPA, page count, and private/shared attribute from the hypercall arguments, then call KVM_SET_MEMORY_ATTRIBUTES to update the page state. Co-authored-by: Keith Adler <kadler@cloudflare.com> Signed-off-by: Keith Adler <kadler@cloudflare.com> Co-authored-by: Alex Orozco <aorozco@google.com> Signed-off-by: Alex Orozco <aorozco@google.com> Signed-off-by: Ruben Hakobyan <hruben@meta.com>
During SNP boot all guest RAM is initially marked KVM_MEMORY_ATTRIBUTE_PRIVATE. Pages imported via SNP_LAUNCH_UPDATE are properly accepted by the guest, but generic RAM pages (e.g. the AP trampoline at GPA 0xD000) are not. When stage0 on the BSP starts secondary vCPUs via x2APIC, the APs try to execute from the trampoline page through the shared mapping while KVM still has it marked private, causing a KVM_EXIT_MEMORY_FAULT (flags=KVM_MEMORY_EXIT_FLAG_PRIVATE) that previously fell through to the catch-all error, killing the VM. Handle VcpuExit::MemoryFault by toggling the page's memory attribute between private and shared based on the exit flags, allowing the vCPU to retry the access. Signed-off-by: Ruben Hakobyan <hruben@meta.com>
Adapt the IGVM loader to work with both MSHV and KVM backends, which differ in page type constants, CPUID page layout, and VMSA handling. Abstract page types into a PageTypeConfig struct populated at runtime from the detected hypervisor, replacing hardcoded mshv_bindings constants. Apply the VMSA register state to each vCPU via setup_sev_snp_regs(), translating SevSelector attributes to KVM segment format using a bitfield decoder. KVM's SNP launch path sanitizes certain CPUID bits that could lead to an insecure guest. If the VMM sets these bits, KVM rejects the CPUID page import on the first attempt, requiring a retry with the firmware-corrected values. Pre-clear the known problematic bits before import to avoid the reject-and-retry cycle: - Leaf 0x1, ECX bit 24: TSC_DEADLINE (filtered by KVM) - Leaf 0x7, EBX bit 1: SGX (filtered by KVM) - Leaf 0x7, EDX: clear entirely (contains speculative features) - Leaf 0x80000008, EBX bit 25: filtered by KVM - Leaf 0x80000021, ECX: clear entirely This keeps the CPUID page stable across launch updates and avoids noisy error logs from the retry path. Co-authored-by: Keith Adler <kadler@cloudflare.com> Signed-off-by: Keith Adler <kadler@cloudflare.com> Co-authored-by: Alex Orozco <aorozco@google.com> Signed-off-by: Alex Orozco <aorozco@google.com> Co-authored-by: Dylan Reid <dgreid@fb.com> Signed-off-by: Dylan Reid <dgreid@fb.com> Signed-off-by: Ruben Hakobyan <hruben@meta.com>
A bootloader/firmware (e.g. stage0) and the VMSA page require dedicated memory regions at fixed GPAs. Add reserve_region_for_stage0() to allocate these regions before IGVM loading begins: - Stage0 at GPA 0xffc0_0000 (4 MB) - VMSA page at GPA 0xffff_ffff_f000 (4 KB) These reservations are KVM-only; MSHV handles stage0/VMSA placement through its own isolated import path. Also add fw_cfg device creation and SYS_statx to the vCPU seccomp allowlist (needed by stage0's file access pattern). Co-authored-by: Keith Adler <kadler@cloudflare.com> Signed-off-by: Keith Adler <kadler@cloudflare.com> Co-authored-by: Alex Orozco <aorozco@google.com> Signed-off-by: Alex Orozco <aorozco@google.com> Signed-off-by: Ruben Hakobyan <hruben@meta.com>
The Linux x86 boot protocol defines the setup area as (setup_sects + 1) * 512 bytes. Previously we exported only the boot_params buffer (4096 bytes), which is wrong for kernels with setup_sects >= 8 where the actual setup area exceeds boot_params. Truncate or extend the existing buffer to the correct setup_sects- derived length, reading any extra bytes directly from the kernel file. This avoids an extra allocation in the common case (setup_sects <= 7) and matches QEMU's fw_cfg_add_kernel() behavior in hw/i386/x86-common.c. Signed-off-by: Dylan Reid <dgreid@fb.com>
Boot-time block devices on PCI segment 0 use 32-bit BARs so early firmware can access them without additional identity mapping in the firmware page tables. However, hot-plugged block devices are only ever seen by the OS kernel which handles 64-bit BARs natively. Switch hot-plugged block devices to 64-bit BARs to avoid exhausting the scarce 32-bit MMIO window (typically 2-3 GB between RAM and 4 GB) when many devices are hot-plugged. Extract the BAR sizing decision into use_64bit_bar_for_virtio_device() and thread an is_hotplug flag through add_virtio_pci_device(). Add unit tests covering all relevant combinations. Signed-off-by: Dylan Reid <dgreid@fb.com>
Add build and clippy jobs for kvm+sev_snp+igvm+fw_cfg feature combination. Signed-off-by: Keith Adler <kadler@cloudflare.com> Signed-off-by: Ruben Hakobyan <hruben@meta.com>
59d73c9 to
8ab8332
Compare
rbradford
left a comment
There was a problem hiding this comment.
Let's land this - we can always iterate further later.
ea2df94
This PR adds support to Cloud Hypervisor to boot a guest on KVM with AMD SEV-SNP support and an oak stage0 firmware built into IGVM.
The work here is derivded from #7787 and #7316 with some changes, improvements and additions to have it boot a SEV-SNP guest on real hardware.
Key Changes
KVM_SEV_INIT2,KVM_SEV_SNP_LAUNCH_ START,KVM_SEV_SNP_LAUNCH_ UPDATE,KVM_SEV_SNP_LAUNCH_ FINISHFixes #6653
Testing