Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] virtio-blk storage via mmio using virtio-drivers (Qemu) #343

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

osteffenrh
Copy link
Contributor

@osteffenrh osteffenrh commented May 14, 2024

Add support for virtio-blk storage devices using the virtio-mmio transport.
MMIO accesses are done via explicit vmgexits. Interrupts are not used (-> polling).

The virtio-drivers crate from the rcore-os project is used. It required some modifications to support custom MMIO access functions.
Crate License: MIT.

This PR requires a patched version of Qemu , adding virtio-mmio slots to the Q35 machine
model.

ToDo

For persistent storage to be usable we still need to:

  • Copy virtio-drivers into the repo
  • Set up some basic Fuzzing
  • Improve memory management:
    • Allow buffers > one page
    • Allocate buffers in a smarter/more compact way?
  • Add block device layer to Coconut and integrate vitio-blk into it
  • Add encryption layer
  • Add a simple file system on top of the block layer. Choices?
  • Block access to the MMIO ranges for VMPL!=0
  • Supply MMIO base address via IGVM file

Build & Run

Building & Preparations

Build the patched Qemu as usual with IGVM support.

git clone https://github.com/osteffenrh/qemu.git --branch virtio-pr --single-branch --depth=1
cd qemu
$ ./configure --prefix=$HOME/bin/qemu-svsm/ --target-list=x86_64-softmmu --enable-igvm
ninja -C build/
make install

Build Coconut as usual.

Create an empty disk image for Coconut to use

truncate -s16M mmio.raw

Launching Qemu

Add

-machine x-svsm-virtio-mmio=on

and

-global virtio-mmio.force-legacy=false \
-drive file="mmio.raw",format=raw,if=none,id=mmio \
-device virtio-blk-device,drive=mmio

to your Qemu command line.

Full example:

$HOME/bin/qemu-svsm/bin/qemu-system-x86_64 \
  -enable-kvm \
  -cpu EPYC-v4 \
  -smp 4 \
  -machine q35,confidential-guest-support=sev0,memory-backend=mem0 \
  -machine x-svsm-virtio-mmio=on \
  -object memory-backend-memfd,id=mem0,size=4G,share=true,prealloc=false,reserve=false \
  -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,init-flags=1,igvm-file="bin/coconut-qemu.igvm" \
  -no-reboot \
  -drive file="disk.qcow2",if=none,id=disk0,format=qcow2,snapshot=on \
  -device virtio-scsi-pci,id=scsi0,disable-legacy=on,iommu_platform=true \
  -device scsi-hd,drive=disk0 \
  -nographic \
  -serial stdio \
  -monitor none \
  -global virtio-mmio.force-legacy=false \
  -drive file="mmio.raw",format=raw,if=none,id=mmio \
  -device virtio-blk-device,drive=mmio

A short write-read test will be run on the virtio-blk device, followed by a
simple write speed benchmark. This requires external time measurement.

Example output (timestamps added externally):

   0.0011s   [SVSM] Virtio test trace
   0.0004s   [SVSM] Detected virtio MMIO device with vendor id 0x554D4551, device type Block, version Modern
   0.0006s   [SVSM] config: 0xfef03100
   0.0007s   [SVSM] found a block device of size 131072KB
   0.0173s   [SVSM] Write+Read Test Start
   0.0002s   [SVSM] Write+Read Test End
   0.0005s   [SVSM] Write Benchmark Start
  65.2324s   [SVSM] virtio-blk start write. Block size = 512
   0.0005s   [SVSM] virtio-blk end write. Block size = 512, Total bytes = 134217728
   8.1706s   [SVSM] virtio-blk start write. Block size = 4096
   0.0003s   [SVSM] virtio-blk end write. Block size = 4096, Total bytes = 134217728
   0.0008s   [SVSM] Write Benchmark End
   0.0005s   [SVSM] Virtio test end

Write Speed

Write speed depends on the block size used.
From the output above we get:

  • 512Byte writes: 1.96MB/s
  • 4096Byte writes: 15.7MB/s

@osteffenrh
Copy link
Contributor Author

I'll rebase and retest everything, just wanted to put this out here already.

@joergroedel
Copy link
Member

Wow, cool! We might also think about using IRQs once the COCONUT kernel has the infrastructure for it.

@00xc
Copy link
Member

00xc commented May 15, 2024

Add support for virtio-blk storage devices using the virtio-mmio transport. MMIO accesses are done via explicit vmgexits. Interrupts are not used (-> polling).

The virtio-drivers crate from the rcore-os project is used. It required some modifications to support custom MMIO access functions. Crate License: MIT.

I haven't gone too deep in the codebase, but can't this be done by implementing Transport for a new MmioTransport-like type?

@osteffenrh
Copy link
Contributor Author

Add support for virtio-blk storage devices using the virtio-mmio transport. MMIO accesses are done via explicit vmgexits. Interrupts are not used (-> polling).
The virtio-drivers crate from the rcore-os project is used. It required some modifications to support custom MMIO access functions. Crate License: MIT.

I haven't gone too deep in the codebase, but can't this be done by implementing Transport for a new MmioTransport-like type?

Unfortunately not, because the devices (virtio-blk for example) use mmio accesses directly too (to the configuration space), without going through the transport.

@osteffenrh
Copy link
Contributor Author

To be clear, there is a MMIO transport in the virtio-drivers crate. But it simply uses volatile reads and writes by default. My change here was to add mmio access functions to the hardware abstraction trait, to be able to use explicit vmgextis instead of "plain" volatile accesses.
Since we don't have a #VC handler yet that supports MMIO, that seemed like the way to go.

@osteffenrh osteffenrh force-pushed the virtio-pr branch 4 times, most recently from 3983c9d to 89212e9 Compare May 20, 2024 19:31
@kraxel
Copy link

kraxel commented May 21, 2024

Wow, cool! We might also think about using IRQs once the COCONUT kernel has the infrastructure for it.

Not convinced this makes sense, so my qemu patch doesn't assign interrupts to these virtio-mmio transport slots.
Do we want run virtio devices asynchronously? Is the complexity worth the performance benefit?

When using interrupts: How would that look like? Would SVSM (partly) emulate lapic and ioapic then, to make sure the guest os can't tamper with the IRQ lines owned by SVSM?

@joergroedel
Copy link
Member

Wow, cool! We might also think about using IRQs once the COCONUT kernel has the infrastructure for it.

Not convinced this makes sense, so my qemu patch doesn't assign interrupts to these virtio-mmio transport slots. Do we want run virtio devices asynchronously? Is the complexity worth the performance benefit?

When using interrupts: How would that look like? Would SVSM (partly) emulate lapic and ioapic then, to make sure the guest os can't tamper with the IRQ lines owned by SVSM?

Yes, The SVSM would own the HV-emulated APIC (of vAPIC in TDX case) and emulate another X(2)APIC for the guest OS. The IOAPIC emulation remains on the host side. APIC support in SVSM is needed anyway for TDX to implement IPIs for TLB flushes. On the SNP side IRQ support is needed to mitigate the AHOI group of attacks.

Comment on lines 389 to 409
fn read_buffer_as<T>(&mut self, offset: isize) -> Result<T, GhcbError>
where
T: Sized + Copy,
{
let size: isize = mem::size_of::<T>() as isize;

if offset < 0 || offset + size > (GHCB_BUFFER_SIZE as isize) {
return Err(GhcbError::InvalidOffset);
}

unsafe {
let src = self
.buffer
.as_mut_ptr()
.cast::<u8>()
.offset(offset)
.cast::<T>();
Ok(*src)
}
}
Copy link
Member

@00xc 00xc May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need to use isize here for the offset. I understand this was perhaps done to mirror write_buffer() below, but we can do the right thing here (and the caller), i.e., use usize, and thus add() instead of offset()

Also, there are some safety issues:

  • We need to check that offset + size does not overflow (otherwise add() / offset() might cause UB.
  • We need to check that src has the alignment requirements for T. We should either add a check or use read_unaligned(). Otherwise this might cause UB.

Something like this:

Suggested change
fn read_buffer_as<T>(&mut self, offset: isize) -> Result<T, GhcbError>
where
T: Sized + Copy,
{
let size: isize = mem::size_of::<T>() as isize;
if offset < 0 || offset + size > (GHCB_BUFFER_SIZE as isize) {
return Err(GhcbError::InvalidOffset);
}
unsafe {
let src = self
.buffer
.as_mut_ptr()
.cast::<u8>()
.offset(offset)
.cast::<T>();
Ok(*src)
}
}
fn read_buffer_as<T>(&mut self, offset: usize) -> Result<T, GhcbError>
where
T: Sized + Copy,
{
offset
.checked_add(mem::size_of::<T>())
.filter(|end| *end <= GHCB_BUFFER_SIZE)
.ok_or(GhcbError::InvalidOffset)?;
// SAFETY: we have verified that offset is within bounds and does not
// overflow
let src = unsafe { self.buffer.as_ptr().add(offset) };
if src.align_offset(mem::align_of::<T>()) != 0 {
return Err(GhcbError::InvalidOffset);
}
// SAFETY: we have verified the pointer is aligned, as well as within
// bounds.
unsafe { Ok(src.cast::<T>().read()) }
}

I believe these additional checks should not cause a noticeable performance hit. As a last resort, we could make this function (as well as write_buffer()) unsafe and have the caller ensure the safety requirements are met.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied the suggestion, with one change:

  unsafe { Ok(src.cast::<T>().read()) }

does not work. Coconut hangs.

This works:

   unsafe { Ok(*(src.cast::<T>())) }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like UB. Can you try with read_volatile() instead? The compiler probably cannot infer that the buffer changes after a VMGEXIT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried it. Still the same: hangs.

Also very strange: The last thing I see is "Validating memory", which happens way before
the virtio code (which is the only place calling this) runs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last try: could you update write_buffer so that it uses write_volatile() as well? E.g. something like:

diff --git a/kernel/src/sev/ghcb.rs b/kernel/src/sev/ghcb.rs
index 9a5fe44..682d2cf 100644
--- a/kernel/src/sev/ghcb.rs
+++ b/kernel/src/sev/ghcb.rs
@@ -386,7 +386,7 @@ impl GHCB {
 
     fn write_buffer<T>(&mut self, data: &T, offset: isize) -> Result<(), GhcbError>
     where
-        T: Sized,
+        T: Copy,
     {
         let size: isize = mem::size_of::<T>() as isize;
 
@@ -401,9 +401,8 @@ impl GHCB {
                 .cast::<u8>()
                 .offset(offset)
                 .cast::<T>();
-            let src = data as *const T;
 
-            ptr::copy_nonoverlapping(src, dst, 1);
+            dst.write_volatile(*data);
         }
 
         Ok(())

Perhaps with both of them being volatile the compiler can make the right assumptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is strange. I retried the approaches mentioned above (read, read_volatile, read_volatile + write_volatile, and the plain "cast" line) and now they all work (virtio-blk tests code runs).
Before Coconut would hang reliably on all approaches except the one I mentioned. 🤷‍♂️

So, shall we use read_volatile and write_volatile to be safe? Or what would you recommend?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried both debug and release modes? Anyhow, yes, we should probably use the volatile versions to be safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried both, debug and release.

Using volatile accesses now.

}

unsafe fn mmio_read<T: Sized + Copy>(src: &T) -> T {
let paddr = PhysAddr::from((src as *const T) as u64);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this, why are we not going through virt_to_phys()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we can't access mmio locations directly and have to go through the ghcb mmio functions anyway, the mmio area is not mapped.

Instead, the physical MMIO base address is cast to the VirtIOHeader struct. This only works because the struct members are never actually accessed directly. It only serves as a fancy way to calculate the correct offsets of the individual registers. (This is just how the crate does it).

It should be possible to map the mmio range, cast that as VirtIOHeader and then translate back to the paddr when doing the mmio calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying this now... virt_to_phys fails with "Invalid physical address" (Copy&paste error btw, it is printing the vaddr there...)

What is the correct way to create the mapping, btw?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a temporary mapping use PerCPUPageMappingGuard. For a permanent mapping you would probably need to call into this_cpu().get_pgtable().map_region(..).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mapping the MMIO range works also now:

diff --git a/kernel/src/virtio/mod.rs b/kernel/src/virtio/mod.rs
index 0f9d8b8..63da985 100644
--- a/kernel/src/virtio/mod.rs
+++ b/kernel/src/virtio/mod.rs
@@ -4,7 +4,7 @@
 //
 // Author: Oliver Steffen <osteffen@redhat.com>
 
-use core::ptr::NonNull;
+use core::ptr::{addr_of, NonNull};
 
 use virtio_drivers::{
     device::blk::{VirtIOBlk, SECTOR_SIZE},
@@ -15,9 +15,7 @@ use virtio_drivers::{
 };
 
 use crate::{
-    address::PhysAddr,
-    cpu,
-    mm::{alloc::*, page_visibility::*, *},
+    address::{PhysAddr, VirtAddr}, cpu::{self, percpu::this_cpu}, mm::{alloc::*, page_visibility::*, *}
 };
 
 struct SvsmHal;
@@ -107,7 +105,7 @@ unsafe impl virtio_drivers::Hal for SvsmHal {
     }
 
     unsafe fn mmio_read<T: Sized + Copy>(src: &T) -> T {
-        let paddr = PhysAddr::from((src as *const T) as u64);
+        let paddr = this_cpu().get_pgtable().phys_addr(VirtAddr::from(addr_of!(*src))).unwrap();
 
         cpu::ghcb::current_ghcb()
             .mmio_read::<T>(paddr)
@@ -115,7 +113,7 @@ unsafe impl virtio_drivers::Hal for SvsmHal {
     }
 
     unsafe fn mmio_write<T: Sized + Copy>(dst: &mut T, v: T) {
-        let paddr = PhysAddr::from((dst as *mut T) as u64);
+        let paddr = this_cpu().get_pgtable().phys_addr(VirtAddr::from(addr_of!(*dst))).unwrap();
 
         cpu::ghcb::current_ghcb()
             .mmio_write::<T>(paddr, &v)
@@ -127,8 +125,12 @@ unsafe impl virtio_drivers::Hal for SvsmHal {
 pub fn test_mmio() {
     static MMIO_BASE: u64 = 0xfef03000; // Hard-coded in Qemu
 
+    let paddr = PhysAddr::from(MMIO_BASE);
+    let mem = PerCPUPageMappingGuard::create_4k(paddr).expect("Error mapping MMIO region");
+
+    log::info!("mapped MMIO range {:016x} to vaddr {:016x}", MMIO_BASE, mem.virt_addr());
     // Test code below taken from virtio-drivers aarch64 example.
-    let header = NonNull::new(MMIO_BASE as *mut VirtIOHeader).unwrap();
+    let header = NonNull::new(mem.virt_addr().as_mut_ptr() as *mut VirtIOHeader).unwrap();
     match unsafe { MmioTransport::<SvsmHal>::new(header) } {
         Err(e) => log::warn!(
             "Error creating VirtIO MMIO transport at {:016x}: {}",

}

unsafe fn mmio_write<T: Sized>(dst: &mut T, v: T) {
let paddr = PhysAddr::from((dst as *mut T) as u64);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, why are we converting a regular reference to a PhysAddr directly?

osteffenrh added a commit to osteffenrh/coconut-svsm that referenced this pull request May 28, 2024
see coconut-svsm#343 (review)

Signed-off-by: Oliver Steffen <osteffen@redhat.com>
osteffenrh added a commit to osteffenrh/coconut-svsm that referenced this pull request Jun 1, 2024
see coconut-svsm#343 (review)

Signed-off-by: Oliver Steffen <osteffen@redhat.com>

rd-vol

wr_vol
@osteffenrh osteffenrh force-pushed the virtio-pr branch 2 times, most recently from 94f788b to 135aba9 Compare June 1, 2024 09:06
Use volatile access to ensue writes are not elided or reordered
and data actually reaches the buffer.

Signed-off-by: Oliver Steffen <osteffen@redhat.com>
Add GHCB::mmio_read() and GHCB::mmio_write() to perform MMIO
accesses via vmgexit.

Signed-off-by: Oliver Steffen <osteffen@redhat.com>
Demonstrate the use of the virtio-drivers crate, accessing a
virtio-blk device via the virtio-mmio transport.

Signed-off-by: Oliver Steffen <osteffen@redhat.com>
@@ -384,9 +386,31 @@ impl GHCB {
Ok(())
}

fn read_buffer_as<T>(&mut self, offset: usize) -> Result<T, GhcbError>
where
T: Sized + Copy,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sized is opt-out, so it is always implied unless something like ?Sized is used.

// TODO: allow more than one page.
// This currently works, becasue in "modern" virtio mode the crate only allocates
// one page at a time.
assert!(pages == 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just using allocate_pages() with some extra logic to zero them and make them all shared should work. right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants