Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance virtio subsystem for better comformance to virtio spec #809

Merged
merged 9 commits into from Jan 8, 2019

Conversation

jiangliu
Copy link
Contributor

@jiangliu jiangliu commented Jan 3, 2019

Issue #, if available:

Description of changes:
There are some bugs in virtio subsystem which may cause firecracker hypervisor panicking, so enhance the virtio subsystem for better Spec conformance and avoid panic.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

The VirtIO Spec 1.0, sec 2.4 defines the alignment constraints for
VirtIO queue's descriptor table, available ring and used ring.
So enforce the alignment constraints in Queue::is_valid().

Signed-off-by: Liu Jiang <gerry@linux.alibaba.com>
@acatangiu
Copy link
Contributor

nit: change the commit summary of bbe0b4c from "virtio: fix a potential attack channel through MMIO transport layer" to something less sensational like: "virtio: harden the MMIO transport spec enforcements" or "virtio: enhance for better comformance to virtio spec".

@acatangiu
Copy link
Contributor

Rust style test failed, please run cargo fmt --all and amend the offending commit.

@jiangliu
Copy link
Contributor Author

jiangliu commented Jan 3, 2019

thanks for review, I have updated the changeset according to the review comments:)

@raduweiss raduweiss added Feature: IO Virtualization Type: Bug Indicates an unexpected problem or unintended behavior labels Jan 3, 2019
@dianpopa dianpopa self-requested a review January 4, 2019 11:27
@acatangiu acatangiu self-requested a review January 4, 2019 11:28
LittleEndian::write_u32(&mut buf[..], 0x8f);
d.write(0x70, &buf[..]);
assert_eq!(d.driver_status, 0x8f);
// TODO: assert!(d.device_activated);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why assert!(d.device_activated) again? Is there any path to de-activate an activated device?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a reminder that there still two cases unhandled yet:

  1. virtio device driver writes FALIED flags to driver_status field;
  2. virtio device driver writes 0 to driver_status fields to reset device;

Still thinking about how to handle above two state transitions which may affect the device_activated flag. The block issue is that VirtioDevice::reset() is defined as optional and the net/block virtio backend driver doesn't support reset yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create issues for outstanding TODOs and reference them in the codebase like:

// TODO (#xxx): issue-xxx-summary

Having said that, please be aware that we've yet found no customer need to support device reset.


LittleEndian::write_u32(&mut buf[..], 0x0);
d.write(0x70, &buf[..]);
// TODO: assert_eq!(d.driver_status, 0x0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a valid check, uncomment it.

@@ -322,9 +322,6 @@ impl Queue {

/// A consuming iterator over all available descriptor chain heads offered by the driver.
pub fn iter<'a, 'b>(&'b mut self, mem: &'a GuestMemory) -> AvailIter<'a, 'b> {
if !self.is_valid(mem) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure about this. Your logic is right, but we need to be very confident there is no way to alter any of these queue params after device activation, otherwise removing this opens up a security hole.

Other reviewers, please think hard on this part.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed all accesses of fields validated by this function and they are indeed not modified anywhere after VirtDevice::activate() 👍

where
F: FnMut(&MemoryMapping, usize, usize) -> Result<()>,
{
// Fast path: content doesn't cross MemoryRegion boundary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fast path is the same code as do_in_region(). Why not call that here directly and in case of Err move to the slow-path.

Would help de-duplicate code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do_in_region_full() has been removed.

== (DEVICE_ACKNOWLEDGE | DEVICE_DRIVER | DEVICE_FEATURES_OK) =>
{
self.driver_status = v;
if !self.device_activated && self.are_queues_valid() {
Copy link
Contributor

@acatangiu acatangiu Jan 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a comment here:

// If the driver incorrectly sets up the queues, the following check will fail and take the device into an unusable state.

I think it's important to call out this scenario since it's something we can't currently recover from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -424,6 +451,52 @@ impl GuestMemory {
}
Err(Error::InvalidGuestAddress(guest_addr))
}

/// Read the whole content from one or more MemoryRegions
fn do_in_region_full<F>(&self, guest_addr: GuestAddress, size: usize, mut cb: F) -> Result<()>
Copy link
Contributor

@acatangiu acatangiu Jan 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see the point to this (big) function. It only adds value when trying to read/write an object which spans across multiple adjacent regions.

We never configure adjacent memory regions. There is either:

  1. A single contiguous region.
  2. A region followed by a hole followed by a second (non-adjacent) region.

I would drop do_in_region_full().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. It only matters when enabling memory hot-adding. Will remove do_in_region_full().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi acatangiu,
I just sent out an RFC to complete the virtio device state machine, which may help to review related patches. Will address other review comments later.
Thank!

Copy link
Contributor

@dhrgit dhrgit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting an object across memory regions only makes sense if the regions are adjacent. In our case, they aren't. If they were, we could simply merge them into a single region and remove some implementation complexity.

There are at most two memory regions, separated by the 32-bit PCI gap. Check out vmm/src/lib.rs::init_guest_memory and x86_64/src/lib.rs::arch_memory_regions for the mem setup code.

if self.check_driver_status(DEVICE_FEATURES_OK, DEVICE_DRIVER_OK | DEVICE_FAILED) {
self.with_queue_mut(f);
} else {
#[cfg(not(release))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn t we want to log this warning on release versions? The MmioDevice::write() and MmioDevice::read() have similar log messages and we do log them.

Copy link
Contributor Author

@jiangliu jiangliu Jan 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to avoid injecting to many log messages to the log system by a malicious guest kernel. And it's ok to remove the conditional compilation.

fn update_driver_status(&mut self, v: u32) {
// match changed bits
match !self.driver_status & v {
DEVICE_ACKNOWLEDGE if self.driver_status == 0 => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you say we declare a constant for the 0 value (i.e the value that is used to initialize the driver_status field).

Ok(MmioDevice {
            device: device,
            device_activated: false,
            features_select: 0,
            acked_features_select: 0,
            queue_select: 0,
            interrupt_status: Arc::new(AtomicUsize::new(0)),
            interrupt_evt: Some(EventFd::new()?),
            driver_status: 0,
            config_generation: 0,
            queues: queues,
            queue_evts: queue_evts,
            mem: Some(mem),
        })

Even if the probability is small, If we were to change this value to something else in the init code, the state machine algorithm would fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if self
.check_driver_status(DEVICE_DRIVER, DEVICE_FEATURES_OK | DEVICE_FAILED)
{
self.device.ack_features(self.acked_features_select, v);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if after setting the DEVICE_DRIVER bit during initialization, a reset is issued by the guest os?
Should we still acknowledge features?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A RESET command will cause the device goes into INIT state, and the guest OS need to reinitialize the device from start. So we don't acknowledge the features after a RESET until DEVICE_DRIVER is set again.

_ => {
warn!("unknown virtio mmio register write: 0x{:x}", offset);
return;
}
}
}
0x100...0xfff => return self.device.write_config(offset - 0x100, data),
0x100...0xfff => {
if self.check_driver_status(DEVICE_DRIVER, DEVICE_FAILED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the spec I understand that the config space can only be written after FEATURES_OK has been set:
Step 4 from 3.1.1 section:
4. Read device feature bits, and write the subset of feature bits understood by the OS and driver to the
device. During this step the driver MAY read (but MUST NOT write) the device-specific configuration
fields to check that it can support the device before accepting it.

Step 7 from 3.1.1
7. Perform device-specific setup, including discovery of virtqueues for the device, optional per-bus setup,
reading and possibly writing the device’s virtio configuration space, and population of virtqueues

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, so we only the device state on MmioDevice::write() and let it go in MmioDevice::read().

LittleEndian::write_u32(&mut buf[..], 0x8f);
d.write(0x70, &buf[..]);
assert_eq!(d.driver_status, 0x8f);
// TODO: assert!(d.device_activated);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a TODO? Same for the other following TODOs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These TODOs have been covered by the last RFC patch.

}

/// Returns the address plus the offset if it is in range.
pub fn checked_offset(&self, addr: GuestAddress, offset: usize) -> Option<GuestAddress> {
addr.checked_add(offset)
.and_then(|a| if a < self.end_addr() { Some(a) } else { None })
if let Some(tgt) = addr.checked_add(offset) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it would be nice to rename tgt to something more intuitive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed addr as base and tgt as addr, so it becomes addr = base + offset.

@dianpopa
Copy link
Contributor

dianpopa commented Jan 4, 2019

By using the above command you can check the coverage of the whole project:

cd tests
sudo env "PATH=$PATH" ./testrun.sh -p apt-get -i /home/local/ANT/dpopa/firecracker/util/s3-images-oct/ -r integration_tests/build/test_coverage.py 

Unfortunately, we still do not have a dynamic process in place for making sure that the coverage does not decrease from a commit to another.
However, I noticed that before your commits the coverage was around 82.2% and after them it dropped to 81.8%. Can you please make sure that newly added functions/code branches are covered by unit tests?

@jiangliu
Copy link
Contributor Author

jiangliu commented Jan 4, 2019

Great, will handle the coverage issue in next round.

@@ -183,6 +183,22 @@ impl MmioDevice {
}
}

fn reset(&mut self) {
assert!(!self.device_activated);
Copy link
Contributor

@acatangiu acatangiu Jan 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't do assert in release code. I suggest logging an error and moving on (in this case return-ing).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the leftover assert! too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for my mistake, fixed:)

self.interrupt_status.store(0, Ordering::SeqCst);
self.driver_status = 0;
// . Keep interrupt_evt and queue_evts as is. There may be pending
// notificaitons in those eventfds, but nothing will happen other
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo in notifications

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

@acatangiu acatangiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me (including the final RFC commit handling device reset). Please address outstanding comments and I will approve the PR.

@@ -322,9 +322,6 @@ impl Queue {

/// A consuming iterator over all available descriptor chain heads offered by the driver.
pub fn iter<'a, 'b>(&'b mut self, mem: &'a GuestMemory) -> AvailIter<'a, 'b> {
if !self.is_valid(mem) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed all accesses of fields validated by this function and they are indeed not modified anywhere after VirtDevice::activate() 👍

The MMIO based VirtIO devices support the VirtIO 1.0 capability,
but the code is a little hard to be understood. Improve code
readability by adding a comment.

Signed-off-by: Liu Jiang <gerry@linux.alibaba.com>
The VirtIO queue code in queue.rs assumes all data structures, including
the descriptor table, available ring and used ring, are consistent and
legal once Queue::is_valid() returns true for an queue. If this assumption
is broken, it may cause panic.

On the other hand, most fields in a working queue object, including size,
ready, desc_table, avail_ring and used_ring, could be modified by the
MmioDevice::write() method. That means an untrusted guest kernel could
easily break the assumption made by queue.rs, and then cause the firecracker
hypervisor panic.

So enhance the VirtIO MMIO transport driver to strictly follow the state
machine defined in the VirtIO Spec IO. It closes the way for an untrusted
guest kernel to modify firecracker internal data structures.

Signed-off-by: Liu Jiang <gerry@linux.alibaba.com>
The lifecycle of VirtIO queue object is as below:
1) MMIO transport driver creates queues
2) MMIO transport validates all queues are valid by calling Queue::is_valid()
3) Passing queues to VirtDevice by calling VirtDevice::activate()
4) VirtDevice device driver creates iterator to access available descritpors
5) MMIO transport reset the VirtDevice by calling VirtDevice::reset()
6) MMIO transport changes/destroys queues

The MMIO transport driver won't touch those queues between stage 3 and stage 5,
so Queue::is_valid will always return true during stage 4. Get rid of the
redundant invocation of Queue::is_valid() in Queue::iter().

Signed-off-by: Liu Jiang <gerry@linux.alibaba.com>
Fetch the entire VirtIO descriptor all at once instead of fetching each
field one by one.

Signed-off-by: Liu Jiang <gerry@linux.alibaba.com>
Split the memory copy operations into to cases:
1) Full copy by do_in_region() which must copy all requested memory content
   or fails. Pass an additional parameter 'size' into function do_in_region(),
   and validate memory region before copying memory content. By this way the
   destination memory will be either fully copied or untouched.
2) Partial copy by do_in_region_patial() which copies memory as much as
   possible and returns size of copied memory content.


Signed-off-by: Liu Jiang <gerry@linux.alibaba.com>
GuestMemory::address_in_range() and GuestMemory::checked_offset() assumes
memory address is continuous. It's a dangerous assumption because there
is a black hole just below 4G when guest memory is about 4G or more.
Then it may cause invalid memory access in VirtIO subsystem.

Signed-off-by: Liu Jiang <gerry@linux.alibaba.com>
Complete the virtio device state machine by:
1) Handle writing FAILED flag into "driver status" control field.
2) Handle writing 0 into "driver status" control field to reset the device.
   If the the backend driver supports reset operation, the device will be
   reset. Otherwise the device will be put into FAILED state.

With these changes, the virtio mmio transport layer implements a full
state machine according to Virtio Spec. The backend driver is still optional
to support reset operation.

Signed-off-by: Liu Jiang <gerry@linux.alibaba.com>
@acatangiu
Copy link
Contributor

@jiangliu I see your comments on our review, but I think you forgot to push the changes as well 😄
I still see the original version of the code.

@jiangliu
Copy link
Contributor Author

jiangliu commented Jan 7, 2019

@jiangliu I see your comments on our review, but I think you forgot to push the changes as well 😄
I still see the original version of the code.
I refined the code according to review comments, but fails to run CI code coverage test due to env issues, so delayed the push. Still trying to setup the CI env:)

@jiangliu
Copy link
Contributor Author

jiangliu commented Jan 7, 2019

Hi,
Any suggestion about this failure? I think my changes shouldn't affect the boot time too much, and don't know how to fix this failure too.
=================================== FAILURES ===================================
__________________________ test_startup_time[ubuntu] ___________________________
integration_tests/performance/test_process_startup_time.py:49: in test_startup_time
assert cpu_startup_time_us <= MAX_STARTUP_TIME_CPU_US
E assert 8616 <= 8000

@acatangiu
Copy link
Contributor

acatangiu commented Jan 7, 2019

@jiangliu test_startup_time measures the time taken from the fork()/clone() into the jailer to the first instructions run inside firecracker. This test has some variability depending on the machine its being run on. It sometimes fails for my local setup too. Don't worry about it, it should work fine on the AWS i3.metal instance.

@andreeaflorescu
Copy link
Member

@acatangiu the test is failing as part of our CI. We should take a look.

@acatangiu
Copy link
Contributor

Oooopsie, now I see that it actually failed in our C.I.

@acatangiu
Copy link
Contributor

@jiangliu I think it was just a system load hiccup. Please rebase your changes against latest upstream master git pull git@github.com:firecracker-microvm/firecracker.git master --rebase and force push to update the PR. That should retry the C.I. and I'm expecting it to go through nicely.

acatangiu
acatangiu previously approved these changes Jan 7, 2019
Copy link
Contributor

@acatangiu acatangiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the RFC from the last commit summary. Code changes look good 👍 !

Copy link
Contributor

@dhrgit dhrgit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, @jiangliu - thank you!

@andreeaflorescu andreeaflorescu merged commit 34bc630 into firecracker-microvm:master Jan 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants