Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Virtio block flush (VIRTIO_BLK_T_FLUSH) not syncing data to host #2172

Closed
pclesr opened this issue Oct 12, 2020 · 18 comments · Fixed by #2447
Closed

Virtio block flush (VIRTIO_BLK_T_FLUSH) not syncing data to host #2172

pclesr opened this issue Oct 12, 2020 · 18 comments · Fixed by #2447
Assignees
Labels
Priority: High Indicates than an issue or pull request should be resolved ahead of issues or pull requests labelled Type: Bug Indicates an unexpected problem or unintended behavior

Comments

@pclesr
Copy link

pclesr commented Oct 12, 2020

Describe the bug

It appears that a sync on an ext4 file in the guest does not result in the data on the host also being sync'd out to disk.

To Reproduce

Using the file tester fio with the following config file:

[globa]
rw=write
ioengine=sync
bs=32k
direct=0
size=128m
numjobs=1
fsync_on_close=1
loops=10

When running this on the host to establish a baseline and running the linux perf utility (with call graphs enabled), one sees on an ancient (android) 4.14 kernel many calls to sync

    5.50%     0.00%  fio              [kernel.kallsyms]     [k] sys_fsync
            |
            ---sys_fsync
               do_fsync
               vfs_fsync_range
               ext4_sync_file
               file_write_and_wait_range

perf is run with
perf record -a --call-graph dwarf,16384 -F 200 -o /tmp/perf.data /tmp/fio /tmp/fio.job
Then the results with perf report and perf script.

When running the exact same fio and config in the VM, the bandwidth reported by fio is 4x that of the host baseline, indicating that the data is never actually sync'd to the medium. The page cache on the host is functioning as a huge disk cache. Running perf on the host during the execution of the fio test in the VM results shows no calls to sync.

The test is run against a virtio disk (/dev/vdb) that is formatted as an ext4 disk on the host, and then specified in the list of devices in the VMs config.json and mounted in the guest.

If perf is run in the VM (with a 5.7 kernel), then perf shows calls both to sync and the ext4 write call in the VM kernel:

   20.56%     0.03%  fio              [kernel.kallsyms]  [k] new_sync_write
            |
             --20.54%--new_sync_write
                       ext4_file_write_iter

so I do not believe it is a problem with fio.

It does not seem to matter what the VM config is. Firecracker is master as of 31 August 2020.

Expected behaviour

I would have expected that synchronous IO in the VM would be worse than on the host.

Environment

Firecracker v0.21.0-473-gf2f6d8f8
Host: Android 4.14
Guest: 5.7 kernel.org
rootfs: Alpine
Arch: arm64
perf version 4.14.133.g2e3484a
fio-3.16-32-g8c302

Running on an Android system with a 2 core A57 with a 14gig eMMC.

[    0.490202] mmc0: new HS400 MMC card at address 0001
[    0.491261] mmcblk0: mmc0:0001 S0J56X 14.8 GiB

Additional context

This is an embedded application, and because of power loss issues, data that is synchronously written in the guest needs to be resident on media.

I am running baseline perf numbers in both disk and network so that my masters can see how good firecracker is, opposed to all the other VM alternatives. 👍

I am a C programmer, and hardly know Rust. However, I wonder if there is any issue in src/devices/src/virtio/block/request.rs because the write is against a memory object, yet the flush is against a disk object?

In the guest:

 cat /proc/partitions
major minor  #blocks  name

 254        0      32768 vda
 254       16     614400 vdb
 254       32       9736 vdc


/ # mount /dev/vdb /bar
[   32.101662] EXT4-fs (vdb): recovery complete
[   32.104363] EXT4-fs (vdb): mounted filesystem with ordered data mode. Opts: (null)
/ # df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/root                27633     26978         0 100% /
devtmpfs                117700         0    117700   0% /dev
/dev/vdb                588352    177272    392648  31% /bar

Checks

  • [x ] Have you searched the Firecracker Issues database for similar problems?
  • [ x] Have you read the existing relevant Firecracker documentation?
  • [ x] Are you certain the bug being reported is a Firecracker issue?
@pclesr pclesr added the Type: Bug Indicates an unexpected problem or unintended behavior label Oct 12, 2020
@sandreim
Copy link
Contributor

Thanks for logging this @pclesr . We'll take a deeper look and let you know our findings.

@sandreim sandreim added the Priority: High Indicates than an issue or pull request should be resolved ahead of issues or pull requests labelled label Oct 12, 2020
@pclesr
Copy link
Author

pclesr commented Oct 12, 2020

metrics showing flush received:

  "block": {
    "activate_fails": 0,
    "cfg_fails": 0,
    "no_avail_buffer": 348,
    "event_fails": 0,
    "execute_fails": 0,
    "invalid_reqs_count": 0,
    "flush_count": 55,
    "queue_event_count": 4819,
    "rate_limiter_event_count": 0,
    "update_count": 0,
    "update_fails": 0,
    "read_bytes": 8795136,
    "write_bytes": 1342458880,
    "read_count": 1180,
    "write_count": 63419,
    "rate_limiter_throttled_events": 0
  },

@bzld
Copy link

bzld commented Oct 12, 2020

It seems calling file.flush() does not provide the expected guarantees while file.sync_all() would indicate the host system to really flush its disk caches, see analog ticket here:
cloud-hypervisor/cloud-hypervisor#399

@sandreim
Copy link
Contributor

Thanks @bzld. I am not sure we need to call fsync() at all in the current block device emulation, since the file is opened with O_SYNC this implies that each write returns after data has been sent to the underlying hardware just like a write()+ fsync().

@bzld
Copy link

bzld commented Oct 13, 2020

@sandreim Yes, O_SYNC would do the same trick - but is the disk file really opened that way? I can't see it here:

let mut disk_image = OpenOptions::new()
.read(true)
.write(!is_disk_read_only)
.open(PathBuf::from(&disk_image_path))?;

(Also, using fsync() might be faster than O_SYNC because the guest VM can better control which writes must be flushed to disk and which need not. And fdatasync() (rust: sync_data()) might even be slightly faster than fsync())

@pclesr
Copy link
Author

pclesr commented Oct 14, 2020

Using the 'big hammer' approach does force the data out -- using fsync instead of flush I can now see the sync calls being executed on the host and the bandwidth reported by fio in the VM is much more in line with the underlying hardware.

diff --git a/src/devices/src/virtio/block/request.rs b/src/devices/src/virtio/block/request.rs
index 826eb97a..a546e472 100644
--- a/src/devices/src/virtio/block/request.rs
+++ b/src/devices/src/virtio/block/request.rs
@@ -6,7 +6,7 @@
 // found in the THIRD-PARTY file.

 use std::convert::From;
-use std::io::{self, Seek, SeekFrom, Write};
+use std::io::{self, Seek, SeekFrom};
 use std::mem;
 use std::result;

@@ -227,7 +227,7 @@ impl Request {
                 METRICS.block.write_bytes.add(self.data_len as usize);
                 METRICS.block.write_count.inc();
             }
-            RequestType::Flush => match diskfile.flush() {
+            RequestType::Flush => match diskfile.sync_all() {
                 Ok(_) => {
                     METRICS.block.flush_count.inc();
                     return Ok(0);

diff --git a/src/vmm/src/default_syscalls/filters.rs b/src/vmm/src/default_syscalls/filters.rs
index 1f654bb2..263263c7 100644
--- a/src/vmm/src/default_syscalls/filters.rs
+++ b/src/vmm/src/default_syscalls/filters.rs
@@ -109,6 +109,7 @@ pub fn default_filter() -> Result<SeccompFilter, Error> {
             allow_syscall(libc::SYS_timerfd_settime),
             allow_syscall(libc::SYS_write),
             allow_syscall(libc::SYS_writev),
+            allow_syscall(libc::SYS_fsync),
         ]
         .into_iter()
         .collect(),

Using perf on the host shows the following when the fio test is running in the VM. Before the change, I saw no sync calls on the host.

     2.22%     0.00%  firecracker  [kernel.kallsyms]  [k] sys_fsync
            |
            ---sys_fsync
               do_fsync
               vfs_fsync_range
               ext4_sync_file
               file_write_and_wait_range

@alindima
Copy link
Contributor

Thanks everyone for following up on this.
After giving the issue a little bit more thought and investigating it a little further, here are the conclusions:

  • As @bzld pointed out, indeed, the backing file is not opened in O_SYNC mode. Enabling this would work, but would likely have a serious performance impact, essentially bypassing all buffering mechanisms.
  • The supposedly "big hammer" approach explored by @pclesr actually seems to be the right solution. Calling the sync_all method on the File instance actually issues a fsync syscall, triggering the right behaviour, with no additional impact.

@pclesr, would you be interested in submitting a PR to mainline with the solution?

@pclesr
Copy link
Author

pclesr commented Oct 14, 2020

Hello @alindima , yes I can submit a PR. I am working on a slightly cleaner implementation that would keep the existing flush in place so that any buffered data would be sent to the page cache before the sync.

I don't know if that is really necessary, though. My knowledge of how the Rust library buffers data is non-existent.

I will have something ready in a day or two. Thank you very much for looking at this !

@acatangiu
Copy link
Contributor

@pclesr I'm confident that your proposed patch is the complete and correct solution (in terms of VIRTIO_BLK_T_FLUSH operation)

I am working on a slightly cleaner implementation that would keep the existing flush in place so that any buffered data would be sent to the page cache before the sync

The data is already being sent to the page cache as part of previous write()s. The .flush() currently present in our code is actually no-op in the Rust Linux std::File implementation.

The correct function to call is as you mentioned sync_all().

I will have something ready in a day or two. Thank you very much for looking at this !

We would prefer adding the fix mentioned above sooner rather than later, could you prioritize this or should we open a PR with it?

@pclesr
Copy link
Author

pclesr commented Oct 14, 2020

@acatangiu -- let me run through the steps for contributing in tests/README and I will put together the pull request.

The reason it takes time is that I have to compile/test for my arm target on a Qemu cross environment (arm64VM running on an x86) because my current employer doesn't provide me access to a spiffy AWS Annapurna VM. :-(

@acatangiu
Copy link
Contributor

I don't see how/why this change would be in any way arch-specific. IMHO it would be fine to just post the PR and take advantage of our automated CI that also runs on aarch64.

You can add an integration test (see our tests dir) that fails without the fix and passes with, that you can develop on x86_64. That test will also run on aarch64 in our CI.

@acatangiu
Copy link
Contributor

Also, thanks for the find and the fix! Good catch!

@pclesr
Copy link
Author

pclesr commented Oct 15, 2020

Created pull request #2185

@pclesr
Copy link
Author

pclesr commented Oct 15, 2020

The virtio device advertises VIRTIO_BLK_F_FLUSH, which tells the guest that it supports flushing the cache to physical medium. That slows down boot time, since there can be some number of flushes (6 on a 5.7 arm64 guest) so the conclusion in PR #2185 was not to accept the update.

Removing VIRTIO_BLK_F_FLUSH from the features advertised by the driver does in fact cause the guest to not submit any flushes (as determined by looking at the metrics from running the fio test. The cache type is still advertised as write through.

/sys/devices/platform/40000000.virtio_mmio/virtio0/block/vda # cat cache_type
write through
/sys/devices/platform/40000000.virtio_mmio/virtio0/block/vda # cat device/features
0000000000000000000000000000000010000000000000000000000000000000

To be strictly correct, the WCE bit should be enabled (since the data is cached on the underlying host), but then the guest has the option of changing the cache type (which is not supported). In addition, supporting the WCE bit requires supporting more than the first 8 bytes of the virtio_blk_config (see section 5.2.4 in https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html)

As @sandreim suggested in the PR, I would propose removing the VIRTIO_BLK_F_FLUSH from the feature set; however, it is still a bit of a lie since the cache shows up as write through, but at least guests won't be sending flush commands.

I would then propose offering a separate mode (maybe via a config option) that would set the VIRTIO_BLK_F_FLUSH bit for those environments where there is more IO and data integrity between guest and host devices is more important.

@bzld
Copy link

bzld commented Oct 19, 2020

Removing the VIRTIO_BLK_F_FLUSH flag in the current implementation might be correct according to the virtio spec section 5.2.6.2 (https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html#x1-2520002):

Note: An implementation that does not offer VIRTIO_BLK_F_FLUSH and does not commit completed writes will not be resilient to data loss in case of crashes. Not offering VIRTIO_BLK_F_FLUSH is an absolute requirement for implementations that do not wish to be safe against such data losses.

Also, offering VIRTIO_BLK_F_CONFIG_WCE is disallowed if VIRTIO_BLK_F_FLUSH cannot be offered (section 5.2.5.2):

Devices SHOULD always offer VIRTIO_BLK_F_FLUSH, and MUST offer it if they offer VIRTIO_BLK_F_CONFIG_WCE.

So the simple, correct fix would be to just not offer VIRTIO_BLK_F_FLUSH.

The more flexible fix would be to have a configurable option to enable VIRTIO_BLK_F_FLUSH, as well as the PR code to perform a fsync when requested.
(There seem to be - more complicated - async implementations of fsync in Rust, which might be feasible to perform the fsync in a way which just blocks the affected virtual block device, but not all other virtual devices on the emulation thread)

@alindima
Copy link
Contributor

I dug a little more to understand why the flush() method is a noop.
This method is implemented as part of rust's Write trait. This trait has two mandatory methods: write and flush.
The write method of the trait is not guaranteed to block waiting for data to be written. This means that there could theoretically be an additional layer of buffering on the type implementing Write.

Now, because the write() implementation of the File actually executes a write() syscall, the data is guaranteed to have reached the OS. Therefore, the guarantee implied by flush is always satisfied. This is why the flush implementation is empty.

The sync_all and sync_data methods are not part of the Write trait, and are guaranteed to block until the data reaches the physical device.

References:
rust-lang/rfcs#739 (comment)
https://doc.rust-lang.org/std/fs/struct.File.html#method.sync_all

@pclesr
Copy link
Author

pclesr commented Oct 27, 2020

I have created a pull request #2223 that keeps the existing behaviour (no flush) and adds a new drive option 'want_flush' for drives that need the coherency and are willing to take the performance hit.

@georgepisaltu
Copy link
Contributor

Hi @pclesr,

We decied to implement two caching mechanisms: unsafe (no flushing mechanism) and writeback (on-demand flushing), which is very similar to your PR.

The unsafe option will still advertise the VIRTIO_BLK_T_FLUSH, but it will still be a noop, which is the same behavior as now.

The writeback option will add the actual flush through file.sync_all (or a similar method which uses fsync) instead of the noop from the unsafe option.

This should be configurable at the API level when installing a block device, through a cache-type option, with a default value to the unsafe method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: High Indicates than an issue or pull request should be resolved ahead of issues or pull requests labelled Type: Bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants