Major performance change with vm-memory bump #1258

rbradford · 2020-06-01T13:37:39Z

For debug builds (after/before): 8% of previous throughput
For release buids (after/before): 56% of previous throughput

Before (debug):

Connecting to host 192.168.249.2, port 5201
[  5] local 192.168.249.1 port 44178 connected to 192.168.249.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  1.33 GBytes  11.4 Gbits/sec    0   3.14 MBytes       
[  5]   1.00-2.00   sec  1.36 GBytes  11.7 Gbits/sec    0   3.14 MBytes       
[  5]   2.00-3.00   sec  1.36 GBytes  11.7 Gbits/sec    0   3.14 MBytes       
[  5]   3.00-4.00   sec  1.31 GBytes  11.2 Gbits/sec    0   3.14 MBytes       
[  5]   4.00-5.00   sec  1.05 GBytes  9.05 Gbits/sec    0   3.14 MBytes       
[  5]   5.00-6.00   sec  1.29 GBytes  11.0 Gbits/sec    0   3.14 MBytes       
[  5]   6.00-7.00   sec  1.33 GBytes  11.4 Gbits/sec    0   3.14 MBytes       
[  5]   7.00-8.00   sec  1.25 GBytes  10.8 Gbits/sec    0   3.14 MBytes       
[  5]   8.00-9.00   sec  1.16 GBytes  9.94 Gbits/sec    0   3.14 MBytes       
[  5]   9.00-10.00  sec  1.16 GBytes  9.94 Gbits/sec    0   3.14 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  12.6 GBytes  10.8 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  12.6 GBytes  10.8 Gbits/sec                  receiver

iperf Done.

After (debug)

Connecting to host 192.168.249.2, port 5201
[  5] local 192.168.249.1 port 44192 connected to 192.168.249.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   105 MBytes   878 Mbits/sec    0   3.16 MBytes       
[  5]   1.00-2.00   sec   100 MBytes   839 Mbits/sec    0   3.16 MBytes       
[  5]   2.00-3.00   sec   101 MBytes   849 Mbits/sec    0   3.16 MBytes       
[  5]   3.00-4.00   sec   104 MBytes   870 Mbits/sec    0   3.16 MBytes       
[  5]   4.00-5.00   sec   102 MBytes   860 Mbits/sec    0   3.16 MBytes       
[  5]   5.00-6.00   sec   105 MBytes   881 Mbits/sec    0   3.16 MBytes       
[  5]   6.00-7.00   sec   102 MBytes   860 Mbits/sec    0   3.16 MBytes       
[  5]   7.00-8.00   sec   104 MBytes   870 Mbits/sec    0   3.16 MBytes       
[  5]   8.00-9.00   sec   102 MBytes   860 Mbits/sec    0   3.16 MBytes       
[  5]   9.00-10.00  sec   104 MBytes   870 Mbits/sec    0   3.16 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.01 GBytes   864 Mbits/sec    0             sender
[  5]   0.00-10.02  sec  1.00 GBytes   861 Mbits/sec                  receiver

iperf Done.

Before (release)

Connecting to host 192.168.249.2, port 5201
[  5] local 192.168.249.1 port 44224 connected to 192.168.249.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  6.18 GBytes  53.1 Gbits/sec    0   3.11 MBytes       
[  5]   1.00-2.00   sec  6.14 GBytes  52.7 Gbits/sec    1   3.11 MBytes       
[  5]   2.00-3.00   sec  6.35 GBytes  54.5 Gbits/sec    0   3.11 MBytes       
[  5]   3.00-4.00   sec  6.36 GBytes  54.6 Gbits/sec    0   3.11 MBytes       
[  5]   4.00-5.00   sec  6.16 GBytes  52.9 Gbits/sec    0   3.11 MBytes       
[  5]   5.00-6.00   sec  4.92 GBytes  42.3 Gbits/sec    0   3.11 MBytes       
[  5]   6.00-7.00   sec  5.31 GBytes  45.6 Gbits/sec    0   3.11 MBytes       
[  5]   7.00-8.00   sec  6.27 GBytes  53.9 Gbits/sec    0   3.11 MBytes       
[  5]   8.00-9.00   sec  6.22 GBytes  53.4 Gbits/sec    0   3.11 MBytes       
[  5]   9.00-10.00  sec  6.33 GBytes  54.4 Gbits/sec    0   3.11 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  60.2 GBytes  51.7 Gbits/sec    1             sender
[  5]   0.00-10.00  sec  60.2 GBytes  51.7 Gbits/sec                  receiver

After (release)

Connecting to host 192.168.249.2, port 5201
[  5] local 192.168.249.1 port 44208 connected to 192.168.249.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  3.26 GBytes  28.0 Gbits/sec    0   3.16 MBytes       
[  5]   1.00-2.00   sec  3.34 GBytes  28.7 Gbits/sec    0   3.16 MBytes       
[  5]   2.00-3.00   sec  3.39 GBytes  29.2 Gbits/sec    0   3.16 MBytes       
[  5]   3.00-4.00   sec  3.36 GBytes  28.9 Gbits/sec    0   3.16 MBytes       
[  5]   4.00-5.00   sec  3.41 GBytes  29.3 Gbits/sec    1   3.16 MBytes       
[  5]   5.00-6.00   sec  3.31 GBytes  28.4 Gbits/sec    0   3.16 MBytes       
[  5]   6.00-7.00   sec  3.45 GBytes  29.7 Gbits/sec    0   3.16 MBytes       
[  5]   7.00-8.00   sec  3.43 GBytes  29.5 Gbits/sec    0   3.16 MBytes       
[  5]   8.00-9.00   sec  3.43 GBytes  29.5 Gbits/sec    0   3.16 MBytes       
[  5]   9.00-10.00  sec  3.44 GBytes  29.5 Gbits/sec    0   3.16 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  33.8 GBytes  29.1 Gbits/sec    1             sender
[  5]   0.00-10.00  sec  33.8 GBytes  29.1 Gbits/sec                  receiver

iperf Done.

The text was updated successfully, but these errors were encountered:

andreeaflorescu · 2020-06-02T07:09:58Z

@rbradford can you give a bit more details of how you're running these tests? What do you think about creating a new issue in the rust-vmm repository as well?

rbradford · 2020-06-02T07:13:03Z

@andreeaflorescu it's iperf3 with default settings and default virtio-net settings

sboeuf · 2020-06-02T07:33:46Z

@rbradford are you sure this is related to vm-memory? Have you identified the commit which introduced the regression?

rbradford · 2020-06-02T07:37:05Z

Yes, the before is the parent of the vm-memory bump and after is the bump commit.

rbradford · 2020-06-02T08:09:38Z

@sboeuf this change ( rust-vmm/vm-memory#94 ) in vm-memory moved away from using the SSE2/AVX2 optimised glibc memcpy() routine. So this regression was not unexpected - what we need to do is work out how to mitigate it.

sboeuf · 2020-06-02T15:30:54Z

@andreeaflorescu @bonzini do you think you can gate the commit rust-vmm/vm-memory@d0aaccc based on musl only? That looks like the only appropriate way of handling this issue.
I mean I understand we need to fix the coherency for musl, otherwise there's a real bug, but glibc is fine without this change, and we can't afford such a performance drop for all our virtio devices.

bonzini · 2020-06-02T16:17:30Z

Are you doing memcpy into the buffers, instead of doing direct read/write into memory?

You could also try adding SSE support to vm-memory.

andreeaflorescu · 2020-06-03T07:27:21Z

Should we discuss about possible fixes in vm-memory?
I opened an issue and we can continue the discussion there to give everyone visibility into next steps/problems/stuff in general :))

rust-vmm/vm-memory#100

sboeuf · 2020-06-03T08:25:27Z

@bonzini
I'm genuinely curious how SSE can be supported at the vm-memory level? How can we make sure we use some of the SSE instructions?

bonzini · 2020-06-03T10:35:47Z

@sboeuf There is std::arch::x86_64::__m128i, I have not checked if it supports ptr::read and ptr::write operations that compile down to single SSE loads and stores.

bonzini · 2020-06-03T11:24:16Z

Can you try benchmarking something like this:

diff --git a/src/volatile_memory.rs b/src/volatile_memory.rs
index 7c8aa1a..68518fe 100644
--- a/src/volatile_memory.rs
+++ b/src/volatile_memory.rs
@@ -477,6 +477,12 @@ fn alignment(addr: usize) -> usize {
 //   we're only using integer primitives.
 unsafe fn copy_single(align: usize, src_addr: usize, dst_addr: usize) {
     match align {
+        16 => {
+            #[cfg(target_arch = "x86_64")] {
+                type Vec128 = std::arch::x86_64::__m256i;
+                write_volatile(dst_addr as *mut Vec128, read_volatile(src_addr as *const Vec128));
+            }
+        }
         8 => write_volatile(dst_addr as *mut u64, read_volatile(src_addr as *const u64)),
         4 => write_volatile(dst_addr as *mut u32, read_volatile(src_addr as *const u32)),
         2 => write_volatile(dst_addr as *mut u16, read_volatile(src_addr as *const u16)),
@@ -504,6 +510,8 @@ fn copy_slice(dst: &mut [u8], src: &[u8]) -> usize {
         }
     };
 
+    #[cfg(target_arch = "x86_64")]
+    copy_aligned_slice(16);
     if size_of::<usize>() > 4 {
         copy_aligned_slice(8);
     }

rbradford · 2020-06-04T15:45:48Z

Can you try benchmarking something like this:

diff --git a/src/volatile_memory.rs b/src/volatile_memory.rs
index 7c8aa1a..68518fe 100644
--- a/src/volatile_memory.rs
+++ b/src/volatile_memory.rs
@@ -477,6 +477,12 @@ fn alignment(addr: usize) -> usize {
 //   we're only using integer primitives.
 unsafe fn copy_single(align: usize, src_addr: usize, dst_addr: usize) {
     match align {
+        16 => {
+            #[cfg(target_arch = "x86_64")] {
+                type Vec128 = std::arch::x86_64::__m256i;
+                write_volatile(dst_addr as *mut Vec128, read_volatile(src_addr as *const Vec128));
+            }
+        }
         8 => write_volatile(dst_addr as *mut u64, read_volatile(src_addr as *const u64)),
         4 => write_volatile(dst_addr as *mut u32, read_volatile(src_addr as *const u32)),
         2 => write_volatile(dst_addr as *mut u16, read_volatile(src_addr as *const u16)),
@@ -504,6 +510,8 @@ fn copy_slice(dst: &mut [u8], src: &[u8]) -> usize {
         }
     };
 
+    #[cfg(target_arch = "x86_64")]
+    copy_aligned_slice(16);
     if size_of::<usize>() > 4 {
         copy_aligned_slice(8);
     }

This caused a segfault as is but i made some changes and added "support" for AVX2 (no evidence that rust is actually using AVX2 or SSE2 instructions here)

diff --git a/src/volatile_memory.rs b/src/volatile_memory.rs
index 9e9901f..8129e91 100644
--- a/src/volatile_memory.rs
+++ b/src/volatile_memory.rs
@@ -477,6 +477,21 @@ fn alignment(addr: usize) -> usize {
//   we're only using integer primitives.
 unsafe fn copy_single(align: usize, src_addr: usize, dst_addr: usize) {
     match align {
+        #[cfg(target_arch = "x86_64")]
+        32 => {
+            type Vec256 = std::arch::x86_64::__m256i;
+            write_volatile(
+                dst_addr as *mut Vec256,
+                read_volatile(src_addr as *const Vec256),
+            );
+        }
+        16 => {
+            type Vec128 = std::arch::x86_64::__m128i;
+            write_volatile(
+                dst_addr as *mut Vec128,
+                read_volatile(src_addr as *const Vec128),
+            );
+        }
         8 => write_volatile(dst_addr as *mut u64, read_volatile(src_addr as *const u64)),
         4 => write_volatile(dst_addr as *mut u32, read_volatile(src_addr as *const u32)),
         2 => write_volatile(dst_addr as *mut u16, read_volatile(src_addr as *const u16)),
@@ -504,6 +519,10 @@ fn copy_slice(dst: &mut [u8], src: &[u8]) -> usize {
         }
     };
 
+    #[cfg(target_arch = "x86_64")]
+    copy_aligned_slice(32);
+    #[cfg(target_arch = "x86_64")]
+    copy_aligned_slice(16);
     if size_of::<usize>() > 4 {
         copy_aligned_slice(8);
     }

The difference was very small:

Connecting to host 192.168.249.2, port 5201
[  5] local 192.168.249.1 port 60124 connected to 192.168.249.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  3.47 GBytes  29.8 Gbits/sec    0   3.03 MBytes       
[  5]   1.00-2.00   sec  3.55 GBytes  30.5 Gbits/sec    0   3.03 MBytes       
[  5]   2.00-3.00   sec  3.54 GBytes  30.4 Gbits/sec    0   3.03 MBytes       
[  5]   3.00-4.00   sec  3.53 GBytes  30.3 Gbits/sec    0   3.03 MBytes       
[  5]   4.00-5.00   sec  3.54 GBytes  30.4 Gbits/sec    0   3.03 MBytes       
[  5]   5.00-6.00   sec  3.50 GBytes  30.0 Gbits/sec    0   3.03 MBytes       
[  5]   6.00-7.00   sec  3.54 GBytes  30.4 Gbits/sec    0   3.03 MBytes       
[  5]   7.00-8.00   sec  3.54 GBytes  30.4 Gbits/sec    0   3.03 MBytes       
[  5]   8.00-9.00   sec  3.53 GBytes  30.3 Gbits/sec    0   3.03 MBytes       
[  5]   9.00-10.00  sec  3.53 GBytes  30.3 Gbits/sec    0   3.03 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  35.3 GBytes  30.3 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  35.3 GBytes  30.3 Gbits/sec                  receiver

iperf Done.

bonzini · 2020-06-04T15:50:58Z

What were the changes in the profile before the changes to vm-memory vs. now?

bonzini · 2020-06-04T15:53:14Z

Also another useful optimization could be loop unrolling, possibly only for the 16- or 32-byte version.

Where small objects are those objects that are less then the native data width for the platform. This ensure that volatile and alignment safe read/writes are used when updating structures that are sensitive to this such as virtio devices where the spec requires writes to be atomic. Fixes: cloud-hypervisor/cloud-hypervisor#1258 Signed-off-by: Rob Bradford <robert.bradford@intel.com>

Currently released vm-memory uses aligned and volatile copying for all data. The version in the fork only uses the assured (and slower) path for data upto the natural data width. Fixes: cloud-hypervisor#1258 Signed-off-by: Rob Bradford <robert.bradford@intel.com>

Currently released vm-memory uses aligned and volatile copying for all data. The version in the fork only uses the assured (and slower) path for data upto the natural data width. Fixes: #1258 Signed-off-by: Rob Bradford <robert.bradford@intel.com>

Where small objects are those objects that are less then the native data width for the platform. This ensure that volatile and alignment safe read/writes are used when updating structures that are sensitive to this such as virtio devices where the spec requires writes to be atomic. Fixes: cloud-hypervisor/cloud-hypervisor#1258 Signed-off-by: Rob Bradford <robert.bradford@intel.com>

Where small objects are those objects that are less then the native data width for the platform. This ensure that volatile and alignment safe read/writes are used when updating structures that are sensitive to this such as virtio devices where the spec requires writes to be atomic. Fixes: cloud-hypervisor/cloud-hypervisor#1258 Fixes: rust-vmm#100 Signed-off-by: Rob Bradford <robert.bradford@intel.com>

Update the version number used to point to the latest version but continue to use our patched version due to the fix for cloud-hypervisor#1258 Signed-off-by: Rob Bradford <robert.bradford@intel.com>

A new version of vm-memory was released upstream which resulted in some components pulling in that new version. Update the version number used to point to the latest version but continue to use our patched version due to the fix for cloud-hypervisor#1258 Signed-off-by: Rob Bradford <robert.bradford@intel.com>

Where small objects are those objects that are less then the native data width for the platform. This ensure that volatile and alignment safe read/writes are used when updating structures that are sensitive to this such as virtio devices where the spec requires writes to be atomic. Fixes: cloud-hypervisor/cloud-hypervisor#1258 Fixes: rust-vmm#100 Signed-off-by: Rob Bradford <robert.bradford@intel.com>

A new version of vm-memory was released upstream which resulted in some components pulling in that new version. Update the version number used to point to the latest version but continue to use our patched version due to the fix for #1258 Signed-off-by: Rob Bradford <robert.bradford@intel.com>

Where small objects are those objects that are less then the native data width for the platform. This ensure that volatile and alignment safe read/writes are used when updating structures that are sensitive to this such as virtio devices where the spec requires writes to be atomic. Fixes: cloud-hypervisor/cloud-hypervisor#1258 Fixes: #100 Signed-off-by: Rob Bradford <robert.bradford@intel.com>

andreeaflorescu mentioned this issue Jun 3, 2020

Performance degradation after fixing #95 rust-vmm/vm-memory#100

Closed

rbradford added this to To do in Release 0.8.0 via automation Jun 8, 2020

rbradford mentioned this issue Jun 9, 2020

build: Use fork of vm-memory with less performance impact #1297

Merged

sboeuf closed this as completed in #1297 Jun 9, 2020

Release 0.8.0 automation moved this from To do to Done Jun 9, 2020

rbradford mentioned this issue Oct 2, 2020

volatile_memory: Only use volatile copy for small objects rust-vmm/vm-memory#117

Merged

rbradford mentioned this issue Oct 2, 2020

build: Fix vm-memory bump build error #1804

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major performance change with vm-memory bump #1258

Major performance change with vm-memory bump #1258

rbradford commented Jun 1, 2020

andreeaflorescu commented Jun 2, 2020

rbradford commented Jun 2, 2020

sboeuf commented Jun 2, 2020

rbradford commented Jun 2, 2020

rbradford commented Jun 2, 2020

sboeuf commented Jun 2, 2020

bonzini commented Jun 2, 2020

andreeaflorescu commented Jun 3, 2020

sboeuf commented Jun 3, 2020

bonzini commented Jun 3, 2020

bonzini commented Jun 3, 2020

rbradford commented Jun 4, 2020 •

edited

Loading

bonzini commented Jun 4, 2020

bonzini commented Jun 4, 2020

Major performance change with vm-memory bump #1258

Major performance change with vm-memory bump #1258

Comments

rbradford commented Jun 1, 2020

andreeaflorescu commented Jun 2, 2020

rbradford commented Jun 2, 2020

sboeuf commented Jun 2, 2020

rbradford commented Jun 2, 2020

rbradford commented Jun 2, 2020

sboeuf commented Jun 2, 2020

bonzini commented Jun 2, 2020

andreeaflorescu commented Jun 3, 2020

sboeuf commented Jun 3, 2020

bonzini commented Jun 3, 2020

bonzini commented Jun 3, 2020

rbradford commented Jun 4, 2020 • edited Loading

bonzini commented Jun 4, 2020

bonzini commented Jun 4, 2020

rbradford commented Jun 4, 2020 •

edited

Loading