Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major performance change with vm-memory bump #1258

Closed
rbradford opened this issue Jun 1, 2020 · 14 comments · Fixed by #1297 or rust-vmm/vm-memory#117
Closed

Major performance change with vm-memory bump #1258

rbradford opened this issue Jun 1, 2020 · 14 comments · Fixed by #1297 or rust-vmm/vm-memory#117

Comments

@rbradford
Copy link
Member

For debug builds (after/before): 8% of previous throughput
For release buids (after/before): 56% of previous throughput

Before (debug):

Connecting to host 192.168.249.2, port 5201
[  5] local 192.168.249.1 port 44178 connected to 192.168.249.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  1.33 GBytes  11.4 Gbits/sec    0   3.14 MBytes       
[  5]   1.00-2.00   sec  1.36 GBytes  11.7 Gbits/sec    0   3.14 MBytes       
[  5]   2.00-3.00   sec  1.36 GBytes  11.7 Gbits/sec    0   3.14 MBytes       
[  5]   3.00-4.00   sec  1.31 GBytes  11.2 Gbits/sec    0   3.14 MBytes       
[  5]   4.00-5.00   sec  1.05 GBytes  9.05 Gbits/sec    0   3.14 MBytes       
[  5]   5.00-6.00   sec  1.29 GBytes  11.0 Gbits/sec    0   3.14 MBytes       
[  5]   6.00-7.00   sec  1.33 GBytes  11.4 Gbits/sec    0   3.14 MBytes       
[  5]   7.00-8.00   sec  1.25 GBytes  10.8 Gbits/sec    0   3.14 MBytes       
[  5]   8.00-9.00   sec  1.16 GBytes  9.94 Gbits/sec    0   3.14 MBytes       
[  5]   9.00-10.00  sec  1.16 GBytes  9.94 Gbits/sec    0   3.14 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  12.6 GBytes  10.8 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  12.6 GBytes  10.8 Gbits/sec                  receiver

iperf Done.

After (debug)

Connecting to host 192.168.249.2, port 5201
[  5] local 192.168.249.1 port 44192 connected to 192.168.249.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   105 MBytes   878 Mbits/sec    0   3.16 MBytes       
[  5]   1.00-2.00   sec   100 MBytes   839 Mbits/sec    0   3.16 MBytes       
[  5]   2.00-3.00   sec   101 MBytes   849 Mbits/sec    0   3.16 MBytes       
[  5]   3.00-4.00   sec   104 MBytes   870 Mbits/sec    0   3.16 MBytes       
[  5]   4.00-5.00   sec   102 MBytes   860 Mbits/sec    0   3.16 MBytes       
[  5]   5.00-6.00   sec   105 MBytes   881 Mbits/sec    0   3.16 MBytes       
[  5]   6.00-7.00   sec   102 MBytes   860 Mbits/sec    0   3.16 MBytes       
[  5]   7.00-8.00   sec   104 MBytes   870 Mbits/sec    0   3.16 MBytes       
[  5]   8.00-9.00   sec   102 MBytes   860 Mbits/sec    0   3.16 MBytes       
[  5]   9.00-10.00  sec   104 MBytes   870 Mbits/sec    0   3.16 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.01 GBytes   864 Mbits/sec    0             sender
[  5]   0.00-10.02  sec  1.00 GBytes   861 Mbits/sec                  receiver

iperf Done.

Before (release)

Connecting to host 192.168.249.2, port 5201
[  5] local 192.168.249.1 port 44224 connected to 192.168.249.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  6.18 GBytes  53.1 Gbits/sec    0   3.11 MBytes       
[  5]   1.00-2.00   sec  6.14 GBytes  52.7 Gbits/sec    1   3.11 MBytes       
[  5]   2.00-3.00   sec  6.35 GBytes  54.5 Gbits/sec    0   3.11 MBytes       
[  5]   3.00-4.00   sec  6.36 GBytes  54.6 Gbits/sec    0   3.11 MBytes       
[  5]   4.00-5.00   sec  6.16 GBytes  52.9 Gbits/sec    0   3.11 MBytes       
[  5]   5.00-6.00   sec  4.92 GBytes  42.3 Gbits/sec    0   3.11 MBytes       
[  5]   6.00-7.00   sec  5.31 GBytes  45.6 Gbits/sec    0   3.11 MBytes       
[  5]   7.00-8.00   sec  6.27 GBytes  53.9 Gbits/sec    0   3.11 MBytes       
[  5]   8.00-9.00   sec  6.22 GBytes  53.4 Gbits/sec    0   3.11 MBytes       
[  5]   9.00-10.00  sec  6.33 GBytes  54.4 Gbits/sec    0   3.11 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  60.2 GBytes  51.7 Gbits/sec    1             sender
[  5]   0.00-10.00  sec  60.2 GBytes  51.7 Gbits/sec                  receiver


After (release)

Connecting to host 192.168.249.2, port 5201
[  5] local 192.168.249.1 port 44208 connected to 192.168.249.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  3.26 GBytes  28.0 Gbits/sec    0   3.16 MBytes       
[  5]   1.00-2.00   sec  3.34 GBytes  28.7 Gbits/sec    0   3.16 MBytes       
[  5]   2.00-3.00   sec  3.39 GBytes  29.2 Gbits/sec    0   3.16 MBytes       
[  5]   3.00-4.00   sec  3.36 GBytes  28.9 Gbits/sec    0   3.16 MBytes       
[  5]   4.00-5.00   sec  3.41 GBytes  29.3 Gbits/sec    1   3.16 MBytes       
[  5]   5.00-6.00   sec  3.31 GBytes  28.4 Gbits/sec    0   3.16 MBytes       
[  5]   6.00-7.00   sec  3.45 GBytes  29.7 Gbits/sec    0   3.16 MBytes       
[  5]   7.00-8.00   sec  3.43 GBytes  29.5 Gbits/sec    0   3.16 MBytes       
[  5]   8.00-9.00   sec  3.43 GBytes  29.5 Gbits/sec    0   3.16 MBytes       
[  5]   9.00-10.00  sec  3.44 GBytes  29.5 Gbits/sec    0   3.16 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  33.8 GBytes  29.1 Gbits/sec    1             sender
[  5]   0.00-10.00  sec  33.8 GBytes  29.1 Gbits/sec                  receiver

iperf Done.
@andreeaflorescu
Copy link

@rbradford can you give a bit more details of how you're running these tests? What do you think about creating a new issue in the rust-vmm repository as well?

@rbradford
Copy link
Member Author

@andreeaflorescu it's iperf3 with default settings and default virtio-net settings

@sboeuf
Copy link
Member

sboeuf commented Jun 2, 2020

@rbradford are you sure this is related to vm-memory? Have you identified the commit which introduced the regression?

@rbradford
Copy link
Member Author

Yes, the before is the parent of the vm-memory bump and after is the bump commit.

@rbradford
Copy link
Member Author

@sboeuf this change ( rust-vmm/vm-memory#94 ) in vm-memory moved away from using the SSE2/AVX2 optimised glibc memcpy() routine. So this regression was not unexpected - what we need to do is work out how to mitigate it.

@sboeuf
Copy link
Member

sboeuf commented Jun 2, 2020

@andreeaflorescu @bonzini do you think you can gate the commit rust-vmm/vm-memory@d0aaccc based on musl only? That looks like the only appropriate way of handling this issue.
I mean I understand we need to fix the coherency for musl, otherwise there's a real bug, but glibc is fine without this change, and we can't afford such a performance drop for all our virtio devices.

@bonzini
Copy link

bonzini commented Jun 2, 2020

Are you doing memcpy into the buffers, instead of doing direct read/write into memory?

You could also try adding SSE support to vm-memory.

@andreeaflorescu
Copy link

Should we discuss about possible fixes in vm-memory?
I opened an issue and we can continue the discussion there to give everyone visibility into next steps/problems/stuff in general :))

rust-vmm/vm-memory#100

@sboeuf
Copy link
Member

sboeuf commented Jun 3, 2020

@bonzini
I'm genuinely curious how SSE can be supported at the vm-memory level? How can we make sure we use some of the SSE instructions?

@bonzini
Copy link

bonzini commented Jun 3, 2020

@sboeuf There is std::arch::x86_64::__m128i, I have not checked if it supports ptr::read and ptr::write operations that compile down to single SSE loads and stores.

@bonzini
Copy link

bonzini commented Jun 3, 2020

Can you try benchmarking something like this:

diff --git a/src/volatile_memory.rs b/src/volatile_memory.rs
index 7c8aa1a..68518fe 100644
--- a/src/volatile_memory.rs
+++ b/src/volatile_memory.rs
@@ -477,6 +477,12 @@ fn alignment(addr: usize) -> usize {
 //   we're only using integer primitives.
 unsafe fn copy_single(align: usize, src_addr: usize, dst_addr: usize) {
     match align {
+        16 => {
+            #[cfg(target_arch = "x86_64")] {
+                type Vec128 = std::arch::x86_64::__m256i;
+                write_volatile(dst_addr as *mut Vec128, read_volatile(src_addr as *const Vec128));
+            }
+        }
         8 => write_volatile(dst_addr as *mut u64, read_volatile(src_addr as *const u64)),
         4 => write_volatile(dst_addr as *mut u32, read_volatile(src_addr as *const u32)),
         2 => write_volatile(dst_addr as *mut u16, read_volatile(src_addr as *const u16)),
@@ -504,6 +510,8 @@ fn copy_slice(dst: &mut [u8], src: &[u8]) -> usize {
         }
     };
 
+    #[cfg(target_arch = "x86_64")]
+    copy_aligned_slice(16);
     if size_of::<usize>() > 4 {
         copy_aligned_slice(8);
     }

@rbradford
Copy link
Member Author

rbradford commented Jun 4, 2020

Can you try benchmarking something like this:

diff --git a/src/volatile_memory.rs b/src/volatile_memory.rs
index 7c8aa1a..68518fe 100644
--- a/src/volatile_memory.rs
+++ b/src/volatile_memory.rs
@@ -477,6 +477,12 @@ fn alignment(addr: usize) -> usize {
 //   we're only using integer primitives.
 unsafe fn copy_single(align: usize, src_addr: usize, dst_addr: usize) {
     match align {
+        16 => {
+            #[cfg(target_arch = "x86_64")] {
+                type Vec128 = std::arch::x86_64::__m256i;
+                write_volatile(dst_addr as *mut Vec128, read_volatile(src_addr as *const Vec128));
+            }
+        }
         8 => write_volatile(dst_addr as *mut u64, read_volatile(src_addr as *const u64)),
         4 => write_volatile(dst_addr as *mut u32, read_volatile(src_addr as *const u32)),
         2 => write_volatile(dst_addr as *mut u16, read_volatile(src_addr as *const u16)),
@@ -504,6 +510,8 @@ fn copy_slice(dst: &mut [u8], src: &[u8]) -> usize {
         }
     };
 
+    #[cfg(target_arch = "x86_64")]
+    copy_aligned_slice(16);
     if size_of::<usize>() > 4 {
         copy_aligned_slice(8);
     }

This caused a segfault as is but i made some changes and added "support" for AVX2 (no evidence that rust is actually using AVX2 or SSE2 instructions here)

diff --git a/src/volatile_memory.rs b/src/volatile_memory.rs
index 9e9901f..8129e91 100644
--- a/src/volatile_memory.rs
+++ b/src/volatile_memory.rs
@@ -477,6 +477,21 @@ fn alignment(addr: usize) -> usize {
//   we're only using integer primitives.
 unsafe fn copy_single(align: usize, src_addr: usize, dst_addr: usize) {
     match align {
+        #[cfg(target_arch = "x86_64")]
+        32 => {
+            type Vec256 = std::arch::x86_64::__m256i;
+            write_volatile(
+                dst_addr as *mut Vec256,
+                read_volatile(src_addr as *const Vec256),
+            );
+        }
+        16 => {
+            type Vec128 = std::arch::x86_64::__m128i;
+            write_volatile(
+                dst_addr as *mut Vec128,
+                read_volatile(src_addr as *const Vec128),
+            );
+        }
         8 => write_volatile(dst_addr as *mut u64, read_volatile(src_addr as *const u64)),
         4 => write_volatile(dst_addr as *mut u32, read_volatile(src_addr as *const u32)),
         2 => write_volatile(dst_addr as *mut u16, read_volatile(src_addr as *const u16)),
@@ -504,6 +519,10 @@ fn copy_slice(dst: &mut [u8], src: &[u8]) -> usize {
         }
     };
 
+    #[cfg(target_arch = "x86_64")]
+    copy_aligned_slice(32);
+    #[cfg(target_arch = "x86_64")]
+    copy_aligned_slice(16);
     if size_of::<usize>() > 4 {
         copy_aligned_slice(8);
     }

The difference was very small:

Connecting to host 192.168.249.2, port 5201
[  5] local 192.168.249.1 port 60124 connected to 192.168.249.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  3.47 GBytes  29.8 Gbits/sec    0   3.03 MBytes       
[  5]   1.00-2.00   sec  3.55 GBytes  30.5 Gbits/sec    0   3.03 MBytes       
[  5]   2.00-3.00   sec  3.54 GBytes  30.4 Gbits/sec    0   3.03 MBytes       
[  5]   3.00-4.00   sec  3.53 GBytes  30.3 Gbits/sec    0   3.03 MBytes       
[  5]   4.00-5.00   sec  3.54 GBytes  30.4 Gbits/sec    0   3.03 MBytes       
[  5]   5.00-6.00   sec  3.50 GBytes  30.0 Gbits/sec    0   3.03 MBytes       
[  5]   6.00-7.00   sec  3.54 GBytes  30.4 Gbits/sec    0   3.03 MBytes       
[  5]   7.00-8.00   sec  3.54 GBytes  30.4 Gbits/sec    0   3.03 MBytes       
[  5]   8.00-9.00   sec  3.53 GBytes  30.3 Gbits/sec    0   3.03 MBytes       
[  5]   9.00-10.00  sec  3.53 GBytes  30.3 Gbits/sec    0   3.03 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  35.3 GBytes  30.3 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  35.3 GBytes  30.3 Gbits/sec                  receiver

iperf Done.

@bonzini
Copy link

bonzini commented Jun 4, 2020

What were the changes in the profile before the changes to vm-memory vs. now?

@bonzini
Copy link

bonzini commented Jun 4, 2020

Also another useful optimization could be loop unrolling, possibly only for the 16- or 32-byte version.

@rbradford rbradford added this to To do in Release 0.8.0 via automation Jun 8, 2020
rbradford added a commit to rbradford/vm-memory that referenced this issue Jun 9, 2020
Where small objects are those objects that are less then the native data
width for the platform. This ensure that volatile and alignment safe
read/writes are used when updating structures that are sensitive to this
such as virtio devices where the spec requires writes to be atomic.

Fixes: cloud-hypervisor/cloud-hypervisor#1258

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
rbradford added a commit to rbradford/cloud-hypervisor that referenced this issue Jun 9, 2020
Currently released vm-memory uses aligned and volatile copying for all
data. The version in the fork only uses the assured (and slower) path
for data upto the natural data width.

Fixes: cloud-hypervisor#1258

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Release 0.8.0 automation moved this from To do to Done Jun 9, 2020
sboeuf pushed a commit that referenced this issue Jun 9, 2020
Currently released vm-memory uses aligned and volatile copying for all
data. The version in the fork only uses the assured (and slower) path
for data upto the natural data width.

Fixes: #1258

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
rbradford added a commit to cloud-hypervisor/vm-memory that referenced this issue Aug 6, 2020
Where small objects are those objects that are less then the native data
width for the platform. This ensure that volatile and alignment safe
read/writes are used when updating structures that are sensitive to this
such as virtio devices where the spec requires writes to be atomic.

Fixes: cloud-hypervisor/cloud-hypervisor#1258

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
rbradford added a commit to rbradford/vm-memory that referenced this issue Oct 2, 2020
Where small objects are those objects that are less then the native data
width for the platform. This ensure that volatile and alignment safe
read/writes are used when updating structures that are sensitive to this
such as virtio devices where the spec requires writes to be atomic.

Fixes: cloud-hypervisor/cloud-hypervisor#1258

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
rbradford added a commit to rbradford/vm-memory that referenced this issue Oct 2, 2020
Where small objects are those objects that are less then the native data
width for the platform. This ensure that volatile and alignment safe
read/writes are used when updating structures that are sensitive to this
such as virtio devices where the spec requires writes to be atomic.

Fixes: cloud-hypervisor/cloud-hypervisor#1258
Fixes: rust-vmm#100

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
rbradford added a commit to rbradford/cloud-hypervisor that referenced this issue Oct 2, 2020
Update the version number used to point to the latest version but
continue to use our patched version due to the fix for cloud-hypervisor#1258

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
rbradford added a commit to rbradford/cloud-hypervisor that referenced this issue Oct 2, 2020
A new version of vm-memory was released upstream which resulted in some
components pulling in that new version. Update the version number used
to point to the latest version but continue to use our patched version
due to the fix for cloud-hypervisor#1258

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
rbradford added a commit to rbradford/vm-memory that referenced this issue Oct 2, 2020
Where small objects are those objects that are less then the native data
width for the platform. This ensure that volatile and alignment safe
read/writes are used when updating structures that are sensitive to this
such as virtio devices where the spec requires writes to be atomic.

Fixes: cloud-hypervisor/cloud-hypervisor#1258
Fixes: rust-vmm#100

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
rbradford added a commit that referenced this issue Oct 2, 2020
A new version of vm-memory was released upstream which resulted in some
components pulling in that new version. Update the version number used
to point to the latest version but continue to use our patched version
due to the fix for #1258

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
alexandruag pushed a commit to rust-vmm/vm-memory that referenced this issue Nov 12, 2020
Where small objects are those objects that are less then the native data
width for the platform. This ensure that volatile and alignment safe
read/writes are used when updating structures that are sensitive to this
such as virtio devices where the spec requires writes to be atomic.

Fixes: cloud-hypervisor/cloud-hypervisor#1258
Fixes: #100

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Release 0.8.0
  
Done
4 participants