Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

high packet loss on bridge between VMs on debian KVM hypervisor #1

Closed
appliedprivacy opened this issue Oct 18, 2021 · 1 comment
Closed

Comments

@appliedprivacy
Copy link
Owner

appliedprivacy commented Oct 18, 2021

We have a high (>20%) packet loss and latency (>10ms) between VMs (debian 10 and 11) running on the same KVM hypervisor host (debian 11) on a bridge that connects only virtual interfaces with no physical interface involved. The packet loss takes only place on the red connectivity. All other links that involve physical interfaces are not affected.
The physical interfaces A and B are attached to the bgp-VM via passthru. Interface C is a bridge (br0) and attached to a physical interface.
Interface "D" is only virtual (virbr1) with no physical interfaces attached.

bgp1

ping from VM1 to bgp-VM:

ping  109.70.100.65
PING 109.70.100.65 (109.70.100.65): 56 data bytes
64 bytes from 109.70.100.65: icmp_seq=0 ttl=64 time=0.478 ms
64 bytes from 109.70.100.65: icmp_seq=2 ttl=64 time=19.331 ms
64 bytes from 109.70.100.65: icmp_seq=3 ttl=64 time=6.489 ms
[...]
64 bytes from 109.70.100.65: icmp_seq=8 ttl=64 time=19.009 ms
64 bytes from 109.70.100.65: icmp_seq=9 ttl=64 time=0.412 ms
64 bytes from 109.70.100.65: icmp_seq=10 ttl=64 time=2.670 ms
^C
--- 109.70.100.65 ping statistics ---
12 packets transmitted, 9 packets received, 25.0% packet loss
round-trip min/avg/max/stddev = 0.412/8.830/21.736/8.391 ms

Same issue when pinging VM1 <-> VM2.

domifstat shows it on RX (not TX):

virsh domifstat bgp-VM vnet1
vnet1 rx_bytes 20491744935664
vnet1 rx_packets 33410748178
vnet1 rx_errs 0
vnet1 rx_drop 133910310
vnet1 tx_bytes 25131685398653
vnet1 tx_packets 40646152117
vnet1 tx_errs 0
vnet1 tx_drop 0

KVM host:

Linux clamps 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) x86_64 GNU/Linux

The hypervisor has 32 CPU cores. Usage is at 75%.
cpu

Besides the qemu processes there are two processes on the KVM host that take up ~50% of a single CPU core named vhost-<number>.

virsh net-list
 Name          State    Autostart   Persistent
------------------------------------------------
 bgp1-to-vms   active   yes         yes
grep -A 5 "interface type"  VM1.xml 
    <interface type='network'>
      <mac address='52:54:00:76:a6:41'/>
      <source network='bgp1-to-vms'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
grep -A 5 "interface type" bgp-VM.xml 
    <interface type='bridge'>
      <mac address='52:54:00:81:95:e3'/>
      <source bridge='br0'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </interface>
    <interface type='network'>
      <mac address='52:54:00:5f:19:a5'/>
      <source network='bgp1-to-vms'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x09' slot='0x00' function='0x0'/>
    </interface>
virsh domiflist bgp-VM
 Interface   Type      Source        Model    MAC
-----------------------------------------------------------------
 vnet0       bridge    br0           virtio   52:54:00:81:95:e3
 vnet1       network   bgp1-to-vms   virtio   52:54:00:5f:19:a5
brctl show
bridge name	bridge id		STP enabled	interfaces
br0		8000.9aaec4c52743	no		eno1
							vnet0
virbr1		8000.52540087124b	yes		vnet1
							vnet2
							vnet3
							vnet4
							vnet5
							vnet6

ping from VM2 to bgp-VM:

ping 109.70.100.65
PING 109.70.100.65 (109.70.100.65) 56(84) bytes of data.
64 bytes from 109.70.100.65: icmp_seq=1 ttl=64 time=111 ms
64 bytes from 109.70.100.65: icmp_seq=2 ttl=64 time=20.0 ms
64 bytes from 109.70.100.65: icmp_seq=3 ttl=64 time=20.0 ms
[...]
64 bytes from 109.70.100.65: icmp_seq=14 ttl=64 time=40.0 ms
64 bytes from 109.70.100.65: icmp_seq=15 ttl=64 time=40.1 ms
64 bytes from 109.70.100.65: icmp_seq=17 ttl=64 time=42.9 ms
^C
--- 109.70.100.65 ping statistics ---
18 packets transmitted, 13 received, 27.7778% packet loss, time 17082ms
rtt min/avg/max/mdev = 19.983/36.399/110.561/23.196 ms
grep vnet6 /proc/net/dev; sleep 1;  grep vnet6 /proc/net/dev
 vnet6: 22540596855380 36729757829    0    0    0     0          0         0 18765581100141 29759595348    0 1211300822    0     0       0          0
 vnet6: 22540646924720 36729841541    0    0    0     0          0         0 18765625454759 29759660007    0 1211319749    0     0       0          0

these metrics are also included in node_exporter output:

node_network_transmit_drop_total{device="virbr1"} 
node_network_transmit_drop_total{device="vnet0"} 
node_network_transmit_drop_total{device="vnet1"} 
node_network_transmit_drop_total{device="vnet..."} 

The following graph is from the hypervisor and shows network traffic in pps and dropped packets per second as the network traffic ramps up.
Drops start at a network load of around 20k pps.

drops

a few log entries from the KVM host (/var/log/messages), no relevant logs in /var/log/syslog:

kernel: [1025010.317588] br0: port 2(vnet9) entered blocking state
kernel: [1025010.317595] br0: port 2(vnet9) entered disabled state
kernel: [1025010.317784] device vnet9 entered promiscuous mode
kernel: [1025010.318392] br0: port 2(vnet9) entered blocking state
kernel: [1025010.318397] br0: port 2(vnet9) entered forwarding state
kernel: [1025010.532324] audit: type=1400 audit(1635007733.797:79): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-56ec6397-5928-483d-a9f1-5c61246ece9e" pid=2021336 comm="apparmor_parser"
kernel: [1025010.541096] virbr1: port 1(vnet10) entered blocking state
kernel: [1025010.541100] virbr1: port 1(vnet10) entered disabled state
kernel: [1025010.541279] device vnet10 entered promiscuous mode
kernel: [1025010.541778] virbr1: port 1(vnet10) entered blocking state
kernel: [1025010.541783] virbr1: port 1(vnet10) entered listening state
kernel: [1025010.762247] audit: type=1400 audit(1635007734.029:80): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="libvirt-56ec6397-5928-483d-a9f1-5c61246ece9e" pid=2021348 comm="apparmor_parser"
kernel: [1025010.826756] audit: type=1400 audit(1635007734.093:81): apparmor="DENIED" operation="capable" profile="libvirtd" pid=973 comm="rpc-worker" capability=38  capname="perfmon"
kernel: [1025012.389140] vfio-pci 0000:06:00.0: Masking broken INTx support
kernel: [1025012.389230] vfio-pci 0000:06:00.0: vfio_ecap_init: hiding ecap 0x19@0x1d0
kernel: [1025012.517115] vfio-pci 0000:06:00.1: Masking broken INTx support
kernel: [1025012.560950] virbr1: port 1(vnet10) entered learning state
kernel: [1025014.546677] vfio-pci 0000:06:00.0: vfio_bar_restore: reset recovery - restoring BARs
kernel: [1025014.576772] virbr1: port 1(vnet10) entered forwarding state
kernel: [1025014.576784] virbr1: topology change detected, propagating
kernel: [1025014.780404] vfio-pci 0000:06:00.1: vfio_bar_restore: reset recovery - restoring BARs
kernel: [1027370.083357] vfio-pci 0000:06:00.0: vfio_bar_restore: reset recovery - restoring BARs
kernel: [1027370.116192] vfio-pci 0000:06:00.1: vfio_bar_restore: reset recovery - restoring BARs

a few web search results on this topic

@appliedprivacy appliedprivacy added bug Something isn't working P: major T: bug and removed bug Something isn't working labels Oct 18, 2021
@appliedprivacy appliedprivacy changed the title high packet loss between VMs on KVM host clamps high packet loss on bridge between VMs on debian KVM hypervisor Oct 23, 2021
@appliedprivacy
Copy link
Owner Author

appliedprivacy commented Oct 27, 2021

solved via multi-queue virtio-net

this got added to the bgp VM and high bandwidth VM interfaces pointing to the bridge (and comes at increased CPU load)

<driver name='vhost' queues='8'/>

https://cloudblog.switch.ch/2016/09/06/tuning-virtualized-network-node-multi-queue-virtio-net/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant