CoreOS VMs on Azure drop network when overlay vxlan is under load #1156
We've set up a 3 node CoreOS cluster on Azure with a Flannel vxlan overlay network. Everything works fine initially when trying to ping containers over flannel, however as soon as network traffic increases, the VMs lose network connectivity.
What we can see the VMs are still running, but you can no longer SSH into the VM and etcd logs fill up with connection errors to other cluster members.
Here's how we're able to reproduce the issue:
The first container's qperf eventually prints
The same steps run without a hitch on AWS, this only seems to be a problem for us on Azure.
I then tried Weave's vxlan (fast datapath) and that caused the same problem as with Flannel vxlan, ruling out Flannel I would think.
Out of curiosity I started two CentOS 7.1 VMs in the same Azure vnet, installed Docker (1.10) and Weave (1.4.5) and ran the same qperf test that caused the previous network issues. This however worked without an issue and the qperf test completed successfully.
Since it seems to work on CentOS I assume that the issue is somewhere in CoreOS or its compatibility with Azure. Any thoughts?
The text was updated successfully, but these errors were encountered:
Thanks for the clear bug report and simple repro case. I was able to reproduce this with CoreOS 835.13.0 (current Stable) and 899.9.0 (current Beta), but not 976.0.0 (current Alpha).
@nordbergm would you mind using Alpha in the mean time while we promote the 4.4 kernel through Beta and Stable?