Skip to content
This repository has been archived by the owner. It is now read-only.

CoreOS VMs on Azure drop network when overlay vxlan is under load #1156

Closed
nordbergm opened this issue Mar 6, 2016 · 7 comments
Closed

CoreOS VMs on Azure drop network when overlay vxlan is under load #1156

nordbergm opened this issue Mar 6, 2016 · 7 comments

Comments

@nordbergm
Copy link

@nordbergm nordbergm commented Mar 6, 2016

We've set up a 3 node CoreOS cluster on Azure with a Flannel vxlan overlay network. Everything works fine initially when trying to ping containers over flannel, however as soon as network traffic increases, the VMs lose network connectivity.

What we can see the VMs are still running, but you can no longer SSH into the VM and etcd logs fill up with connection errors to other cluster members.

Here's how we're able to reproduce the issue:

  1. Start a 3 node CoreOS Stable (835.13.0) cluster in an Azure virtual network. In our case West US region with vnet subnet 172.16.0.0/16 and the following flannel config { "Network": "10.0.0.0/16", "SubnetLen": 25, "Backend": { "Type": "vxlan" } }
  2. Start a qperf container on two separate VMs docker run -ti tobwiens/ubuntu-qperf
  3. In the first container, run qperf
  4. In the second container, run qperf -t 10 <first container ip> tcp_bw udp_lat

The first container's qperf eventually prints qperf failed to receive synchronization after test: timed out and the second container VM becomes unresponsive.

The same steps run without a hitch on AWS, this only seems to be a problem for us on Azure.
I wasn't sure if the problem was with Azure or Flannel, so I tried Flannel's default UDP backend, and that worked fine.

I then tried Weave's vxlan (fast datapath) and that caused the same problem as with Flannel vxlan, ruling out Flannel I would think.

Out of curiosity I started two CentOS 7.1 VMs in the same Azure vnet, installed Docker (1.10) and Weave (1.4.5) and ran the same qperf test that caused the previous network issues. This however worked without an issue and the qperf test completed successfully.

Since it seems to work on CentOS I assume that the issue is somewhere in CoreOS or its compatibility with Azure. Any thoughts?

@crawford
Copy link
Member

@crawford crawford commented Mar 8, 2016

Thanks for the clear bug report and simple repro case. I was able to reproduce this with CoreOS 835.13.0 (current Stable) and 899.9.0 (current Beta), but not 976.0.0 (current Alpha).

@nordbergm would you mind using Alpha in the mean time while we promote the 4.4 kernel through Beta and Stable?

@crawford
Copy link
Member

@crawford crawford commented Mar 8, 2016

Heh, go figure. As soon as I sent that, the 976.0.0 machine failed as well.

@crawford
Copy link
Member

@crawford crawford commented Mar 8, 2016

I'll loop in the Azure folks and see if we can't figure out what's going on here.

@nordbergm
Copy link
Author

@nordbergm nordbergm commented Mar 8, 2016

@crawford Thanks for looking into this so quickly. Let me know if there's anything I can do to help test and if you need more information.

@crawford
Copy link
Member

@crawford crawford commented Mar 24, 2016

It looks like the relevant patches (coreos/linux@757647e, coreos/linux@a060679, and coreos/linux@c85e492) made it into Linux 4.5. I think we can pull that in next week.

@nordbergm
Copy link
Author

@nordbergm nordbergm commented Apr 5, 2016

Thanks, that's great. Look forward to trying it out in the alpha channel.

@crawford
Copy link
Member

@crawford crawford commented Apr 5, 2016

Should be fixed by coreos/coreos-overlay#1867.

@crawford crawford closed this Apr 5, 2016
@crawford crawford added this to the CoreOS 1011.0.0 milestone Apr 5, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants