New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoreOS VMs on Azure drop network when overlay vxlan is under load #1156

Closed
nordbergm opened this Issue Mar 6, 2016 · 7 comments

Comments

Projects
None yet
2 participants
@nordbergm

nordbergm commented Mar 6, 2016

We've set up a 3 node CoreOS cluster on Azure with a Flannel vxlan overlay network. Everything works fine initially when trying to ping containers over flannel, however as soon as network traffic increases, the VMs lose network connectivity.

What we can see the VMs are still running, but you can no longer SSH into the VM and etcd logs fill up with connection errors to other cluster members.

Here's how we're able to reproduce the issue:

  1. Start a 3 node CoreOS Stable (835.13.0) cluster in an Azure virtual network. In our case West US region with vnet subnet 172.16.0.0/16 and the following flannel config { "Network": "10.0.0.0/16", "SubnetLen": 25, "Backend": { "Type": "vxlan" } }
  2. Start a qperf container on two separate VMs docker run -ti tobwiens/ubuntu-qperf
  3. In the first container, run qperf
  4. In the second container, run qperf -t 10 <first container ip> tcp_bw udp_lat

The first container's qperf eventually prints qperf failed to receive synchronization after test: timed out and the second container VM becomes unresponsive.

The same steps run without a hitch on AWS, this only seems to be a problem for us on Azure.
I wasn't sure if the problem was with Azure or Flannel, so I tried Flannel's default UDP backend, and that worked fine.

I then tried Weave's vxlan (fast datapath) and that caused the same problem as with Flannel vxlan, ruling out Flannel I would think.

Out of curiosity I started two CentOS 7.1 VMs in the same Azure vnet, installed Docker (1.10) and Weave (1.4.5) and ran the same qperf test that caused the previous network issues. This however worked without an issue and the qperf test completed successfully.

Since it seems to work on CentOS I assume that the issue is somewhere in CoreOS or its compatibility with Azure. Any thoughts?

@crawford

This comment has been minimized.

Member

crawford commented Mar 8, 2016

Thanks for the clear bug report and simple repro case. I was able to reproduce this with CoreOS 835.13.0 (current Stable) and 899.9.0 (current Beta), but not 976.0.0 (current Alpha).

@nordbergm would you mind using Alpha in the mean time while we promote the 4.4 kernel through Beta and Stable?

@crawford

This comment has been minimized.

Member

crawford commented Mar 8, 2016

Heh, go figure. As soon as I sent that, the 976.0.0 machine failed as well.

@crawford

This comment has been minimized.

Member

crawford commented Mar 8, 2016

I'll loop in the Azure folks and see if we can't figure out what's going on here.

@nordbergm

This comment has been minimized.

nordbergm commented Mar 8, 2016

@crawford Thanks for looking into this so quickly. Let me know if there's anything I can do to help test and if you need more information.

@crawford

This comment has been minimized.

Member

crawford commented Mar 24, 2016

It looks like the relevant patches (coreos/linux@757647e, coreos/linux@a060679, and coreos/linux@c85e492) made it into Linux 4.5. I think we can pull that in next week.

@nordbergm

This comment has been minimized.

nordbergm commented Apr 5, 2016

Thanks, that's great. Look forward to trying it out in the alpha channel.

@crawford

This comment has been minimized.

Member

crawford commented Apr 5, 2016

Should be fixed by coreos/coreos-overlay#1867.

@crawford crawford closed this Apr 5, 2016

@crawford crawford added this to the CoreOS 1011.0.0 milestone Apr 5, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment