-
Notifications
You must be signed in to change notification settings - Fork 181
Fix Debian base boxes networking issue #153
Comments
This is due to dhclient checking the checksum of udp packets it receives, but checksums are not computed if the packets don't leave through a physical interface. You can add an iptables rule to force adding checksums (replace your network range eventually) : |
@fcharlier tks for the input :) TBH this is pretty weird to me as it used to work in the past and I thought I had moved all the custom stuff from the "old" templates into the new script. looks like I might need to look at the git logs again to find out if I missed out on something. |
Seems like an LXC issue to me. Here is a patch for the lxc-net init script on Ubuntu hosts:
|
Agreed on the patch above. Had to do the same thing. |
Thanks guys! I'll give it a try later on and will add an entry to our troubleshooting section of the Wiki if it works for me as well :) |
Oh folks, seems like this should have more attention after a long debugging session over twitter with @rcarmo today as he experienced that with an Ubuntu guest as well if I'm not wrong. I'm willing to look into fixing this but I have no idea how to detect that issue prior to booting the container so we can either fix things automatically or let the users know that things are misconfigured and let them choose whether they want to do the changes or not. Anyone has any thoughts? (/cc @leorochael as you seem to be interested on this as well :) |
I'd say you should check for the existence of an iptables rule that does the checksumming and add it if missing. Period :) On Oct 16, 2013, at 03:50 , Fabio Rehm notifications@github.com wrote:
|
Also, I've just upgraded my 13.04/lxc 0.9.0 VM to saucy/lxc 1.0.0. Adding the checksumming made things work as well. |
I have hit the same issue and found another solution on Xen wiki: http://wiki.xen.org/wiki/Xen_Networking Instead of having iptables calculate the checksum, the interface can be configured to disable offload checksum and thus make the OS network stack do all the work without relying on hardware (which isn't there, because its a bridge).
Here I present the packet capture before and after the configuration change: Notice the problematic This works just like the iptables rule, but I find it cleaner. |
Yeah. I tested this with Tiago and it works if you add this just after
This can also be put inside vagrant-lxc, although you'll need to have ethtool installed (which it isn't by default, at least not on my barebones 13.04 and 13.10 VMs) R. On Oct 16, 2013, at 18:49 , Tiago Teresa Teodósio notifications@github.com wrote:
|
Tks guys! Just to confirm, will this have any sort of side effect if we end up doing automatically? Should we ask for confirmation prior to running the command? I just want to make sure we don't mess up with users networks by doing this automatically :P |
As I said on other issues, my networking skillz are still close to none at this moment so how can we identify the issue in advance prior to running the command? I mean, which commands do I need to run in order to identify the problem? |
I can't see what sort of side effects it would bring, since I have Ubuntu and Debian containers running with this set now. What this does is disable the off-load of packet checksums to a hardware interface, which seems to have become the default ever since Ethernet cards became smart enough to checksum packets on their own. This was done to save CPU time, but since we're not using a hardware interface in lxcbr0, there's no hardware to checksum the packets, and hence the host kernel seems to be sending them without checksums at all (they're supposed to be filled in inline by the hardware). But the guests consider those invalid packets. I'm at a loss to explain if this is part of an LXC tweak, a kernel tweak, or some mis-placed default, but that's what's happening now. All that will happen is that SSHing to a container will take an extra, negligible amount of CPU (heck, we're encrypting the connection inside the machine already...). R. |
Tks for the information everyone but I'm still not sure how / when I'll get to fix this. Tonight I'll try to sum everything up into a single issue along with the proposed fixes and will add a BIG note to the README pointing users over there. |
While we have two suggested solutions now, I still wonder how this problem got introduced. Is it caused by recent kernel changes? LXC changes? |
Our money's actually on the kernel. The guy who wrote ethtool seems to be related to the bridging module. But it might also be a change in LXC init somewhere. |
My analysis was not entirely correct. Please do not run Running |
Quick update: I was doing some testing on a clean Ubuntu 13.04 installation on DigitalOcean and after rebuilding the base wheezy box I was able to connect to it without issues. |
@fgrehm The problem is only during creation of containers. As soon as possible I'll try the patch that you have shown me. |
@fgrehm |
@fgrehm in this w.e. I try again with the last stable release of lxc from the ubuntu team. |
I had the occasion to set up another 13.10 container box. using |
The actual bug is in the ISC DHCP client, which is supposed to honor a flag (TP_STATUS_CSUMNOTREADY) indicating that the UDP checksum was never computed and should be ignored. Various distributions are adding a patch to the DHCP client package that fixes this. For example: https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/930962 Unfortunately the ethtool or iptables workaround will be needed for any guests with the buggy DHCP client. |
Guests that previously worked fine started misbehaving after upgrading the host, so that's not the only factor. On 29 Oct 2013, at 23:09 , Ed Swierk notifications@github.com wrote:
|
Correct, the ISC DHCP client bug is not the only factor. Until recently, the veth driver didn't claim to support checksum offload and other features. Now that it does (http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8093315a91340bca52549044975d8c7f673b28a1), a packet traverses the entire path from LXC host to guest, through bridges and veths, without the checksum getting computed. Any client that doesn't notice this, and assumes the checksum is valid, would break. |
Thanks for all the information folks. I still didn't have the time to dig into this but I ran into it while doing some testing with other stuff. So I'd love to have vagrant-lxc erroring out / warning users in case we are able to detect the issue on the host. I personally don't like the idea of automatically running arbitrary commands without need (like on my laptop that doesn't have a need for it) or without letting the user know what's going on. If someone could please point me to the buggy package version and / or a command that we can use to detect if the problem is in place I promise I'll implement it on the plugin pretty quick :D |
Maybe checking for a specific kernel version is feasible. @eswierk, can you track that commit you pointed to into a release tag? I know Ubuntu tweaks their stuff, but... |
The minimal workaround is to disable TX checksum offload on the LXC's veth device by running ethtool -k vethXXX tx off. That command is effective only if the veth device advertises TX checksum offload support, which is the exact condition that triggers the ISC DHCP bug. So you might as well just do it--there's little value in going to the trouble of checking the kernel version first. It would be more useful to check the version of the ISC DHCP client in the container, and apply the workaround only if it's an old version that doesn't deal with the missing checksum situation. But that check would depend on the exact distribution you've installed in the container. For Ubuntu you could use the bug fix release details from https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/930962. |
On (K)Ubuntu 13.04 host with lxc 0.9.0-0ubuntu3.7, disabling checksum offloading seems to solve all networking issues on all boxes I tried. On a 13.10 host with lxc 1.0.0 |
Just wanted to add two cents, been stuck on this for 24 hours. Just reiterating. Seems solved now, though I am only testing with basic lxc commands, not vagrant-lxc. Can we agree that using base lxc makes more sense for debugging? root@my-macbook-air:~# lsb_release -a Only this worked for me: sudo iptables -t mangle -A POSTROUTING -o lxcbr0 -p udp --dport bootpc -j CHECKSUM --checksum-fill I tried ethtool solution but did not work. Once the iptables rule is added to host system -- note, people did not say above where to add the rule clearly -- the client system was able to fetch an IP from dnsmasq. Otherwise, I saw "bad udp cksum" messages in output of "tcpdump -vvv -i lxcbr0". |
So... it has been 2 months since this issue had any activity and I wasn't able to reproduce it on Raring neither on Saucy. Besides that, I did some testing and the new Debian boxes built with the changes on GH-245 are working fine for me as well. We already have a note on the readme pointing users to this issue for 4 months I'd expect more reports if it was an issue affecting lots of people. If someone is able to hook me up with a VBox machine that consistently reproduces the bug or is able put up a PR that detects and / or fixes the problems I'll be more than happy to review / bring it in, but for now I'm just going to close the issue :-) |
Folks, for whatever reason I ended up hitting this issue again on a Ubuntu Saucy VBox VM with a Debian Sid box. It's pretty weird that it doesn't happen on my laptop using Saucy as well o_O Anyways, the easiest way I could work around it without changing anything on the plugin was to make use of the Basically I created #!/bin/bash
/sbin/ethtool -K lxcbr0 tx off And added the following to my Vagrant.configure("2") do |config|
config.vm.provider :lxc do |lxc|
lxc.customize 'network.script.up', '/sbin/ethtool-fix'
end
end If you want that behavior for all of your vagrant-lxc containers, you can add that line above to your |
Just built the boxes using the new scripts and while trying them on a Raring host I noticed that the containers boots but do not get an ip, an attempt to bring the a wheezy container up by hand with
sudo lxc-start -n $(cat .vagrant/machines/default/lxc/id)
yields:Dunno what to do in order to fix it right now.
The text was updated successfully, but these errors were encountered: