containers in docker 1.11 does not get same MTU as host #22297

Closed
jxstanford opened this Issue Apr 25, 2016 · 16 comments

Projects

None yet
@jxstanford
jxstanford commented Apr 25, 2016 edited

This issue is the same/similar to the issues documented in #22028 and #12565, and is being opened under a new issue at the request of @thaJeztah.


BUG REPORT INFORMATION

Output of docker version:

Client:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:40:36 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:40:36 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 8
 Running: 8
 Paused: 0
 Stopped: 0
Images: 8
Server Version: 1.11.0
Storage Driver: devicemapper
 Pool Name: docker-253:1-260061060-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 2.418 GB
 Data Space Total: 107.4 GB
 Data Space Available: 79.62 GB
 Metadata Space Used: 6.398 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.141 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Either use `--storage-opt dm.thinpooldev` or use `--storage-opt dm.no_warn_on_loop_devices=true` to suppress this warning.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.107-RHEL7 (2015-12-01)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null host bridge
Kernel Version: 3.10.0-327.13.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.64 GiB
Name: summit-training-gse.novalocal
ID: JOSZ:IGGB:P4V5:ADQH:HSZM:XFFE:TSNR:LUWM:FQAC:N6FC:TZOM:GNKC
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled

Additional environment details (AWS, VirtualBox, physical, etc.):

Docker running in a CentOS 7.2 VM on RedHat RDO (Liberty) OpenStack cloud.

Steps to reproduce the issue:

  1. Create a docker container
  2. Compare MTU on container with MTU on host
  3. Try a command such as apt-get update that would typically have large packets that might fragment

Describe the results you received:
Host interface info:
eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> **mtu 1400** qdisc pfifo_fast state UP qlen 1000

Container interface info:
eth0@if32: <BROADCAST,MULTICAST,UP,LOWER_UP> **mtu 1500** qdisc noqueue state UP group default

Requests originating from the container that have packets larger than 1400 are dropped

Describe the results you expected:

I would expect functionality on par with pre-1.10 docker, where users could expect networking to work without user intervention in the form of setting the MTU on the daemon (since there is no sysconfig or other environmental configuration mechanism, this literally means editing the service script), editing the docker related iptables rules, or adjusting the MTU on the container.

Additional information you deem important (e.g. issue happens only occasionally):

The other tickets referenced have identified a couple workarounds including setting the --mtu flag on the docker daemon. That didn't work for us. After dropping the container and image, adjusting daemon args, and starting the container again, the MTU in the container remained at 1500 while host was 1400.

Our workaround involves inserting an iptables rule to mangle the packets in transit between the host and container:

iptables -I FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
(see: https://www.frozentux.net/iptables-tutorial/chunkyhtml/x4721.html)

I wouldn't consider any of the workarounds referenced to be a fix for this issue. In my opinion, the fix for the issue is to have the container MTU match host interface MTU upon container creation without user intervention.

@thaJeztah
Member

After dropping the container and image, adjusting daemon args, and starting the container again, the MTU in the container remained at 1500 while host was 1400.

Hm, that wasn't clear from your previous comment; how did you set the --mtu for the daemon, and are you sure this setting was picked up by the daemon? Did you use a drop-in file for systemd? Have you run systemctl daemon-reload after making those changes?

Adding --mtu 300 to my daemon options, and running a container, this looks to work for me;

docker run --rm debian:jessie sh -c "ip a | grep mtu"
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
2: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1
3: ip6tnl0@NONE: <NOARP> mtu 1452 qdisc noop state DOWN group default qlen 1
4: ip6gre0@NONE: <NOARP> mtu 1448 qdisc noop state DOWN group default qlen 1
11: eth0@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 300 qdisc noqueue state UP group default
@jxstanford

I just confirmed proper functioning of the --mtu flag in our lab, so that was a red herring.

@phemmer
Contributor
phemmer commented Apr 26, 2016 edited

In my opinion, the fix for the issue is to have the container MTU match host interface MTU upon container creation without user intervention.

Setting the container MTU to the host MTU is a hack. We shouldn't be implementing hacks to solve network issues.
What if your box has a second route, with an MTU smaller than the default route, your containers will still have issues using this alternate route.
The proper solution is to figure out why PTMU discovery is apparently not working.

Edited: s/bigger/smaller/

@davebiffuk

I'm also seeing this issue in a Docker-on-OpenStack environment. The host instance has MTU 1400, Docker brings up docker0 with MTU 1500, network performance suffers.

Setting the container MTU to the host MTU is a hack. We shouldn't be implementing hacks to solve network issues.

Agreed, but reverting commit fd9d7c0 would allow many/most common use cases (container host with a single network interface) to work again without user intervention to set the MTU.

@phemmer
Contributor
phemmer commented Jul 22, 2016

network performance suffers

Do you have any evidence to support this? Unless you've turned PMTU discovery off, once the kernel has found the MTU, it should persist it, and effectively your MTU becomes 1400.

@davebiffuk

Yes, I have a system demonstrating this available to me now.

The hypervisor host has MTU 1500. Instance traffic is VXLAN encapsulated which means the instances get a lower MTU set, i.e. the 1400 shown here. (This is similar to the IPSEC encapsulation scenario at the start of this issue.)

root@trusty-instance:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:77:15:ec brd ff:ff:ff:ff:ff:ff
3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default 
    link/ether 02:42:49:1d:61:c3 brd ff:ff:ff:ff:ff:ff
9: veth9d6a511: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default 
    link/ether 2a:dc:b6:b3:b1:c0 brd ff:ff:ff:ff:ff:ff

Docker containers set the MTU to 1500:

root@8a9c9641a12e:/# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
8: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default 
    link/ether 02:42:c0:a8:03:01 brd ff:ff:ff:ff:ff:ff

and large transfers hang after the initial handshake:

root@8a9c9641a12e:/# apt-get update
Ign http://archive.ubuntu.com trusty InRelease
12% [Waiting for headers]

tcpdump shows no ICMP fragmentation-needed packets on any of the interfaces.
net.ipv4.ip_no_pmtu_disc = 0 on the instance and the hypervisor host.

If I set --mtu=1400 or do docker run --net=host then the problem does not appear. It's due to Docker using a Linux bridge with a too-large MTU. Looks like this issue too: #12565 (comment)

Let me know if there's anything else I can provide to help.

@phemmer
Contributor
phemmer commented Jul 22, 2016

hanging isn't the same as degradation. hanging is almost always indicative of something blocking icmp packets. I've tested this scenario and was never able to replicate any issues (on a vanilla box, with no tweaks). How customized is the host? Your prompt looks like you're running ubuntu trusty. Is this stock trusty, or a modified distribution, where did it come from, etc?

@jterstriep

I've seen the exact situation with Docker on OpenStack. Performance suffers, building an image with apt-get upgrade can take hours. We were setting the --mtu flag manually and restarting the docker server but with Docker 1.12 and docker-machine, it's become problematic.

@einarf
einarf commented Aug 27, 2016 edited

docker run --rm debian:jessie sh -c "ip a | grep mtu" does give me the expected MTU value as configured in with the --mtu flag in docker 1.11.x and 1.12.x ...

... but doing a run or up through a compose setup (using default network_mode and no custom network config, docker-compose==1.8.0 format 2) the networks created (br-xxxxxxx) will use 1500 MTU regardless of the --mtu flag. (1500). docker-compose run --rm myservice sh -c "ip a | grep mtu" always returns 1500.

If I run everything in the host network, everything works perfectly fine.. but that is painful.

I'm not sure if docker-compose is doing something wrong or dockerd itself, or maybe we are supposed to configure these additional networks ourselves in detail.

EDIT: This is explained in the post below

(Running in OpenStack, were MTU is 1450)

@einarf
einarf commented Aug 27, 2016 edited

Simple workaround for OpenStack and compose 2: (I will use 1450 MTU in this example)
(Works in docker 1.12.x and probably also 1.11.x)

Make sure to pass the correct --mtu=1450 to the deamon so the host network (docker0) gets the right mtu. The bridge networks created by compose will still get the default 1500 mtu.

EDIT: When using compose 2, the docker0 network will remain at 1500 mtu (unlike when using compose 1). This is perfectly fine. I'm guessing since no container is attached directly to that network, the mtu changes will not apply.

We can fix the additional networks by overriding the default network in the compose file (version 2)

networks:
  default:
    driver: bridge
    driver_opts:
      com.docker.network.driver.mtu: 1450

You probably have to manually delete the old network created by compose (docker network rm <network_name>) as you will probably see a message like ERROR: Network "<network_name>" needs to be recreated - options have changed. This need to be done every time you do any changes to the network. (optionally you can get compose to use a different network name so you are not obstructed by this issue as a quick fix if you have a lot of hosts to work with. Use com.docker.network.bridge.name for this)

This is equivalent of doing a docker network create -o "com.docker.network.driver.mtu"="1450" <network_name> (creates a bridge unless you specify otherwise), so if you prefer to manage your networks manually, this is what you do.

Just overriding the default network to your manually created external network is also easy.

networks:
  default:
    external:
      name: <network_name>

When creating bridge networks manually, do not get confused by the initial mtu set on the interface. It will report an mtu of 1500, but as soon as you run containers, the values will adjust.

The more confusing part was that engine reference docs lists the wrong parameter name for specifying bridge mtu (Found this related issue #24921):
https://docs.docker.com/engine/reference/commandline/network_create/#/bridge-driver-options

It took a fair amount of digging to finally get this working. I'm sure this can be translated to other ways of configuring networks. I just used the default network birdge for simplicity. As long as you find the right values for the driver you are using, you should be fine.

NOTE: This test was done on Ubuntu Trusty. There might be some underlying issues related to the network configuration that needs to be solved. All I know is that the instance gets its MTU of 1450 through dhcp and that's about it.

@JohannesRudolph

The same issues as described by @einarf above persists in Docker 1.12 (should thus add the 1.12 label as well)!

@aboch aboch self-assigned this Sep 26, 2016
@aboch
Contributor
aboch commented Sep 26, 2016

From @jxstanford comment, if I undertood correctly, the original problem reported by this issue turned not to be a real problem:

I just confirmed proper functioning of the --mtu flag in our lab, so that was a red herring.

Therefore I believe this issue should be closed.

@einarf Regarding your unexpected mtu value, please be aware the docker daemon --mtu flag only affects the default bridge network.

We do not have a --mtu first class flag for docker network create command yet, so the way user can set the mtu for a custom network is via options:

docker network create --opt com.docker.network.driver.mtu=<value> <nw name>

Same should be possible in your compose file via driver_opts field.
If not please open a dedicated issue for that.

@einarf
einarf commented Sep 26, 2016

@aboch Yep, that is all covered in my last post :)

@mavenugo
Contributor

Thanks @aboch @einarf for confirmation. Lets close this issue.

@mavenugo mavenugo closed this Sep 26, 2016
@dsmiley
dsmiley commented Sep 27, 2016

Maybe i'm missing something, but is the title of this issue, "containers in docker 1.11 does not get same MTU as host" actually addressed? Yes, --mtu works, but that's a work-around for Docker not determining the correct MTU on its own.

@aboch
Contributor
aboch commented Sep 27, 2016 edited

@dsmiley
In another PR (#18108) it was decided to drop the behavior where the mtu for the containers running on the default bridge network would be inherited from the host (there were valid concerns with that approach and @phemmer IIRC lead the discussion).
Based on that we would not restore the original behavior which is what this issue was initially asking.

As it usually happens, other issues piled on this one. In this case they were due to incorrect assumption that the --mtu option was not working or that it would affect user-defined networks.

Given the original request won't be addressed and the other two sub-issues were addressed, we decided to close this issue.

Besides this, there in fact seems to be an outstanding issue about path MTU discovery, but there are already more specific issues opened for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment