docker 1.10, 1.11 do not infer MTU from eth0; docker 1.9 does #22028

Closed
sprohaska opened this Issue Apr 14, 2016 · 12 comments

Projects

None yet

5 participants

@sprohaska

Docker 1.9 takes the MTU from eth0 and uses it for veth:

$ ip addr
eth0: ... mtu 1454, ...
vethc4...: mtu 1454, ...

With Docker 1.10 and 1.11, I need to explicitly pass --mtu to the daemon. Without it, docker uses 1500 for the veth and I observe network problems.

@sprohaska

Commit fd9d7c0 'don't try to use default route MTU as bridge MTU' might be the reason. The commit doesn't say why the default for the MTU is no longer determined from the default route.

@thaJeztah
Member

More information can be found in the pull request; #18108, and the issue it is resolving; #7796

@sprohaska

The issue is resolved for me. I've decided to consider the --mtu flag mandatory for me.

I read through the comments in the referenced PR and issue. I tend to disagree with the reasoning: If I understand correctly, the new behavior is to always use 1500 as the default. The old behavior was to use the MTU of the default route. In my understanding the old behavior was strictly better: If the setup is non-standard, the MTU should probably be manually configured, and the default does not matter. If one relies on the default, the MTU of the default route seems to be a better choice than an unspecific default of 1500. The default route is likely to be relevant, and if its MTU is smaller than 1500, docker should be configured accordingly.

@thaJeztah
Member
@phemmer
Contributor
phemmer commented Apr 14, 2016

@sprohaska What is the problem you are having? I'd be happy to help, but I need to know what your symptoms are.
If you're having issues with traffic sometimes not getting from the container through your default route, this is almost certainly an issue with misconfiguration of your host. In this case, the output of iptables -nvL and sysctl -a 2>/dev/null | grep pmtu would be very useful. Also, are you adjusting any network settings inside the containers, or running any special network settings in docker (network driver, custom bridge, etc)?

@sprohaska

@phemmer Thanks for offering help.

With the default MTU 1500, apt-get update seems stuck on 'Waiting for headers', which led me to http://askubuntu.com/a/532663, which led me to MTU.

I'm running docker on a VM in OpenStack, using a recent Ubuntu 15.10 cloud image. I haven't tweaked the firewall on the VM. The VM doesn't have a public IP. It relies on OpenStack SNAT to access the Internet.

$ iptables -nvL
Chain INPUT (policy ACCEPT 941K packets, 428M bytes)
 pkts bytes target     prot opt in     out     source               destination
    0     0 ACCEPT     tcp  --  lxcbr0 *       0.0.0.0/0            0.0.0.0/0            tcp dpt:53
    0     0 ACCEPT     udp  --  lxcbr0 *       0.0.0.0/0            0.0.0.0/0            udp dpt:53
    0     0 ACCEPT     tcp  --  lxcbr0 *       0.0.0.0/0            0.0.0.0/0            tcp dpt:67
    0     0 ACCEPT     udp  --  lxcbr0 *       0.0.0.0/0            0.0.0.0/0            udp dpt:67

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination
    0     0 DOCKER-ISOLATION  all  --  *      *       0.0.0.0/0            0.0.0.0/0
32795   46M DOCKER     all  --  *      docker0  0.0.0.0/0            0.0.0.0/0
32795   46M ACCEPT     all  --  *      docker0  0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
15617  878K ACCEPT     all  --  docker0 !docker0  0.0.0.0/0            0.0.0.0/0
    0     0 ACCEPT     all  --  docker0 docker0  0.0.0.0/0            0.0.0.0/0
    0     0 ACCEPT     all  --  *      lxcbr0  0.0.0.0/0            0.0.0.0/0
    0     0 ACCEPT     all  --  lxcbr0 *       0.0.0.0/0            0.0.0.0/0

Chain OUTPUT (policy ACCEPT 953K packets, 300M bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain DOCKER (1 references)
 pkts bytes target     prot opt in     out     source               destination

Chain DOCKER-ISOLATION (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 RETURN     all  --  *      *       0.0.0.0/0            0.0.0.0/0
$ sysctl -a 2>/dev/null | grep pmtu
net.ipv4.ip_forward_use_pmtu = 0
net.ipv4.ip_no_pmtu_disc = 0
net.ipv4.route.min_pmtu = 552

Docker 1.9.1 just worked. Docker 1.10 and 1.11 also work when I explicitly configure --mtu 1454. With --mtu 1454 I have the device config below. MTU on eth0, docker0 and veth match. Networks looks good. Without --mtu, docker0 and veth have MTU 1500 and apt-get inside a container seems stuck.

$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1454 qdisc pfifo_fast state UP group default qlen 1000
    link/ether fa:16:3e:07:0a:a2 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.122/24 brd 172.17.0.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe07:aa2/64 scope link
       valid_lft forever preferred_lft forever
3: lxcbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
    link/ether 1e:96:6c:b7:26:8d brd ff:ff:ff:ff:ff:ff
    inet 10.0.3.1/24 scope global lxcbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::1c96:6cff:feb7:268d/64 scope link
       valid_lft forever preferred_lft forever
4: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1454 qdisc noqueue state UP group default
    link/ether 02:42:28:b0:2f:d9 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.1/16 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:28ff:feb0:2fd9/64 scope link
       valid_lft forever preferred_lft forever
24: veth46a284d@if23: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1454 qdisc noqueue master docker0 state UP group default
    link/ether ca:ac:6a:5c:37:e0 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::c8ac:6aff:fe5c:37e0/64 scope link
       valid_lft forever preferred_lft forever
@phemmer
Contributor
phemmer commented Apr 14, 2016

Hrm, unable to reproduce on my normal system. Will try to get cloudstack up and running, maybe they're doing something fishy.
However I wouldn't expect apt-get update to trigger the issue, even if there was one. What is happening is that somewhere along the line, an ICMP-FRAGMENTATION-NEEDED packet is getting dropped. I expected to see something in the iptables output which would result in this.
As mentioned, I'll try to get cloudstack up and running. But if you have any desire to troubleshoot this on your own, I'd do some tcpdump -i $ifname icmp on the various interfaces on your system (the veth46a..., docker0, etc), and see if the packet goes missing anywhere. You should see it coming from 172.18.0.1 (the docker0 ip addr).

@sprohaska

I tried to tcpdump the issue. ping -M do -s ... works for various sizes as expected. But HTTP connections get stuck. I don't see ICMP on docker0 (see tcpdump output below).

I'm wondering whether any of the following discussions is relevant:

#3596
#12565
http://www.linuxfoundation.org/collaborate/workgroups/networking/bridge#What_can_be_bridged.3F

I'm going to use an explicit --mtu and would stop here, considering the issue solved for me.

@phemmer, I could gather more details if you want to investigate further.

docker daemon with default mtu 1500. curl -L http://google.com gets stuck:

# tcpdump -nvv -i docker0
...
19:50:24.451507 IP (tos 0x0, ttl 64, id 45819, offset 0, flags [DF], proto TCP (6), length 60)
    172.18.0.2.45750 > 74.125.136.94.80: Flags [S], cksum 0x7f1e (incorrect -> 0x535e), seq 2597794260, win 29200, options [mss 1460,sackOK,TS val 12616495 ecr 0,nop,wscale 7], length 0
19:50:24.465708 IP (tos 0x0, ttl 47, id 21200, offset 0, flags [none], proto TCP (6), length 60)
    74.125.136.94.80 > 172.18.0.2.45750: Flags [S.], cksum 0xeabe (correct), seq 656656527, ack 2597794261, win 42540, options [mss 1430,sackOK,TS val 1859835394 ecr 12616495,nop,wscale 7], length 0
19:50:24.465769 IP (tos 0x0, ttl 64, id 45820, offset 0, flags [DF], proto TCP (6), length 52)
    172.18.0.2.45750 > 74.125.136.94.80: Flags [.], cksum 0x7f16 (incorrect -> 0xbeb0), seq 1, ack 1, win 229, options [nop,nop,TS val 12616499 ecr 1859835394], length 0
19:50:24.465955 IP (tos 0x0, ttl 64, id 45821, offset 0, flags [DF], proto TCP (6), length 165)
    172.18.0.2.45750 > 74.125.136.94.80: Flags [P.], cksum 0x7f87 (incorrect -> 0xbdb2), seq 1:114, ack 1, win 229, options [nop,nop,TS val 12616499 ecr 1859835394], length 113: HTTP, length: 113
    GET /?gfe_rd=cr&ei=gPQPV9_WGoHv-gbTzp2oBA HTTP/1.1
...

19:50:24.480113 IP (tos 0x0, ttl 47, id 21212, offset 0, flags [none], proto TCP (6), length 52)
    74.125.136.94.80 > 172.18.0.2.45750: Flags [.], cksum 0xbdc8 (correct), seq 1, ack 114, win 333, options [nop,nop,TS val 1859835409 ecr 12616499], length 0
19:50:24.668545 IP (tos 0x0, ttl 47, id 21279, offset 0, flags [none], proto TCP (6), length 1379)
    74.125.136.94.80 > 172.18.0.2.45750: Flags [P.], cksum 0x35f6 (correct), seq 9927:11254, ack 114, win 333, options [nop,nop,TS val 1859835597 ecr 12616499], length 1327: HTTP
19:50:24.668599 IP (tos 0x0, ttl 64, id 45822, offset 0, flags [DF], proto TCP (6), length 64)
    172.18.0.2.45750 > 74.125.136.94.80: Flags [.], cksum 0x7f22 (incorrect -> 0x55ad), seq 114, ack 1, win 251, options [nop,nop,TS val 12616550 ecr 1859835409,nop,nop,sack 1 {9927:11254}], length 0

docker daemon --mtu 1454 .... curl works:

# tcpdump -nvv -i docker0
...
19:52:21.100104 IP (tos 0x0, ttl 64, id 42790, offset 0, flags [DF], proto TCP (6), length 60)
    172.18.0.2.46532 > 74.125.136.94.80: Flags [S], cksum 0x7f1e (incorrect -> 0x814f), seq 3679081020, win 28280, options [mss 1414,sackOK,TS val 12645658 ecr 0,nop,wscale 7], length 0
19:52:21.114716 IP (tos 0x0, ttl 47, id 10265, offset 0, flags [none], proto TCP (6), length 60)
    74.125.136.94.80 > 172.18.0.2.46532: Flags [S.], cksum 0xc380 (correct), seq 1743802859, ack 3679081021, win 42540, options [mss 1430,sackOK,TS val 1621126013 ecr 12645658,nop,wscale 7], length 0
19:52:21.114763 IP (tos 0x0, ttl 64, id 42791, offset 0, flags [DF], proto TCP (6), length 52)
    172.18.0.2.46532 > 74.125.136.94.80: Flags [.], cksum 0x7f16 (incorrect -> 0x977b), seq 1, ack 1, win 221, options [nop,nop,TS val 12645661 ecr 1621126013], length 0
19:52:21.114875 IP (tos 0x0, ttl 64, id 42792, offset 0, flags [DF], proto TCP (6), length 165)
    172.18.0.2.46532 > 74.125.136.94.80: Flags [P.], cksum 0x7f87 (incorrect -> 0x13b9), seq 1:114, ack 1, win 221, options [nop,nop,TS val 12645661 ecr 1621126013], length 113: HTTP, length: 113
    GET /?gfe_rd=cr&ei=9fQPV6HCBYHZ8AfQu77QBw HTTP/1.1
...

19:52:21.129320 IP (tos 0x0, ttl 47, id 10274, offset 0, flags [none], proto TCP (6), length 52)
    74.125.136.94.80 > 172.18.0.2.46532: Flags [.], cksum 0x968b (correct), seq 1, ack 114, win 333, options [nop,nop,TS val 1621126028 ecr 12645661], length 0
19:52:21.157952 IP (tos 0x0, ttl 47, id 10283, offset 0, flags [none], proto TCP (6), length 1454)
    74.125.136.94.80 > 172.18.0.2.46532: Flags [.], cksum 0x8ee0 (correct), seq 1:1403, ack 114, win 333, options [nop,nop,TS val 1621126056 ecr 12645661], length 1402: HTTP, length: 1402
    HTTP/1.1 200 OK
...

19:52:21.157977 IP (tos 0x0, ttl 47, id 10284, offset 0, flags [none], proto TCP (6), length 1454)
    74.125.136.94.80 > 172.18.0.2.46532: Flags [.], cksum 0xb62d (correct), seq 1403:2805, ack 114, win 333, options [nop,nop,TS val 1621126056 ecr 12645661], length 1402: HTTP
19:52:21.157984 IP (tos 0x0, ttl 47, id 10285, offset 0, flags [none], proto TCP (6), length 1454)
    74.125.136.94.80 > 172.18.0.2.46532: Flags [.], cksum 0x5dd7 (correct), seq 2805:4207, ack 114, win 333, options [nop,nop,TS val 1621126056 ecr 12645661], length 1402: HTTP
19:52:21.157992 IP (tos 0x0, ttl 47, id 10286, offset 0, flags [none], proto TCP (6), length 1454)
    74.125.136.94.80 > 172.18.0.2.46532: Flags [.], cksum 0x3990 (correct), seq 4207:5609, ack 114, win 333, options [nop,nop,TS val 1621126056 ecr 12645661], length 1402: HTTP
19:52:21.157999 IP (tos 0x0, ttl 47, id 10287, offset 0, flags [none], proto TCP (6), length 1454)
    74.125.136.94.80 > 172.18.0.2.46532: Flags [.], cksum 0x7bc5 (correct), seq 5609:7011, ack 114, win 333, options [nop,nop,TS val 1621126056 ecr 12645661], length 1402: HTTP
19:52:21.158006 IP (tos 0x0, ttl 47, id 10288, offset 0, flags [none], proto TCP (6), length 1454)
    74.125.136.94.80 > 172.18.0.2.46532: Flags [.], cksum 0x34bc (correct), seq 7011:8413, ack 114, win 333, options [nop,nop,TS val 1621126056 ecr 12645661], length 1402: HTTP
19:52:21.158012 IP (tos 0x0, ttl 47, id 10289, offset 0, flags [none], proto TCP (6), length 1454)
    74.125.136.94.80 > 172.18.0.2.46532: Flags [.], cksum 0xea70 (correct), seq 8413:9815, ack 114, win 333, options [nop,nop,TS val 1621126056 ecr 12645661], length 1402: HTTP
19:52:21.158062 IP (tos 0x0, ttl 64, id 42794, offset 0, flags [DF], proto TCP (6), length 52)
    172.18.0.2.46532 > 74.125.136.94.80: Flags [.], cksum 0x7f16 (incorrect -> 0x8bb4), seq 114, ack 2805, win 265, options [nop,nop,TS val 12645672 ecr 1621126056], length 0
19:52:21.158080 IP (tos 0x0, ttl 64, id 42797, offset 0, flags [DF], proto TCP (6), length 52)
    172.18.0.2.46532 > 74.125.136.94.80: Flags [.], cksum 0x7f16 (incorrect -> 0x7b04), seq 114, ack 7011, win 331, options [nop,nop,TS val 12645672 ecr 1621126056], length 0
19:52:21.158100 IP (tos 0x0, ttl 64, id 42800, offset 0, flags [DF], proto TCP (6), length 52)
    172.18.0.2.46532 > 74.125.136.94.80: Flags [.], cksum 0x7f16 (incorrect -> 0x6a54), seq 114, ack 11217, win 397, options [nop,nop,TS val 12645672 ecr 1621126056], length 0
19:52:21.158383 IP (tos 0x0, ttl 64, id 42802, offset 0, flags [DF], proto TCP (6), length 52)
    172.18.0.2.46532 > 74.125.136.94.80: Flags [F.], cksum 0x7f16 (incorrect -> 0x6a0f), seq 114, ack 11285, win 397, options [nop,nop,TS val 12645672 ecr 1621126056], length 0
19:52:21.158422 IP (tos 0x0, ttl 64, id 31003, offset 0, flags [DF], proto TCP (6), length 52)
    172.18.0.2.36920 > 216.58.213.238.80: Flags [F.], cksum 0x5a64 (incorrect -> 0x1972), seq 75, ack 472, win 230, options [nop,nop,TS val 12645672 ecr 333127129], length 0
19:52:21.166644 IP (tos 0x0, ttl 55, id 9837, offset 0, flags [none], proto TCP (6), length 52)
    216.58.213.238.80 > 172.18.0.2.36920: Flags [F.], cksum 0x18c3 (correct), seq 472, ack 76, win 333, options [nop,nop,TS val 333127200 ecr 12645672], length 0
19:52:21.166686 IP (tos 0x0, ttl 64, id 31004, offset 0, flags [DF], proto TCP (6), length 52)
    172.18.0.2.36920 > 216.58.213.238.80: Flags [.], cksum 0x5a64 (incorrect -> 0x1928), seq 76, ack 473, win 230, options [nop,nop,TS val 12645674 ecr 333127200], length 0
19:52:21.173076 IP (tos 0x0, ttl 47, id 10300, offset 0, flags [none], proto TCP (6), length 52)
    74.125.136.94.80 > 172.18.0.2.46532: Flags [F.], cksum 0x6a3f (correct), seq 11285, ack 115, win 333, options [nop,nop,TS val 1621126071 ecr 12645672], length 0
19:52:21.173118 IP (tos 0x0, ttl 64, id 42803, offset 0, flags [DF], proto TCP (6), length 52)
    172.18.0.2.46532 > 74.125.136.94.80: Flags [.], cksum 0x7f16 (incorrect -> 0x69fb), seq 115, ack 11286, win 397, options [nop,nop,TS val 12645676 ecr 1621126071], length 0
@justincormack
Member

I think force setting the MTU with --mtu is the only sensible way, and the best outcome, as it leaves it up to the admin, and makes it independent of ordering as in the issue.

@thaJeztah
Member

I'll close this issue, because this was a deliberate change, and specifying the MTU is the solution for this. Feel free to comment here after I closed

@thaJeztah thaJeztah closed this Apr 18, 2016
@jxstanford

We're running docker on OpenStack VMs, and ran into this issue. Setting the --mtu to to match eth0 on the host did not help in our case (docker-engine 1.11.0). We ended up inserting a rule in iptables at the beginning of the chain. The command we used is:

iptables -I FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu

I would implore you to reconsider closing this issue. It's esoteric enough that it will haunt people and possibly drive them away. I'm fortunate enough to be surrounded by some very good network folks, and I suspect that many would not be able to resolve this on their own.

@thaJeztah
Member

@jxstanford could you open a new issue? If manually changing the MTU does not resolve the issue, then automatically setting the MTU likely won't resolve it as well, so I'd prefer to handle it as a separate issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment