don't try to use default route MTU as container MTU #18108

Merged
merged 1 commit into from Dec 7, 2015

Projects

None yet

9 participants

@phemmer
Contributor
phemmer commented Nov 20, 2015

Trying to use the default route's MTU as the container (bridge) MTU is a bad idea:

  • The box can have multiple default routes with different MTUs.
  • The default route and MTU can change after docker has started.
  • Traffic from the container might not even go over the default route (alternate routes, virtual networks, & inter-container communication).

Aside from the issues trying to determine the MTU to use, it's also unnecessary. The kernel performs path MTU discovery to resolve this exact situation. So, this PR lets the kernel do its job.

It might even be a good idea to raise the default MTU to 9000. But this PR just fixes the bad behavior. We can improve things in another PR.

closes #7796

@LK4D4
Contributor
LK4D4 commented Nov 20, 2015
@calavera
Contributor

LGTM, but I'll wait until someone in the networking teams gives their ๐Ÿ‘

@LK4D4
Contributor
LK4D4 commented Nov 23, 2015
@aboch
Contributor
aboch commented Nov 23, 2015

@phemmer

This logic was already removed in #13060

Then it was added back (slightly modified) in #13953 in order to fix #13952

Can you please check whether your change is not going to reopen the issue #13952
(Also pinging @ibuildthecloud who opened 13952)

@phemmer
Contributor
phemmer commented Nov 25, 2015

Yes, this will introduce the behavior described in #13952. However #13952 does not mention any actual problems caused by the behavior.

@mountkin
Contributor
@aboch
Contributor
aboch commented Nov 25, 2015

If manully setting --mtu daemon flag is an acceptable solution for #13952, then this PR should be merged.
It would also help on #18249

@phemmer
Contributor
phemmer commented Nov 25, 2015

I'm going to take a look at what mountkin mentioned.
My first thought is something was misconfigured on the system as PMUTD is supposed to prevent such issues. Though I'll see if I can get my hands on a GCE box and replicate the issue.

@phemmer
Contributor
phemmer commented Nov 26, 2015

Yeah, I'm not able to reproduce any issues on GCE. Everything behaves exactly as it's supposed to.

I launched a GCE box. Host MTU was 1460. Launched docker with --mtu 1500. Started tcpdump inside the container, and started a large stream to a remote host.
The tcpdump output shows PMTUD working exactly as it's supposed to:

02:24:39.387865 IP 172.17.0.2.48770 > 66.229.123.231.3000: Flags [S], seq 3106098261, win 29200, options [mss 1460,sackOK,TS val 909093 ecr 0,nop,wscale 7], length 0
02:24:39.429699 IP 66.229.123.231.3000 > 172.17.0.2.48770: Flags [S.], seq 3695655258, ack 3106098262, win 5792, options [mss 1460,sackOK,TS val 3753644923 ecr 909093,nop,wscale 2], length 0
02:24:39.429725 IP 172.17.0.2.48770 > 66.229.123.231.3000: Flags [.], ack 1, win 229, options [nop,nop,TS val 909103 ecr 3753644923], length 0
02:24:39.429854 IP 172.17.0.2.48770 > 66.229.123.231.3000: Flags [.], seq 1:1449, ack 1, win 229, options [nop,nop,TS val 909103 ecr 3753644923], length 1448
02:24:39.429868 IP 172.17.0.1 > 172.17.0.2: ICMP 66.229.123.231 unreachable - need to frag (mtu 1460), length 556
02:24:39.429877 IP 172.17.0.2.48770 > 66.229.123.231.3000: Flags [.], seq 1449:2897, ack 1, win 229, options [nop,nop,TS val 909103 ecr 3753644923], length 1448
02:24:39.429881 IP 172.17.0.1 > 172.17.0.2: ICMP 66.229.123.231 unreachable - need to frag (mtu 1460), length 556
02:24:39.429886 IP 172.17.0.2.48770 > 66.229.123.231.3000: Flags [P.], seq 2897:4345, ack 1, win 229, options [nop,nop,TS val 909103 ecr 3753644923], length 1448
02:24:39.469948 IP 66.229.123.231.3000 > 172.17.0.2.48770: Flags [.], ack 1409, win 2152, options [nop,nop,TS val 3753644927 ecr 909103], length 0
02:24:39.469962 IP 172.17.0.2.48770 > 66.229.123.231.3000: Flags [.], seq 5793:7201, ack 1, win 229, options [nop,nop,TS val 909113 ecr 3753644927], length 1408
02:24:39.469964 IP 172.17.0.2.48770 > 66.229.123.231.3000: Flags [.], seq 7201:7241, ack 1, win 229, options [nop,nop,TS val 909113 ecr 3753644927], length 40
02:24:39.469965 IP 172.17.0.2.48770 > 66.229.123.231.3000: Flags [.], seq 7241:8649, ack 1, win 229, options [nop,nop,TS val 909113 ecr 3753644927], length 1408
02:24:39.474975 IP 66.229.123.231.3000 > 172.17.0.2.48770: Flags [.], ack 1449, win 2152, options [nop,nop,TS val 3753644927 ecr 909103], length 0
02:24:39.474995 IP 172.17.0.2.48770 > 66.229.123.231.3000: Flags [.], seq 8649:10057, ack 1, win 229, options [nop,nop,TS val 909115 ecr 3753644927], length 1408
02:24:39.474978 IP 66.229.123.231.3000 > 172.17.0.2.48770: Flags [.], ack 2857, win 2856, options [nop,nop,TS val 3753644927 ecr 909103], length 0
02:24:39.474999 IP 172.17.0.2.48770 > 66.229.123.231.3000: Flags [P.], seq 10057:11465, ack 1, win 229, options [nop,nop,TS val 909115 ecr 3753644927], length 1408
02:24:39.475001 IP 172.17.0.2.48770 > 66.229.123.231.3000: Flags [.], seq 11465:12873, ack 1, win 229, options [nop,nop,TS val 909115 ecr 3753644927], length 1408
02:24:39.475003 IP 172.17.0.2.48770 > 66.229.123.231.3000: Flags [.], seq 12873:14281, ack 1, win 229, options [nop,nop,TS val 909115 ecr 3753644927], length 1408
02:24:39.474982 IP 66.229.123.231.3000 > 172.17.0.2.48770: Flags [.], ack 5753, win 4264, options [nop,nop,TS val 3753644928 ecr 909103], length 0
02:24:39.509035 IP 66.229.123.231.3000 > 172.17.0.2.48770: Flags [.], ack 7201, win 4968, options [nop,nop,TS val 3753644931 ecr 909113], length 0

Notice the ICMP need to frag messages indicating an MTU of 1460. Notice that they're coming from the host itself (172.17.0.1), not from a hop outside the host, since the packet can't even leave the box as the default route's MTU is too small.

@aboch
Contributor
aboch commented Nov 26, 2015

Thanks @phemmer for the test.
LGTM

@tiborvass
Contributor

thanks @phemmer ! Needs a rebase

@phemmer
Contributor
phemmer commented Dec 7, 2015

rebased

@tiborvass
Contributor

LGTM

@thaJeztah
Member

Thanks @phemmer! I searched the docs, and it looks like https://github.com/docker/docker/blob/43077f9b6406e3d5e401a361b4c9742c00be528b/docs/userguide/networking/default_network/custom-docker0.md mentions setting the default based on the hosts interface, so that may need some changes.

I don't think other sections of the documentation mention the default value currently.

@thaJeztah thaJeztah added this to the 1.10 milestone Dec 7, 2015
@phemmer phemmer don't try to use default route MTU as bridge MTU
Signed-off-by: Patrick Hemmer <patrick.hemmer@gmail.com>
fd9d7c0
@phemmer
Contributor
phemmer commented Dec 7, 2015

Documentation adjusted.

@thaJeztah
Member

Thanks @phemmer!

LGTM

@thaJeztah thaJeztah merged commit b36b492 into docker:master Dec 7, 2015

6 checks passed

docker/dco-signed All commits signed
Details
documentation success 2 tests run, 0 skipped, 0 failed.
Details
experimental Jenkins build Docker-PRs-experimental 11947 has succeeded
Details
janky Jenkins build Docker-PRs 20713 has succeeded
Details
userns Jenkins build Docker-PRs-userns 3189 has succeeded
Details
windows Jenkins build Windows-PRs 18364 has succeeded
Details
@cha87de

This breaks docker running in Cloud setups with e.g. VxLan tunnels and limited MTUs on the virtualised networking. In such a setup the virtual machine's operating system gets the MTU e.g. via CloudInit from the Cloud middleware. This information is now no longer passed through to the docker containers.

Owner

It's not supposed to be passed through. If your containers can't reach the remote network because they have an MTU bigger than the interface on that network, something is wrong with your setup. If you provide a way to reproduce, I can take a look at it.

something is wrong with your setup

Obviously ... VxLan is broken. Thanks for supporting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment