Skip to content
This repository has been archived by the owner. It is now read-only.

Interface MTU change on AWS instances supporting Jumbo packets #2443

Closed
dghubble opened this issue May 27, 2018 · 9 comments
Closed

Interface MTU change on AWS instances supporting Jumbo packets #2443

dghubble opened this issue May 27, 2018 · 9 comments

Comments

@dghubble
Copy link
Member

@dghubble dghubble commented May 27, 2018

Issue Report

Bug

Between Container Linux releases, the default interface MTU used on AWS instances supporting Jumbo packets has changed/regressed from 9001 to 1500. This causes connectivity problems between pods across node boundaries. CNI providers don't expect the MTU to change over time node-by-node. It also means Jumbo packets aren't being used. I received pages shortly after the OS auto-update occurred.

Container Linux Version

Stable 1745.4.0
Stable 1745.3.1

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1745.4.0
VERSION_ID=1745.4.0
BUILD_ID=2018-05-24-2146
PRETTY_NAME="Container Linux by CoreOS 1745.4.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Environment

AWS

Reproduction Steps

Launch an EC2 instance that supports Jumbo packets. A t2.small will suffice. With stable 1688.5.3, the interface comes up with MTU 9001.

$ ip link show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000                                                                                                
    link/ether 06:06:9f:dc:08:10 brd ff:ff:ff:ff:ff:ff

Now try with stable 1745.3.1 or 1745.4.0. The MTU has changed to 1500, as though Jumbo frame support isn't available.

$ ip link show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 02:6d:6b:43:05:ec brd ff:ff:ff:ff:ff:ff
@lucab
Copy link
Member

@lucab lucab commented May 27, 2018

I'm not too familiar with such network details, but I guess that AWS is pushing the MTU parameter via DHCP and UseMTU= should be picking it up.

Can you please:

  • check for relevant log entries journalctl -u systemd-networkd
  • post the output of networkctl status -a
  • diff for relevant changes in /run/systemd/netif/leases/* and /run/systemd/netif/links/* between current and previous stable.
@dghubble
Copy link
Member Author

@dghubble dghubble commented May 27, 2018

Seems like a good guess. It looks like its attempting to set the MTU and now failing. On 1688.5.3,

$ journalctl -u systemd-networkd
May 27 21:35:00 localhost systemd-networkd[678]: Enumeration completed
May 27 21:35:00 localhost systemd[1]: Started Network Service.
May 27 21:35:00 localhost systemd-networkd[678]: lo: Configured
May 27 21:35:00 localhost systemd-networkd[678]: eth0: IPv6 successfully enabled
May 27 21:35:00 localhost systemd-networkd[678]: eth0: Gained carrier
May 27 21:35:00 localhost systemd-networkd[678]: eth0: DHCPv4 address 10.0.2.95/20 via 10.0.0.1
May 27 21:35:00 localhost systemd-networkd[678]: Not connected to system bus, not setting hostname.
May 27 21:35:01 localhost systemd-networkd[678]: eth0: Gained IPv6LL
May 27 21:35:01 localhost systemd-networkd[678]: eth0: Configured

and on 1745.4.0,

May 27 21:05:56 localhost systemd-networkd[675]: Enumeration completed
May 27 21:05:56 localhost systemd[1]: Started Network Service.
May 27 21:05:56 localhost systemd-networkd[675]: lo: Configured
May 27 21:05:56 localhost systemd-networkd[675]: eth0: IPv6 successfully enabled
May 27 21:05:56 localhost systemd-networkd[675]: eth0: Gained carrier
May 27 21:05:56 localhost systemd-networkd[675]: eth0: DHCPv4 address 10.0.44.198/20 via 10.0.32.1
May 27 21:05:56 localhost systemd-networkd[675]: Not connected to system bus, not setting hostname.
May 27 21:05:56 localhost systemd-networkd[675]: eth0: Could not set MTU: Invalid argument
May 27 21:05:58 localhost systemd-networkd[675]: eth0: Gained IPv6LL
May 27 21:06:02 localhost systemd-networkd[675]: eth0: Configured

I haven't found anything notable or different in the run locations.

@lucab
Copy link
Member

@lucab lucab commented May 28, 2018

Thanks. I'm observing this on latest alpha too, and it seems to have started with the systemd-v238 bump (unrelated to kernel version). I forwarded this to systemd/systemd#9102 with some additional details, but I failed to locate where the real issue is. Waiting for some more insights from upstream.

@bgilbert
Copy link
Member

@bgilbert bgilbert commented May 30, 2018

Per systemd/systemd#9102 (comment), this looks like a kernel regression. RHBZ.

@ajeddeloh
Copy link

@ajeddeloh ajeddeloh commented May 30, 2018

Interestingly this doesn't impact all instance types. m4.large does not exhibit it, but t2.small does. Cannot repro with qemu.

@bgilbert
Copy link
Member

@bgilbert bgilbert commented May 31, 2018

Reverting coreos/linux@f599c64 fixes the problem.

This should be fixed in alpha 1786.2.0, beta 1772.2.0, and stable 1745.5.0, due shortly. Thanks for the report, @dghubble!

@bgilbert bgilbert closed this May 31, 2018
@dghubble
Copy link
Member Author

@dghubble dghubble commented May 31, 2018

👋 Hooray, thanks Ben and company!

@whereisaaron
Copy link

@whereisaaron whereisaaron commented Jun 4, 2018

Thanks for fixing this. I've been chasing load balancer PMTU problems and tracked it back to this. I saw this issue when updating to CoreOS-stable-1745.3.1-hvm. Suddenly all my T2 instances has an MTU of 1500, whereas other instance time like M* and R* instances had an MTU of 9001. This caused PMTU problems for external applications that were load-balanced across both types of instance.

I am guessing that while T2 and other instance types support jumbo frames, T2 instances perhaps default to 1500 whereas the others default to 9001 (before DHCP overrides that).

@bgilbert
Copy link
Member

@bgilbert bgilbert commented Jun 16, 2018

@dghubble dghubble mentioned this issue Jul 4, 2018
3 of 7 tasks complete
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants