Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel 4.12 breaks non-zero updelay in network bonding driver #2065

Closed
bgilbert opened this issue Jul 20, 2017 · 4 comments
Closed

Kernel 4.12 breaks non-zero updelay in network bonding driver #2065

bgilbert opened this issue Jul 20, 2017 · 4 comments

Comments

@bgilbert
Copy link
Member

@bgilbert bgilbert commented Jul 20, 2017

Issue Report

Bug

Container Linux Version

$ cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1478.0.0
VERSION_ID=1478.0.0
BUILD_ID=2017-07-19-0038
PRETTY_NAME="Container Linux by CoreOS 1478.0.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Environment

Packet

Expected Behavior

Working network.

Actual Behavior

Networking is unreliable. The kernel log gets a message every 100 ms:

link status up for interface eth1, enabling it in 200 ms

Reproduction Steps

  1. Boot a Container Linux release with a 4.12 kernel (1465 or above) on a Packet type 1 instance.
  2. If dmesg is not already filling with log messages, do this:
echo -enp1s0f1 | sudo tee /sys/devices/virtual/net/bond0/bonding/slaves
echo +enp1s0f1 | sudo tee /sys/devices/virtual/net/bond0/bonding/slaves

Other Information

Problem bisected to torvalds/linux@de77ecd.

Workaround:

echo 0 | sudo tee /sys/devices/virtual/net/bond0/bonding/updelay

This causes:

bond0: Setting up delay to 0
bond0: link status definitely up for interface eth1, 1000 Mbps full duplex
@bgilbert
Copy link
Member Author

@bgilbert bgilbert commented Jul 21, 2017

Posted to netdev; no response yet.

@bgilbert
Copy link
Member Author

@bgilbert bgilbert commented Jul 29, 2017

Fixed by coreos/linux#74, which will likely be included in alpha 1492.0.0 and beta 1465.3.0.

@bgilbert bgilbert closed this Jul 29, 2017
@f0
Copy link

@f0 f0 commented Oct 13, 2017

@bgilbert
we have exactly this problem with 1520.6 (comming from 1465.8.0)

internet and bonding are the names of the bonding interfaces
The system is not stable, starts and then breaks

[  127.468853] 8021q: adding VLAN 0 to HW filter on device mv-internet
[  129.976625] 8021q: adding VLAN 0 to HW filter on device mv-internet
[  130.274563] 8021q: adding VLAN 0 to HW filter on device mv-internet
[  131.479939] internet: link status down for interface eno4, disabling it in 200 ms
[  131.488827] internet: link status down for interface eno3, disabling it in 200 ms
[  131.499935] internet: link status down for interface eno4, disabling it in 200 ms
[  131.508829] internet: link status down for interface eno3, disabling it in 200 ms
[  131.518936] internet: link status down for interface eno4, disabling it in 200 ms
[  131.527811] internet: link status down for interface eno3, disabling it in 200 ms
[  131.551937] bonding: link status down for interface eno2, disabling it in 200 ms
[  131.561933] bonding: link status down for interface eno2, disabling it in 200 ms
[  131.571936] bonding: link status down for interface eno2, disabling it in 200 ms
[  131.687960] bonding: link status down for interface eno1, disabling it in 200 ms
[  131.697958] bonding: link status down for interface eno1, disabling it in 200 ms
[  131.707958] bonding: link status down for interface eno1, disabling it in 200 ms
[  131.717957] bonding: link status down for interface eno1, disabling it in 200 ms
[  131.727959] bonding: link status down for interface eno1, disabling it in 200 ms
[  131.737955] bonding: link status down for interface eno1, disabling it in 200 ms
[  131.740719] igb 0000:01:00.3 eno4: speed changed to 0 for port eno4
[  131.740721] 8021q: adding VLAN 0 to HW filter on device eno4
[  131.746854] internet: link status definitely down for interface eno4, disabling it
[  131.746858] internet: now running without any active interface!
[  131.746873] internet: link status definitely down for interface eno3, disabling it
[  131.785720] bonding: link status definitely down for interface eno2, disabling it
[  131.794622] bonding: now running without any active interface!
[  131.905489] 8021q: adding VLAN 0 to HW filter on device eno3
[  132.022874] 8021q: adding VLAN 0 to HW filter on device eno2
[  132.029961] bonding: link status definitely down for interface eno1, disabling it
[  132.143833] 8021q: adding VLAN 0 to HW filter on device eno1
[  136.179303] igb 0000:01:00.3 eno4: igb: eno4 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[  136.206285] igb 0000:01:00.1 eno2: igb: eno2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[  136.216911] internet: link status up for interface eno4, enabling it in 0 ms
[  136.216917] internet: link status definitely up for interface eno4, 1000 Mbps full duplex
[  136.216937] internet: first active interface up!
[  136.303962] bonding: link status up for interface eno2, enabling it in 0 ms
[  136.312070] bonding: link status definitely up for interface eno2, 1000 Mbps full duplex
[  136.321627] bonding: first active interface up!
[  136.654290] igb 0000:01:00.0 eno1: igb: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[  136.743938] bonding: link status up for interface eno1, enabling it in 200 ms
[  136.878283] igb 0000:01:00.2 eno3: igb: eno3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[  136.943979] internet: link status up for interface eno3, enabling it in 200 ms
[  136.959993] bonding: link status definitely up for interface eno1, 1000 Mbps full duplex
[  137.159969] internet: link status definitely up for interface eno3, 1000 Mbps full duplex
[  137.731647] 8021q: adding VLAN 0 to HW filter on device mv-internet
[  139.569075] igb 0000:01:00.0 eno1: igb: eno1 NIC Link is Down
[  139.575956] igb 0000:01:00.0 eno1: speed changed to 0 for port eno1
[  139.671972] bonding: link status down for interface eno1, disabling it in 200 ms
[  139.888010] bonding: link status definitely down for interface eno1, disabling it
[  140.420073] igb 0000:01:00.2 eno3: igb: eno3 NIC Link is Down
[  140.503969] internet: link status down for interface eno3, disabling it in 200 ms
[  140.576099] igb 0000:01:00.2 eno3: speed changed to 0 for port eno3
[  140.720052] internet: link status definitely down for interface eno3, disabling it
[  142.737319] igb 0000:01:00.0 eno1: igb: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[  142.807955] bonding: link status up for interface eno1, enabling it in 200 ms
[  143.023980] bonding: link status definitely up for interface eno1, 1000 Mbps full duplex
[  143.561310] igb 0000:01:00.2 eno3: igb: eno3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[  143.639966] internet: link status up for interface eno3, enabling it in 200 ms
[  143.855952] internet: link status definitely up for interface eno3, 1000 Mbps full duplex
[  154.438086] 8021q: adding VLAN 0 to HW filter on device mv-internet
[  154.669631] 8021q: adding VLAN 0 to HW filter on device mv-internet
@euank euank reopened this Oct 13, 2017
@euank euank closed this Oct 13, 2017
@euank euank reopened this Oct 13, 2017
@bgilbert
Copy link
Member Author

@bgilbert bgilbert commented Oct 13, 2017

@f0 This looks like a different problem. In the original bug, the link status up message repeated indefinitely, without any actual link status change. In the log you posted, the underlying interfaces are going down and coming back up. Could you open a new issue for this? Please include the output of lspci.

@bgilbert bgilbert closed this Oct 13, 2017
pothos added a commit to flatcar-linux/mantle that referenced this issue Mar 19, 2020
After coreos/bugs#2065
a test for "excessive bonding link status messages"
was introduced which also is good to keep for
coreos/bugs#2374.

However, having this message printed 10 times
does not directly relate to an error.
The test should check if something like
'bond0: Gained carrier' or
'link status definitely up for interface enp0s20f0'
is coming at the end and then continue.
For now, just increase the threshold.
pothos added a commit to flatcar-linux/mantle that referenced this issue Mar 19, 2020
After coreos/bugs#2065
a test for "excessive bonding link status messages"
was introduced which also is good to keep for
coreos/bugs#2374.

However, having this message printed 10 times
does not directly relate to an error.
The test should check if something like
'bond0: Gained carrier' or
'link status definitely up for interface enp0s20f0'
is coming at the end and then continue.
For now, just increase the threshold.
pothos added a commit to flatcar-linux/mantle that referenced this issue Mar 19, 2020
After coreos/bugs#2065
a test for "excessive bonding link status messages"
was introduced which also is good to keep for
coreos/bugs#2374.

However, having this message printed 10 times
does not directly relate to an error.
The test should check if something like
'bond0: Gained carrier' or
'link status definitely up for interface enp0s20f0'
is coming at the end and then continue.
For now, just increase the threshold.
pothos added a commit to flatcar-linux/mantle that referenced this issue Mar 19, 2020
After coreos/bugs#2065
a test for "excessive bonding link status messages"
was introduced which also is good to keep for
coreos/bugs#2374.

However, having this message printed 10 times
does not directly relate to an error.
The test should check if something like
'bond0: Gained carrier' or
'link status definitely up for interface enp0s20f0'
is coming at the end and then continue.
Add a second match for these messages that skips
the test. Also lower the threshold to see if the
logic works well.
pothos added a commit to flatcar-linux/mantle that referenced this issue Mar 20, 2020
After coreos/bugs#2065
a test for "excessive bonding link status messages"
was introduced which also is good to keep for
coreos/bugs#2374.

However, having this message printed 10 times
does not directly relate to an error.
The test should check if something like
'bond0: Gained carrier' or
'link status definitely up for interface enp0s20f0'
is coming at the end and then continue.
Add a second match for these messages that skips
the test. Also lower the threshold to see if the
logic works well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.