New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignition timeout when PXE booting with LACP connected to Cisco Nexus 5k. #2527

Open
CyrusTheVirusG opened this Issue Nov 30, 2018 · 16 comments

Comments

Projects
None yet
4 participants
@CyrusTheVirusG

CyrusTheVirusG commented Nov 30, 2018

Issue Report

Ignition fails to wait for network to become ready.

This is problematic for an operator with hosts connected to Cisco Nexus 5k ports configured for LACP as the ports must negotiate, even with fast timers and port fast enabled it takes around 70 seconds for the interfaces to receive an IP address.

The problem is that the server is not attempting to negotiate LACP until it can be configured to do so, thus you must wait for LACP to timeout.

Remove LACP and everything works as expected, kind of. The PXE boot guide leaves a lot to be desired, locksmith errors aside it works.

Bug

In less than 62 seconds, the fastest I've seen an IP address acquired in a bonded configuration, Ignition has already given up and reboots the system which is a painful 5 minute wait while troubleshooting this issue btw.

Container Linux Version

Container Linux 1911.4.0.

$ cat /etc/os-release
NAME="Container Linux by CoreOS"
...
BUG_REPORT_URL="https://issues.coreos.com"

Environment

What hardware/cloud provider/hypervisor is being used to run Container Linux?
Baremetal, Supermicro server, with Intel nics.

Expected Behavior

If it would wait the 90 seconds which is supposedly the default, as I seen in another issue that I could not find again, there would be no issue or wait until the host has an IP address.

Actual Behavior

Complete craziness, knowing that it has no IP address and no network connectivity it repeatedly attempts to download a file which in fact requires a network connection and gives up very shortly before it is possible, like 5-10 seconds or so.

Reproduction Steps

  1. Connect two interfaces to a Cisco 5k with channel-group mode active configured.
    int eth1/1
    lacp rate fast
    channel-group 100 mode active

interface port-channel100
no lacp graceful-convergence
spanning-tree port type edge

You could trying messing with the switch config but no combination of things makes it work, lacp rate fast and no graceful-convergence with spanning tree set to edge buys you 10 seconds or so.

  1. Setup PXE server, ensure it works.
  2. Wait for said issue after attempting to boot.

Other Information

I also attempted to set the IP address in the kernel boot options but the system halted because "the Transport endpoint is not connected".
Probably because of LACP negotiation but who knows. This may not be an issue in your "development" environment but in real world metal cases it is definitely a problem.

@CyrusTheVirusG

This comment has been minimized.

CyrusTheVirusG commented Nov 30, 2018

A possible fix would be to add a kernel option to specify interfaces that should be bonded.

Which would allow for proper LACP negotiation on startup and avoid the LACP timeout problem.

@CyrusTheVirusG

This comment has been minimized.

CyrusTheVirusG commented Nov 30, 2018

Some more details, I think that it thinks that the interface gained IPv6 connectivity when it did not.

[ 2.373696] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
It's around this time that the download first tries to take place.

@CyrusTheVirusG

This comment has been minimized.

CyrusTheVirusG commented Nov 30, 2018

I suppose a custom oem would probably also solve this, should have scrolled to the bottom before spending hours troubleshooting this.

@CyrusTheVirusG

This comment has been minimized.

CyrusTheVirusG commented Nov 30, 2018

Custom OEM is the way to go if you're having this issue as well. This should probably still be addressed and/or the documentation should be updated.

@ajeddeloh

This comment has been minimized.

ajeddeloh commented Nov 30, 2018

If it's fetching things over http, you can adjust the timeouts in the ignition.timeouts section. Would that fix your problem or is it timing out fetching the initial config?

@CyrusTheVirusG

This comment has been minimized.

CyrusTheVirusG commented Nov 30, 2018

It is timing out fetching the initial config.

@ajeddeloh

This comment has been minimized.

ajeddeloh commented Dec 1, 2018

As an (ugly) workaround you could use a data url for the initial config (assuming you're using coreos.config.url on the kernel command line) to just be a stub config that sets the timeouts and fetches the real config using the ignition.config.replace section.

Ignition on CL depends on network.target. What the network being up means is somewhat nebulous (see https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/). This is why we retry instead of trying to detect if network is up.

It's hard to determine what a "reasonable" timeout is. I'd agree that ~60s (what it currently is) is perhaps a little short. Do you think increasing it to ~120s would fix your problem?

@CyrusTheVirusG

This comment has been minimized.

CyrusTheVirusG commented Dec 1, 2018

90 seconds would be sufficient, 120 would definitely fix it.

It takes 30 seconds for LACP to timeout, spanning tree hello and forwarding takes around 17 seconds without port fast enabled and then the DHCP client has to do its thing.

I'm unsure but I believe spanning tree has to do it's cycle once more after the LACP timeout. Since the port falls back into an individual state, could be wrong though I haven't gone that far into it.

@CyrusTheVirusG

This comment has been minimized.

CyrusTheVirusG commented Dec 1, 2018

I'll time the Intel pxe boot agent to see what it uses, they probably have it figured out pretty good.

@CyrusTheVirusG

This comment has been minimized.

CyrusTheVirusG commented Dec 1, 2018

They are using ~73 seconds, probably 75.

@CyrusTheVirusG

This comment has been minimized.

CyrusTheVirusG commented Dec 1, 2018

With port-fast it I get an IP at ~62 seconds and without ~72 seconds, the Intel agent always works though so maybe 75 is the magic number.

@lucab

This comment has been minimized.

Member

lucab commented Dec 9, 2018

it repeatedly attempts to download a file which in fact requires a network connection and gives up very shortly before it is possible, like 5-10 seconds or so.

@CyrusTheVirusG can you please attach the specific log entries showing this (and the whole boot logs if possible)? This doesn't match my expectations and I'm worried there may be something else going on.

@ajeddeloh which 60s timeout are you thinking about?

@DirkTheDaring

This comment has been minimized.

DirkTheDaring commented Dec 9, 2018

The is the journal file before tampering with the boot, basically I added nsswitch.conf/resolve.conf to resolve the name servers, because it initially failed with not finding a nameserver. I was completely unaware that the interfaces did not come up. So the "error" is not find the nameserver at this point, but still the root cause is the the interfaces are not up and the system gives up too early. Journal.txt attached.

journal.txt

@DirkTheDaring

This comment has been minimized.

DirkTheDaring commented Dec 9, 2018

also some additional topic, i mount also an ca certificate. that is the only thing i always mounted, as i need my self signed ca, in order to make ssl work. This used to work about 8 month ago.
Exact processt is: Create a /etc/sssl/certs/mycert.pem and put it into a "cert.cpio.gz", which gets mounted during boot. This used to work on my machine (TM) :)

@lucab

This comment has been minimized.

Member

lucab commented Dec 9, 2018

Ah, nevermind, I was mixing the client HTTP total timeout (which is unlimited) with the config fetching timeout (which is one minute).

lucab added a commit to lucab/ignition that referenced this issue Dec 10, 2018

internal/exec: increase default config fetch timeout
This bumps default config timeout to 2 minutes in order to be more
resilient in environments where network stabilization may take a long
time.

Ref: coreos/bugs#2527
Ref: coreos/bugs#2532
@ajeddeloh

This comment has been minimized.

ajeddeloh commented Dec 10, 2018

Err, its actually not one minute, it's 10.1 + 10.2 + 10.4 + 10.8 + 11.6 + 13.2 = 65.5 seconds (although the last request would start at t=52.3s). Each http attempt times out after 10 seconds then there is a backoff between attempts that exponentially increases. This really ought to be simplified since the exponential backoff is dwarfed by http timeout, but for now we could probably increase the max backoff to 30 sec or something similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment