Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Ignition timeout when PXE booting with LACP connected to Cisco Nexus 5k. #2527
Ignition fails to wait for network to become ready.
This is problematic for an operator with hosts connected to Cisco Nexus 5k ports configured for LACP as the ports must negotiate, even with fast timers and port fast enabled it takes around 70 seconds for the interfaces to receive an IP address.
The problem is that the server is not attempting to negotiate LACP until it can be configured to do so, thus you must wait for LACP to timeout.
Remove LACP and everything works as expected, kind of. The PXE boot guide leaves a lot to be desired, locksmith errors aside it works.
In less than 62 seconds, the fastest I've seen an IP address acquired in a bonded configuration, Ignition has already given up and reboots the system which is a painful 5 minute wait while troubleshooting this issue btw.
Container Linux Version
Container Linux 1911.4.0.
What hardware/cloud provider/hypervisor is being used to run Container Linux?
If it would wait the 90 seconds which is supposedly the default, as I seen in another issue that I could not find again, there would be no issue or wait until the host has an IP address.
Complete craziness, knowing that it has no IP address and no network connectivity it repeatedly attempts to download a file which in fact requires a network connection and gives up very shortly before it is possible, like 5-10 seconds or so.
You could trying messing with the switch config but no combination of things makes it work, lacp rate fast and no graceful-convergence with spanning tree set to edge buys you 10 seconds or so.
I also attempted to set the IP address in the kernel boot options but the system halted because "the Transport endpoint is not connected".
As an (ugly) workaround you could use a data url for the initial config (assuming you're using
Ignition on CL depends on
It's hard to determine what a "reasonable" timeout is. I'd agree that ~60s (what it currently is) is perhaps a little short. Do you think increasing it to ~120s would fix your problem?
90 seconds would be sufficient, 120 would definitely fix it.
It takes 30 seconds for LACP to timeout, spanning tree hello and forwarding takes around 17 seconds without port fast enabled and then the DHCP client has to do its thing.
I'm unsure but I believe spanning tree has to do it's cycle once more after the LACP timeout. Since the port falls back into an individual state, could be wrong though I haven't gone that far into it.
referenced this issue
Dec 9, 2018
@CyrusTheVirusG can you please attach the specific log entries showing this (and the whole boot logs if possible)? This doesn't match my expectations and I'm worried there may be something else going on.
@ajeddeloh which 60s timeout are you thinking about?
The is the journal file before tampering with the boot, basically I added nsswitch.conf/resolve.conf to resolve the name servers, because it initially failed with not finding a nameserver. I was completely unaware that the interfaces did not come up. So the "error" is not find the nameserver at this point, but still the root cause is the the interfaces are not up and the system gives up too early. Journal.txt attached.
also some additional topic, i mount also an ca certificate. that is the only thing i always mounted, as i need my self signed ca, in order to make ssl work. This used to work about 8 month ago.
added a commit
Dec 10, 2018
referenced this issue
Dec 10, 2018
Err, its actually not one minute, it's