New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression in v0.15.1: installation of cluster does not work due to CAPI nodeStartupTimeout #5637
Comments
To reproduce, one could just stop the boot of the server after the reboot in the workflow and wait until the 10 minutes timeout happens in the capi-controller-manager. At this stage, the successful tinkerbell workflow of the server gets deleted and a new gets created with a pending state... As such the installation never finish and the create cluster command reaches it's timeout as well. |
Hi @Cajga, thanks for opening the issue.
I don't think we changed that logic in Also, can you provide a few more details about what you were trying to do?
Ideally when performing the cluster operations like create or upgrade, the Machine Health Checks should be paused. I can look into whether that's actually happening |
Thanks for looking into this. I am using ubuntu with sda but it does reboot. To be honest, I have been using the same HW with the same config since the closed beta of EKS Anywhere Bare Metal. The relevant code seems to be this: And the behavior (that you mentioned) was changed with the following commit/diff, 2 months ago (possibly this get into release v0.15.0?). With this change the code always use reboot regardless if the host has sda or nvme: To answer your questions:
|
Thanks for the information @Cajga! A few more questions:
|
Please also note that in the meantime, I opened an enterprise support ticket regarding this issue as well. |
Hi @mitalipaygude, Thanks for pointing that flag out. We were looking for it in the help of the command but not in the code :) Indeed, using |
@Cajga There are some changes going into upstream projects that will help mitigate the root issue. Unfortunately I don't have a timeline on them making it to EKS-A. |
Hi @chrisdoherty4 , Thanks for the links/info. I believe that, the root cause of this issue (NOTE: this is not the "BMC timeout" issue! ) is the really slow boot process of our old hosts. Even when I used IPMI simulator ( which is basically instantly answers to rufio and the BMC task succeeds immediately) I run into the NodeStartupTimeout when it is set to 10m (default)... In any case, I can test when they make it to EKS-A. |
Closing this issue out @Cajga. |
@mitalipaygude, thanks a lot for the prompt support. I've seen the help message as well. |
What happened:
Seems from v0.15.1 the node gets rebooted in hook instead of using kexec by default. Due to this, our old (and big) servers reach the default 10 minutes in nodeStartupTimeout in CAPI during the installation and the CAPI manager deletes the machine which results a new tinkerbell workflow while the server is booting up (after a successful tinkerbell workflow!!!).
Error message in the capi-controller-manager in the capi-system namespace:
Please increase the CAPI nodeStartupTimeout to at least 20m or disable it as recommended by the CAPI documentation.
What you expected to happen:
Bare Metal installation works on nodes where the startup takes long time without the need to switch to kexec in tinkerbelltemplate.
How to reproduce it (as minimally and precisely as possible):
Need a server that is slow to bootup (many network cards to PXE, lots of HW so POST is slow etc.) and try to install EKS-A BM on it.
Anything else we need to know?:
Environment:
The text was updated successfully, but these errors were encountered: