Regression in v0.15.1: installation of cluster does not work due to CAPI nodeStartupTimeout #5637

Cajga · 2023-04-14T14:28:52Z

What happened:
Seems from v0.15.1 the node gets rebooted in hook instead of using kexec by default. Due to this, our old (and big) servers reach the default 10 minutes in nodeStartupTimeout in CAPI during the installation and the CAPI manager deletes the machine which results a new tinkerbell workflow while the server is booting up (after a successful tinkerbell workflow!!!).

Error message in the capi-controller-manager in the capi-system namespace:

I0414 13:43:15.004013       1 machinehealthcheck_controller.go:431] "Target has failed health check, marking for remediation" controller="machinehealthcheck" controllerGroup="cluster.x-k8s.io" controllerKind="Ma
chineHealthCheck" MachineHealthCheck="eksa-system/eks-a-bm-poc-worker01-worker-unhealthy" namespace="eksa-system" name="eks-a-bm-poc-worker01-worker-unhealthy" reconcileID=3289aa16-1020-4ff2-8394-25bfb3e23866 Cl
uster="eksa-system/eks-a-bm-poc" target="eksa-system/eks-a-bm-poc-worker01-worker-unhealthy/eks-a-bm-poc-worker01-5fc6799fcc-78vq9/" reason="NodeStartupTimeout" message="Node failed to report startup in 10m0s"

Please increase the CAPI nodeStartupTimeout to at least 20m or disable it as recommended by the CAPI documentation.

What you expected to happen:
Bare Metal installation works on nodes where the startup takes long time without the need to switch to kexec in tinkerbelltemplate.

How to reproduce it (as minimally and precisely as possible):
Need a server that is slow to bootup (many network cards to PXE, lots of HW so POST is slow etc.) and try to install EKS-A BM on it.

Anything else we need to know?:

Environment:

EKS Anywhere Release: v.0.15.1
EKS Distro Release: 1.25

The text was updated successfully, but these errors were encountered:

Cajga · 2023-04-16T13:29:03Z

To reproduce, one could just stop the boot of the server after the reboot in the workflow and wait until the 10 minutes timeout happens in the capi-controller-manager. At this stage, the successful tinkerbell workflow of the server gets deleted and a new gets created with a pending state...

As such the installation never finish and the create cluster command reaches it's timeout as well.

abhinavmpandey08 · 2023-04-17T18:18:12Z

Hi @Cajga, thanks for opening the issue.

Seems from v0.15.1 the node gets rebooted in hook instead of using kexec by default

I don't think we changed that logic in v0.15.1. If you are using Ubuntu with sda, it should use kexec. If using nvme or bottlerocket, it will default to reboot.

Also, can you provide a few more details about what you were trying to do?

Did you see this error during create or upgrade?
Were you trying this from the CLI (eksctl anywhere create/upgrade command) or the controller?
Is this for management or workload cluster?

Ideally when performing the cluster operations like create or upgrade, the Machine Health Checks should be paused. I can look into whether that's actually happening

Cajga · 2023-04-17T19:24:54Z

Hi @abhinavmpandey08,

Thanks for looking into this.

I am using ubuntu with sda but it does reboot. To be honest, I have been using the same HW with the same config since the closed beta of EKS Anywhere Bare Metal.

The relevant code seems to be this:
https://github.com/aws/eks-anywhere/blob/main/pkg/api/v1alpha1/tinkerbelltemplateconfig_defaults.go#L96

And the behavior (that you mentioned) was changed with the following commit/diff, 2 months ago (possibly this get into release v0.15.0?). With this change the code always use reboot regardless if the host has sda or nvme:
71d5e1a#diff-c53922cd61c7c7232821214164c88b935da003c028dd8597a7a07b69d0feea67L67-L80

To answer your questions:

I was trying to re-create my cluster
I was using eksctl anywhere create
We are going with the multiple standalone cluster approach (managed by gitops from a single repo) so you can call it a management cluster I believe

abhinavmpandey08 · 2023-04-19T17:27:26Z

Thanks for the information @Cajga!
I see we updated the workflow to always reboot now instead of kexec. You can find more details about it in the PR description here: #4882

A few more questions:

How long does your hardware generally take to fully provision?
Are you using BMC with Rufio to provision the servers?
Can you provide the full CLI logs (ideally with -v6)?

Cajga · 2023-04-19T18:22:09Z

Hi @abhinavmpandey08,

I think, with the reboot, this server would need around 15 minutes to join the cluster. I have another type as well in use which barely makes it in 10 minutes (sometimes it works sometimes it does not)
This is a bit complex but in short: I am using BMC with rufio. In long: I have an open issue with AWS Enterprise Support as it seems, rufio timeouts on the "stop task" when my node is used as a worker node (this is also started with v0.15.1, it worked well till v0.14.6). In order to workaround this problem, I am using ipmi_sim (to make rufio happy) and I manually handle the BMC when a tinkerbell task happens.
I am attending KubeCon this week from HO. Will be able to provide the output next week when I am back to the office.

Please also note that in the meantime, I opened an enterprise support ticket regarding this issue as well.

mitalipaygude · 2023-05-02T03:36:04Z

Hello @Cajga,

We have a flag here for overriding the node startup timeout. We are updating our documents and the usage for create and upgrade commands. Let me know if this flag helps, and then I can close out this issue.

Cajga · 2023-05-02T08:55:35Z

Hi @mitalipaygude,

Thanks for pointing that flag out. We were looking for it in the help of the command but not in the code :)

Indeed, using eksctl anywhere create cluster --node-startup-timeout 20m ... works well. We are also happy to see that the default is going to change to 20m so we would not need this: #5755.

chrisdoherty4 · 2023-05-02T15:30:21Z

@Cajga There are some changes going into upstream projects that will help mitigate the root issue. Unfortunately I don't have a timeline on them making it to EKS-A.

Cajga · 2023-05-02T22:27:08Z

Hi @chrisdoherty4 ,

Thanks for the links/info.

I believe that, the root cause of this issue (NOTE: this is not the "BMC timeout" issue! ) is the really slow boot process of our old hosts. Even when I used IPMI simulator ( which is basically instantly answers to rufio and the BMC task succeeds immediately) I run into the NodeStartupTimeout when it is set to 10m (default)...

In any case, I can test when they make it to EKS-A.

mitalipaygude · 2023-05-02T23:40:47Z

Hello @Cajga , here's the PR we added to display the help command with the option for timeout, and called it out in the docs.

#5763

mitalipaygude · 2023-05-15T17:08:28Z

Closing this issue out @Cajga.

Cajga · 2023-05-15T17:20:46Z

@mitalipaygude, thanks a lot for the prompt support. I've seen the help message as well.

abhinavmpandey08 added area/providers/tinkerbell Tinkerbell provider related tasks and issues external An issue, bug or feature request filed from outside the AWS org team/providers labels Apr 17, 2023

jiayiwang7 added this to the v0.15.0 milestone Apr 20, 2023

jiayiwang7 assigned mitalipaygude, jacobweinstock, abhinavmpandey08 and pokearu Apr 20, 2023

drewvanstone modified the milestones: v0.15.0, v0.16.0 Apr 27, 2023

jacobweinstock mentioned this issue May 1, 2023

Increase the bare metal provider's default CAPI nodeStartupTimeout #5755

Closed

jiayiwang7 mentioned this issue May 2, 2023

Display timeout flags in help command and doc #5763

Merged

mitalipaygude closed this as completed May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in v0.15.1: installation of cluster does not work due to CAPI nodeStartupTimeout #5637

Regression in v0.15.1: installation of cluster does not work due to CAPI nodeStartupTimeout #5637

Cajga commented Apr 14, 2023

Cajga commented Apr 16, 2023

abhinavmpandey08 commented Apr 17, 2023

Cajga commented Apr 17, 2023

abhinavmpandey08 commented Apr 19, 2023

Cajga commented Apr 19, 2023

mitalipaygude commented May 2, 2023

Cajga commented May 2, 2023

chrisdoherty4 commented May 2, 2023 •

edited

Cajga commented May 2, 2023

mitalipaygude commented May 2, 2023

mitalipaygude commented May 15, 2023

Cajga commented May 15, 2023

Regression in v0.15.1: installation of cluster does not work due to CAPI nodeStartupTimeout #5637

Regression in v0.15.1: installation of cluster does not work due to CAPI nodeStartupTimeout #5637

Comments

Cajga commented Apr 14, 2023

Cajga commented Apr 16, 2023

abhinavmpandey08 commented Apr 17, 2023

Cajga commented Apr 17, 2023

abhinavmpandey08 commented Apr 19, 2023

Cajga commented Apr 19, 2023

mitalipaygude commented May 2, 2023

Cajga commented May 2, 2023

chrisdoherty4 commented May 2, 2023 • edited

Cajga commented May 2, 2023

mitalipaygude commented May 2, 2023

mitalipaygude commented May 15, 2023

Cajga commented May 15, 2023

chrisdoherty4 commented May 2, 2023 •

edited