Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in v0.15.1: installation of cluster does not work due to CAPI nodeStartupTimeout #5637

Closed
Cajga opened this issue Apr 14, 2023 · 12 comments
Assignees
Labels
area/providers/tinkerbell Tinkerbell provider related tasks and issues external An issue, bug or feature request filed from outside the AWS org team/providers
Milestone

Comments

@Cajga
Copy link
Contributor

Cajga commented Apr 14, 2023

What happened:
Seems from v0.15.1 the node gets rebooted in hook instead of using kexec by default. Due to this, our old (and big) servers reach the default 10 minutes in nodeStartupTimeout in CAPI during the installation and the CAPI manager deletes the machine which results a new tinkerbell workflow while the server is booting up (after a successful tinkerbell workflow!!!).

Error message in the capi-controller-manager in the capi-system namespace:

I0414 13:43:15.004013       1 machinehealthcheck_controller.go:431] "Target has failed health check, marking for remediation" controller="machinehealthcheck" controllerGroup="cluster.x-k8s.io" controllerKind="Ma
chineHealthCheck" MachineHealthCheck="eksa-system/eks-a-bm-poc-worker01-worker-unhealthy" namespace="eksa-system" name="eks-a-bm-poc-worker01-worker-unhealthy" reconcileID=3289aa16-1020-4ff2-8394-25bfb3e23866 Cl
uster="eksa-system/eks-a-bm-poc" target="eksa-system/eks-a-bm-poc-worker01-worker-unhealthy/eks-a-bm-poc-worker01-5fc6799fcc-78vq9/" reason="NodeStartupTimeout" message="Node failed to report startup in 10m0s"

Please increase the CAPI nodeStartupTimeout to at least 20m or disable it as recommended by the CAPI documentation.

What you expected to happen:
Bare Metal installation works on nodes where the startup takes long time without the need to switch to kexec in tinkerbelltemplate.

How to reproduce it (as minimally and precisely as possible):
Need a server that is slow to bootup (many network cards to PXE, lots of HW so POST is slow etc.) and try to install EKS-A BM on it.

Anything else we need to know?:

Environment:

  • EKS Anywhere Release: v.0.15.1
  • EKS Distro Release: 1.25
@Cajga
Copy link
Contributor Author

Cajga commented Apr 16, 2023

To reproduce, one could just stop the boot of the server after the reboot in the workflow and wait until the 10 minutes timeout happens in the capi-controller-manager. At this stage, the successful tinkerbell workflow of the server gets deleted and a new gets created with a pending state...

As such the installation never finish and the create cluster command reaches it's timeout as well.

@abhinavmpandey08
Copy link
Member

Hi @Cajga, thanks for opening the issue.

Seems from v0.15.1 the node gets rebooted in hook instead of using kexec by default

I don't think we changed that logic in v0.15.1. If you are using Ubuntu with sda, it should use kexec. If using nvme or bottlerocket, it will default to reboot.

Also, can you provide a few more details about what you were trying to do?

  1. Did you see this error during create or upgrade?
  2. Were you trying this from the CLI (eksctl anywhere create/upgrade command) or the controller?
  3. Is this for management or workload cluster?

Ideally when performing the cluster operations like create or upgrade, the Machine Health Checks should be paused. I can look into whether that's actually happening

@abhinavmpandey08 abhinavmpandey08 added area/providers/tinkerbell Tinkerbell provider related tasks and issues external An issue, bug or feature request filed from outside the AWS org team/providers labels Apr 17, 2023
@Cajga
Copy link
Contributor Author

Cajga commented Apr 17, 2023

Hi @abhinavmpandey08,

Thanks for looking into this.

I am using ubuntu with sda but it does reboot. To be honest, I have been using the same HW with the same config since the closed beta of EKS Anywhere Bare Metal.

The relevant code seems to be this:
https://github.com/aws/eks-anywhere/blob/main/pkg/api/v1alpha1/tinkerbelltemplateconfig_defaults.go#L96

And the behavior (that you mentioned) was changed with the following commit/diff, 2 months ago (possibly this get into release v0.15.0?). With this change the code always use reboot regardless if the host has sda or nvme:
71d5e1a#diff-c53922cd61c7c7232821214164c88b935da003c028dd8597a7a07b69d0feea67L67-L80

To answer your questions:

  1. I was trying to re-create my cluster
  2. I was using eksctl anywhere create
  3. We are going with the multiple standalone cluster approach (managed by gitops from a single repo) so you can call it a management cluster I believe

@abhinavmpandey08
Copy link
Member

Thanks for the information @Cajga!
I see we updated the workflow to always reboot now instead of kexec. You can find more details about it in the PR description here: #4882

A few more questions:

  • How long does your hardware generally take to fully provision?
  • Are you using BMC with Rufio to provision the servers?
  • Can you provide the full CLI logs (ideally with -v6)?

@Cajga
Copy link
Contributor Author

Cajga commented Apr 19, 2023

Hi @abhinavmpandey08,

  1. I think, with the reboot, this server would need around 15 minutes to join the cluster. I have another type as well in use which barely makes it in 10 minutes (sometimes it works sometimes it does not)
  2. This is a bit complex but in short: I am using BMC with rufio. In long: I have an open issue with AWS Enterprise Support as it seems, rufio timeouts on the "stop task" when my node is used as a worker node (this is also started with v0.15.1, it worked well till v0.14.6). In order to workaround this problem, I am using ipmi_sim (to make rufio happy) and I manually handle the BMC when a tinkerbell task happens.
  3. I am attending KubeCon this week from HO. Will be able to provide the output next week when I am back to the office.

Please also note that in the meantime, I opened an enterprise support ticket regarding this issue as well.

@mitalipaygude
Copy link
Member

Hello @Cajga,

We have a flag here for overriding the node startup timeout. We are updating our documents and the usage for create and upgrade commands. Let me know if this flag helps, and then I can close out this issue.

@Cajga
Copy link
Contributor Author

Cajga commented May 2, 2023

Hi @mitalipaygude,

Thanks for pointing that flag out. We were looking for it in the help of the command but not in the code :)

Indeed, using eksctl anywhere create cluster --node-startup-timeout 20m ... works well. We are also happy to see that the default is going to change to 20m so we would not need this: #5755.

@chrisdoherty4
Copy link
Member

chrisdoherty4 commented May 2, 2023

@Cajga There are some changes going into upstream projects that will help mitigate the root issue. Unfortunately I don't have a timeline on them making it to EKS-A.

@Cajga
Copy link
Contributor Author

Cajga commented May 2, 2023

Hi @chrisdoherty4 ,

Thanks for the links/info.

I believe that, the root cause of this issue (NOTE: this is not the "BMC timeout" issue! ) is the really slow boot process of our old hosts. Even when I used IPMI simulator ( which is basically instantly answers to rufio and the BMC task succeeds immediately) I run into the NodeStartupTimeout when it is set to 10m (default)...

In any case, I can test when they make it to EKS-A.

@mitalipaygude
Copy link
Member

Hello @Cajga , here's the PR we added to display the help command with the option for timeout, and called it out in the docs.

#5763

@mitalipaygude
Copy link
Member

Closing this issue out @Cajga.

@Cajga
Copy link
Contributor Author

Cajga commented May 15, 2023

@mitalipaygude, thanks a lot for the prompt support. I've seen the help message as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/providers/tinkerbell Tinkerbell provider related tasks and issues external An issue, bug or feature request filed from outside the AWS org team/providers
Projects
None yet
Development

No branches or pull requests

8 participants