BottleRocket NVIDIA EKS Node group wont join EKS Cluster #3959

asluborski · 2024-05-14T20:54:31Z

Platform I'm building on:
BOTTLEROCKET_x86_64_NVIDIA w/ p2.xlarge through AWS console
Kubernetes 1.29

What I expected to happen:
When creating node group in AWS EKS cluster console under compute tab for BOTTLEROCKET_x86_64_NVIDIA w/ p2.xlarge, when done creating, should join the EKS cluster. I have not changed anything and the rest of the OS images like AL2 with GPU join the cluster fine.

What actually happened:

Got the message :
NodeCreationFailure | Instances failed to join the kubernetes cluster

How to reproduce the problem:
I am doing this through the AWS console under the EKS cluster compute tab. I am using this security group role for the nodes

These are my policies for the cluster node group role

I also added IAM permissions for my user that includes ssm:, ssmmessages:, and ec2messages:*

These are my policies for the cluster role itself

I am not sure what I am missing? I am using a VPC with public/private subnets and everything was working fine until I tried using bottlerocket.

The text was updated successfully, but these errors were encountered:

vigh-m · 2024-05-14T22:36:51Z

Hi @asluborski! Thanks for reaching out with this issue.

Have you referred to our Quickstart guide when launching nodes for EKS?
What user-data are you passing to the Bottlerocket instance when you try to launch it?
You can check the console logs via the Get System Logs option under Actions -> Monitor and Troubleshoot from the Instance. This will help figure out if the node is booting fully. If not, it can help diagnose this further.

asluborski · 2024-05-16T17:14:48Z

I was able to find the issue. The conflict was with kubernetes version I was using for the cluster and the nvidia driver installing for the p2 instances. P2 instances require the legacy nvidia drivers 470.x, it was installing newer nvidia drivers(500.x), hence failing. I had to downgrade to Kubernetes 1.23 for bottlerocket to install legacy nvidia drivers.

asluborski added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels May 14, 2024

vigh-m added status/needs-info Further information is requested type/support User support related issues. and removed type/bug Something isn't working status/needs-triage Pending triage or re-evaluation labels May 14, 2024

asluborski closed this as completed May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BottleRocket NVIDIA EKS Node group wont join EKS Cluster #3959

BottleRocket NVIDIA EKS Node group wont join EKS Cluster #3959

asluborski commented May 14, 2024 •

edited

Loading

vigh-m commented May 14, 2024

asluborski commented May 16, 2024

BottleRocket NVIDIA EKS Node group wont join EKS Cluster #3959

BottleRocket NVIDIA EKS Node group wont join EKS Cluster #3959

Comments

asluborski commented May 14, 2024 • edited Loading

vigh-m commented May 14, 2024

asluborski commented May 16, 2024

asluborski commented May 14, 2024 •

edited

Loading