Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BottleRocket NVIDIA EKS Node group wont join EKS Cluster #3959

Closed
asluborski opened this issue May 14, 2024 · 2 comments
Closed

BottleRocket NVIDIA EKS Node group wont join EKS Cluster #3959

asluborski opened this issue May 14, 2024 · 2 comments
Labels
status/needs-info Further information is requested type/support User support related issues.

Comments

@asluborski
Copy link

asluborski commented May 14, 2024

Platform I'm building on:
BOTTLEROCKET_x86_64_NVIDIA w/ p2.xlarge through AWS console
Kubernetes 1.29

What I expected to happen:
When creating node group in AWS EKS cluster console under compute tab for BOTTLEROCKET_x86_64_NVIDIA w/ p2.xlarge, when done creating, should join the EKS cluster. I have not changed anything and the rest of the OS images like AL2 with GPU join the cluster fine.

What actually happened:

Got the message :
NodeCreationFailure | Instances failed to join the kubernetes cluster

How to reproduce the problem:
I am doing this through the AWS console under the EKS cluster compute tab. I am using this security group role for the nodes

config

These are my policies for the cluster node group role
policies

I also added IAM permissions for my user that includes ssm:, ssmmessages:, and ec2messages:*

These are my policies for the cluster role itself
cluster

I am not sure what I am missing? I am using a VPC with public/private subnets and everything was working fine until I tried using bottlerocket.

@asluborski asluborski added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels May 14, 2024
@vigh-m
Copy link
Contributor

vigh-m commented May 14, 2024

Hi @asluborski! Thanks for reaching out with this issue.

  1. Have you referred to our Quickstart guide when launching nodes for EKS?
  2. What user-data are you passing to the Bottlerocket instance when you try to launch it?
  3. You can check the console logs via the Get System Logs option under Actions -> Monitor and Troubleshoot from the Instance. This will help figure out if the node is booting fully. If not, it can help diagnose this further.

@vigh-m vigh-m added status/needs-info Further information is requested type/support User support related issues. and removed type/bug Something isn't working status/needs-triage Pending triage or re-evaluation labels May 14, 2024
@asluborski
Copy link
Author

I was able to find the issue. The conflict was with kubernetes version I was using for the cluster and the nvidia driver installing for the p2 instances. P2 instances require the legacy nvidia drivers 470.x, it was installing newer nvidia drivers(500.x), hence failing. I had to downgrade to Kubernetes 1.23 for bottlerocket to install legacy nvidia drivers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/needs-info Further information is requested type/support User support related issues.
Projects
None yet
Development

No branches or pull requests

2 participants