New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU AMI] Incompatibility of EFA, NVIDIA drivers with latest Linux kernels #1494
Comments
My team builds EKS GPU images and I think we're facing this issue. We're using
Any advice? |
Today's release, v20240227 includes changes for Kubernetes 1.29 that address this issue. These changes will be backported to earlier Kubernetes versions in upcoming releases. There are a few things to note with this change:
Please reach out here or to AWS Support if these changes cause issues with your workload. This is a significant change and we expect some wrinkles will need ironing out. 😄 |
I'm running the latest version of AMI Here's the error message that I have in my app container:
While both dcgm-exporter and feature-discovery pods failed to startwith the message:
|
I had to add |
With the latest Amazon Linux 2 kernels, customers running EC2 P4d, P4de and P5 instances may be unable to use the GPUDirect RDMA feature, which allows for faster communication between the NVIDIA driver and the EC2 Elastic Fabric Adapter (EFA).
This issue is caused by a change accepted by the Linux kernel community which introduced an incompatibility between the NVIDIA driver and the EFA driver. This change prevents the proprietary NVIDIA driver from dynamically linking to open source ones, such as EFA. We are currently working towards a solution to allow the use of the GPUDirect RDMA feature with the affected kernels.
Linux kernels with versions equal or above to the follow are affected:
The EKS-Optimized Accelerated AMI does not contain the affected kernel versions. By default, these AMIs have locked the kernel version and are not affected, unless the kernel version lock is manually removed. We recommend customers using custom AMIs to lock their kernel to a version lower than those listed above to prevent any impact on their workloads, until we have determined a solution. The kernel version can be locked with the following command:
The text was updated successfully, but these errors were encountered: