[GPU AMI] Incompatibility of EFA, NVIDIA drivers with latest Linux kernels #1494

cartermckinnon · 2023-10-26T19:12:12Z

With the latest Amazon Linux 2 kernels, customers running EC2 P4d, P4de and P5 instances may be unable to use the GPUDirect RDMA feature, which allows for faster communication between the NVIDIA driver and the EC2 Elastic Fabric Adapter (EFA).

This issue is caused by a change accepted by the Linux kernel community which introduced an incompatibility between the NVIDIA driver and the EFA driver. This change prevents the proprietary NVIDIA driver from dynamically linking to open source ones, such as EFA. We are currently working towards a solution to allow the use of the GPUDirect RDMA feature with the affected kernels.

Linux kernels with versions equal or above to the follow are affected:

4.14.326
5.4.257
5.10.195
5.15.131
6.1.52

The EKS-Optimized Accelerated AMI does not contain the affected kernel versions. By default, these AMIs have locked the kernel version and are not affected, unless the kernel version lock is manually removed. We recommend customers using custom AMIs to lock their kernel to a version lower than those listed above to prevent any impact on their workloads, until we have determined a solution. The kernel version can be locked with the following command:

sudo yum versionlock kernel*

The text was updated successfully, but these errors were encountered:

pfuntner · 2023-12-18T17:42:59Z

My team builds EKS GPU images and I think we're facing this issue. We're using amazon-eks-gpu-node-1.28-v20231201 (ami-0a2b1b38a4684df6a in us-east-1 region) and I believe the kernel packages are already initially pinned.

yum upgrade output: eks gpu upgrade errors.txt

Any advice?

cartermckinnon · 2024-02-28T19:45:02Z

Today's release, v20240227 includes changes for Kubernetes 1.29 that address this issue. These changes will be backported to earlier Kubernetes versions in upcoming releases.

There are a few things to note with this change:

The open-source NVIDIA kernel module will be used on supported instance types. This is necessary for EFA to function.
The proprietary NVIDIA kernel module will be used on instance types that are not supported by the open-source module.
We've migrated from the legacy nvidia-docker2 package to the nvidia-container-toolkit.
The latest version of the 535-series NVIDIA driver is used, 535.161.07.

Please reach out here or to AWS Support if these changes cause issues with your workload. This is a significant change and we expect some wrinkles will need ironing out. 😄

farioas · 2024-03-03T23:24:19Z

I'm running the latest version of AMI v1.29.0-eks-5e0fdde and nvidia-gpu-operator v23.9.1 on g5.48xlarge

Here's the error message that I have in my app container:

    Last State:     Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: /usr/local/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /usr/local/nvidia/toolkit/libnvidia-container.so.1): unknown
      Exit Code:    128

While both dcgm-exporter and feature-discovery pods failed to startwith the message:

    Last State:    Terminated
      Reason:      StartError
      Message:     failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: detection error: nvml error: unknown error: unknown

IshwarChandra · 2024-03-04T11:19:53Z

I'm running the latest version of AMI v1.29.0-eks-5e0fdde and nvidia-gpu-operator v23.9.1 on g5.48xlarge

Here's the error message that I have in my app container:

    Last State:     Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: /usr/local/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /usr/local/nvidia/toolkit/libnvidia-container.so.1): unknown
      Exit Code:    128

While both dcgm-exporter and feature-discovery pods failed to startwith the message:

    Last State:    Terminated
      Reason:      StartError
      Message:     failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: detection error: nvml error: unknown error: unknown

I had to add privileged: true in securityContext of Daemonset's definition file. But I am not sure if this is recommended at all.

cartermckinnon changed the title ~~[GPU] Incompatibility of EFA, NVIDIA drivers with latest Linux kernels~~ [GPU AMI] Incompatibility of EFA, NVIDIA drivers with latest Linux kernels Oct 26, 2023

cartermckinnon mentioned this issue Jan 8, 2024

Amazon-eks-gpu AMI nvidia-container-toolkit dependency questions #1560

Closed

This was referenced Jan 16, 2024

Problem with NVIDIA GSP and g4dn, g5, and g5g instances #1523

Open

containerd certificate config using incorrect header section for the EKS 1.24 GPU AMI #1168

Closed

farioas mentioned this issue Mar 3, 2024

Unable to run pod on G5 48xlarge instance, other g5 instance works well NVIDIA/gpu-operator#634

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU AMI] Incompatibility of EFA, NVIDIA drivers with latest Linux kernels #1494

[GPU AMI] Incompatibility of EFA, NVIDIA drivers with latest Linux kernels #1494

cartermckinnon commented Oct 26, 2023 •

edited

pfuntner commented Dec 18, 2023

cartermckinnon commented Feb 28, 2024

farioas commented Mar 3, 2024 •

edited

IshwarChandra commented Mar 4, 2024

[GPU AMI] Incompatibility of EFA, NVIDIA drivers with latest Linux kernels #1494

[GPU AMI] Incompatibility of EFA, NVIDIA drivers with latest Linux kernels #1494

Comments

cartermckinnon commented Oct 26, 2023 • edited

pfuntner commented Dec 18, 2023

cartermckinnon commented Feb 28, 2024

farioas commented Mar 3, 2024 • edited

IshwarChandra commented Mar 4, 2024

cartermckinnon commented Oct 26, 2023 •

edited

farioas commented Mar 3, 2024 •

edited