Problem with NVIDIA GSP and g4dn, g5, and g5g instances #1523

chiragjn · 2023-11-18T10:41:28Z

What happened:

We provisioned a g5.* instance and it was booted with the latest ami Release v20231116
When we try to run any gpu workloads, container toolkit (cli) fails to communicate with gpu devices. When we shell into the node and run nvidia-smi -q it really struggles to get output and bunch of values are Unknown Error

Adding lscpu and nvidia-smi logs
lscpu+nvidia-smi.log.txt

Workload runc errors

Error: failed to create containerd task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=12.0 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 --pid=15676 /run/containerd/io.containerd.runtime.v1.linux/k8s.io/reranker-a10/rootfs] nvidia-container-cli: initialization error: driver error: timed out: unknown

I am reporting this because we have seen similar issues in last few days with A100 + Driver 535 + AMD EPYC configurations someplace else

How to reproduce it (as minimally and precisely as possible):
Provision a g5 instance with latest AMI, run nvidia-smi -q on host

Environment:

AWS Region: eu-west-1
Instance Type(s): g5.8xlarge
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.7
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.27 (v1.27.7-eks-4f4795d)
AMI Version: amazon-eks-gpu-node-1.27-v20231116
AMI ID: ami-04358af1a6af90875
Kernel (e.g. uname -a): Linux ip-10-2-53-244.eu-west-1.compute.internal 5.10.192-183.736.amzn2.x86_64 #1 SMP Wed Sep 6 21:15:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Release information (run cat /etc/eks/release on a node):

BASE_AMI_ID="ami-0fe9073bb890001f8"
BUILD_TIME="Thu Nov 16 03:14:20 UTC 2023"
BUILD_KERNEL="5.10.192-183.736.amzn2.x86_64"
ARCH="x86_64"

The text was updated successfully, but these errors were encountered:

dims · 2023-11-18T12:56:23Z

@chiragjn can you please open a service ticket? (since you are using the stock unmodified AMI that EKS team ships!)

chiragjn · 2023-11-18T14:18:50Z

Ah okay, I have created one now. Just curious are the stock AMIs not built using this codebase? The changelog seems to be indicate that

dims · 2023-11-18T14:46:48Z

it's a layer above what's here in this repo @chiragjn (cough! check license of things cough!)

bhavitsharma · 2023-11-18T19:44:09Z

Hey, we're experiencing this problem with K80 GPU EC2s like p2.xlarge. It works perfectly for the A10G/V100 GPUs (We're also using the stock AMIs)

cartermckinnon · 2023-11-18T21:47:28Z

@bhavitsharma this is expected, as newer versions of the NVIDIA driver have dropped support for the chipsets used in p2: #1448 (comment)

bhavitsharma · 2023-11-18T21:54:53Z

@cartermckinnon, as far as I understand, this is only for kubernetes 1.28. We're running 1.27

bryantbiggs · 2023-11-18T22:01:16Z

@bhavitsharma the GPU cards on P2s do not support 5xx series drivers. The 1.28 GPU AMI has always provided the 535 driver, but starting with release v20231116, 1.25+ GPU AMIs are all shipping with the 535 driver as well. Therefore, P2s will not work with these AMIs

chiragjn · 2023-11-19T05:31:56Z

I am still waiting to hear back from support team, just posting there that the issue is not consistently re-producible, got another g5 node and things are working fine 🙃
This inconsistency aspect is also similar to the other A100 + Driver 535 + AMD EPYC setup we have elsewhere
Will report back if we find the root cause

cartermckinnon · 2023-11-21T18:42:30Z

@chiragjn I haven't been able to reproduce this, and we haven't received any other reports of weirdness on g5 instances. Have you narrowed down a reproduction?

chiragjn · 2023-11-22T11:34:50Z

@cartermckinnon We are also not able to reproduce this consistently, just a few hours ago we had an issue, so far like ~4/20 attempts
I would recommend provisioning a huge number and deploying a GPU workload that tries to use CUDA and see how many of them succeed

dims · 2023-11-22T21:53:43Z

@chiragjn this sounds like something that needs to be reported using AWS support. Can you please open one? thanks!

chiragjn · 2023-11-23T07:12:32Z

I have reported it, I am guessing they too are having trouble reproducing this. We are doing some tests of our own, I'll post updates on our results

dmegyesi · 2023-11-24T11:09:13Z

I confirm we have the same problem on g5.4xlarge with the new AMI, the GPU is totally dead:

Kubelet:

Pod Allocate failed due to device plugin GetPreferredAllocation rpc failed with err: rpc error: code = Unknown desc = error getting list of preferred allocation devices: unable to retrieve list of available devices: error creating nvml.Device 0: nvml: Unknown Error, which is unexpected

$ nvidia-smi -a

=============NVSMI LOG==============

Timestamp                                 : Fri Nov 24 11:07:21 2023
Driver Version                            : 535.54.03
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:00:1E.0
    Product Name                          : NVIDIA A10G
    Product Brand                         : NVIDIA
    Product Architecture                  : Ampere
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : Unknown Error
        Pending                           : Unknown Error
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1322321021225
    GPU UUID                              : Unknown Error
    Minor Number                          : 0
    VBIOS Version                         : Unknown Error
    MultiGPU Board                        : No
    Board ID                              : 0x1e
    Board Part Number                     : 900-2G133-A840-000
    GPU Part Number                       : Unknown Error
    FRU Part Number                       : N/A
    Module ID                             : Unknown Error
[...]

cartermckinnon · 2023-11-24T17:42:02Z

@dmegyesi do you see this on other g5 sizes as well? Is the issue consistent on a given instance or does the nvidia-smi query succeed sometimes?

chiragjn · 2023-12-01T19:42:13Z

@cartermckinnon
I think I found a workaround. It seems to be Nvidia GSP related issue.

I was able to get hold of a faulty g5.12xlarge node and check dmesg with nsenter on the node

And the logs led me to NVIDIA/open-gpu-kernel-modules#446

It reports a few different issues

When multiple gpus are connected via NVLink but assigned to different VMs (Timeout waiting for RPC from GSP! NVIDIA/open-gpu-kernel-modules#446 (comment))
When gpus connected via different kinds of NVLink are assigned to same VM (Timeout waiting for RPC from GSP! NVIDIA/open-gpu-kernel-modules#446 (comment))
GSP causing problems

1 and 2 do not apply to g5 instances, so based on 3, I tried disabling GSP

First checked

cat /proc/driver/nvidia/params | grep EnableGpuFirmware

gives

EnableGpuFirmware: 18
EnableGpuFirmwareLogs: 2

Then tried disabling it

echo options nvidia NVreg_EnableGpuFirmware=0 > /etc/modprobe.d/nvidia-gsp.conf
dracut -f 
reboot

Checked again

cat /proc/driver/nvidia/params | grep EnableGpuFirmware

gives

EnableGpuFirmware: 18
EnableGpuFirmwareLogs: 2

Did not work

Funnily enough, AWS' documentation on ec2 nvidia drivers installation mentions this issue but under GRID and gaming drivers
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html

If you are using NVIDIA driver version 510.x or greater on the G4dn, G5, or G5g instances, disable GSP with the following commands. For more information, on why this is required visit NVIDIA’s documentation.
[ec2-user ~]$ sudo touch /etc/modprobe.d/nvidia.conf
[ec2-user ~]$ echo "options nvidia NVreg_EnableGpuFirmware=0" | sudo tee --append /etc/modprobe.d/nvidia.conf
Reboot the instance.
[ec2-user ~]$ sudo reboot

And it points to https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#disabling-gsp for the reason

Some GPUs include a GPU System Processor (GSP), which may be used to offload GPU initialization and management tasks. In GPU pass through and bare-metal deployments on Linux, GSP is supported only for vCS. If you are using any other product in a GPU pass through or bare-metal deployment on Linux, you must disable the GSP firmware.

Great, let's try this

touch /etc/modprobe.d/nvidia.conf
echo "options nvidia NVreg_EnableGpuFirmware=0" | sudo tee --append /etc/modprobe.d/nvidia.conf
dracut -f 
reboot

Checked again

cat /proc/driver/nvidia/params | grep EnableGpuFirmware

gives

EnableGpuFirmware: 18
EnableGpuFirmwareLogs: 2

Did not work

At this point I go for the nuclear option - deleting the gsp firmware

rm -f /lib/firmware/nvidia/535.54.03/gsp_*.bin
reboot

And it works! dmesg complains, but it works!

[   35.745003] nvidia 0000:00:1b.0: Direct firmware load for nvidia/535.54.03/gsp_ga10x.bin failed with error -2
[   36.450155] nvidia 0000:00:1c.0: Direct firmware load for nvidia/535.54.03/gsp_ga10x.bin failed with error -2
[   37.140379] nvidia 0000:00:1d.0: Direct firmware load for nvidia/535.54.03/gsp_ga10x.bin failed with error -2
[   37.844255] nvidia 0000:00:1e.0: Direct firmware load for nvidia/535.54.03/gsp_ga10x.bin failed with error -2

And now my workload runs on this node

cartermckinnon · 2023-12-01T20:50:03Z

@chiragjn thanks! That certainly looks like the smoking gun. Requiring a reboot puts us in a tough position; and I'm not sure we can do something at runtime before systemd-modules-load runs, that happens very early in the boot process. I'll see what I can come up with 👍

dmegyesi · 2023-12-02T17:22:30Z

@dmegyesi do you see this on other g5 sizes as well? Is the issue consistent on a given instance or does the nvidia-smi query succeed sometimes?

Apologies for not answering earlier, I was attending re:Invent and couldn't follow up with our customers before. Yes, we have seen this on various sizes of g5 instances. I'd say roughly 1 out of 5 times the machines actually worked, randomly, can't see a pattern behind even with the same workload. We run the Nvidia DCGM exporter on the nodes that's also touching the GPUs, not sure if this is relevant info.
(Bit of an off topic: Meanwhile we have a support case running with AWS support but it's been super unhelpful, is there a recommended way to get these kind of investigations assigned to the right people over there who are more familiar with the lower level details?)

chiragjn · 2023-12-03T20:39:48Z

We run the Nvidia DCGM exporter on the nodes

We also run DCGM exporter and we can confirm that not running it at least reduces the failure incidence rate. But like you said some nodes still run fine, so we don't suspect it is exactly a dcgm problem.
A similar issue was reported on DCGM Exporter: NVIDIA/dcgm-exporter#148 and it also points to GSP related issue

chiragjn · 2023-12-08T09:48:32Z

@cartermckinnon Any luck with figuring out a solution? 😅 Any insights into what would be best place in the node lifecycle to fix this would also be great

suket22 · 2023-12-21T16:02:34Z

I came across NVIDIA/gpu-operator#634 (comment) which is also pointing to GSP as a possible source for this issue

chiragjn · 2024-01-03T06:58:16Z

@cartermckinnon I have a working but quite a hacky solution to disable GSP
truefoundry/infra-charts@b809b5b#diff-af2ed6a75711a35412327054490193f39e21039911b95abfd6324cbe5cc31e2dR230-R240

Is it possible for the AMI team to configure the kernel params and disable GSP while building the kernel modules?

chiragjn · 2024-01-16T09:07:16Z

Bumping this again. I am not sure how or where can I get support for this.
Is the EKS team considering disabling GSP or figuring out any alternate solution?

cartermckinnon · 2024-01-16T15:53:08Z

Sorry for the delay; we're doing a rework of our NVIDIA setup to address #1494 which has taken priority.

Is the EKS team considering disabling GSP

Yes, I expect to get a fix out for this in the next few weeks.

sidewinder12s · 2024-01-19T00:50:11Z

We've also appeared to hit this. So far a sure fire way to trigger it has been to run a pod that just runs some nvidia-smi dump/debug commands to confirm a functional GPU. Then recreate the pod on the same node, it'll leave the pod bricked/unable to be created when it restarts with:

  Warning  Failed     57s                  kubelet            Error: failed to create containerd task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=11.2 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=450,driver<451 --pid=30895 /run/containerd/io.containerd.runtime.v1.linux/k8s.io/herd-monitor/rootfs]
nvidia-container-cli: initialization error: driver error: timed out: unknown

Before recreating the pod we run these commands without issue:

nvidia-smi (just to check for 0 exit code)

Then these:
/usr/bin/nvidia-smi --query-remapped-rows=gpu_uuid,remapped_rows.failure --format=csv,noheader
/usr/bin/nvidia-smi --query-gpu=gpu_uuid,ecc.errors.uncorrected.volatile.sram --format=csv,noheader

This was on a g5.48xlarge and Kube 1.26 AMI Version v20240110

bryantbiggs · 2024-02-17T01:13:00Z

hey @sidewinder12s - I am trying to reproduce the issue based on your comments above and just wanted to make sure I was following the process you were using. If I have the following pod spec:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test-nvidia-smi
spec:
  restartPolicy: OnFailure
  containers:
    - name: gpu-demo
      image: public.ecr.aws/amazonlinux/amazonlinux:2023-minimal
      command: ['/bin/sh', '-c']
      args: ['nvidia-smi']
      # args: ['nvidia-smi --query-remapped-rows=gpu_uuid,remapped_rows.failure --format=csv,noheader']
      # args: ['nvidia-smi --query-gpu=gpu_uuid,ecc.errors.uncorrected.volatile.sram --format=csv,noheader']
      env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: all
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: compute,utility
      resources:
        requests:
          nvidia.com/gpu: 4
        limits:
          nvidia.com/gpu: 4
  tolerations:
    - key: 'nvidia.com/gpu'
      operator: 'Equal'
      value: 'true'
      effect: 'NoSchedule'

based on your comments (and correct me if I am wrong), you are saying if I deploy it, then remove it, then re-deploy it - thats when you see the issue? Something like:

kubectl apply -f pod.yaml
Pod comes up ok, shows SMI output per usual
kubectl delete -f pod.yaml
kubectl apply -f pod.yaml
The pod fails to deploy, error in logs is shown

Is that correct?

sidewinder12s · 2024-02-17T01:27:46Z

Roughly yes, though we're only ever assigning 1 GPU and I am not sure if we're setting those env vars (its not on our pod spec, but maybe we set it within the container image).

This image is a health checking image that performs a bunch of health checks constantly, including those nvidia-smi commands (largely to discover issues on GPUs that have caused us issues in the past). It's been a few weeks but I'm also not sure if it was caused by a pod restarting, the daemonset pod being recreated/updated or both.

bryantbiggs · 2024-02-17T02:17:38Z

If you are using an image that's built on top of an NVIDIA image, they will have already added those environment variables. Since I'm just using a minimal AL2023 image here for simplicity, I have to add those to get the full SMI output details.

But thank you for sharing, I'm going to keep digging into trying to reproduce the issue

chiragjn · 2024-03-11T12:21:08Z

Sorry for the delay; we're doing a rework of our NVIDIA setup to address #1494 which has taken priority.

Is the EKS team considering disabling GSP

Yes, I expect to get a fix out for this in the next few weeks.

@cartermckinnon Apologies for the ping, but was any decision made here? 😅

cartermckinnon · 2024-03-11T17:08:52Z

The latest release (which will complete today) addresses this issue in Kubernetes 1.29 for g4dn instances, by disabling GSP automatically. We're still working on the right solution for g5 instance types. These instances support EFA and require the open-source NVIDIA kmod as a result, but the GSP feature cannot be disabled on the open-source kmod. We're following up with EC2 and NVIDIA regarding this issue.

chiragjn · 2024-03-22T14:39:17Z

the GSP feature cannot be disabled on the open-source kmod

Is there any documentation that mentions this?
Not able to understand why is this a limitation because users on NVIDIA/open-gpu-kernel-modules#446 mention some success in disabling it

My userdata script that tries to disable GSP is sadly now broken with the new release with open source kmod rolling out

cartermckinnon · 2024-03-22T19:34:49Z

In my tests, the kmod param EnableGpuFirmware just doesn't have any effect with the open kmod. It doesn't seem to be used anywhere in the code, but it's 100% possible I'm misreading things: https://github.com/NVIDIA/open-gpu-kernel-modules/blob/3bf16b890caa8fd6b5db08b5c2437b51c758ac9d/kernel-open/nvidia/nv.c#L131-L133

We intend to load the proprietary kmod on g5 types for the time being so that EnableGpuFirmware has the intended effect. That will ship in the next AMI release (the one after v20240315).

cartermckinnon · 2024-06-11T14:40:34Z

We intend to load the proprietary kmod on g5 types for the time being so that EnableGpuFirmware has the intended effect. That will ship in the next AMI release (the one after v20240315).

We've completed our rollout of this across all active k8s versions, so I'm going to close this issue. If you continue seeing this problem, please mention me here or open a case with AWS support.

chiragjn changed the title ~~Potential problems with Nvidia Drivers 535 and g5 instances (AMD EPYC)~~ Potential problems with Nvidia Drivers 535 and g5 instances on AMI v20231116 (AMD EPYC) Nov 18, 2023

chiragjn changed the title ~~Potential problems with Nvidia Drivers 535 and g5 instances on AMI v20231116 (AMD EPYC)~~ Potential problems with Nvidia Drivers 535 and g5 instances (AMD EPYC) on AMI v20231116 Nov 18, 2023

cartermckinnon changed the title ~~Potential problems with Nvidia Drivers 535 and g5 instances (AMD EPYC) on AMI v20231116~~ Problem with NVIDIA GSP and g4dn, g5, and g5g instances Dec 1, 2023

chiragjn mentioned this issue Dec 15, 2023

Disable DCGM by default because of Nvidia GSP issue - Bump version truefoundry/infra-charts#167

Merged

chiragjn mentioned this issue Jan 3, 2024

EKS GPUs - Disable GSP from userData script truefoundry/infra-charts#176

Merged

chiragjn mentioned this issue Jan 24, 2024

Inject Nvidia GPUs using volume-mounts to isolate them to assigned pods bottlerocket-os/bottlerocket#3718

Merged

chiragjn mentioned this issue Mar 12, 2024

Option to disable GSP Firmware module for Nvidia GPUs bottlerocket-os/bottlerocket#3817

Closed

chulkilee mentioned this issue Apr 25, 2024

pytorch could not detect Nvidia driver on bottlerocket bottlerocket-os/bottlerocket#3916

Open

nghtm mentioned this issue May 3, 2024

dcgmi version and dcgm-exporter version NVIDIA/dcgm-exporter#319

Closed

nghtm mentioned this issue May 10, 2024

HyperPod Lifecycle Script install_dcgm_exporter.sh is failing on g5.48xlarges aws-samples/awsome-distributed-training#324

Closed

cartermckinnon closed this as completed Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with NVIDIA GSP and g4dn, g5, and g5g instances #1523

Problem with NVIDIA GSP and g4dn, g5, and g5g instances #1523

chiragjn commented Nov 18, 2023 •

edited

Loading

dims commented Nov 18, 2023

chiragjn commented Nov 18, 2023

dims commented Nov 18, 2023

bhavitsharma commented Nov 18, 2023

cartermckinnon commented Nov 18, 2023

bhavitsharma commented Nov 18, 2023

bryantbiggs commented Nov 18, 2023

chiragjn commented Nov 19, 2023 •

edited

Loading

cartermckinnon commented Nov 21, 2023

chiragjn commented Nov 22, 2023 •

edited

Loading

dims commented Nov 22, 2023

chiragjn commented Nov 23, 2023

dmegyesi commented Nov 24, 2023

cartermckinnon commented Nov 24, 2023

chiragjn commented Dec 1, 2023 •

edited

Loading

cartermckinnon commented Dec 1, 2023

dmegyesi commented Dec 2, 2023 •

edited

Loading

chiragjn commented Dec 3, 2023 •

edited

Loading

chiragjn commented Dec 8, 2023

suket22 commented Dec 21, 2023

chiragjn commented Jan 3, 2024

chiragjn commented Jan 16, 2024

cartermckinnon commented Jan 16, 2024

sidewinder12s commented Jan 19, 2024 •

edited

Loading

bryantbiggs commented Feb 17, 2024

sidewinder12s commented Feb 17, 2024

bryantbiggs commented Feb 17, 2024

chiragjn commented Mar 11, 2024

cartermckinnon commented Mar 11, 2024

chiragjn commented Mar 22, 2024 •

edited

Loading

cartermckinnon commented Mar 22, 2024

cartermckinnon commented Jun 11, 2024

Problem with NVIDIA GSP and g4dn, g5, and g5g instances #1523

Problem with NVIDIA GSP and g4dn, g5, and g5g instances #1523

Comments

chiragjn commented Nov 18, 2023 • edited Loading

dims commented Nov 18, 2023

chiragjn commented Nov 18, 2023

dims commented Nov 18, 2023

bhavitsharma commented Nov 18, 2023

cartermckinnon commented Nov 18, 2023

bhavitsharma commented Nov 18, 2023

bryantbiggs commented Nov 18, 2023

chiragjn commented Nov 19, 2023 • edited Loading

cartermckinnon commented Nov 21, 2023

chiragjn commented Nov 22, 2023 • edited Loading

dims commented Nov 22, 2023

chiragjn commented Nov 23, 2023

dmegyesi commented Nov 24, 2023

cartermckinnon commented Nov 24, 2023

chiragjn commented Dec 1, 2023 • edited Loading

cartermckinnon commented Dec 1, 2023

dmegyesi commented Dec 2, 2023 • edited Loading

chiragjn commented Dec 3, 2023 • edited Loading

chiragjn commented Dec 8, 2023

suket22 commented Dec 21, 2023

chiragjn commented Jan 3, 2024

chiragjn commented Jan 16, 2024

cartermckinnon commented Jan 16, 2024

sidewinder12s commented Jan 19, 2024 • edited Loading

bryantbiggs commented Feb 17, 2024

sidewinder12s commented Feb 17, 2024

bryantbiggs commented Feb 17, 2024

chiragjn commented Mar 11, 2024

cartermckinnon commented Mar 11, 2024

chiragjn commented Mar 22, 2024 • edited Loading

cartermckinnon commented Mar 22, 2024

cartermckinnon commented Jun 11, 2024

chiragjn commented Nov 18, 2023 •

edited

Loading

chiragjn commented Nov 19, 2023 •

edited

Loading

chiragjn commented Nov 22, 2023 •

edited

Loading

chiragjn commented Dec 1, 2023 •

edited

Loading

dmegyesi commented Dec 2, 2023 •

edited

Loading

chiragjn commented Dec 3, 2023 •

edited

Loading

sidewinder12s commented Jan 19, 2024 •

edited

Loading

chiragjn commented Mar 22, 2024 •

edited

Loading