Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with NVIDIA GSP and g4dn, g5, and g5g instances #1523

Closed
chiragjn opened this issue Nov 18, 2023 · 32 comments
Closed

Problem with NVIDIA GSP and g4dn, g5, and g5g instances #1523

chiragjn opened this issue Nov 18, 2023 · 32 comments

Comments

@chiragjn
Copy link

chiragjn commented Nov 18, 2023

What happened:

We provisioned a g5.* instance and it was booted with the latest ami Release v20231116
When we try to run any gpu workloads, container toolkit (cli) fails to communicate with gpu devices. When we shell into the node and run nvidia-smi -q it really struggles to get output and bunch of values are Unknown Error

Adding lscpu and nvidia-smi logs
lscpu+nvidia-smi.log.txt

Workload runc errors

Error: failed to create containerd task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=12.0 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 --pid=15676 /run/containerd/io.containerd.runtime.v1.linux/k8s.io/reranker-a10/rootfs] nvidia-container-cli: initialization error: driver error: timed out: unknown

I am reporting this because we have seen similar issues in last few days with A100 + Driver 535 + AMD EPYC configurations someplace else

How to reproduce it (as minimally and precisely as possible):
Provision a g5 instance with latest AMI, run nvidia-smi -q on host

Environment:

  • AWS Region: eu-west-1
  • Instance Type(s): g5.8xlarge
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.7
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.27 (v1.27.7-eks-4f4795d)
  • AMI Version: amazon-eks-gpu-node-1.27-v20231116
  • AMI ID: ami-04358af1a6af90875
  • Kernel (e.g. uname -a): Linux ip-10-2-53-244.eu-west-1.compute.internal 5.10.192-183.736.amzn2.x86_64 #1 SMP Wed Sep 6 21:15:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Release information (run cat /etc/eks/release on a node):
BASE_AMI_ID="ami-0fe9073bb890001f8"
BUILD_TIME="Thu Nov 16 03:14:20 UTC 2023"
BUILD_KERNEL="5.10.192-183.736.amzn2.x86_64"
ARCH="x86_64"
@chiragjn chiragjn changed the title Potential problems with Nvidia Drivers 535 and g5 instances (AMD EPYC) Potential problems with Nvidia Drivers 535 and g5 instances on AMI v20231116 (AMD EPYC) Nov 18, 2023
@chiragjn chiragjn changed the title Potential problems with Nvidia Drivers 535 and g5 instances on AMI v20231116 (AMD EPYC) Potential problems with Nvidia Drivers 535 and g5 instances (AMD EPYC) on AMI v20231116 Nov 18, 2023
@dims
Copy link
Member

dims commented Nov 18, 2023

@chiragjn can you please open a service ticket? (since you are using the stock unmodified AMI that EKS team ships!)

@chiragjn
Copy link
Author

Ah okay, I have created one now. Just curious are the stock AMIs not built using this codebase? The changelog seems to be indicate that

@dims
Copy link
Member

dims commented Nov 18, 2023

it's a layer above what's here in this repo @chiragjn (cough! check license of things cough!)

@bhavitsharma
Copy link

Hey, we're experiencing this problem with K80 GPU EC2s like p2.xlarge. It works perfectly for the A10G/V100 GPUs (We're also using the stock AMIs)

@cartermckinnon
Copy link
Member

@bhavitsharma this is expected, as newer versions of the NVIDIA driver have dropped support for the chipsets used in p2: #1448 (comment)

@bhavitsharma
Copy link

@cartermckinnon, as far as I understand, this is only for kubernetes 1.28. We're running 1.27

@bryantbiggs
Copy link
Contributor

@bhavitsharma the GPU cards on P2s do not support 5xx series drivers. The 1.28 GPU AMI has always provided the 535 driver, but starting with release v20231116, 1.25+ GPU AMIs are all shipping with the 535 driver as well. Therefore, P2s will not work with these AMIs

@chiragjn
Copy link
Author

chiragjn commented Nov 19, 2023

I am still waiting to hear back from support team, just posting there that the issue is not consistently re-producible, got another g5 node and things are working fine 🙃
This inconsistency aspect is also similar to the other A100 + Driver 535 + AMD EPYC setup we have elsewhere
Will report back if we find the root cause

@cartermckinnon
Copy link
Member

@chiragjn I haven't been able to reproduce this, and we haven't received any other reports of weirdness on g5 instances. Have you narrowed down a reproduction?

@chiragjn
Copy link
Author

chiragjn commented Nov 22, 2023

@cartermckinnon We are also not able to reproduce this consistently, just a few hours ago we had an issue, so far like ~4/20 attempts
I would recommend provisioning a huge number and deploying a GPU workload that tries to use CUDA and see how many of them succeed

@dims
Copy link
Member

dims commented Nov 22, 2023

@chiragjn this sounds like something that needs to be reported using AWS support. Can you please open one? thanks!

@chiragjn
Copy link
Author

I have reported it, I am guessing they too are having trouble reproducing this. We are doing some tests of our own, I'll post updates on our results

@dmegyesi
Copy link

I confirm we have the same problem on g5.4xlarge with the new AMI, the GPU is totally dead:

Kubelet:

Pod Allocate failed due to device plugin GetPreferredAllocation rpc failed with err: rpc error: code = Unknown desc = error getting list of preferred allocation devices: unable to retrieve list of available devices: error creating nvml.Device 0: nvml: Unknown Error, which is unexpected
$ nvidia-smi -a

=============NVSMI LOG==============

Timestamp                                 : Fri Nov 24 11:07:21 2023
Driver Version                            : 535.54.03
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:00:1E.0
    Product Name                          : NVIDIA A10G
    Product Brand                         : NVIDIA
    Product Architecture                  : Ampere
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : Unknown Error
        Pending                           : Unknown Error
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1322321021225
    GPU UUID                              : Unknown Error
    Minor Number                          : 0
    VBIOS Version                         : Unknown Error
    MultiGPU Board                        : No
    Board ID                              : 0x1e
    Board Part Number                     : 900-2G133-A840-000
    GPU Part Number                       : Unknown Error
    FRU Part Number                       : N/A
    Module ID                             : Unknown Error
[...]

@cartermckinnon
Copy link
Member

@dmegyesi do you see this on other g5 sizes as well? Is the issue consistent on a given instance or does the nvidia-smi query succeed sometimes?

@chiragjn
Copy link
Author

chiragjn commented Dec 1, 2023

@cartermckinnon
I think I found a workaround. It seems to be Nvidia GSP related issue.

I was able to get hold of a faulty g5.12xlarge node and check dmesg with nsenter on the node

And the logs led me to NVIDIA/open-gpu-kernel-modules#446

It reports a few different issues

  1. When multiple gpus are connected via NVLink but assigned to different VMs (Timeout waiting for RPC from GSP! NVIDIA/open-gpu-kernel-modules#446 (comment))
  2. When gpus connected via different kinds of NVLink are assigned to same VM (Timeout waiting for RPC from GSP! NVIDIA/open-gpu-kernel-modules#446 (comment))
  3. GSP causing problems

1 and 2 do not apply to g5 instances, so based on 3, I tried disabling GSP


First checked

cat /proc/driver/nvidia/params | grep EnableGpuFirmware

gives

EnableGpuFirmware: 18
EnableGpuFirmwareLogs: 2

Then tried disabling it

echo options nvidia NVreg_EnableGpuFirmware=0 > /etc/modprobe.d/nvidia-gsp.conf
dracut -f 
reboot

Checked again

cat /proc/driver/nvidia/params | grep EnableGpuFirmware

gives

EnableGpuFirmware: 18
EnableGpuFirmwareLogs: 2

Did not work


Funnily enough, AWS' documentation on ec2 nvidia drivers installation mentions this issue but under GRID and gaming drivers
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html

If you are using NVIDIA driver version 510.x or greater on the G4dn, G5, or G5g instances, disable GSP with the following commands. For more information, on why this is required visit NVIDIA’s documentation.
[ec2-user ~]$ sudo touch /etc/modprobe.d/nvidia.conf
[ec2-user ~]$ echo "options nvidia NVreg_EnableGpuFirmware=0" | sudo tee --append /etc/modprobe.d/nvidia.conf
Reboot the instance.
[ec2-user ~]$ sudo reboot

And it points to https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#disabling-gsp for the reason

Some GPUs include a GPU System Processor (GSP), which may be used to offload GPU initialization and management tasks. In GPU pass through and bare-metal deployments on Linux, GSP is supported only for vCS. If you are using any other product in a GPU pass through or bare-metal deployment on Linux, you must disable the GSP firmware.

Great, let's try this

touch /etc/modprobe.d/nvidia.conf
echo "options nvidia NVreg_EnableGpuFirmware=0" | sudo tee --append /etc/modprobe.d/nvidia.conf
dracut -f 
reboot

Checked again

cat /proc/driver/nvidia/params | grep EnableGpuFirmware

gives

EnableGpuFirmware: 18
EnableGpuFirmwareLogs: 2

Did not work


At this point I go for the nuclear option - deleting the gsp firmware

rm -f /lib/firmware/nvidia/535.54.03/gsp_*.bin
reboot

And it works! dmesg complains, but it works!

[   35.745003] nvidia 0000:00:1b.0: Direct firmware load for nvidia/535.54.03/gsp_ga10x.bin failed with error -2
[   36.450155] nvidia 0000:00:1c.0: Direct firmware load for nvidia/535.54.03/gsp_ga10x.bin failed with error -2
[   37.140379] nvidia 0000:00:1d.0: Direct firmware load for nvidia/535.54.03/gsp_ga10x.bin failed with error -2
[   37.844255] nvidia 0000:00:1e.0: Direct firmware load for nvidia/535.54.03/gsp_ga10x.bin failed with error -2

And now my workload runs on this node

@cartermckinnon
Copy link
Member

@chiragjn thanks! That certainly looks like the smoking gun. Requiring a reboot puts us in a tough position; and I'm not sure we can do something at runtime before systemd-modules-load runs, that happens very early in the boot process. I'll see what I can come up with 👍

@cartermckinnon cartermckinnon changed the title Potential problems with Nvidia Drivers 535 and g5 instances (AMD EPYC) on AMI v20231116 Problem with NVIDIA GSP and g4dn, g5, and g5g instances Dec 1, 2023
@dmegyesi
Copy link

dmegyesi commented Dec 2, 2023

@dmegyesi do you see this on other g5 sizes as well? Is the issue consistent on a given instance or does the nvidia-smi query succeed sometimes?

Apologies for not answering earlier, I was attending re:Invent and couldn't follow up with our customers before. Yes, we have seen this on various sizes of g5 instances. I'd say roughly 1 out of 5 times the machines actually worked, randomly, can't see a pattern behind even with the same workload. We run the Nvidia DCGM exporter on the nodes that's also touching the GPUs, not sure if this is relevant info.
(Bit of an off topic: Meanwhile we have a support case running with AWS support but it's been super unhelpful, is there a recommended way to get these kind of investigations assigned to the right people over there who are more familiar with the lower level details?)

@chiragjn
Copy link
Author

chiragjn commented Dec 3, 2023

We run the Nvidia DCGM exporter on the nodes

We also run DCGM exporter and we can confirm that not running it at least reduces the failure incidence rate. But like you said some nodes still run fine, so we don't suspect it is exactly a dcgm problem.
A similar issue was reported on DCGM Exporter: NVIDIA/dcgm-exporter#148 and it also points to GSP related issue

@chiragjn
Copy link
Author

chiragjn commented Dec 8, 2023

@cartermckinnon Any luck with figuring out a solution? 😅 Any insights into what would be best place in the node lifecycle to fix this would also be great

@suket22
Copy link
Member

suket22 commented Dec 21, 2023

I came across NVIDIA/gpu-operator#634 (comment) which is also pointing to GSP as a possible source for this issue

@chiragjn
Copy link
Author

chiragjn commented Jan 3, 2024

@cartermckinnon I have a working but quite a hacky solution to disable GSP
truefoundry/infra-charts@b809b5b#diff-af2ed6a75711a35412327054490193f39e21039911b95abfd6324cbe5cc31e2dR230-R240

Is it possible for the AMI team to configure the kernel params and disable GSP while building the kernel modules?

@chiragjn
Copy link
Author

Bumping this again. I am not sure how or where can I get support for this.
Is the EKS team considering disabling GSP or figuring out any alternate solution?

@cartermckinnon
Copy link
Member

Sorry for the delay; we're doing a rework of our NVIDIA setup to address #1494 which has taken priority.

Is the EKS team considering disabling GSP

Yes, I expect to get a fix out for this in the next few weeks.

@sidewinder12s
Copy link

sidewinder12s commented Jan 19, 2024

We've also appeared to hit this. So far a sure fire way to trigger it has been to run a pod that just runs some nvidia-smi dump/debug commands to confirm a functional GPU. Then recreate the pod on the same node, it'll leave the pod bricked/unable to be created when it restarts with:

  Warning  Failed     57s                  kubelet            Error: failed to create containerd task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=11.2 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=450,driver<451 --pid=30895 /run/containerd/io.containerd.runtime.v1.linux/k8s.io/herd-monitor/rootfs]
nvidia-container-cli: initialization error: driver error: timed out: unknown

Before recreating the pod we run these commands without issue:

nvidia-smi (just to check for 0 exit code)

Then these:
/usr/bin/nvidia-smi --query-remapped-rows=gpu_uuid,remapped_rows.failure --format=csv,noheader
/usr/bin/nvidia-smi --query-gpu=gpu_uuid,ecc.errors.uncorrected.volatile.sram --format=csv,noheader

This was on a g5.48xlarge and Kube 1.26 AMI Version v20240110

@bryantbiggs
Copy link
Contributor

hey @sidewinder12s - I am trying to reproduce the issue based on your comments above and just wanted to make sure I was following the process you were using. If I have the following pod spec:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test-nvidia-smi
spec:
  restartPolicy: OnFailure
  containers:
    - name: gpu-demo
      image: public.ecr.aws/amazonlinux/amazonlinux:2023-minimal
      command: ['/bin/sh', '-c']
      args: ['nvidia-smi']
      # args: ['nvidia-smi --query-remapped-rows=gpu_uuid,remapped_rows.failure --format=csv,noheader']
      # args: ['nvidia-smi --query-gpu=gpu_uuid,ecc.errors.uncorrected.volatile.sram --format=csv,noheader']
      env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: all
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: compute,utility
      resources:
        requests:
          nvidia.com/gpu: 4
        limits:
          nvidia.com/gpu: 4
  tolerations:
    - key: 'nvidia.com/gpu'
      operator: 'Equal'
      value: 'true'
      effect: 'NoSchedule'

based on your comments (and correct me if I am wrong), you are saying if I deploy it, then remove it, then re-deploy it - thats when you see the issue? Something like:

  1. kubectl apply -f pod.yaml
  2. Pod comes up ok, shows SMI output per usual
  3. kubectl delete -f pod.yaml
  4. kubectl apply -f pod.yaml
  5. The pod fails to deploy, error in logs is shown

Is that correct?

@sidewinder12s
Copy link

Roughly yes, though we're only ever assigning 1 GPU and I am not sure if we're setting those env vars (its not on our pod spec, but maybe we set it within the container image).

This image is a health checking image that performs a bunch of health checks constantly, including those nvidia-smi commands (largely to discover issues on GPUs that have caused us issues in the past). It's been a few weeks but I'm also not sure if it was caused by a pod restarting, the daemonset pod being recreated/updated or both.

@bryantbiggs
Copy link
Contributor

If you are using an image that's built on top of an NVIDIA image, they will have already added those environment variables. Since I'm just using a minimal AL2023 image here for simplicity, I have to add those to get the full SMI output details.

But thank you for sharing, I'm going to keep digging into trying to reproduce the issue

@chiragjn
Copy link
Author

Sorry for the delay; we're doing a rework of our NVIDIA setup to address #1494 which has taken priority.

Is the EKS team considering disabling GSP

Yes, I expect to get a fix out for this in the next few weeks.

@cartermckinnon Apologies for the ping, but was any decision made here? 😅

@cartermckinnon
Copy link
Member

The latest release (which will complete today) addresses this issue in Kubernetes 1.29 for g4dn instances, by disabling GSP automatically. We're still working on the right solution for g5 instance types. These instances support EFA and require the open-source NVIDIA kmod as a result, but the GSP feature cannot be disabled on the open-source kmod. We're following up with EC2 and NVIDIA regarding this issue.

@chiragjn
Copy link
Author

chiragjn commented Mar 22, 2024

the GSP feature cannot be disabled on the open-source kmod

Is there any documentation that mentions this?
Not able to understand why is this a limitation because users on NVIDIA/open-gpu-kernel-modules#446 mention some success in disabling it

My userdata script that tries to disable GSP is sadly now broken with the new release with open source kmod rolling out

@cartermckinnon
Copy link
Member

In my tests, the kmod param EnableGpuFirmware just doesn't have any effect with the open kmod. It doesn't seem to be used anywhere in the code, but it's 100% possible I'm misreading things: https://github.com/NVIDIA/open-gpu-kernel-modules/blob/3bf16b890caa8fd6b5db08b5c2437b51c758ac9d/kernel-open/nvidia/nv.c#L131-L133

We intend to load the proprietary kmod on g5 types for the time being so that EnableGpuFirmware has the intended effect. That will ship in the next AMI release (the one after v20240315).

@cartermckinnon
Copy link
Member

We intend to load the proprietary kmod on g5 types for the time being so that EnableGpuFirmware has the intended effect. That will ship in the next AMI release (the one after v20240315).

We've completed our rollout of this across all active k8s versions, so I'm going to close this issue. If you continue seeing this problem, please mention me here or open a case with AWS support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants