Option to disable GSP Firmware module for Nvidia GPUs #3817

chiragjn · 2024-03-12T05:23:01Z

I had brought this up earlier in another thread but creating an issue to track this separately.
When GSP Firmware is enabled, running dcgm exporter renders the GPU unresponsive and all interactions with the GPU start timing out leading to container creation failures or unresponsive gpu containers.

E.g. When trying to create a pod with gpu access on a g5.xlarge in an EKS 1.28 running BottleRocket AMI (This node already has dcgm exporter daemonset running), we get:

Error: failed to create containerd task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=12.0 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 --pid=15676 /run/containerd/io.containerd.runtime.v1.linux/k8s.io/reranker-a10/rootfs] nvidia-container-cli: initialization error: driver error: timed out: unknown

As such, the only working solution we have found is to disable the GSP entirely. For AL2 I have done this using a user data init script, but with bottlerocket I am afraid we don't have that option.

See these threads for more details:
awslabs/amazon-eks-ami#1523
NVIDIA/open-gpu-kernel-modules#446

For AL2, EKS team has decided to disable GSP too and that work is in progress.

How to reproduce

Deploy DCGM Exporter helm chart, targeting all gpu nodes in the cluster
Bring up g5.* nodes
Deploy a workload pod consuming gpu
Either the pod will fail to create
Or if the pod manages to run, shell in the pod and try running nvidia-smi. Most likely it will struggle to output or show ERR in a bunch of fields
Check dmesg on the node, logs would contain XID 119 errors

Sadly, this issue is not 100% reproducible and takes some recycling of nodes to encounter. One of the users in the above thread has reported a guaranteed of triggering it awslabs/amazon-eks-ami#1523 (comment) - although I have not tested it.

What I'd like:

A kernel module setting to be able to disable GSP

Any alternatives you've considered:

Building Custom AMIs with the GSP files removed

The text was updated successfully, but these errors were encountered:

foersleo · 2024-03-14T10:45:23Z

Thanks for bringing this up @chiragjn.

As has been discussed on the linked issue, the GSP firmware can be disabled through the module option NVreg_EnableGpuFirmware (as also detailed in the documentation of the driver in https://download.nvidia.com/XFree86/Linux-x86_64/535.161.07/README/gsp.html).

Setting module options for Bottlerocket can be done through the user-data setting for the kernel command line and the reboot-to-reconcile option as documented on https://bottlerocket.dev/en/os/1.19.x/api/settings/boot/ .

To achieve disabling GSP, you will have to set the following options in your user data:

[settings.boot]
reboot-to-reconcile: true
[settings.boot.kernel-parameters]
"nvidia.NVreg_EnableGpuFirmware" =
[
  "0"
]

I have done a test with the following eksctl config to check in an A/B scenario. One nodegroup that boots the image "vanilla" (ng-bottlerocket-g4), and one nodegroup with the appropriate settings to disable GSP firmware (ng-bottlerocket-g4-nogsp):

---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: bottlerocket-nvidia
  region: us-west-2
  version: '1.28'

nodeGroups:
  - name: ng-bottlerocket-g4
    instanceType: g4dn.xlarge
    desiredCapacity: 1
    amiFamily: Bottlerocket
    ami: ami-0afc36986e4122bb4
    iam:
       attachPolicyARNs:
          - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
          - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
          - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
          - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
    ssh:
        allow: true
        publicKeyName: ec2_rsa
    bottlerocket:
      settings:
        motd: "Hello from eksctl!"
  - name: ng-bottlerocket-g4-nogsp
    instanceType: g4dn.xlarge
    desiredCapacity: 1
    amiFamily: Bottlerocket
    ami: ami-0afc36986e4122bb4
    iam:
       attachPolicyARNs:
          - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
          - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
          - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
          - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
    ssh:
        allow: true
        publicKeyName: ec2_rsa
    bottlerocket:
      settings:
        motd: "Hello from eksctl!"
        boot:
          reboot-to-reconcile: true
          kernel-parameters:
            nvidia.NVreg_EnableGpuFirmware:
              - "0"

Instance g4dn.xlarge with GSP disabled:

bash-5.1# /usr/libexec/nvidia/tesla/bin/nvidia-smi -q | grep "GSP Firmware Version"
    GSP Firmware Version                  : N/A
bash-5.1# cat /proc/cmdline 
nvidia.NVreg_EnableGpuFirmware="0" [...]

Instance g4dn.xlarge without GSP disabled:

bash-5.1# /usr/libexec/nvidia/tesla/bin/nvidia-smi -q | grep "GSP Firmware Version"
    GSP Firmware Version                  : 535.161.07

Would this fix your issue or is there anything extra that you would need from Bottlerocket?

chiragjn · 2024-03-14T12:17:09Z

Oh amazing, didn't know about this
I'll try this out with Karpenter user data and report back by tomorrow

chiragjn · 2024-03-15T19:01:16Z

This works as expected! Thanks again :)

chiragjn added status/needs-triage Pending triage or re-evaluation type/enhancement New feature or request labels Mar 12, 2024

foersleo added status/needs-info Further information is requested and removed status/needs-triage Pending triage or re-evaluation labels Mar 14, 2024

chiragjn mentioned this issue Mar 15, 2024

Disable GSP on EKS Bottlerocket and enable DCGM exporter truefoundry/infra-charts#226

Merged

chiragjn closed this as completed Mar 15, 2024

yeazelm mentioned this issue Apr 25, 2024

pytorch could not detect Nvidia driver on bottlerocket #3916

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to disable GSP Firmware module for Nvidia GPUs #3817

Option to disable GSP Firmware module for Nvidia GPUs #3817

chiragjn commented Mar 12, 2024

foersleo commented Mar 14, 2024

chiragjn commented Mar 14, 2024

chiragjn commented Mar 15, 2024

Option to disable GSP Firmware module for Nvidia GPUs #3817

Option to disable GSP Firmware module for Nvidia GPUs #3817

Comments

chiragjn commented Mar 12, 2024

foersleo commented Mar 14, 2024

chiragjn commented Mar 14, 2024

chiragjn commented Mar 15, 2024