Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to disable GSP Firmware module for Nvidia GPUs #3817

Closed
chiragjn opened this issue Mar 12, 2024 · 3 comments
Closed

Option to disable GSP Firmware module for Nvidia GPUs #3817

chiragjn opened this issue Mar 12, 2024 · 3 comments
Labels
status/needs-info Further information is requested type/enhancement New feature or request

Comments

@chiragjn
Copy link
Contributor

I had brought this up earlier in another thread but creating an issue to track this separately.
When GSP Firmware is enabled, running dcgm exporter renders the GPU unresponsive and all interactions with the GPU start timing out leading to container creation failures or unresponsive gpu containers.

E.g. When trying to create a pod with gpu access on a g5.xlarge in an EKS 1.28 running BottleRocket AMI (This node already has dcgm exporter daemonset running), we get:

Error: failed to create containerd task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=12.0 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 --pid=15676 /run/containerd/io.containerd.runtime.v1.linux/k8s.io/reranker-a10/rootfs] nvidia-container-cli: initialization error: driver error: timed out: unknown

As such, the only working solution we have found is to disable the GSP entirely. For AL2 I have done this using a user data init script, but with bottlerocket I am afraid we don't have that option.

See these threads for more details:
awslabs/amazon-eks-ami#1523
NVIDIA/open-gpu-kernel-modules#446

For AL2, EKS team has decided to disable GSP too and that work is in progress.

How to reproduce

  • Deploy DCGM Exporter helm chart, targeting all gpu nodes in the cluster
  • Bring up g5.* nodes
  • Deploy a workload pod consuming gpu
  • Either the pod will fail to create
  • Or if the pod manages to run, shell in the pod and try running nvidia-smi. Most likely it will struggle to output or show ERR in a bunch of fields
  • Check dmesg on the node, logs would contain XID 119 errors

Sadly, this issue is not 100% reproducible and takes some recycling of nodes to encounter. One of the users in the above thread has reported a guaranteed of triggering it awslabs/amazon-eks-ami#1523 (comment) - although I have not tested it.

What I'd like:

A kernel module setting to be able to disable GSP

Any alternatives you've considered:

Building Custom AMIs with the GSP files removed

@chiragjn chiragjn added status/needs-triage Pending triage or re-evaluation type/enhancement New feature or request labels Mar 12, 2024
@foersleo
Copy link
Contributor

Thanks for bringing this up @chiragjn.

As has been discussed on the linked issue, the GSP firmware can be disabled through the module option NVreg_EnableGpuFirmware (as also detailed in the documentation of the driver in https://download.nvidia.com/XFree86/Linux-x86_64/535.161.07/README/gsp.html).

Setting module options for Bottlerocket can be done through the user-data setting for the kernel command line and the reboot-to-reconcile option as documented on https://bottlerocket.dev/en/os/1.19.x/api/settings/boot/ .

To achieve disabling GSP, you will have to set the following options in your user data:

[settings.boot]
reboot-to-reconcile: true
[settings.boot.kernel-parameters]
"nvidia.NVreg_EnableGpuFirmware" =
[
  "0"
]

I have done a test with the following eksctl config to check in an A/B scenario. One nodegroup that boots the image "vanilla" (ng-bottlerocket-g4), and one nodegroup with the appropriate settings to disable GSP firmware (ng-bottlerocket-g4-nogsp):

---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: bottlerocket-nvidia
  region: us-west-2
  version: '1.28'

nodeGroups:
  - name: ng-bottlerocket-g4
    instanceType: g4dn.xlarge
    desiredCapacity: 1
    amiFamily: Bottlerocket
    ami: ami-0afc36986e4122bb4
    iam:
       attachPolicyARNs:
          - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
          - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
          - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
          - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
    ssh:
        allow: true
        publicKeyName: ec2_rsa
    bottlerocket:
      settings:
        motd: "Hello from eksctl!"
  - name: ng-bottlerocket-g4-nogsp
    instanceType: g4dn.xlarge
    desiredCapacity: 1
    amiFamily: Bottlerocket
    ami: ami-0afc36986e4122bb4
    iam:
       attachPolicyARNs:
          - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
          - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
          - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
          - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
    ssh:
        allow: true
        publicKeyName: ec2_rsa
    bottlerocket:
      settings:
        motd: "Hello from eksctl!"
        boot:
          reboot-to-reconcile: true
          kernel-parameters:
            nvidia.NVreg_EnableGpuFirmware:
              - "0"

Instance g4dn.xlarge with GSP disabled:

bash-5.1# /usr/libexec/nvidia/tesla/bin/nvidia-smi -q | grep "GSP Firmware Version"
    GSP Firmware Version                  : N/A
bash-5.1# cat /proc/cmdline 
nvidia.NVreg_EnableGpuFirmware="0" [...]

Instance g4dn.xlarge without GSP disabled:

bash-5.1# /usr/libexec/nvidia/tesla/bin/nvidia-smi -q | grep "GSP Firmware Version"
    GSP Firmware Version                  : 535.161.07

Would this fix your issue or is there anything extra that you would need from Bottlerocket?

@foersleo foersleo added status/needs-info Further information is requested and removed status/needs-triage Pending triage or re-evaluation labels Mar 14, 2024
@chiragjn
Copy link
Contributor Author

Oh amazing, didn't know about this
I'll try this out with Karpenter user data and report back by tomorrow

@chiragjn
Copy link
Contributor Author

This works as expected! Thanks again :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/needs-info Further information is requested type/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants