Skip to content

feat: add support for nvidia MIG#35

Merged
arnaldo2792 merged 3 commits intobottlerocket-os:developfrom
piyush-jena:nvidia-mig
Feb 6, 2025
Merged

feat: add support for nvidia MIG#35
arnaldo2792 merged 3 commits intobottlerocket-os:developfrom
piyush-jena:nvidia-mig

Conversation

@piyush-jena
Copy link
Copy Markdown
Contributor

@piyush-jena piyush-jena commented Feb 5, 2025

Issue number:

Related:

Description of changes:
Adding nvidia-migmanager service and binary that configures the instance with nvidia mig.

Testing done:

  1. Instance joined the cluster
NAME                                           STATUS   ROLES    AGE   VERSION
ip-XXXX.us-west-2.compute.internal   Ready    <none>   15h   v1.29.5-eks-1109419
  1. Model Default:
bash-5.1# apiclient get settings.kubelet-device-plugin
{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "volume-mounts",
        "device-partitioning-strategy": "none",
        "device-sharing-strategy": "none",
        "pass-device-specs": true
      }
    }
  }
}
  1. Model Updates:
bash-5.1#: apiclient set settings.kubelet-device-plugins.nvidia.device-partitioning-strategy="mig"
apiclient apply <<EOF
[settings.kubelet-device-plugins.nvidia.mig.profile]
"a100.40gb"="1g.5gb"
"h100.80gb"="4"
EOF
bash-5.1# apiclient get settings.kubelet-device-plugin
{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "volume-mounts",
        "device-partitioning-strategy": "mig",
        "device-sharing-strategy": "none",
        "mig": {
          "profile": {
            "a100.40gb": "1g.5gb",
            "h100.80gb": "4",
          }
        },
        "pass-device-specs": true
      }
    }
  }
}

kubectl describe node shows 56 gpus post instance reboot.

  1. Bounded check:
bash-5.1# apiclient apply <<EOF
> [settings.kubelet-device-plugins.nvidia.mig.profile]
> "hello"="1g.5gb"
> EOF
Failed to apply settings: Failed to PATCH settings from '-' to '/settings?tx=apiclient-apply-7NsnlaurtHEacSYL': Status 400 when PATCHing /settings?tx=apiclient-apply-7NsnlaurtHEacSYL: Json deserialize error: Unable to deserialize into NvidiaGPUModel: NVIDIA GPU Model must match '^([a-z])(\d+)\.(\d+)gb$', given: hello at line 1 column 62
bash-5.1# apiclient apply <<EOF
> [settings.kubelet-device-plugins.nvidia.mig.profile]
> "a100.40gb"="2"
> EOF
bash-5.1# apiclient apply <<EOF
> [settings.kubelet-device-plugins.nvidia.mig.profile]
> "a100.40gb"="5"
> EOF
Failed to apply settings: Failed to PATCH settings from '-' to '/settings?tx=apiclient-apply-GzUHB0axGlWNPzGw': Status 400 when PATCHing /settings?tx=apiclient-apply-GzUHB0axGlWNPzGw: Json deserialize error: Unable to deserialize into MIGProfile: MIG Profile must match '^[0-9]g\.\d+gb$', given: 5 at line 1 column 71
  1. Files generated:
bash-5.1# cat /etc/nvidia-migmanager/nvidia-migmanager.toml
device-partitioning-strategy = "mig"
profile = { "a100.40gb" = "1g.5gb", "h100.80gb" = "4" }

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

Comment thread packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec Outdated
Comment thread packages/kmod-5.10-nvidia/mig-nvidia-fabricmanager.service.drop-in.conf Outdated
Comment thread sources/nvidia-migmanager/src/main.rs Outdated
error::GpuModelSnafu
);

if pci_device_id.starts_with("0x20B0") {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe not worth doing, but you could repeat the serde alias approach and deserialize the enum variant from the PCI device ID.

Comment thread sources/nvidia-migmanager/src/main.rs Outdated
@piyush-jena piyush-jena force-pushed the nvidia-mig branch 2 times, most recently from cb05c7e to 2957c9e Compare February 6, 2025 02:17
@piyush-jena
Copy link
Copy Markdown
Contributor Author

Force push fixes all the comments.

Comment thread Twoliter.toml Outdated
Comment thread packages/nvidia-migmanager/nvidia-migmanager.spec Outdated
@piyush-jena
Copy link
Copy Markdown
Contributor Author

Force push fixes the above comments.

Copy link
Copy Markdown
Contributor

@bcressey bcressey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 🥇

Comment thread packages/nvidia-migmanager/nvidia-migmanager.spec Outdated
Comment thread packages/nvidia-migmanager/nvidia-migmanager.spec Outdated
@piyush-jena
Copy link
Copy Markdown
Contributor Author

Fixed the above 2 comments

Version: 0.1
Release: 1%{?dist}
Epoch: 1
Summary: Tool to generate NVIDIA MIG Binary and config files
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
Summary: Tool to generate NVIDIA MIG Binary and config files
Summary: Tool manage NVIDIA MIG and its config files

@arnaldo2792 arnaldo2792 merged commit 20a86ea into bottlerocket-os:develop Feb 6, 2025
@ginglis13 ginglis13 mentioned this pull request Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants