Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inject Nvidia GPUs using volume-mounts to isolate them to assigned pods #3718

Merged

Conversation

chiragjn
Copy link
Contributor

@chiragjn chiragjn commented Jan 18, 2024

Issue number: NA

Motivation
When using nodes with multiple gpus (e.g. g4dn.12xlarge), the default way of using nvidia-container-toolkit and nvidia-device-plugin leads to a problem where gpus are not exclusively isolated to the pods they are assigned to.
This is because nvidia-device-plugin by default looks up NVIDIA_VISIBLE_DEVICES to decide which gpu devices to pass on to nvidia-container-toolkit / nvidia-container-cli to inject in the pod.

Most nvidia cuda base images have env NVIDIA_VISIBLE_DEVICES=all baked into them which means a pod with such an image will get access to all gpu cards instead of exclusively getting the number requested in resources.limits yet the kubelet will only report the number in resources.limits as allocated.

E.g. On a 4 GPU node,

Pod 1 requests (with NVIDIA_VISIBLE_DEVICES=all in image)

resources:
  limits:
    nvidia.com/gpu: 1

Pod 2 requests (with NVIDIA_VISIBLE_DEVICES=all in image)

resources:
  limits:
    nvidia.com/gpu: 2

In this scenario, both pods get access to all 4 cards and node will report

nvidia.com/gpu
Allocated: 3
Free: 1

This isn't good because the deployed non-privileged pods are unaware of each other and expect exclusive access to requested cards.

References:

  1. Read list of GPU devices from volume mounts instead of NVIDIA_VISIBLE_DEVICES
  2. Preventing unprivileged access to GPUs in Kubernetes
  3. Device Plugin Docs

Description of changes:

We follow the guidelines in the above docs.

  1. For nvidia-device-plugin, we set
  • --device-list-strategy volume-mounts to pass allocated devices as volume mounts instead of bypassing and relying on value of NVIDIA_VISIBLE_DEVICES
  1. Configure the toolkit to only accept devices as volume mounts when the pod is not privileged.

NVIDIA_VISIBLE_DEVICES as env var will still be considered for privileged pods.

Testing done:

I would need help/advice testing this out on actual bottlerocket nodes.

EDIT: I built a custom AMI for EKS 1.28 following the docs and was able to confirm these changes work as expected.

My employer has been running this config on AL2 nodes without any issues ensuring correct bin packing per node. I am attaching tests we did for different scenarios.

Tests on AL2

g4dn.12xlarge - 4 GPUS
Privileged Requested GPU Count env NVIDIA_VISIBLE_DEVICES Assigned GPUs
No 2 all 2 GPUs assigned
No 2 none 2 GPUs assigned
No 2 void 2 GPUs assigned
No 2 0,2,3 2 GPUs assigned
No 0 all No GPUs assigned
No 0 none No GPUs assigned
No 0 void No GPUs assigned
No 0 0,2,3 No GPUs assigned
Yes 2 all All 4 assigned
Yes 2 none All 4 assigned
Yes 2 void All 4 assigned
Yes 2 0,2,3 All 4 assigned
Yes 0 all All 4 assigned
Yes 0 none All 4 assigned
Yes 0 void No GPUs assigned
Yes 0 0,2,3 All 4 assigned

As you can see, in non-privileged mode NVIDIA_VISIBLE_DEVICES will be ignored entirely
Some of the info in the Google Docs linked above is outdated so doesn't align with above results

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@chiragjn
Copy link
Contributor Author

@arnaldo2792 Would really appreciate your review or someone else's from the team who has worked on this :)

@chiragjn
Copy link
Contributor Author

On another thought, since bottlerocket is not EKS exclusive, would it make sense to have these as settings so users can configure according to their needs?

@arnaldo2792
Copy link
Contributor

Thanks for the great catch @chiragjn! (yet another, nice!). As I mentioned elsewhere, we are still thinking on what changes we want to ask for this PR since as it is, it could break the ECS variant. Once again, thanks for the contribution, and I'll reply back soon with the suggestions!

@chiragjn
Copy link
Contributor Author

chiragjn commented Jan 24, 2024

FWIW, I was able to build a custom AMI with these changes for 1.28-nvidia variant fairly easily
Kudos for the great docs and setup
I can now confirm these changes work as expected :)


Maybe till the contributing team decides on how to expose these in Settings API,
The container runtime config can be templated like follows

{{#if K8s}}
accept-nvidia-visible-devices-as-volume-mounts = true
accept-nvidia-visible-devices-envvar-when-unprivileged = false
{{/if}}

...

AFAIK, the device plugin is K8s only so the config can be safely changed.


Not relevant to this discussion directly:

@arnaldo2792
Copy link
Contributor

re: suggested implementation

{{#if K8s}}

Unfortunately, we don't provide a handlebars helper to evaluate if the current host is k8s or ECS. The solution that aligns best with Out-of-tree-builds (see #2669) is to have two different sub-packages for nvidia-container-toolkit, one per variant, and include it as needed while building the final image. But, to accomplish this, we need more changes on other places.

Additionally I removed the GSP firmware files because some versions of 535 driver cause XID 119 errors and make the whole GPU unresponsive when trying to run DCGM

Thanks for the heads up! I read the threads and I'll contact the author of this comment since we should align to what they are planning to do to fix this problem.

@chiragjn
Copy link
Contributor Author

Just curious if providing a handlebar for variant name being built might help and what kind of effort might that take, I'll be happy to give it a shot (with some guidance) if that is a good approach :)

@@ -1,3 +1,6 @@
accept-nvidia-visible-devices-as-volume-mounts = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @chiragjn, we had an internal discussion about the changes in this PR. I'll provide diffs of examples of how to accomplish what we need and explanations of why they are needed to try to ease the learning curve.

The first thing is to have two configuration files:

  • nvidia-container-toolkit-config-ecs.toml: this should be the file as it is today
  • nvidia-container-toollkit-config-k8s.toml: this is thew new file with the configurations as you have them here.

Once you have the two files, you need to update the nvidia-container-toolkit.spec file two add both sources as follows:

 Source0: https://%{goimport}/archive/v%{gover}/nvidia-container-toolkit-%{gover}.tar.gz
-Source1: nvidia-container-toolkit-config.toml
+Source1: nvidia-container-toolkit-config-k8s.toml
 Source2: nvidia-container-toolkit-tmpfiles.conf
 Source3: nvidia-oci-hooks-json
 Source4: nvidia-gpu-devices.rules
+Source5: nvidia-container-toolkit-config-ecs.toml

And install them like this under the %install section in the spec file:

-install -m 0644 %{S:1} %{buildroot}%{_cross_factorydir}/etc/nvidia-container-runtime/config.toml
+install -m 0644 %{S:1} %{buildroot}%{_cross_factorydir}/etc/nvidia-container-runtime/
+install -m 0644 %{S:5} %{buildroot}%{_cross_factorydir}/etc/nvidia-container-runtime/

Having two files will allow us to prevent conditionally including one or the other based on the variant information. But, we still need a way to include either one, we will accomplish this by creating two sub-packages of nvidia-container-toolkit. You can do this by adding something similar to the following lines right after the last %description section in the spec:

 %description
 %{summary}.

+%package ecs
+Summary: Files specific for the ECS variants
+Requires: %{name}
+
+%description ecs
+%{summary}.
+
+%package k8s
+Summary: Files specific for the Kubernetes variants
+Requires: %{name}
+
+%description k8s
+%{summary}.
+
 %prep

This will create the two subpackages: nvidia-container-toolkit-ecs and nvidia-container-toolkit-k8s. Notice in the diff the Requires: %{name} snippet, this will guarantee that nvidia-container-toolkit is installed alongside nvidia-container-toolkit-<subpackage>. After this, you need to include the correct file per package in the %files section:

 %{_cross_templatedir}/nvidia-oci-hooks-json
-%{_cross_factorydir}/etc/nvidia-container-runtime/config.toml
 %{_cross_tmpfilesdir}/nvidia-container-toolkit.conf
 %{_cross_udevrulesdir}/90-nvidia-gpu-devices.rules
+
+%files ecs
+%{_cross_factorydir}/etc/nvidia-container-runtime/nvidia-container-toolkit-config-ecs.toml
+
+%files k8s
+%{_cross_factorydir}/etc/nvidia-container-runtime/nvidia-container-toolkit-config-k8s.toml

The last change in the spec file is to create the actual configuration file that will be used by nvidia-container-runtime-hook. In Bottlerocket, we use the "factory" feature of tmpfilesd to create certain files at /etc. For /etc/nvidia-container-runtime/config.toml the source of the factory is the file at %{_cross_factorydir}/etc/nvidia-container-runtime/config.toml. Thus, to provide the file for the factory you will create a symlink at this location that points to the correct configuration file per variant. You can do this in a post install script for each sub-package as follows:

+%post ecs -p <lua>
+posix.link("%{_cross_factorydir}/etc/nvidia-container-runtime/nvidia-container-toolkit-config-ecs.toml", "%{_cross_factorydir}/etc/nvidia-container-runtime/config.toml")
+
+%post k8s -p <lua>
+posix.link("%{_cross_factorydir}/etc/nvidia-container-runtime/nvidia-container-toolkit-config-k8s.toml", "%{_cross_factorydir}/etc/nvidia-container-runtime/config.toml")
+

The %post scripts should be placed in between the %install and %files sections.

The last thing to glue the changes together is to update each *-nvidia variant to include the variant-specific nvidia-container-toolkit sub-package. You can do this by updating the file at variants/*-nvidia/Cargo.toml file as follows:

included-packages = [
     "ecs-agent",
 # NVIDIA support
     "ecs-gpu-init",
-    "nvidia-container-toolkit",
+    "nvidia-container-toolkit-ecs",
     "kmod-6.1-nvidia-tesla-535",
 ]

And that's it! If all this is too overwhelming, or you don't have the bandwidth to work on it, please let me know, I can take over your changes and drive the PR to completion 👍 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the extremely precise diff, you practically did all the work 😅
Anyway I have made the changes and was able to build an AMI and test it out and can confirm they work as expected

I also confirmed changes using an admin container

bash-5.1# pwd
/x86_64-bottlerocket-linux-gnu/sys-root/usr/share/factory/etc/nvidia-container-runtime
bash-5.1# ls -li
total 8
2325 -rw-r--r--. 2 root root 237 Jan 30 18:53 config.toml
2325 -rw-r--r--. 2 root root 237 Jan 30 18:53 nvidia-container-toolkit-config-k8s.toml

Just two questions

  1. Do the sub-packages need to appear in Cargo.lock too? I tried cargo generate-lockfile but nothing changed.
  2. Should config.toml be a hard link or soft link? If I read the docs correctly posix.link by default creates hard link

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to answer your questions, sorry!

  1. No, they don't, included-packages is a field we use in the build system
  2. A hard link should be OK.

@chiragjn chiragjn marked this pull request as draft January 30, 2024 18:04
@chiragjn chiragjn marked this pull request as ready for review January 30, 2024 20:35
Copy link
Contributor

@arnaldo2792 arnaldo2792 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look great 🎉 ! Just have one request, could you please squash your commits? We don't squash them when we merge the PR 😅. And FYI, it seems like the key you used to sign the last commits wasn't uploaded to GitHub and the commits show as Unverified.

In my end, I'm just missing testing the kubernetes variants. I'll do that first thing tomorrow. I already validated the ECS variants and things look good.

@chiragjn
Copy link
Contributor Author

chiragjn commented Feb 1, 2024

No worries, I have squashed and rebased :)

@@ -30,6 +30,7 @@ included-packages = [
# NVIDIA support
"ecs-gpu-init",
"nvidia-container-toolkit",
"nvidia-container-toolkit-ecs",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I prefer to list only the most specific set of leaf packages here - you could drop "nvidia-container-toolkit" everywhere since the -k8s or -ecs subpackage will pull that in by way of dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)
Thanks for the review!

@bcressey
Copy link
Contributor

bcressey commented Feb 5, 2024

@chiragjn - thanks for the contribution! It looks pretty much ready to me, with a couple of minor nits that you could address if you have the cycles.

@arnaldo2792
Copy link
Contributor

I confirmed the PR fixes the problem, and I can't get all the devices when NVIDIA_VISIBLE_DEVICES is all:

k8s-1.23

  • With NVIDIA_VISIBLE_DEVICES=all and nvidia.com/gpu=1
Fri Feb  2 21:10:44 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02   Driver Version: 470.223.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:1D.0 Off |                    0 |
| N/A   20C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
  • Without NVIDIA_VISIBLE_DEVICES and nvidia.com/gpu=2
Fri Feb  2 21:12:05 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02   Driver Version: 470.223.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:1D.0 Off |                    0 |
| N/A   20C    P8     8W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   19C    P8     8W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

k8s-1.24

  • With NVIDIA_VISIBLE_DEVICES=all and nvidia.com/gpu=1
Fri Feb  2 21:51:08 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:1D.0 Off |                    0 |
| N/A   22C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
  • Without NVIDIA_VISIBLE_DEVICES and nvidia.com/gpu=2
Fri Feb  2 21:51:50 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:1B.0 Off |                    0 |
| N/A   22C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000000:00:1C.0 Off |                    0 |
| N/A   22C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

k8s-1.25

  • With NVIDIA_VISIBLE_DEVICES=all and nvidia.com/gpu=1
nvidia-pr on Fedora ❯ k exec gpu-tests-2-87rnk -- nvidia-smi
Sat Feb  3 00:16:40 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    Off | 00000000:00:1E.0 Off |                    0 |
|  0%   16C    P8               8W / 300W |      4MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
  • Without NVIDIA_VISIBLE_DEVICES and nvidia.com/gpu=2
Sat Feb  3 00:17:15 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    Off | 00000000:00:1C.0 Off |                    0 |
|  0%   16C    P8               9W / 300W |      4MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G                    Off | 00000000:00:1D.0 Off |                    0 |
|  0%   17C    P8               8W / 300W |      4MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

k8s-1.26

  • With NVIDIA_VISIBLE_DEVICES=all and nvidia.com/gpu=1
Mon Feb  5 18:36:42 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    Off | 00000000:00:1D.0 Off |                    0 |
|  0%   16C    P8               9W / 300W |      4MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
  • Without NVIDIA_VISIBLE_DEVICES and nvidia.com/gpu=2
Mon Feb  5 18:37:12 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    Off | 00000000:00:1B.0 Off |                    0 |
|  0%   15C    P8               8W / 300W |      4MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G                    Off | 00000000:00:1C.0 Off |                    0 |
|  0%   15C    P8               9W / 300W |      4MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

k8s-1.27

  • With NVIDIA_VISIBLE_DEVICES=all and nvidia.com/gpu=1
Mon Feb  5 23:14:51 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:1D.0 Off |                    0 |
| N/A   19C    P8              10W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
  • Without NVIDIA_VISIBLE_DEVICES and nvidia.com/gpu=2
Mon Feb  5 23:15:24 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:1C.0 Off |                    0 |
| N/A   20C    P8               8W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000000:00:1E.0 Off |                    0 |
| N/A   20C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

k8s-1.28

  • With NVIDIA_VISIBLE_DEVICES=all and nvidia.com/gpu=1
Mon Feb  5 23:24:52 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:1E.0 Off |                    0 |
| N/A   19C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
  • Without NVIDIA_VISIBLE_DEVICES and nvidia.com/gpu=2
Mon Feb  5 23:25:25 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:1B.0 Off |                    0 |
| N/A   19C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000000:00:1C.0 Off |                    0 |
| N/A   19C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

k8s-1.29

  • With NVIDIA_VISIBLE_DEVICES=all and nvidia.com/gpu=1
Tue Feb  6 01:24:04 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:1C.0 Off |                    0 |
| N/A   20C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
  • Without NVIDIA_VISIBLE_DEVICES and nvidia.com/gpu=2
Tue Feb  6 01:24:34 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:1B.0 Off |                    0 |
| N/A   21C    P8              11W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000000:00:1E.0 Off |                    0 |
| N/A   21C    P8              11W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

…ed pods

Create separate container toolkit config for ECS and K8s
Apply suggestions from code review
Drop `nvidia-container-toolkit` because it is now a transitive dependency

Motivation
---
When using nodes with multiple gpus (e.g. g4dn.12xlarge), the default way of using nvidia-container-toolkit and nvidia-device-plugin leads to a problem where gpus are not exclusively isolated to the pods they are assigned to. This is because nvidia-device-plugin by default looks up NVIDIA_VISIBLE_DEVICES to decide which gpu devices to pass on to nvidia-container-toolkit / nvidia-container-cli to inject in the pod.

Most nvidia cuda base images have env NVIDIA_VISIBLE_DEVICES=all baked into them which means a pod with such an image will get access to all gpu cards instead of exclusively getting the number requested in resources.limits yet the kubelet will only report the number in resources.limits as allocated.

E.g. On a 4 GPU node,

Pod 1 requests (with NVIDIA_VISIBLE_DEVICES=all in image)

```
resources:
  limits:
    nvidia.com/gpu: 1
```

Pod 2 requests (with NVIDIA_VISIBLE_DEVICES=all in image)

```
resources:
  limits:
    nvidia.com/gpu: 2
```

In this scenario, both pods get access to all 4 cards and node will report

```
nvidia.com/gpu
Allocated: 3
Free: 1
```

This isn't good because the deployed non-privileged pods are unaware of each other and expect exclusive access to requested cards.
Copy link
Contributor

@bcressey bcressey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@arnaldo2792 arnaldo2792 merged commit 0da372a into bottlerocket-os:develop Feb 7, 2024
50 checks passed
@vyaghras vyaghras mentioned this pull request Feb 21, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants